Making research data ready to be accessible

It should be decided if data should be or can be made accessible after gaining them, according to the scientific process and accepted methodology. There are a few steps to be followed at first:

1. A selection

It is not necessary to make all data accessible. There are some conditions worth taking into consideration when choosing data to be archived, such as:

requirements of agencies financing scientific research
the scientific value of research data
the uniqueness- it is worth checking if data do not duplicate with exiting datasets
the possibility to replicate research findings - if data include all parameters allowing to repeat the experiment
economic issues - costs of managing and storing data with their justification

Research data do not have to be ideal, they can i.e. have gaps in measurements resulting from outdoor conditions. It is important to highlight them and describe the causes.

2. Deleting sensitive data

Allowing investigated respondents to be identified:

the anonymization - converting personal data to make them impossible to be allocated to a defined or possible to be defined respondent
the pseudonymization- converting data in a way they cannot be allocated to a given respondent, without the use of any additional information

The reversibility is the basic feature distinguishing the pseudonymization and anonymization. The anonymization is an irreversible process, whereas the pseudonymzation is a reversible process.

3. The choice of files format

Data should be published in a widely available format, which does not require the commercial software and uses the standard coding (ASCII, UTF-8). It is worth considering the exiting formats in your discipline and not forcing other users to additional conversion of data to avoid losing the quality of data.

File formats:

Text, Documentation, Scripts: XML, PDF/A, HTML, Plain Text
Still Image: TIFF, JPEG 2000, PNG, JPEG/JFIF, DNG (digital negative), BMP, GIF
Geospatial: Shapefile (SHP, DBF, SHX), GeoTIFF, NetCDF
Graphic Image:
raster formats: TIFF, JPEG2000, PNG, JPEG/JFIF, DNG, BMP, GIF
vector formats: Scalable vector graphics, AutoCAD Drawing Interchange Format, Encapsulated Postscripts, Shape files
Cartographic: Most complete data, GeoTIFF, GeoPDF, GeoJPEG2000, Shapefile
Audio: WAVE, AIFF, MP3, MXF, FLAC
Video: MOV, MPEG-4, AVI, MXF
Database: XML, CSV, TAB

4. Giving proper names to folders and files

Please try to answer a few questions at first: What files` names and what structure would be the most useful if I want to use them? What should be included in names to find a precise set of data without any problems? The consequence is the basic rule of organising files. The accepted structure and nomenclature should be maintained all the time.

5. Giving a proper description of metadata to datasets

Data should be described to be indexed, found and reused.

Do you know that?

Some of datasets require to be edited or cleaned. There can be errors in datasets concerning inter alia spelling or grammar, or the nominal value (a multiple use of terms). Objects or contexts identifiers should be free from errors too ('no. of catalogue or 'place#'). If there is any tool required to 'clean' gathered data, you can use i.e. OpenRefine.

Making research data ready to be accessible | Gdańsk University of Technology

Page content

Navigation Open Research Data