It should be decided if data should be or can be made accessible after gaining them, according to the scientific process and accepted methodology. There are a few steps to be followed at first:
1. A selection
It is not necessary to make all data accessible. There are some conditions worth taking into consideration when choosing data to be archived, such as:
- requirements of agencies financing scientific research
- the scientific value of research data
- the uniqueness- it is worth checking if data do not duplicate with exiting datasets
- the possibility to replicate research findings - if data include all parameters allowing to repeat the experiment
- economic issues - costs of managing and storing data with their justification
Research data do not have to be ideal, they can i.e. have gaps in measurements resulting from outdoor conditions. It is important to highlight them and describe the causes.
2. Deleting sensitive data
Allowing investigated respondents to be identified:
- the anonymization - converting personal data to make them impossible to be allocated to a defined or possible to be defined respondent
- the pseudonymization- converting data in a way they cannot be allocated to a given respondent, without the use of any additional information
The reversibility is the basic feature distinguishing the pseudonymization and anonymization. The anonymization is an irreversible process, whereas the pseudonymzation is a reversible process.
3. The choice of files format
Data should be published in a widely available format, which does not require the commercial software and uses the standard coding (ASCII, UTF-8). It is worth considering the exiting formats in your discipline and not forcing other users to additional conversion of data to avoid losing the quality of data.
- Text, Documentation, Scripts: XML, PDF/A, HTML, Plain Text
- Still Image: TIFF, JPEG 2000, PNG, JPEG/JFIF, DNG (digital negative), BMP, GIF
- Geospatial: Shapefile (SHP, DBF, SHX), GeoTIFF, NetCDF
- Graphic Image:
raster formats: TIFF, JPEG2000, PNG, JPEG/JFIF, DNG, BMP, GIF
vector formats: Scalable vector graphics, AutoCAD Drawing Interchange Format, Encapsulated Postscripts, Shape files - Cartographic: Most complete data, GeoTIFF, GeoPDF, GeoJPEG2000, Shapefile
- Audio: WAVE, AIFF, MP3, MXF, FLAC
- Video: MOV, MPEG-4, AVI, MXF
- Database: XML, CSV, TAB
4. Giving proper names to folders and files
Please try to answer a few questions at first: What files` names and what structure would be the most useful if I want to use them? What should be included in names to find a precise set of data without any problems? The consequence is the basic rule of organising files. The accepted structure and nomenclature should be maintained all the time.
5. Giving a proper description of metadata to datasets
Data should be described to be indexed, found and reused.
Do you know that?
Some of datasets require to be edited or cleaned. There can be errors in datasets concerning inter alia spelling or grammar, or the nominal value (a multiple use of terms). Objects or contexts identifiers should be free from errors too ('no. of catalogue or 'place#'). If there is any tool required to 'clean' gathered data, you can use i.e. OpenRefine.