UTUGuides: Research Data Management (the lifecycle of research data): Documentation, description and metadata

Instructions for Documentation

Content of good documentation
Readme-files

Well-organized and documented data is easy to use, share, open, preserve, and reuse. Documentation involves describing the methods, structure, and handling of the data. Keep documentation up to date throughout the research process. Post-research documentation can be significantly more difficult, if not impossible.

Good data documentation enables:

Findability
Accessibility
Understandability
Evaluability
Long-term preservation
Reusability

Different disciplines have various documentation practices, which should be followed. At its simplest, a readme file describing the overall content is created with the data.

Good documentation includes:

Data collection methods: sampling, data collection process, used devices, and software
Quality assurance methods
File and folder structure
Version control
Access and usage conditions or confidentiality information
Names, identifiers, and descriptions of variables, datasets, and values
Explanation or definition of codes and classification systems used
Definitions of technical terms and abbreviations used
Codes for missing values and their reasons

Source: Fuchs, S., Koivula, H., Korhonen, T., Lindholm, T., Rauste, P., & Siipilehto, L. (2023, May 17). Data Organisation ABC workshop - Datan Organisoinnin ABC työpaja. Zenodo. https://doi.org/10.5281/zenodo.7944449

A ReadMe file binds the components of the data set together.

It accumulates:

Connections between separate files
Data collection methods
Data quality information
Purpose of use
Limitations

The ReadMe file records documentation generated during data handling and information related to data quality. It also provides instructions for reusing the data.

Examples from the University of Jyväskylä.

Cornelli univeristy examples.

Description and Metadata

Describing research materials is a part of the research process and helps others understand your research and data. Metadata, or data about data, is part of the research description and is typically the portion of the research material that can be openly found and used. Often, funding agencies require the disclosure of descriptive information about research materials. The easiest and most cost-effective way to produce descriptive metadata is step-by-step throughout the data's lifecycle

The University of Turku currently does not have a dedicated place to open research metadata. Metadata is stored in a suitable repository or data service. The Finnish Qvain is a recommended description tool; through it, described materials can be found in Etsin. Metadata via Qvain also transfers to the Finnish Metax metadata repository. Information materials can also be directly described via Metax's interface (Metax REST API).

Metadata can also be opened in many general or discipline-specific data archives or repositories. However, not all allow the separation of metadata from the data itself. Many data archives use metadata standards or schemas that should be followed from the beginning of data description.

Instructions for data description and metadata

Metadata and descriptive information should be collected during the research process. Describing data post-research is often more laborious.

High-quality metadata serves as a researcher's calling card for their study. Metadata includes information about the data's:

Name
Production date
Producer
Format
Subject
Usage rights

Different disciplines have established practices for describing data and marking metadata. However, it is important that the basics are described regardless of the field, promoting the findability, accessibility, interoperability, and reusability of research data in line with FAIR principles. Descriptions can be stored as text files or using an appropriate metadata format.

Guidelines for storing metadata by the Data Archive:

Each research dataset should have its own directory, containing both the data and its descriptive information. Some descriptive details are included within the main data file (e.g., variable explanations or unit information), but most are stored in separate descriptive files.

According to the Finnish Social Science Data Archive research data description should include these elements:

Description of how the study was conducted
- Crucial information include the original purpose of the data, creators/principal investigators, producers/funders, selection criteria of study population and units of observation, and information on how the data were collected. As well as information about possible publications related to the data, population/universe covered, unit of analysis, selection criteria, data collection and data source.
Instrument of data collection
- Data collection instrument, such as an interview form, questions, etc., is stored in all languages used.
Description of data files
- All properties of a single file should be described. It is recommended that the following aspects are documented for each file:
  - name of the file
  - file location (file path)
  - file size
  - file format
  - software used to create the file
  - date of creation
  - file creator
  - file version
  - access rights set for the file
Description of variables
- The details of variables are described as accurately as possible. Additionally, information about data processing and any changes made are included. Some of the information may be directly in the data file, while some are in the descriptive metadata. The variables include:
  - number of variables and units of observation
  - muuttujalista, jossa luetellaan kunkin muuttujan nimi, selite ja sijainti tiedostossa sekä muuttujien saamat arvot ja niiden selitteet
  - list of variables with the name and label of each variable as well as its location in the file and its values and value labels
  - frequency distribution of each variable
  - information on the classifications used, e.g. "main categories of the ISCO-88 were used in the occupational classification" or "country codes: 3-digit ISO 3166"
  - meanings of abbreviations used
  - codings for missing data
  - information on constructed variables (e.g. how the weight variables and sum variables were calculated)
  - recoding and standardising of variables
  - data protection measures taken
Availability information
- In the descriptive metadata, provide information on how the data is accessible and where it is stored.
Contextual information and paradata
- The descriptive metadata also includes information about the external conditions, as they are a prerequisite for the reuse and understanding of the research data.

In describing research data, the aim is discoverability and usability, so it's advisable to implement it as consistently and machine-readable as possible, leveraging existing standards and schemas as widely as possible.

There are numerous metadata standards, some of which are highly specific to particular disciplines. Researchers should utilize the standard of their own discipline. Lists of different metadata standards can be found: DCC-listaus, Metadata Standards Catalog

The most commonly used standards are Dublin Core and DataCite. In many widely used metadata standards, such as Dublin Core and DataCite, there are both mandatory and optional fields.

Note! A data repository or archive may also require a specific metadata standard. If you already know at the beginning of your research the data repository you intend to use for preserving and sharing your research data, collect the metadata according to the metadata standard used by the data repository.

Dublin Core and DataCite

The standard for the Dublin Core metadata format is SFS-ISO 15836-1:2020 Information and documentation. Part 1 is the Core Elements, and 15836-2:2020 is Part 2, which defines properties and classes as identified by the Dublin Core community. Dublin Core has 15 mandatory fields. The content and other guidelines can be found, for example from Dublin Core or Paladini.

While DataCite's schema doesn't have the status of an official standard, its usage is highly controlled. DataCite consists of 20 elements. The entire DataCite schema and guidelines can be found here.

Examples of DataCite XML-formatted metadata can be found here.

The national research data description tool, Qvain

Qvain, is easy to use for creating metadata for research datasets. Utilizing Qvain does not require the research data to be in the IDA service, but it's easy to link them together. After using Qvain, the described research dataset can be found through the Etsin tool, from where it is then harvested to various services and platforms.

Guidance for using Qvain.

Check out CSC's video on publishing your data in Fairdata with Qvain.

Qvain requires certain information from all described research datasets:

Data source (i.e., where the data is located)
License and access rights (the license defines how the data in the dataset can be used, access rights indicate how the research dataset can be accessed (may also be restricted))
Title
description and other basic information (description in Markdown syntax, including details such as publication date, keywords, discipline, language)
Actors (individuals and organizations)

Qvain can also include information such as:

Publications and other outputs related to the dataset
Geographical area
Time period
Infrastructure
Historical data and events
Project and funding"

Metadata as a free text

Metadata can also be produced informally, as long as it ensures that the information is in machine-readable format.

Important information includes:

Title of the dataset
Authors with their roles
Other data collectors with their roles and affiliations
Discipline or field of study
Funders
Purpose of the research dataset, i.e., basic project information
Time period
Description of the collected data and methods used
Quantity of the dataset
Description of files (file name, format, content)
Accessibility of the dataset for modification
Any publications based on the dataset

Examples of informal metadata:

Harvard: https://datamanagement.hms.harvard.edu/collect-analyze/documentation-metadata/readme-files
Cornell: https://data.research.cornell.edu/data-management/sharing/readme/

The most well-known repositories utilize the following metadata standards or schemas:

Repository

Standard/Scheema

What else?

Zenodo

DataCite

Mandatory fields: Publication date, title, authors, description, access right, license

Figshare

DataCite

Mandatory fields: Item title, item type, authors, categories, keywords, description, license

IDA/Qvain

Fairdata Metax tietomalli

The mandatory fields in Qvain are: License, description of the dataset, title, publication date, keywords, author (individual or organization), and publisher (individual or organization).

Dryad

Dublin Core, DataCite, OAI-ORE, RDF DataCube

Mandatory fields: Journal name; Title; Author(s); Abstract; Research domain; Keyword(s)

Pangaea

Darwin Core, Dublin Core, ISO 19115, DIFF

Mandatory fields: Event; Expedition; PI; Author(s); Data set title; Reference(s);Method; Abstract;

BOLD system

BOLD = Barcode of life data system. E.g. in a photo the mandatory fields are: Image file; Original specimen; View metadata; Sample ID; License; License year; License contact.

Finnisha Social Science Data Archive

DDI, Data Documentation Initiative

Mandatory fields: Data creator or collector's name, response on informing participants, dataset name and brief description, dataset size, reporter's name, background organization, and email.

EUDAT CDI B2SHARE / EUDAT B2SHARE

EUDAT Core ja Extended schema

FSD´s Archiving Services uses DDI-format in XML.

DDI-formaatti tukee Tietoarkiston tavoitetta tallentaa ja arksitoida suomalaisen yhteiskunnan, ihmisten ja kulttuuristen ilmiöiden tutkimiseksi kerättyjä tutkimusaineistoja.

The DDI format supports the Data Archive's goal of preserving and archiving research datasets collected for studying Finnish society, people, and cultural phenomena.

According to the DDI format, the following aspects are described as clearly as possible in the metadata:

Author(s) of the research
Topic and content of the research
Selection or sampling method of the dataset
Data collection process
Unit of observation/dataset unit
Terms of use
File format(s)
Variables in quantitative datasets, number of variables
Question texts in survey questionnaires
Key documents in qualitative datasets (interview questions, call for papers, etc.).

For additional information and detailed instructions, please visit:

https://www.fsd.tuni.fi/en/services/depositing-data/ddi/

International instructions: