Skip to content

Tabular phenotypic data guidelines

This appendix is a collection of guidelines and examples for curating well-organized tabular phenotypic data.

Guidelines

These guidelines are intended to improve the organization and clarity of tabular phenotypic data like the participants file, sessions file, and phenotypic and assessment data.

They are recommendations and are by default ignored during validation. You can make them mandatory during validation by setting the AdditionalValidation key contains "Phenotype" in the dataset_description.json.

1. Aggregate data across sessions

Aggregate participant information across all sessions into one tabular TSV file per measurement or phenotypic assessment and store this file in the /phenotype directory. Demographic information is a special case and SHOULD be aggregated in the participants.tsv file at the root level of the dataset. It is RECOMMENDED to use the age column in the participants.tsv file to record participant age at every session in longitudinal or multi-session data sets.

2. Always pair tabular data with data dictionaries

Tabular phenotypic data MUST be prepared as one pair of a tabular file in tab-separated value (TSV) format and a corresponding data dictionary in JavaScript Object Notation (JSON) format. See the Tabular files section for more information.

3. Add MeasurementToolMetadata to each tabular phenotypic measurement tool

Whenever possible, it is RECOMMENDED to add MeasurementToolMetadata to each phenotype/<measurement_tool_name>.json data dictionary. This improves reusability and provides clarity about the measurement tool. See MeasurementToolMetadata in the glossary for more.

4. Ensure minimal annotation for phenotypic and assessment data

In phenotypic and assessment data, each measurement tool SHOULD have an independent aggregated data TSV file in which the user collects all subjects, sessions, and/or runs of data as one entry per row (with a row defined by the smallest unit of acquisition). In other words:

  • Each row MUST start with participant_id.

  • Each TSV file MUST contain a session_id column when multiple sessions1 are present in the data set regardless of whether those sessions are in the phenotype/ data, sub-<label>/ data, or a combination of the two.

  • If a measurement tool is acquired multiple times within a single session, a run_id column MUST be added to disambiguate the separate acquisitions.

  • A measurement tool’s acquisition time SHOULD be stored in the sessions.tsv file at the root-level of the dataset in the acq_time column.

Column name Requirement Level Data type Description
participant_id REQUIRED string A participant identifier of the form sub-<label>, matching a participant entity found in the dataset. Note that data for one participant MAY be represented across multiple rows in case of multiple sessions or runs, and therefore the entry in the participant_id column will be repeated.

The combination of participant_id, session_id and run_id MUST be unique.

This column must appear first in the file.
session_id OPTIONAL, but REQUIRED if sessions are defined in the dataset string A session identifier of the form ses-<label>, matching a session found in the dataset. A session_id column MUST be added to all tabular files in the phenotype directory as soon as multiple sessions are present in the data set regardless of whether those sessions are in the phenotype/ data, sub-<label>/ data, or a combination of the two.

The combination of participant_id, session_id and run_id MUST be unique.

This column must appear second in the file.
run_id OPTIONAL, but REQUIRED if there are multiple runs within any session string A run identifier that corresponds to an existing run-<index> entity used in a filename(s). A chronological run number is used when a measurement tool or assessment described by a tabular file was repeated within a session.

The combination of participant_id, session_id and run_id MUST be unique.

This column must appear third in the file.
HED OPTIONAL string Hierarchical Event Descriptor (HED) tags. See the HED Appendix for details.

This column may appear anywhere in the file.
Additional Columns OPTIONAL n/a Additional columns are allowed.

Furthermore, if you have to add a session_id column to the tabular phenotypic data, you then MUST also introduce a session directory to the imaging data, even if only one imaging session has been created. This guideline can be considered as "if anyone uses sessions, everyone uses sessions." And vice versa, if imaging data has session directories, all imaging data and tabular phenotypic data MUST have sessions.

This produces a file in which same-participant entries can take up as many rows as needed according to the smallest unit of acquisition.

5. Store demographic data in the participants file and instrument data in the phenotype directory

The participants file is for demographic data about the participant, including longitudinal information such as age. The phenotypic and assessment data directory is for phenotypic measurement instruments collected about the participants such as questionnaires, surveys, and cognitive assessments. Create one tabular file for each instrument in the phenotypic and assessment data directory.

6. Record participant properties in the participants file and session properties in the sessions file

Since the same participant_id and session_id columns can be used similarly in the participants file and the sessions file, use the two different files to instead differentiate properties of participants versus sessions. Properties of participants MAY include things like age, sex, race, or household income. Properties of sessions MAY include things like acquisition time, measurement device properties, and indoor or outdoor experimental conditions.

7. Use the sessions file at the root-level

If there is more than one session for any one participant, then it is RECOMMENDED to provide a sessions file at the dataset root. The sessions file MUST list all sessions for all subjects across imaging and tabular phenotypic data. The data dictionary JSON file’s session_id field MUST include Levels with the description of each session_id.

8. Record acquisition time of all sessions with acq_time

It is RECOMMENDED to store acquisition time2 for tabular phenotypic data and store the time of acquisition of each row inside a column named acq_time in the sessions file. This is consistent with how acquisition time is recorded for MRI data and other time-sensitive measurements (for example systolic blood pressure).

Summary

This appendix described guidelines for best tabular phenotypic data. In summary, it is RECOMMENDED to always use the participants file and separate files by measurement instrument in the phenotypic and assessment data directory, since they each collect different information. If you use sessions, then the sessions file is also RECOMMENDED.

Examples

What follows are a few common use case examples for tabular phenotypic files.

1 participant session with both non-tabular and tabular phenotypic data

File tree

├─ phenotype/
│  ├─ measurement_tool.json 
│  └─ measurement_tool.tsv 
└─ sub-01/
   └─ anat/
      ├─ sub-01_T1w.json 
      └─ sub-01_T1w.nii.gz 

Contents of phenotype/measurement_tool.tsv

tsv participant_id measurement_1 measurement_2 sub-01 value1 value2

1 participant with 2 sessions, where 1 session is only tabular phenotype and the other is only imaging

With only one imaging and one phenotypic session each in this example you might want to merge both imaging and phenotypic data under one session. But it is more correct to have separate sessions for the imaging and phenotypic data, especially if the sessions were collected days, weeks, or months apart. You can denote both sessions and their acquisition time in the sessions.tsv file and have session_id Levels noted in the sessions.json sidecar. Below are a CORRECT and an INCORRECT example of prepared data following these guidelines.

CORRECT

File tree

├─ sessions.json 
├─ sessions.tsv 
├─ phenotype/
│  ├─ measurement_tool.json 
│  └─ measurement_tool.tsv 
└─ sub-01/
   └─ ses-MRI/
      └─ anat/
         ├─ sub-01_ses-MRI_T1w.json 
         └─ sub-01_ses-MRI_T1w.nii.gz 

Contents of sessions.tsv

tsv participant_id session_id acq_time sub-01 ses-pheno 2001-01-01T12:05:00 sub-01 ses-MRI 2001-03-01T13:14:00

Contents of phenotype/measurement_tool.tsv

tsv participant_id session_id measurement_1 measurement_2 sub-01 ses-pheno value1 value2

INCORRECT

File tree

├─ phenotype/
│  ├─ measurement_tool.json 
│  └─ measurement_tool.tsv 
└─ sub-01/
   └─ anat/
      ├─ sub-01_T1w.json 
      └─ sub-01_T1w.nii.gz 

Contents of phenotype/measurement_tool.tsv

tsv participant_id measurement_1 measurement_2 sub-01 value1 value2

A session directory MUST be present in the participant directory and the session_id column MUST be present in phenotype/measurement_tool.tsv as well. Sessions must be used consistently for the combination of tabular and non-tabular phenotypic data.

2 participants with a mix of tabular phenotypic data and imaging sessions

In this example, participants acquired both a phenotypic measurement tool and an MRI during ses-MRI1. sub-01 has a ses-MRI2 with no phenotypic measurement tool acquired and sub-02 has a ses-pheno where no MRI was acquired.

File tree

├─ sessions.json 
├─ sessions.tsv 
├─ phenotype/
│  ├─ measurement_tool.json 
│  └─ measurement_tool.tsv 
├─ sub-01/
│  ├─ ses-MRI1/
│  │  └─ anat/
│  │     ├─ sub-01_ses-MRI1_T1w.json 
│  │     └─ sub-01_ses-MRI1_T1w.nii.gz 
│  └─ ses-MRI2/
│     └─ anat/
│        ├─ sub-01_ses-MRI2_T1w.json 
│        └─ sub-01_ses-MRI2_T1w.nii.gz 
└─ sub-02/
   └─ ses-MRI1/
      └─ anat/
         ├─ sub-02_ses-MRI1_T1w.json 
         └─ sub-02_ses-MRI1_T1w.nii.gz 

Contents of sessions.tsv

tsv participant_id session_id acq_time sub-01 ses-MRI1 2001-01-01T11:12:00 sub-01 ses-MRI2 2001-07-01T13:14:00 sub-02 ses-MRI1 2001-01-181T15:16:00 sub-02 ses-pheno 2001-02-20T12:05:00

Contents of phenotype/measurement_tool.tsv

tsv participant_id session_id measurement_1 measurement_2 sub-01 ses-MRI1 value1 value2 sub-02 ses-MRI1 value3 value4 sub-02 ses-pheno value5 value6

3 participants with 3 different kinds of sessions among them

The ses-baseline session collects an MRI and tabular phenotypic data.

File tree

├─ participants.json 
├─ participants.tsv 
├─ sessions.json 
├─ sessions.tsv 
├─ phenotype/
│  ├─ survey.json 
│  └─ survey.tsv 
├─ sub-01/
│  ├─ ses-baseline/ 
│  └─ ses-followupMRI/ 
├─ sub-02/
│  └─ ses-baseline/ 
└─ sub-03/
   ├─ ses-baseline/ 
   └─ ses-followupMRI/ 

Contents of participants.tsv. Participant properties that can change from session to session belong here especially.

tsv participant_id session_id sex age gender race household_income sub-01 ses-baseline M 10 3 4 5 sub-01 ses-followupMRI M 10 3 4 5 sub-01 ses-interview M 11 4 4 6 sub-02 ses-baseline F 9 1 3 3 sub-02 ses-interview F 10 1 7 3 sub-03 ses-baseline F 11 2 10 4 sub-03 ses-followupMRI F 12 5 10 4

Contents of sessions.tsv.

tsv participant_id session_id acq_time sub-01 ses-baseline 2001-01-01T12:05:00 sub-01 ses-followupMRI 2001-07-01T13:33:00 sub-01 ses-interview 2002-01-01T11:21:00 sub-02 ses-baseline 2001-04-01T11:01:00 sub-02 ses-interview 2002-04-01T14:08:00 sub-03 ses-baseline 2001-09-01T11:45:00 sub-03 ses-followupMRI 2002-03-01T12:17:00

Contents of sessions.json. Note how the session_id Levels are clearly described.

{
    "participant_id": {
        "Description": "BIDS participant identifier"
    },
    "session_id": {
        "Description": "BIDS session identifier",
        "Levels": {
            "ses-baseline": "Baseline visit for MRI and assessments",
            "ses-followupMRI": "6-months after baseline MRI follow-up",
            "ses-interview": "1-year after baseline in-person follow-up"
        }
    },
    "acq_time": {
        "Description": "When the data acquisition started"
    }
}

Contents of phenotype/survey.tsv. Note how sub-03 does not have a row for ses-interview because that session was not collected and is absent above in the participants.tsv and sessions.tsv files.

tsv participant_id session_id question_1 question_2 question_3 sub-01 ses-baseline A 2 no sub-01 ses-interview A 3 yes sub-02 ses-baseline A 2 no sub-02 ses-interview B 1 unsure sub-03 ses-baseline B 3 no

For more complete examples, see the pheno00* bids-examples on GitHub.

Footnotes

1 A session is any logical grouping of imaging and behavioral data consistent across participants. Session can (but doesn't have to) be synonymous to a visit in a longitudinal study. In situations where different data types are obtained over several visits (for example fMRI on one day followed by DWI the day after) those can still be grouped in one session. Refer to the definition of session for more details.

2 Datetime format and the anonymization procedure are described in Units.