Tabular phenotypic data guidelines

This appendix is a collection of guidelines and examples for curating well-organized tabular phenotypic data.

Guidelines

These guidelines are intended to improve the organization and clarity of tabular phenotypic data like the participants file, sessions file, and phenotypic and assessment data.

They are recommendations and are by default ignored during validation. You can make them mandatory during validation by setting the AdditionalValidation key to contain "Phenotype" in the dataset_description.json.

1. Aggregate data across sessions

Aggregate participant information across all sessions into one tabular TSV file per measurement or phenotypic assessment and store this file in the /phenotype directory. Demographic information is a special case and SHOULD be aggregated in the participants.tsv file at the root level of the dataset. It is RECOMMENDED to use the age column in the participants.tsv file to record participant age at every session in longitudinal or multi-session data sets.

2. Always pair tabular data with data dictionaries

Tabular phenotypic data MUST be prepared as one pair of a tabular file in tab-separated value (TSV) format and a corresponding data dictionary in JavaScript Object Notation (JSON) format. See the Tabular files section for more information.

3. Add `MeasurementToolMetadata` to each tabular phenotypic measurement tool

Whenever possible, it is RECOMMENDED to add MeasurementToolMetadata to each phenotype/<measurement_tool_name>.json data dictionary. This improves reusability and provides clarity about the measurement tool. See MeasurementToolMetadata in the glossary for more.

4. Ensure minimal annotation for phenotypic and assessment data

In phenotypic and assessment data, each measurement tool SHOULD have an independent aggregated data TSV file in which the user collects all subjects, sessions, and/or runs of data as one entry per row (with a row defined by the smallest unit of acquisition). In other words:

Each row MUST start with participant_id.
Each TSV file MUST contain a session_id column when multiple sessions ¹ are present in the data set.
If a measurement tool is acquired multiple times within a single session, a run_id column MUST be added to disambiguate the separate acquisitions.
A measurement tool’s acquisition time SHOULD be stored in the sessions.tsv file at the root-level of the dataset in the acq_time column.

Column name	Requirement Level	Data type	Description
participant_id	REQUIRED	string	A participant identifier of the form `sub-<label>`, matching a participant entity found in the dataset. Note that data for one participant MAY be represented across multiple rows in case of multiple sessions or runs, and therefore the entry in the `participant_id` column will be repeated. The combination of `participant_id`, `session_id` and `run_id` MUST be unique. This column must appear first in the file.
session_id	OPTIONAL	string	A session identifier of the form `ses-<label>`, matching a session found in the dataset. A `session_id` column MUST be added to all tabular files in the phenotype directory as soon as multiple sessions are present in the data set. The combination of `participant_id`, `session_id` and `run_id` MUST be unique. This column must appear second in the file.
run_id	OPTIONAL, but REQUIRED if there are multiple runs within any session	string	A run identifier that corresponds to an existing `run-<index>` entity used in a filename(s). A chronological `run` number is used when a measurement tool or assessment described by a tabular file was repeated within a session. The combination of `participant_id`, `session_id` and `run_id` MUST be unique. This column must appear third in the file.
HED	OPTIONAL	string	Hierarchical Event Descriptor (HED) tags. See the HED Appendix for details. This column may appear anywhere in the file.
Additional Columns	OPTIONAL	`n/a`	Additional columns are allowed.

Furthermore, if you add a session_id column to the tabular phenotypic data, you MUST introduce a session directory to the imaging data, even if only one imaging session has been created. And vice versa, if imaging data has session directories, all imaging data and tabular phenotypic data MUST have sessions.

This produces a file in which same-participant entries can take up as many rows as needed according to the smallest unit of acquisition.

5. Store demographic data in the participants file and instrument data in the phenotype directory

The participants file is for demographic data about the participant, including longitudinal information such as age. The phenotypic and assessment data directory is for phenotypic measurement instruments collected about the participants such as questionnaires, surveys, and cognitive assessments. Create one tabular file for each instrument in the phenotypic and assessment data directory.

6. Record participant properties in the participants file and session properties in the sessions file

Since the same participant_id and session_id columns can be used similarly in the participants file and the sessions file, use the two different files to instead differentiate properties of participants versus sessions. Properties of participants MAY include things like age, sex, race, or household income. Properties of sessions MAY include things like acquisition time, measurement device properties, and indoor or outdoor experimental conditions.

7. Use the sessions file at the root-level

If there is more than one session for any one participant, then it is REQUIRED to provide a sessions file at the dataset root, even if this file only contains a two-column inventory of participant_id and session_id. The sessions.tsv file MUST list all sessions for all subjects across imaging and tabular phenotypic data. The sessions.json file’s session_id field MUST include "Levels" with the description of each session_id.

8. Record acquisition time of all sessions with `acq_time`

It is RECOMMENDED to store acquisition time² for tabular phenotypic data and store the time of acquisition of each row inside a column named acq_time in the sessions file. This is consistent with how acquisition time is recorded for MRI data and other time-sensitive measurements (for example systolic blood pressure).

Summary

This appendix described guidelines for best tabular phenotypic data. In summary, it is RECOMMENDED to always use the participants file and separate files by measurement instrument in the phenotypic and assessment data directory, since they each collect different information. If you use sessions, then the sessions file is also RECOMMENDED.

Examples

What follows are a few common use case examples for tabular phenotypic files.

1 participant session with both non-tabular and tabular phenotypic data

File tree

├─ phenotype/
│  ├─ measurement_tool.json 
│  └─ measurement_tool.tsv 
└─ sub-01/
   └─ anat/
      ├─ sub-01_T1w.json 
      └─ sub-01_T1w.nii.gz

Contents of phenotype/measurement_tool.tsv

participant_id	measurement_1	measurement_2
sub-01	value1	value2

1 participant with 2 sessions, where 1 session is only tabular phenotype and the other is only imaging

With only one imaging and one phenotypic session each in this example you might want to merge both imaging and phenotypic data under one session. But it is more correct to have separate sessions for the imaging and phenotypic data, especially if the sessions were collected days, weeks, or months apart. You can denote both sessions and their acquisition time in the sessions.tsv file and have session_id Levels noted in the sessions.json sidecar. Below are a CORRECT and an INCORRECT example of prepared data following these guidelines.

CORRECT

File tree

├─ sessions.json 
├─ sessions.tsv 
├─ phenotype/
│  ├─ measurement_tool.json 
│  └─ measurement_tool.tsv 
└─ sub-01/
   └─ ses-MRI/
      └─ anat/
         ├─ sub-01_ses-MRI_T1w.json 
         └─ sub-01_ses-MRI_T1w.nii.gz

Contents of sessions.tsv

participant_id	session_id	acq_time
sub-01	ses-pheno	2001-01-01T12:05:00
sub-01	ses-MRI	2001-03-01T13:14:00

Contents of phenotype/measurement_tool.tsv

participant_id	session_id	measurement_1	measurement_2
sub-01	ses-pheno	value1	value2

INCORRECT

File tree

├─ phenotype/
│  ├─ measurement_tool.json 
│  └─ measurement_tool.tsv 
└─ sub-01/
   └─ anat/
      ├─ sub-01_T1w.json 
      └─ sub-01_T1w.nii.gz

Contents of phenotype/measurement_tool.tsv

participant_id	measurement_1	measurement_2
sub-01	value1	value2

A session directory MUST be present in the participant directory and the session_id column MUST be present in phenotype/measurement_tool.tsv as well. Sessions must be used consistently for the combination of tabular and non-tabular phenotypic data.

2 participants with a mix of tabular phenotypic data and imaging sessions

In this example, participants acquired both a phenotypic measurement tool and an MRI during ses-MRI1. sub-01 has a ses-MRI2 with no phenotypic measurement tool acquired and sub-02 has a ses-pheno where no MRI was acquired.

File tree

├─ sessions.json 
├─ sessions.tsv 
├─ phenotype/
│  ├─ measurement_tool.json 
│  └─ measurement_tool.tsv 
├─ sub-01/
│  ├─ ses-MRI1/
│  │  └─ anat/
│  │     ├─ sub-01_ses-MRI1_T1w.json 
│  │     └─ sub-01_ses-MRI1_T1w.nii.gz 
│  └─ ses-MRI2/
│     └─ anat/
│        ├─ sub-01_ses-MRI2_T1w.json 
│        └─ sub-01_ses-MRI2_T1w.nii.gz 
└─ sub-02/
   └─ ses-MRI1/
      └─ anat/
         ├─ sub-02_ses-MRI1_T1w.json 
         └─ sub-02_ses-MRI1_T1w.nii.gz

Contents of sessions.tsv

participant_id	session_id	acq_time
sub-01	ses-MRI1	2001-01-01T11:12:00
sub-01	ses-MRI2	2001-07-01T13:14:00
sub-02	ses-MRI1	2001-01-181T15:16:00
sub-02	ses-pheno	2001-02-20T12:05:00

Contents of phenotype/measurement_tool.tsv

participant_id	session_id	measurement_1	measurement_2
sub-01	ses-MRI1	value1	value2
sub-02	ses-MRI1	value3	value4
sub-02	ses-pheno	value5	value6

3 participants with 3 different kinds of sessions among them

The ses-baseline session collects an MRI and tabular phenotypic data.

File tree

├─ participants.json 
├─ participants.tsv 
├─ sessions.json 
├─ sessions.tsv 
├─ phenotype/
│  ├─ survey.json 
│  └─ survey.tsv 
├─ sub-01/
│  ├─ ses-baseline/ 
│  └─ ses-followupMRI/ 
├─ sub-02/
│  └─ ses-baseline/ 
└─ sub-03/
   ├─ ses-baseline/ 
   └─ ses-followupMRI/

Contents of participants.tsv. Participant properties that can change from session to session belong here especially.

participant_id	session_id	sex	age	gender	race	household_income
sub-01	ses-baseline	M	10	3	4	5
sub-01	ses-followupMRI	M	10	3	4	5
sub-01	ses-interview	M	11	4	4	6
sub-02	ses-baseline	F	9	1	3	3
sub-02	ses-interview	F	10	1	7	3
sub-03	ses-baseline	F	11	2	10	4
sub-03	ses-followupMRI	F	12	5	10	4

Contents of sessions.tsv.

participant_id	session_id	acq_time
sub-01	ses-baseline	2001-01-01T12:05:00
sub-01	ses-followupMRI	2001-07-01T13:33:00
sub-01	ses-interview	2002-01-01T11:21:00
sub-02	ses-baseline	2001-04-01T11:01:00
sub-02	ses-interview	2002-04-01T14:08:00
sub-03	ses-baseline	2001-09-01T11:45:00
sub-03	ses-followupMRI	2002-03-01T12:17:00

Contents of sessions.json. Note how the session_id Levels are clearly described.

{
    "participant_id": {
        "Description": "BIDS participant identifier"
    },
    "session_id": {
        "Description": "BIDS session identifier",
        "Levels": {
            "ses-baseline": "Baseline visit for MRI and assessments",
            "ses-followupMRI": "6-months after baseline MRI follow-up",
            "ses-interview": "1-year after baseline in-person follow-up"
        }
    },
    "acq_time": {
        "Description": "When the data acquisition started"
    }
}

Contents of phenotype/survey.tsv. Note how sub-03 does not have a row for ses-interview because that session was not collected and is absent above in the participants.tsv and sessions.tsv files.

participant_id	session_id	question_1	question_2	question_3
sub-01	ses-baseline	A	2	no
sub-01	ses-interview	A	3	yes
sub-02	ses-baseline	A	2	no
sub-02	ses-interview	B	1	unsure
sub-03	ses-baseline	B	3	no

For more complete examples, see the pheno00* bids-examples on GitHub.

Footnotes

¹ A session is any logical grouping of imaging and behavioral data consistent across participants. Session can (but doesn't have to) be synonymous to a visit in a longitudinal study. In situations where different data types are obtained over several visits (for example fMRI on one day followed by DWI the day after) those can still be grouped in one session. Refer to the definition of session for more details.

² Datetime format and the anonymization procedure are described in Units.