Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classify "phenotype/" as a datatype directory with no subject/session parent #1828

Open
effigies opened this issue May 20, 2024 · 6 comments
Labels
discussion ongoing discussion phenotype

Comments

@effigies
Copy link
Collaborator

phenotype/ is a bit of an outlier in BIDS terms. Other folders at the top level are either entities (sub-<label>) or their contents are opaque to BIDS. phenotype/, on the other hand, is a collection of .tsv/.json files that are to be validated on the same terms as participants.tsv/participants.json.

I would suggest classifying phenotype as a datatype, distinct from other datatypes only in that it spans multiple subjects, and so the subject and session entities do not apply. BEP036 seems to go some way in a similar direction, permitting a pheno/ datatype within subjects/sessions.

In the (unmerged) PR #1672, I suggest using phenotype as a datatype for the purposes of filename validation, and then carve out some exceptions that allow us to use it that way without it being an official datatype. If we make it a datatype, then the exception can be removed. That it fits with very little modification to the schema and validation (bids-standard/bids-validator#1957), seems to me to be an argument for this classification.

The alternative, as I see it, is to consider phenotype a completely unique category of thing, and all implementations will need to have special code for handling it.

@effigies
Copy link
Collaborator Author

@ericearl @surchs I would appreciate your opinions on this. I think this would complement BEP036, in that subject- or session-specific phenotypes would naturally go inside a sub-<label>/[ses-<label>/]phenotype/ directory.

@surchs
Copy link
Contributor

surchs commented May 21, 2024

Thanks @effigies for the ping!

I would suggest classifying phenotype as a datatype, distinct from other datatypes only in that it spans multiple subjects, and so the subject and session entities do not apply. BEP036 seems to go some way in a similar direction, permitting a pheno/ datatype within subjects/sessions.

Yes, we did discuss this option as the "segregated" representation, i.e. "put pheno data in the leaves of the tree" or as you say: "treat pheno as any other data type"

In the current version of BEP036 I am (we are?) leaning to excluding the "segregated" option in favour of the "aggregated" option of a root level /phenotype directory. For me there are three arguments for going with "aggregated" over "segregated":

  • we probably don't want to have two distinct places / recommendations for pheno data, as this would be confusing / require additional tooling to ensure consistency and interoperability
  • if we have to pick one, let's pick the one that is closest to how people already store and handle phenotypic data. In my experience, people acquire, store, and curate phenotypic data in big tabular files - sometimes in parallel to acquiring imaging data (see e.g. Example pheno dataset with irregular sessions for BEP tests bids-examples#432 from @nikhil153)
  • under the "segregated" model, each phenotypic file would be pretty "atomic". E.g. if I collect questionnaire.tsv for subjects and sessions, then sub-01/ses-01/phenotype/questionnaire.tsv and sub-01/ses-02/phenotype/questionnaire.tsv each contain only one data row (from one "acqusition"). That makes it more challenging to ensure consistency of column names and formatting across these files - something we already struggle with in big single tabular files

I think this would complement BEP036, in that subject- or session-specific phenotypes would naturally go inside a sub-/[ses-/]phenotype/ directory.

I can see how treating phenotype like any other datatype would make things easier from a BIDS perspective. But at the moment, /phenotype at the root level is allowed in the BIDS spec. Should we allow both the root level phenotype directory and the <sub-<label>/[ses-<label>/]phenotype/ directory, and in the same dataset? I guess not many BIDS datasets (I have seen) make use of the root level phenotype directory yet. So if we only allow the <sub-<label>/[ses-<label>/]phenotype/, that might work. But I think that'll be quite the hill to climb for users who want to organize their data in BIDS, and also for users who want to later do analysis on someone else's data and first have to aggregate things again.

Maybe I'm overcautious here - but in my experience phenotypic data can be the most messy part of a dataset and are often acquired / handled by non-technical people in a research team. So I'm a bit concerned about tranforming data and storing them in a way that makes it easy for hard to detect inconsistencies to sneak in.

I think @barbarastrasser, you also commented on BEP036 about this topic because of your use cases, maybe you could add your thoughts too.

@effigies
Copy link
Collaborator Author

I am not presently trying to make phenotype valid within subject, just to determine if it is a datatype. That it allows us the possibility of enabling it at lower levels if the use cases are compelling seems like an argument that it is that kind of thing. Saying so would not obligate us to define files with this datatype that show up in subject/session directories.

We cannot disable it at the root level in BIDS 1.x, in any case.

@surchs
Copy link
Contributor

surchs commented May 22, 2024

Ah OK, guess I misread your question.

distinct from other datatypes only in that it spans multiple subjects, and so the subject and session entities do not apply

So you are proposing to turn phenotype into a BIDS datatype, but a special one that (for now) only exists at the directory root - unlike other BIDS datatypes that only exist in the <sub-<label>/[ses-<label>, yes?

If we make it a datatype, then the exception can be removed. That it fits with very little modification to the schema and validation (bids-standard/bids-validator#1957), seems to me to be an argument for this classification.

I'm not very familiar with the BIDS schema or what the implication of such a change would be. From looking at your PR, my limited understanding is that the proposed change allows you to do some general checks for .tsv and .json files in a root level /phenotype directory. That makes sense to me, especially if it reduces special cases.

Maybe @ericearl would be better here to give feedback.

@effigies
Copy link
Collaborator Author

So you are proposing to turn phenotype into a BIDS datatype, but a special one that (for now) only exists at the directory root - unlike other BIDS datatypes that only exist in the <sub-<label>/[ses-<label>, yes?

Correct.

@barbarastrasser
Copy link

Hi everyone,

Maybe first some high-level thoughts on the aggregated vs. segregated approach:

I think it depends a bit on how to look at data. Is the aim to describe a participant in depth (maybe also interesting when looking up imaging and pheno data across datasets) or is the aim to describe a dataset in depth? For the former, it might be easier if everything that is collected is structured the same segregated way - especially when thinking about designing software for automatic querying etc.). For the latter a phenotype directory in the dataset root should be sufficient to my impression.

But I also understand the user perspective. I agree that the aggregated format is the way people acquire phenotype data most of the time, and that it might be easier for them to handle than storing single rows, which is error-prone.

However, issues I witnessed with the current specification is that it is not flexible enough to satisfy the needs of researchers. I know that there are individual efforts going on to split the aggregated data row-wise and store this single line in the sub-<label>/[ses-<label>/]beh directory since there is no other possibility to store the data in a bids compliant way.

The specific problems we encountered were that the way validation is currently handled does not allow for

  • storing pheno-only data (since there has to be a subject directory for every entry in the participants.tsv file)

  • storing varying questionnaires. Here, not every participant had to complete the same questionnaires, and therefore not all participant_id s listed in the participants.tsv are in each <meausurement_tool>.tsv.

  • There is no recommendation to store longitudinal phenotype data (e.g. the same questionnaire collected in all sessions).

I think people will use the phenotype directory more, whether it is in the subject directory or in the root directory, as long as there is flexibility to deal with cases like the ones described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion ongoing discussion phenotype
Projects
None yet
Development

No branches or pull requests

3 participants