Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin idea: automatic metadata annotation #15

Open
cmungall opened this issue Dec 10, 2016 · 5 comments
Open

plugin idea: automatic metadata annotation #15

cmungall opened this issue Dec 10, 2016 · 5 comments
Assignees

Comments

@cmungall
Copy link
Member

Reproducibility and provenance are increasingly important.

Makefiles and Makefile-like solutions such as biomake help with reproducibility; if the recipe and input files are provided in a github repo then in theory it is easy to re-executed and hopefully get the same answer.

However, if the final output files are submitted to a data repository, the provenance may not be immediately obvious. Initiatives such as BD2K are emphasizing the importance of metadata on all digital objects, which includes analysis results. Of course it is possible to manually annotate these artefacts, but why do that when this can be automated.

It should be possible for any file derived from biomake to immediately see a graph of objects used to derive it, together with complete metadata on each; this includes standard filesystem metadata e.g. timestamp but additional metadata too. See also https://github.com/W3C-HCLSIG/HCLSDatasetDescriptions

This may be a heavyweight feature so may be best implemented as some kind of plugin.

@cmungall cmungall self-assigned this Dec 10, 2016
@ihh
Copy link
Member

ihh commented Dec 10, 2016

it shouldn't be that hard to record metadata, just look for where the MD5 hash is updated and stick another hook in there.

@cmungall
Copy link
Member Author

cmungall commented Oct 9, 2017

James Taylor suggests using PROV: https://twitter.com/jxtx/status/916406694674132992

@cmungall
Copy link
Member Author

I am thinking of making a start on this, very soon using PROV-O as the vocabulary. This is also used by projects like wf4ever

the basic model has 3 classes, entity, agent and activity
img

I think the primary agent would be biomake itself, with an acted-on-behalf-of edge to the person executing the workflow. The entity would be the file, and the activity would be the makefile recipe/rule.

The primary output would be rdf/turtle, but we could also have json too (as well as a native prolog representation). Having some kind of dot/grpahviz export should also be simple.

@cmungall
Copy link
Member Author

Another possibility here is allowing the user to easily generate a bagit or bagit-ro for their folder once the workflow is executed.

@ihh
Copy link
Member

ihh commented Jan 18, 2018

@cmungall I like this, especially how clean the mapping to PROV-O is: I think most/all of those things in the diagram are already being calculated at some point in biomake.

cmungall added a commit that referenced this issue Jan 23, 2018
Default interceptor is a persistent store that logs actions
as unit clauses.

This could be extended to provide a complete workflow record,
as specified in #15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants