-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NF Support annex.private for clone/get and create. #7247
base: master
Are you sure you want to change the base?
Conversation
Punctuation, grammar
This adds a new reckless mode, "private", to clone (and, by extension, get). When used, it will set annex.private (git config option) before running `git annex init`. Manpage for git-annex says: "When this is set to true, no information about the repository will be recorded in the git-annex branch". Private mode is intended mainly for creating temporary clones (which are changed, pushed back into origin, and dropped) without cluttering `git-annex:uuid.log` -- repository information gets stored in `.git/annex/journal-private/` rather than in the git-annex branch. The git-annex branch still exists, and is used to record things not related to the private repository. See: https://git-annex.branchable.com/tips/cloning_a_repository_privately/ The change to `clone` also affects `get`, because `get` uses `clone_dataset()`. Mechanisms for storing reckless mode and inheriting it in subdatasets were already in place.
This will make --reckless private similar to other reckless clone modes (auto and shared-* use untrust, ephemeral uses dead). Note that with annex.private, the information goes into `.git/annex/journal-private/trust.log`. Personally, I suppose that in the context of private mode, using untrust is not necessary (other clones won't have information about the private one), but can act as an additional safeguard (for number of copies) when issuing a drop-from-elsewhere command within the private repository.
This adds a new parameter, "private", to the create command. When set, it will set the annex.private configuration option before calling git annex init. When the option is set, the created dataset won't record information about itself in the git-annex branch (but the branch will still exist and record e.g. information about other repositories). This can be useful for creating temporary datasets which will be discarded after being published, or for privacy reasons.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #7247 +/- ##
==========================================
+ Coverage 90.61% 91.60% +0.98%
==========================================
Files 325 325
Lines 43405 43467 +62
Branches 0 5830 +5830
==========================================
+ Hits 39333 39816 +483
+ Misses 4072 3636 -436
- Partials 0 15 +15
☔ View full report in Codecov by Sentry. |
If the parent dataset is given explicitly to the create command, and it uses annex private mode, the created subdataset will also use private mode. This is similar to how clone --reckless works. Note: the current code allows overriding one way: datalad create -d super --private sub (sub will be private regardless of super), but not the other way (if super is private, it will take precedence even when setting private=False). However, I don't see much reason for such setup, and I'd rather keep private as a True/False flag.
When using create with --private, also set datalad.clone.reckless so that cloning subdatasets into thus created dataset would default to using --reckless private.
This should also cover get, because all that's relevant happens in create. Note that checking the presence of "annex.private" in config is not sufficient. To make a repository private, this option must be in place before git annex init ts called. So to be sure, we also check effects of this config, i.e. presence of "journal-private" & absence of new entries in git-annex::uuid.log.
8694600
to
13c276c
Compare
Looking briefly at the test errors:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much, @mslw! Sorry for missing this one.
Should the reckless mode be named "private" or maybe "temporary"?
I'm not sure either. Mostly b/c there's overlap with ephemeral
. If origin
isn't on the same fie system and therefore can't be linked, then both modes are identical (in intention - implementation should follow).
However, since ephemeral
is used already, I think adding temporary
is confusing. So, private
and thus being a little more specific than other mode names, is the best I can see.
Re untrust
:
the only value is as an additional safeguard for issuing a command to drop from elsewhere when working in a private repository.
Good point, I think. But since we likely need to declare dead anyway to guard against older annex, untrust
in addition seems superfluous.
On that note: DataLad's minimum version for annex doesn't matter for this. A repo can be "touched" by git-annex w/o any DataLad being involved.
Should create also set datalad.clone.reckless config option so that clones made into the created private dataset respect that mode by default? Should create also check that config in potential parent datasets?
Yes, agree. Thanks!
Looking through the diff ...
Thanks for looking into it, I am really curious about your opinion. This seems like a small change in principle, but there are some interactions and decisions along the way (see e.g. commit message for 24fd32d).
See, but here's something I'm concerned about. I noticed a similar logic in the code of And if it wouldn't observe these settings, then there's little we can do about it other then make users aware of that. Both ephemeral and private setups exist for a good reason. |
I think this warrants a ping to #6847 and #7232 because this PR uses a relatively new annex feature (though approaching 2 years old now). As this is an additional feature for This probably warrants a check / warning in the relevant commands though, that I didn't initially propose? Please advise. |
Good catch, I missed that! You're right, then, I suppose - we can't do a lot about it and
Setting private while knowing that installed annex can't respect it, doesn't seem to make sense to me. Further operation on that dataset would lead to obviously undesired results (hence second option above as an alternative). Note, that 1 isn't actually mutually exclusive with 3. It's probably 1+3 or 2. 3 could seem superfluous in that case, but I think having that protection at a lower level despite the intention to fail early is the safer approach WRT later code changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, implementation looks good to me.
Not approving yet, b/c of unresolved question above.
A single lookup for annex.private is changed to use getbool instead of a literal comparison. This will return True for "true", "1", and the like (KeyError is handled if the key is not present). Suggested-by: Benjamin Poldrack <[email protected]>
Thanks @bpoldrack for taking a closer look. Just a quick note that I didn't forget about this PR, but I focused on other things recently. I will make changes addressing your review (especially the thing when annex version doesn't support the private mode) at some point by the end of the week. |
Sorry, I ran out of steam. Will be unavailable in the coming week. @bpoldrack do you think you can take over? |
Yes, no worries. |
This introduces a version kludge indicating whether private mode is supported by git-annex and adjusts tests for clone and create to account for it. Furthermore this patch makes `create` and `clone` fail if private mode was requested by command parameter but isn't supported by installed git-annex. This seems preferrable over a fallback solution like `git annex dead here`, since the behavior is not the same and whether such an approach suits the usecase is up to the user to decide. There's one deviation from failing early when private mode isn't supported: This is when the setting comes from the `datalad.clone.reckless` setting in a superdataset. In this case we fail only when actually trying to set this. The rationale for this is: At the point in time we read such a config, we don't even know whether we are creating/cloning an annex repo to begin with (could be plain git). To fail when we could succeed, while the instruction for private mode only comes from a general setting rather than by telling the executed command directly seems wrong. Failing to `clone --reckless private` a git repo seems OK, because it can be rectified by the command call that doesn't need the option to begin with. But `clone -d.` a git repo while `datalad.clone.reckless` is set to private in the superdataset seems different, because a user would need to change the general setup for a config that isn't even supposed to have an effect on the dataset in question. It may arguably be rare to ever run into this situation, but we should be aware, that users may have good reasons for different execution environments (incl. different annex versions), while working on the same datasets on the filesystem (where the config for the superdataset is likely to come from).
fc91360
to
601e7a9
Compare
@mslw do you think anything was left to be done for this PR? |
I think that with 601e7a9 by @bpoldrack this is complete, and I would be very happy to see it make part of the next minor release. Sorry for not making it clear. It seems that there are plans to raise the minimal required git-annex version with the next minor release, too (e.g #7232) - if it goes above |
Code Climate has analyzed commit da07f45 and detected 1 issue on this pull request. Here's the issue category breakdown:
View more on Code Climate. |
I fixed a merge conflict; at the moment travis & appveyor tests pass. I want to reiterate that I consider this PR complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me this PR looks good, and the failing benchmarks are fixed outside of this PR already. I'd really like to have this PR merged! I have played around with the new functions in toy repos, and they work fine and as I expect judging from the git-annex docs. I haven't had real usecases for private mode, though - maybe a second pair of eyes can spot something I might have missed?
- Support for annex private mode. | ||
A new `--reckless private` mode was added to the `clone`, `get` & `install` commands; a new `--private` option was added to `create`. | ||
Using private mode will configure the new clone with `annex.private=true`, meaning that the clone won't store any information about itself in the git-annex branch (the branch will still exist and contain information about other clones). | ||
This mode can be used for temporary clones, where changes are pushed (e.g. back to origin), and the temporary clone is promptly discarded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Ideally we would like to harmonize how
private
is specified across commands. But not yet sure if--private
or--reckless
: - For
reckless
-- can't it be a combination ofshared-
andprivate
orephemeral
andprivate
or some other in particularwreckless
is we do want to quickly do some change in a clone to push back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would like to harmonize how private is specified across commands. But not yet sure if --private or --reckless
I'm trying to remember what was the reason I chose the current way, but can't. Probably just following the suggestion in #6456 to reuse the existing switch for clone, and then not feeling comfortable with calling the create operation "reckless".
For reckless -- can't it be a combination of shared- and private or ephemeral and private or some other in particular wreckless is we do want to quickly do some change in a clone to push back?
Umm... For ephemeral & private: that wouldn't change anything because clone ephemeral already sets annex.private (which is the only DataLad usage of annex.private outside this PR). For shared & private, and yet-to-be-invented and private: I did not think of it, but I suppose it would technically be possible...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I agree that it is useful to reuse the --reckless
switch, especially when the combination --reckless ephemeral --private
becomes possible but doesn't make sense. And I agree it makes sense to call it --private
instead of reckless for create
With datalad-next, this is how it would be done:
(see the What it does not do is setting Update: Linking datalad/datalad-next#534 (comment) that has the conclusion: on the fly setting is OK. |
Description of changes
This adds a new reckless mode, called "private" to
clone
(and, implicitly,get
andinstall
), and an analogous "private" option to create. When used, it will set theannex.private
config option totrue
beforegit annex init
is called. As a consequence, information about the cloned / created dataset will not be recorded in the git-annex branch (the branch will still exist, and track information about other repositories). Closes #6456Private mode can be useful when working with temporary clones that are discarded after pushing (clone, make changes, push back to origin, discard clone). Compared to using
git annex dead here
, it does not clutteruuid.log
andtrust.log
with information about dead clones; and compared withreckless=ephemeral
it does not share the annex.See Cloning a repository privately and configuration section in git-annex's manpage.
Questions I still have
git annex untrust here
in the private mode. Other reckless modes do it (ephemeral uses dead) so I did as well, but kept it in a separate commit so it can be removed. But IMO this is less needed here, as the private repository is not mentioned in the annex branch, so other repositories can't learn about it. So it seems to me that the only value is as an additional safeguard for issuing a command to drop from elsewhere when working in a private repository.clone_ephemeral
- when private mode is set, the trust information goes into.git/annex/journal-private/
and not the annex branch, so I'm not sure if older annex versions would notice it. But, it any case, announcing dead doesn't seem to be a problem.create
also setdatalad.clone.reckless
config option so that clones made into the created private dataset respect that mode by default? Should create also check that config in potential parent datasets?edit: I made two separate commits that do it - one makes clone inherit
annex.private
, the other makes it setdatalad.clone.reckless
. See commit messages for caveats.PR checklist
datalad.clone.reckless
also withincreate
(and check the option during creation?)CHANGELOG-missing
label to this pull request in order to have a snippet generated from its title; or usescriv create
locally and include the generated file in the pull request, see scriv).