NF Support annex.private for clone/get and create. #7247

mslw · 2023-01-09T21:04:23Z

Description of changes

This adds a new reckless mode, called "private" to clone (and, implicitly, get and install), and an analogous "private" option to create. When used, it will set the annex.private config option to true before git annex init is called. As a consequence, information about the cloned / created dataset will not be recorded in the git-annex branch (the branch will still exist, and track information about other repositories). Closes #6456

Private mode can be useful when working with temporary clones that are discarded after pushing (clone, make changes, push back to origin, discard clone). Compared to using git annex dead here, it does not clutter uuid.log and trust.log with information about dead clones; and compared with reckless=ephemeral it does not share the annex.

See Cloning a repository privately and configuration section in git-annex's manpage.

Questions I still have

Should the reckless mode be named "private" or maybe "temporary"?
After reading about untrusted repositories I am not sure whether we need to do git annex untrust here in the private mode. Other reckless modes do it (ephemeral uses dead) so I did as well, but kept it in a separate commit so it can be removed. But IMO this is less needed here, as the private repository is not mentioned in the annex branch, so other repositories can't learn about it. So it seems to me that the only value is as an additional safeguard for issuing a command to drop from elsewhere when working in a private repository.
On the same note, I am not sure about this comment in clone_ephemeral - when private mode is set, the trust information goes into .git/annex/journal-private/ and not the annex branch, so I'm not sure if older annex versions would notice it. But, it any case, announcing dead doesn't seem to be a problem.
Should create also set datalad.clone.reckless config option so that clones made into the created private dataset respect that mode by default? Should create also check that config in potential parent datasets?
edit: I made two separate commits that do it - one makes clone inherit annex.private, the other makes it set datalad.clone.reckless. See commit messages for caveats.

PR checklist

Set datalad.clone.reckless also within create (and check the option during creation?)
More manual testing
Write tests
Create a changelog snippet (add the CHANGELOG-missing label to this pull request in order to have a snippet generated from its title; or use scriv create locally and include the generated file in the pull request, see scriv).

Punctuation, grammar

This adds a new reckless mode, "private", to clone (and, by extension, get). When used, it will set annex.private (git config option) before running `git annex init`. Manpage for git-annex says: "When this is set to true, no information about the repository will be recorded in the git-annex branch". Private mode is intended mainly for creating temporary clones (which are changed, pushed back into origin, and dropped) without cluttering `git-annex:uuid.log` -- repository information gets stored in `.git/annex/journal-private/` rather than in the git-annex branch. The git-annex branch still exists, and is used to record things not related to the private repository. See: https://git-annex.branchable.com/tips/cloning_a_repository_privately/ The change to `clone` also affects `get`, because `get` uses `clone_dataset()`. Mechanisms for storing reckless mode and inheriting it in subdatasets were already in place.

This will make --reckless private similar to other reckless clone modes (auto and shared-* use untrust, ephemeral uses dead). Note that with annex.private, the information goes into `.git/annex/journal-private/trust.log`. Personally, I suppose that in the context of private mode, using untrust is not necessary (other clones won't have information about the private one), but can act as an additional safeguard (for number of copies) when issuing a drop-from-elsewhere command within the private repository.

This adds a new parameter, "private", to the create command. When set, it will set the annex.private configuration option before calling git annex init. When the option is set, the created dataset won't record information about itself in the git-annex branch (but the branch will still exist and record e.g. information about other repositories). This can be useful for creating temporary datasets which will be discarded after being published, or for privacy reasons.

codecov · 2023-01-09T21:40:09Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (eeb7313) 90.61% compared to head (da07f45) 91.60%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7247      +/-   ##
==========================================
+ Coverage   90.61%   91.60%   +0.98%     
==========================================
  Files         325      325              
  Lines       43405    43467      +62     
  Branches        0     5830    +5830     
==========================================
+ Hits        39333    39816     +483     
+ Misses       4072     3636     -436     
- Partials        0       15      +15

Files	Coverage Δ
datalad/core/distributed/clone.py	`99.04% <100.00%> (+<0.01%)`	⬆️
datalad/core/distributed/tests/test_clone.py	`98.45% <100.00%> (+0.02%)`	⬆️
datalad/core/local/create.py	`99.34% <100.00%> (+0.76%)`	⬆️
datalad/core/local/tests/test_create.py	`98.47% <100.00%> (+0.07%)`	⬆️
datalad/interface/common_opts.py	`100.00% <ø> (ø)`
datalad/support/annexrepo.py	`90.60% <85.71%> (+0.35%)`	⬆️

... and 74 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

If the parent dataset is given explicitly to the create command, and it uses annex private mode, the created subdataset will also use private mode. This is similar to how clone --reckless works. Note: the current code allows overriding one way: datalad create -d super --private sub (sub will be private regardless of super), but not the other way (if super is private, it will take precedence even when setting private=False). However, I don't see much reason for such setup, and I'd rather keep private as a True/False flag.

When using create with --private, also set datalad.clone.reckless so that cloning subdatasets into thus created dataset would default to using --reckless private.

This should also cover get, because all that's relevant happens in create. Note that checking the presence of "annex.private" in config is not sufficient. To make a repository private, this option must be in place before git annex init ts called. So to be sure, we also check effects of this config, i.e. presence of "journal-private" & absence of new entries in git-annex::uuid.log.

mslw · 2023-01-13T18:12:05Z

Looking briefly at the test errors:

"Test old nose code" fails on installing dependencies, more specifically conda install git annex
On Travis, some runs fail on tests added by me. Runs that fail use git-annex 8.20201007, runs that pass use a more recent version. Older version does not support annex.private.

Tests for ephemeral clone (which also sets annex.private) execute similar piece if external_versions['cmd:annex'] >= "8.20210428". I don't think I should avoid the test for private though, as here effect of the config is the whole point. So probably I would have to also add annex version check to the clone and create commands. But maybe the required annex version will be increased for next minor release? For now, I won't modify create/clone further and wait for suggestions.

bpoldrack

Thanks very much, @mslw! Sorry for missing this one.

Should the reckless mode be named "private" or maybe "temporary"?

I'm not sure either. Mostly b/c there's overlap with ephemeral. If origin isn't on the same fie system and therefore can't be linked, then both modes are identical (in intention - implementation should follow).
However, since ephemeral is used already, I think adding temporary is confusing. So, private and thus being a little more specific than other mode names, is the best I can see.

Re untrust:

the only value is as an additional safeguard for issuing a command to drop from elsewhere when working in a private repository.

Good point, I think. But since we likely need to declare dead anyway to guard against older annex, untrust in addition seems superfluous.
On that note: DataLad's minimum version for annex doesn't matter for this. A repo can be "touched" by git-annex w/o any DataLad being involved.

Should create also set datalad.clone.reckless config option so that clones made into the created private dataset respect that mode by default? Should create also check that config in potential parent datasets?

Yes, agree. Thanks!

Looking through the diff ...

mslw · 2023-02-07T17:15:31Z

Thanks for looking into it, I am really curious about your opinion. This seems like a small change in principle, but there are some interactions and decisions along the way (see e.g. commit message for 24fd32d).

Re untrust:

the only value is as an additional safeguard for issuing a command to drop from elsewhere when working in a private repository.

Good point, I think. But since we likely need to declare dead anyway to guard against older annex, untrust in addition seems superfluous. On that note: DataLad's minimum version for annex doesn't matter for this. A repo can be "touched" by git-annex w/o any DataLad being involved.

See, but here's something I'm concerned about. I noticed a similar logic in the code of clone_ephemeral (comments, actually), and I'm not sure if I agree. When I declare annex dead here in a repository that is in private mode (which it would be in this case, as private needs to be declared very early on), the information goes into .git/annex/journal-private/trust.log, not the "ordinary" place (trust.log on git-annex branch). I did not check what, if anything, a git-annex version prior to introduction of private mode would do.

And if it wouldn't observe these settings, then there's little we can do about it other then make users aware of that. Both ephemeral and private setups exist for a good reason.

mslw · 2023-02-23T09:38:34Z

I think this warrants a ping to #6847 and #7232 because this PR uses a relatively new annex feature (though approaching 2 years old now).

As this is an additional feature for clone and create, I think it's fine if the minimally supported annex version (and indeed, one that's currently in Debian stable) does not support the private feature. In this case, the annex.private config option would be set, but not acted upon.

This probably warrants a check / warning in the relevant commands though, that I didn't initially propose? Please advise.

bpoldrack · 2023-03-08T15:04:20Z

When I declare annex dead here in a repository that is in private mode (which it would be in this case, as private needs to be declared very early on), the information goes into .git/annex/journal-private/trust.log, not the "ordinary" place (trust.log on git-annex branch).

Good catch, I missed that! You're right, then, I suppose - we can't do a lot about it and annex dead here doesn't actually mitigate.
Somewhat tied to the question of raising annex' minimum version. Since it's just an additional feature I agree - not imperative to raise. I think, there are three somewhat reasonable options here:

Have all relevant commands check the installed annex version at the beginning if that option (or the config) is set and error out for an invalid parameter
Use an annex version kludge like we do elsewhere and set private only if supported, declare git annex dead here as fallback (as the next best option to achieve something that may be good enough for purpose) alongside a warning that the proper thing requires annex XXX instead
annex version kludge for setting private, but no fallback - error out instead.

Setting private while knowing that installed annex can't respect it, doesn't seem to make sense to me. Further operation on that dataset would lead to obviously undesired results (hence second option above as an alternative).

Note, that 1 isn't actually mutually exclusive with 3. It's probably 1+3 or 2. 3 could seem superfluous in that case, but I think having that protection at a lower level despite the intention to fail early is the safer approach WRT later code changes.
Got to admit I don't have a clear preference. I suppose 1+3 is cleanest. 2 could be convenient but also a bit misleading.

bpoldrack

So far, implementation looks good to me.
Not approving yet, b/c of unresolved question above.

datalad/core/local/create.py

A single lookup for annex.private is changed to use getbool instead of a literal comparison. This will return True for "true", "1", and the like (KeyError is handled if the key is not present). Suggested-by: Benjamin Poldrack <[email protected]>

mslw · 2023-03-14T18:55:59Z

Thanks @bpoldrack for taking a closer look. Just a quick note that I didn't forget about this PR, but I focused on other things recently. I will make changes addressing your review (especially the thing when annex version doesn't support the private mode) at some point by the end of the week.

mslw · 2023-03-26T17:03:44Z

Sorry, I ran out of steam. Will be unavailable in the coming week. @bpoldrack do you think you can take over?

bpoldrack · 2023-03-27T06:50:05Z

@mslw

Sorry, I ran out of steam. Will be unavailable in the coming week. @bpoldrack do you think you can take over?

Yes, no worries.

This introduces a version kludge indicating whether private mode is supported by git-annex and adjusts tests for clone and create to account for it. Furthermore this patch makes `create` and `clone` fail if private mode was requested by command parameter but isn't supported by installed git-annex. This seems preferrable over a fallback solution like `git annex dead here`, since the behavior is not the same and whether such an approach suits the usecase is up to the user to decide. There's one deviation from failing early when private mode isn't supported: This is when the setting comes from the `datalad.clone.reckless` setting in a superdataset. In this case we fail only when actually trying to set this. The rationale for this is: At the point in time we read such a config, we don't even know whether we are creating/cloning an annex repo to begin with (could be plain git). To fail when we could succeed, while the instruction for private mode only comes from a general setting rather than by telling the executed command directly seems wrong. Failing to `clone --reckless private` a git repo seems OK, because it can be rectified by the command call that doesn't need the option to begin with. But `clone -d.` a git repo while `datalad.clone.reckless` is set to private in the superdataset seems different, because a user would need to change the general setup for a config that isn't even supposed to have an effect on the dataset in question. It may arguably be rare to ever run into this situation, but we should be aware, that users may have good reasons for different execution environments (incl. different annex versions), while working on the same datasets on the filesystem (where the config for the superdataset is likely to come from).

yarikoptic · 2023-06-03T02:41:31Z

@mslw do you think anything was left to be done for this PR?

mslw · 2023-06-05T10:13:56Z

@mslw do you think anything was left to be done for this PR?

I think that with 601e7a9 by @bpoldrack this is complete, and I would be very happy to see it make part of the next minor release. Sorry for not making it clear.

It seems that there are plans to raise the minimal required git-annex version with the next minor release, too (e.g #7232) - if it goes above 20210428, than that commit could as well be reverted. But given that there's no clear indication where the required version would ultimately land, I do not wish to create circular dependencies between PRs (kludges can also be removed later).

codeclimate · 2023-10-09T16:12:09Z

Code Climate has analyzed commit da07f45 and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category	Count
Complexity	1

View more on Code Climate.

mslw · 2023-10-10T16:19:18Z

I fixed a merge conflict; at the moment travis & appveyor tests pass.

I want to reiterate that I consider this PR complete.

adswa

to me this PR looks good, and the failing benchmarks are fixed outside of this PR already. I'd really like to have this PR merged! I have played around with the new functions in toy repos, and they work fine and as I expect judging from the git-annex docs. I haven't had real usecases for private mode, though - maybe a second pair of eyes can spot something I might have missed?

yarikoptic · 2023-10-13T15:19:21Z

changelog.d/pr-7247.md

+- Support for annex private mode.
+ A new `--reckless private` mode was added to the `clone`, `get` & `install` commands; a new `--private` option was added to `create`.
+ Using private mode will configure the new clone with `annex.private=true`, meaning that the clone won't store any information about itself in the git-annex branch (the branch will still exist and contain information about other clones).
+ This mode can be used for temporary clones, where changes are pushed (e.g. back to origin), and the temporary clone is promptly discarded.


Ideally we would like to harmonize how private is specified across commands. But not yet sure if --private or --reckless:

For reckless -- can't it be a combination of shared- and private or ephemeral and private or some other in particular wreckless is we do want to quickly do some change in a clone to push back?

Ideally we would like to harmonize how private is specified across commands. But not yet sure if --private or --reckless

I'm trying to remember what was the reason I chose the current way, but can't. Probably just following the suggestion in #6456 to reuse the existing switch for clone, and then not feeling comfortable with calling the create operation "reckless".

For reckless -- can't it be a combination of shared- and private or ephemeral and private or some other in particular wreckless is we do want to quickly do some change in a clone to push back?

Umm... For ephemeral & private: that wouldn't change anything because clone ephemeral already sets annex.private (which is the only DataLad usage of annex.private outside this PR). For shared & private, and yet-to-be-invented and private: I did not think of it, but I suppose it would technically be possible...

FWIW, I agree that it is useful to reuse the --reckless switch, especially when the combination --reckless ephemeral --private becomes possible but doesn't make sense. And I agree it makes sense to call it --private instead of reckless for create

mih · 2023-11-07T15:31:32Z

With datalad-next, this is how it would be done:

❯ datalad -c annex.private=true create some
create(ok): /tmp/some (dataset)
❯ ls --no-icons some/.git/annex
fsck   index.lck        journal.lck  keysdb.lck    sentinal        smudge.lck
index  journal-private  keysdb       othertmp.lck  sentinal.cache  smudge.log

(see the journal-private directory)

What it does not do is setting annex.private in the local gitconfig. I assume, but have not confirmed, that this is necessary for full compliance.

Also see https://git-annex.branchable.com/tips/cloning_a_repository_privately/#comment-2725ecbd413dffad080eab473787c530

Update: Linking datalad/datalad-next#534 (comment) that has the conclusion: on the fly setting is OK.

mslw added 4 commits January 6, 2023 17:30

Tweak docstring for reckless slightly

08e0f0a

Punctuation, grammar

mslw added 3 commits January 12, 2023 19:56

Allow create to set datalad.clone.reckless private

2ab5c80

When using create with --private, also set datalad.clone.reckless so that cloning subdatasets into thus created dataset would default to using --reckless private.

mslw force-pushed the nf-annex-private branch from 8694600 to 13c276c Compare January 13, 2023 13:21

mslw added CHANGELOG-missing When a PR's description does not contain a changelog item, yet. semver-minor Increment the minor version when merged labels Jan 13, 2023

[release-action] Autogenerate changelog snippet for PR 7247

893162c

github-actions bot removed the CHANGELOG-missing When a PR's description does not contain a changelog item, yet. label Jan 13, 2023

Add a changelog description for private mode

d9ffc28

mslw changed the base branch from maint to master January 13, 2023 13:47

mslw changed the title ~~NF (draft) Support annex.private for clone/get and create.~~ NF Support annex.private for clone/get and create. Jan 13, 2023

mslw marked this pull request as ready for review January 13, 2023 13:50

mslw added semver-minor Increment the minor version when merged and removed semver-minor Increment the minor version when merged labels Jan 13, 2023

mslw requested a review from bpoldrack January 18, 2023 14:12

bpoldrack reviewed Feb 7, 2023

View reviewed changes

bpoldrack reviewed Mar 8, 2023

View reviewed changes

datalad/core/local/create.py Outdated Show resolved Hide resolved

bpoldrack force-pushed the nf-annex-private branch from fc91360 to 601e7a9 Compare April 14, 2023 11:10

Merge branch 'master' into nf-annex-private

4360353

Merge branch 'master' into nf-annex-private

da07f45

adswa approved these changes Oct 13, 2023

View reviewed changes

yarikoptic reviewed Oct 13, 2023

View reviewed changes

mih mentioned this pull request Nov 7, 2023

Finalize annex.private support (if needed) datalad/datalad-next#534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NF Support annex.private for clone/get and create. #7247

NF Support annex.private for clone/get and create. #7247

mslw commented Jan 9, 2023 •

edited

codecov bot commented Jan 9, 2023 •

edited

mslw commented Jan 13, 2023

bpoldrack left a comment

mslw commented Feb 7, 2023

mslw commented Feb 23, 2023

bpoldrack commented Mar 8, 2023

bpoldrack left a comment

mslw commented Mar 14, 2023

mslw commented Mar 26, 2023

bpoldrack commented Mar 27, 2023

yarikoptic commented Jun 3, 2023

mslw commented Jun 5, 2023

codeclimate bot commented Oct 9, 2023

mslw commented Oct 10, 2023

adswa left a comment

yarikoptic Oct 13, 2023

mslw Oct 18, 2023 •

edited

adswa Oct 18, 2023

mih commented Nov 7, 2023 •

edited

NF Support annex.private for clone/get and create. #7247

Are you sure you want to change the base?

NF Support annex.private for clone/get and create. #7247

Conversation

mslw commented Jan 9, 2023 • edited

Description of changes

Questions I still have

PR checklist

codecov bot commented Jan 9, 2023 • edited

Codecov Report

mslw commented Jan 13, 2023

bpoldrack left a comment

Choose a reason for hiding this comment

mslw commented Feb 7, 2023

mslw commented Feb 23, 2023

bpoldrack commented Mar 8, 2023

bpoldrack left a comment

Choose a reason for hiding this comment

mslw commented Mar 14, 2023

mslw commented Mar 26, 2023

bpoldrack commented Mar 27, 2023

yarikoptic commented Jun 3, 2023

mslw commented Jun 5, 2023

codeclimate bot commented Oct 9, 2023

mslw commented Oct 10, 2023

adswa left a comment

Choose a reason for hiding this comment

yarikoptic Oct 13, 2023

Choose a reason for hiding this comment

mslw Oct 18, 2023 • edited

Choose a reason for hiding this comment

adswa Oct 18, 2023

Choose a reason for hiding this comment

mih commented Nov 7, 2023 • edited

mslw commented Jan 9, 2023 •

edited

codecov bot commented Jan 9, 2023 •

edited

mslw Oct 18, 2023 •

edited

mih commented Nov 7, 2023 •

edited