Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF Support annex.private for clone/get and create. #7247

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

mslw
Copy link
Contributor

@mslw mslw commented Jan 9, 2023

Description of changes

This adds a new reckless mode, called "private" to clone (and, implicitly, get and install), and an analogous "private" option to create. When used, it will set the annex.private config option to true before git annex init is called. As a consequence, information about the cloned / created dataset will not be recorded in the git-annex branch (the branch will still exist, and track information about other repositories). Closes #6456

Private mode can be useful when working with temporary clones that are discarded after pushing (clone, make changes, push back to origin, discard clone). Compared to using git annex dead here, it does not clutter uuid.log and trust.log with information about dead clones; and compared with reckless=ephemeral it does not share the annex.

See Cloning a repository privately and configuration section in git-annex's manpage.

Questions I still have

  • Should the reckless mode be named "private" or maybe "temporary"?
  • After reading about untrusted repositories I am not sure whether we need to do git annex untrust here in the private mode. Other reckless modes do it (ephemeral uses dead) so I did as well, but kept it in a separate commit so it can be removed. But IMO this is less needed here, as the private repository is not mentioned in the annex branch, so other repositories can't learn about it. So it seems to me that the only value is as an additional safeguard for issuing a command to drop from elsewhere when working in a private repository.
  • On the same note, I am not sure about this comment in clone_ephemeral - when private mode is set, the trust information goes into .git/annex/journal-private/ and not the annex branch, so I'm not sure if older annex versions would notice it. But, it any case, announcing dead doesn't seem to be a problem.
  • Should create also set datalad.clone.reckless config option so that clones made into the created private dataset respect that mode by default? Should create also check that config in potential parent datasets?
    edit: I made two separate commits that do it - one makes clone inherit annex.private, the other makes it set datalad.clone.reckless. See commit messages for caveats.

PR checklist

  • Set datalad.clone.reckless also within create (and check the option during creation?)
  • More manual testing
  • Write tests
  • Create a changelog snippet (add the CHANGELOG-missing label to this pull request in order to have a snippet generated from its title; or use scriv create locally and include the generated file in the pull request, see scriv).

This adds a new reckless mode, "private", to clone (and, by extension,
get). When used, it will set annex.private (git config option) before
running `git annex init`. Manpage for git-annex says: "When this is
set to true, no information about the repository will be recorded in
the git-annex branch".

Private mode is intended mainly for creating temporary clones (which
are changed, pushed back into origin, and dropped) without cluttering
`git-annex:uuid.log` -- repository information gets stored in
`.git/annex/journal-private/` rather than in the git-annex branch. The
git-annex branch still exists, and is used to record things not
related to the private repository. See:
https://git-annex.branchable.com/tips/cloning_a_repository_privately/

The change to `clone` also affects `get`, because `get` uses
`clone_dataset()`. Mechanisms for storing reckless mode and inheriting
it in subdatasets were already in place.
This will make --reckless private similar to other reckless clone
modes (auto and shared-* use untrust, ephemeral uses dead).

Note that with annex.private, the information goes into
`.git/annex/journal-private/trust.log`.

Personally, I suppose that in the context of private mode, using
untrust is not necessary (other clones won't have information about
the private one), but can act as an additional safeguard (for number
of copies) when issuing a drop-from-elsewhere command within the
private repository.
This adds a new parameter, "private", to the create command. When set,
it will set the annex.private configuration option before calling git
annex init. When the option is set, the created dataset won't record
information about itself in the git-annex branch (but the branch will
still exist and record e.g. information about other repositories).

This can be useful for creating temporary datasets which will be
discarded after being published, or for privacy reasons.
@codecov
Copy link

codecov bot commented Jan 9, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (eeb7313) 90.61% compared to head (da07f45) 91.60%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7247      +/-   ##
==========================================
+ Coverage   90.61%   91.60%   +0.98%     
==========================================
  Files         325      325              
  Lines       43405    43467      +62     
  Branches        0     5830    +5830     
==========================================
+ Hits        39333    39816     +483     
+ Misses       4072     3636     -436     
- Partials        0       15      +15     
Files Coverage Δ
datalad/core/distributed/clone.py 99.04% <100.00%> (+<0.01%) ⬆️
datalad/core/distributed/tests/test_clone.py 98.45% <100.00%> (+0.02%) ⬆️
datalad/core/local/create.py 99.34% <100.00%> (+0.76%) ⬆️
datalad/core/local/tests/test_create.py 98.47% <100.00%> (+0.07%) ⬆️
datalad/interface/common_opts.py 100.00% <ø> (ø)
datalad/support/annexrepo.py 90.60% <85.71%> (+0.35%) ⬆️

... and 74 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

If the parent dataset is given explicitly to the create command, and
it uses annex private mode, the created subdataset will also use
private mode. This is similar to how clone --reckless works.

Note: the current code allows overriding one way:
datalad create -d super --private sub
(sub will be private regardless of super), but not the other way (if
super is private, it will take precedence even when setting
private=False). However, I don't see much reason for such setup, and
I'd rather keep private as a True/False flag.
When using create with --private, also set datalad.clone.reckless so
that cloning subdatasets into thus created dataset would default to
using --reckless private.
This should also cover get, because all that's relevant happens in
create.

Note that checking the presence of "annex.private" in config is not
sufficient. To make a repository private, this option must be in place
before git annex init ts called. So to be sure, we also check effects
of this config, i.e. presence of "journal-private" & absence of new
entries in git-annex::uuid.log.
@mslw mslw added CHANGELOG-missing When a PR's description does not contain a changelog item, yet. semver-minor Increment the minor version when merged labels Jan 13, 2023
@github-actions github-actions bot removed the CHANGELOG-missing When a PR's description does not contain a changelog item, yet. label Jan 13, 2023
@mslw mslw changed the base branch from maint to master January 13, 2023 13:47
@mslw mslw changed the title NF (draft) Support annex.private for clone/get and create. NF Support annex.private for clone/get and create. Jan 13, 2023
@mslw mslw marked this pull request as ready for review January 13, 2023 13:50
@mslw mslw added semver-minor Increment the minor version when merged and removed semver-minor Increment the minor version when merged labels Jan 13, 2023
@mslw
Copy link
Contributor Author

mslw commented Jan 13, 2023

Looking briefly at the test errors:

  • "Test old nose code" fails on installing dependencies, more specifically conda install git annex

  • On Travis, some runs fail on tests added by me. Runs that fail use git-annex 8.20201007, runs that pass use a more recent version. Older version does not support annex.private.

    Tests for ephemeral clone (which also sets annex.private) execute similar piece if external_versions['cmd:annex'] >= "8.20210428". I don't think I should avoid the test for private though, as here effect of the config is the whole point. So probably I would have to also add annex version check to the clone and create commands. But maybe the required annex version will be increased for next minor release? For now, I won't modify create/clone further and wait for suggestions.

@mslw mslw requested a review from bpoldrack January 18, 2023 14:12
Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much, @mslw! Sorry for missing this one.

Should the reckless mode be named "private" or maybe "temporary"?

I'm not sure either. Mostly b/c there's overlap with ephemeral. If origin isn't on the same fie system and therefore can't be linked, then both modes are identical (in intention - implementation should follow).
However, since ephemeral is used already, I think adding temporary is confusing. So, private and thus being a little more specific than other mode names, is the best I can see.

Re untrust:

the only value is as an additional safeguard for issuing a command to drop from elsewhere when working in a private repository.

Good point, I think. But since we likely need to declare dead anyway to guard against older annex, untrust in addition seems superfluous.
On that note: DataLad's minimum version for annex doesn't matter for this. A repo can be "touched" by git-annex w/o any DataLad being involved.

Should create also set datalad.clone.reckless config option so that clones made into the created private dataset respect that mode by default? Should create also check that config in potential parent datasets?

Yes, agree. Thanks!

Looking through the diff ...

@mslw
Copy link
Contributor Author

mslw commented Feb 7, 2023

Thanks for looking into it, I am really curious about your opinion. This seems like a small change in principle, but there are some interactions and decisions along the way (see e.g. commit message for 24fd32d).

Re untrust:

the only value is as an additional safeguard for issuing a command to drop from elsewhere when working in a private repository.

Good point, I think. But since we likely need to declare dead anyway to guard against older annex, untrust in addition seems superfluous. On that note: DataLad's minimum version for annex doesn't matter for this. A repo can be "touched" by git-annex w/o any DataLad being involved.

See, but here's something I'm concerned about. I noticed a similar logic in the code of clone_ephemeral (comments, actually), and I'm not sure if I agree. When I declare annex dead here in a repository that is in private mode (which it would be in this case, as private needs to be declared very early on), the information goes into .git/annex/journal-private/trust.log, not the "ordinary" place (trust.log on git-annex branch). I did not check what, if anything, a git-annex version prior to introduction of private mode would do.

And if it wouldn't observe these settings, then there's little we can do about it other then make users aware of that. Both ephemeral and private setups exist for a good reason.

@mslw
Copy link
Contributor Author

mslw commented Feb 23, 2023

I think this warrants a ping to #6847 and #7232 because this PR uses a relatively new annex feature (though approaching 2 years old now).

As this is an additional feature for clone and create, I think it's fine if the minimally supported annex version (and indeed, one that's currently in Debian stable) does not support the private feature. In this case, the annex.private config option would be set, but not acted upon.

This probably warrants a check / warning in the relevant commands though, that I didn't initially propose? Please advise.

@bpoldrack
Copy link
Member

When I declare annex dead here in a repository that is in private mode (which it would be in this case, as private needs to be declared very early on), the information goes into .git/annex/journal-private/trust.log, not the "ordinary" place (trust.log on git-annex branch).

Good catch, I missed that! You're right, then, I suppose - we can't do a lot about it and annex dead here doesn't actually mitigate.
Somewhat tied to the question of raising annex' minimum version. Since it's just an additional feature I agree - not imperative to raise. I think, there are three somewhat reasonable options here:

  • Have all relevant commands check the installed annex version at the beginning if that option (or the config) is set and error out for an invalid parameter
  • Use an annex version kludge like we do elsewhere and set private only if supported, declare git annex dead here as fallback (as the next best option to achieve something that may be good enough for purpose) alongside a warning that the proper thing requires annex XXX instead
  • annex version kludge for setting private, but no fallback - error out instead.

Setting private while knowing that installed annex can't respect it, doesn't seem to make sense to me. Further operation on that dataset would lead to obviously undesired results (hence second option above as an alternative).

Note, that 1 isn't actually mutually exclusive with 3. It's probably 1+3 or 2. 3 could seem superfluous in that case, but I think having that protection at a lower level despite the intention to fail early is the safer approach WRT later code changes.
Got to admit I don't have a clear preference. I suppose 1+3 is cleanest. 2 could be convenient but also a bit misleading.

Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, implementation looks good to me.
Not approving yet, b/c of unresolved question above.

datalad/core/local/create.py Outdated Show resolved Hide resolved
A single lookup for annex.private is changed to use getbool instead of
a literal comparison. This will return True for "true", "1", and the
like (KeyError is handled if the key is not present).

Suggested-by: Benjamin Poldrack <[email protected]>
@mslw
Copy link
Contributor Author

mslw commented Mar 14, 2023

Thanks @bpoldrack for taking a closer look. Just a quick note that I didn't forget about this PR, but I focused on other things recently. I will make changes addressing your review (especially the thing when annex version doesn't support the private mode) at some point by the end of the week.

@mslw
Copy link
Contributor Author

mslw commented Mar 26, 2023

Sorry, I ran out of steam. Will be unavailable in the coming week. @bpoldrack do you think you can take over?

@bpoldrack
Copy link
Member

@mslw

Sorry, I ran out of steam. Will be unavailable in the coming week. @bpoldrack do you think you can take over?

Yes, no worries.

This introduces a version kludge indicating whether private mode is
supported by git-annex and adjusts tests for clone and create to account
for it.
Furthermore this patch makes `create` and `clone` fail if private mode
was requested by command parameter but isn't supported by installed
git-annex. This seems preferrable over a fallback solution like `git
annex dead here`, since the behavior is not the same and whether such an
approach suits the usecase is up to the user to decide.

There's one deviation from failing early when private mode isn't
supported: This is when the setting comes from the
`datalad.clone.reckless` setting in a superdataset. In this case we fail
only when actually trying to set this. The rationale for this is:
At the point in time we read such a config, we don't even know whether
we are creating/cloning an annex repo to begin with (could be plain git).
To fail when we could succeed, while the instruction for private mode
only comes from a general setting rather than by telling the executed
command directly seems wrong. Failing to `clone --reckless private` a
git repo seems OK, because it can be rectified by the command call that
doesn't need the option to begin with. But `clone -d.` a git repo while
`datalad.clone.reckless` is set to private in the superdataset seems
different, because a user would need to change the general setup for a
config that isn't even supposed to have an effect on the dataset in
question.
It may arguably be rare to ever run into this situation, but we should
be aware, that users may have good reasons for different execution
environments (incl. different annex versions), while working on the same
datasets on the filesystem (where the config for the superdataset is
likely to come from).
@yarikoptic
Copy link
Member

@mslw do you think anything was left to be done for this PR?

@mslw
Copy link
Contributor Author

mslw commented Jun 5, 2023

@mslw do you think anything was left to be done for this PR?

I think that with 601e7a9 by @bpoldrack this is complete, and I would be very happy to see it make part of the next minor release. Sorry for not making it clear.

It seems that there are plans to raise the minimal required git-annex version with the next minor release, too (e.g #7232) - if it goes above 20210428, than that commit could as well be reverted. But given that there's no clear indication where the required version would ultimately land, I do not wish to create circular dependencies between PRs (kludges can also be removed later).

@codeclimate
Copy link

codeclimate bot commented Oct 9, 2023

Code Climate has analyzed commit da07f45 and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 1

View more on Code Climate.

@mslw
Copy link
Contributor Author

mslw commented Oct 10, 2023

I fixed a merge conflict; at the moment travis & appveyor tests pass.

I want to reiterate that I consider this PR complete.

Copy link
Member

@adswa adswa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me this PR looks good, and the failing benchmarks are fixed outside of this PR already. I'd really like to have this PR merged! I have played around with the new functions in toy repos, and they work fine and as I expect judging from the git-annex docs. I haven't had real usecases for private mode, though - maybe a second pair of eyes can spot something I might have missed?

- Support for annex private mode.
A new `--reckless private` mode was added to the `clone`, `get` & `install` commands; a new `--private` option was added to `create`.
Using private mode will configure the new clone with `annex.private=true`, meaning that the clone won't store any information about itself in the git-annex branch (the branch will still exist and contain information about other clones).
This mode can be used for temporary clones, where changes are pushed (e.g. back to origin), and the temporary clone is promptly discarded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Ideally we would like to harmonize how private is specified across commands. But not yet sure if --private or --reckless:
  • For reckless -- can't it be a combination of shared- and private or ephemeral and private or some other in particular wreckless is we do want to quickly do some change in a clone to push back?

Copy link
Contributor Author

@mslw mslw Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would like to harmonize how private is specified across commands. But not yet sure if --private or --reckless

I'm trying to remember what was the reason I chose the current way, but can't. Probably just following the suggestion in #6456 to reuse the existing switch for clone, and then not feeling comfortable with calling the create operation "reckless".

For reckless -- can't it be a combination of shared- and private or ephemeral and private or some other in particular wreckless is we do want to quickly do some change in a clone to push back?

Umm... For ephemeral & private: that wouldn't change anything because clone ephemeral already sets annex.private (which is the only DataLad usage of annex.private outside this PR). For shared & private, and yet-to-be-invented and private: I did not think of it, but I suppose it would technically be possible...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I agree that it is useful to reuse the --reckless switch, especially when the combination --reckless ephemeral --private becomes possible but doesn't make sense. And I agree it makes sense to call it --private instead of reckless for create

@mih
Copy link
Member

mih commented Nov 7, 2023

With datalad-next, this is how it would be done:

❯ datalad -c annex.private=true create some
create(ok): /tmp/some (dataset)
❯ ls --no-icons some/.git/annex
fsck   index.lck        journal.lck  keysdb.lck    sentinal        smudge.lck
index  journal-private  keysdb       othertmp.lck  sentinal.cache  smudge.log

(see the journal-private directory)

What it does not do is setting annex.private in the local gitconfig. I assume, but have not confirmed, that this is necessary for full compliance.

Also see https://git-annex.branchable.com/tips/cloning_a_repository_privately/#comment-2725ecbd413dffad080eab473787c530

Update: Linking datalad/datalad-next#534 (comment) that has the conclusion: on the fly setting is OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clone/get should support setting annex.private=true
5 participants