Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config to avoid checking MD5 on get and put #928

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gaul
Copy link
Contributor

@gaul gaul commented Oct 30, 2017

Overload existing --no-check-md5. Object stores like S3Proxy cannot
return an MD5 ETag using some providers:

  • Atmos returns nothing
  • Azure returns an opaque ETag
  • B2 returns a SHA1

@gaul gaul force-pushed the put-check-etag branch 2 times, most recently from f8269e5 to 38ba00c Compare October 30, 2017 01:04
@gaul gaul changed the title Add config to avoid checking MD5 on put Add config to avoid checking MD5 on get and put Oct 30, 2017
@fviard
Copy link
Contributor

fviard commented Oct 30, 2017

Thanks for the PR.
Looking at all the difference regarding the "etag" handling, maybe we can do something more future proo with a little change to your PR?

Let's remove the relation to no-check-md5, and create a new config key like:
etag_type, etag_mode, ...
Its default value could be:
etag_mode=md5
(Having the same behavior as the existing one of s3cmd)

And in your case, we can use a value like:
etag_mode=none (or etag_mode=) (or etag_mode=other)
(Having a behavior that ignores etag, that is what you want with your pr i think).

With that, in the future, we could have something like:
etag_mode=sha1
to treat sha1 etags like B2 and so.

@gaul
Copy link
Contributor Author

gaul commented Oct 30, 2017

Disabling ETag is really the only sane choice. Some background, S3Proxy builds on top of Apache jclouds, a cross-cloud library, and jclouds returns a String from putBlob calls. This correlates with ETag for most providers, but jclouds returns the object name for Atmos and the SHA-1 for B2. One could argue these should return nothing since they cannot be meaningfully interpreted or used for things like conditional gets. Azure actually returns its ETag, but it is not an MD5. Thus there really is no precedent for non-MD5 ETags and adding extra ETag modes to s3cmd supports a very small population.

@fviard
Copy link
Contributor

fviard commented Oct 30, 2017

Most providers implementing the Amazon s3 protocol, are respecting the aws s3 behavior.
The issue is more with providers doing gateway with other protocols that not necessarily have the data. But these are not the main ones.
So, in my opinion, that is a lot better to try to support as much have them, maybe by having custom behaviors.
(And probably one setting in the case where etag is not supported or we don't know the mode)

I say "md5", but we can say "md5s3". Even if they allows themselves to change in the future, the algo is pretty deterministic and never changed so far. There is even some way to check the md5 for multipart files if you have a "fixed" block size.

Anyway, why the etag is useful?

  • First, like in the present case, that can validate that the received/stored file is not corrupted.
  • Using that makes "sync" a lot more efficient, because as there is no checksum or custom header in the default set of headers returned with a list, it means that we have to issue individual HEAD requests for each file to know his checksum to determine if it has to be uploaded or not. With the ETag, we can do that only for multipart files, and a smart server could be able to take advantage of that to have the multi part files also supported.

@gaul
Copy link
Contributor Author

gaul commented Oct 30, 2017

HTTPS should provide sufficient protection against corruption. The Azure ETag is opaque and includes some kind of timestamp it seems. The other two providers do not actually return an ETag header and the S3Proxy/jclouds behavior is bogus and likely should not return any header. Thus there is no work to do here except adding an ignore flag as this pull request does.

@fviard
Copy link
Contributor

fviard commented Nov 1, 2017

In my opinion, in such a case, https is not necessarily sufficient.
If we assume that the etag has a format that can be used, like the one with aws:

  1. Nothing prevents to use a simple http connection (and not an https), that will be especially the case with simple s3 clones used on local networks.
  2. https integrity check is not enough. Most services like aws, azure, ... have an https gateway as a frontend on the internet, and then the request is forwarded to different servers/services on the internal network. Normally, what is put in the etag by most services is dependent on the really stored file.

Regarding the patch, in my opinion it is a new config just to deal with specific lines that conflict with servers that have a different behavior. But nothing that would be really clear for an user or future proof.
Like, if you consider that the etag is not reliable for some services, why still letting it be used for "sync" file changes calculus, and so.

So, that's why i think that adding something more generic stating what is the supposed meaning of the ETag for a specific service could fix your issue and even have an added value.
(One of its value can be "None", "unknown", "garbage" thus giving the hint to s3cmd that it should not be used anywhere, even allowing to indicate to the user why his "sync" could be slower than with another service)

@gaul
Copy link
Contributor Author

gaul commented Nov 1, 2017

Checking ETag is not too useful given that most interesting uploads use multipart which also does not have a meaningful value. Most S3 tools which check ETag have some way to disable it, for example, aws/aws-sdk-java#560. If you will not accept this pull request as-is, please close it.

@harshavardhana
Copy link
Contributor

ETag check is ignored automatically if you add <unique-value>-N at the end @gaul @fviard

Overload existing --no-check-md5.  Object stores like S3Proxy cannot
return an MD5 ETag using some providers:

* Atmos returns nothing
* Azure returns an opaque ETag
* B2 returns a SHA1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants