Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Compress::Gzip error for gzip files with extra field #14550

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kojix2
Copy link
Contributor

@kojix2 kojix2 commented May 1, 2024

Hi.
If a Gzip file had a extra field, it could not be opened with Compress::Gzip::Reader.
This is because the length of an extra field, XLEN, is 2 bytes (uint16), but was treated as 1 byte.
This pull request aims to fix this issue.
Thank you.

Reference: BGZF file specifications: https://samtools.github.io/hts-specs/SAMv1.pdf
image

Reference: GZIP file format specification version 4.3
https://www.rfc-editor.org/rfc/rfc1952


   2.1. Overall conventions

      In the diagrams below, a box like this:

         +---+
         |   | <-- the vertical bars might be missing
         +---+

      represents one byte; a box like this:

         +==============+
         |              |
         +==============+

      represents a variable number of bytes.
      Each member has the following structure:

         +---+---+---+---+---+---+---+---+---+---+
         |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
         +---+---+---+---+---+---+---+---+---+---+

      (if FLG.FEXTRA set)

         +---+---+=================================+
         | XLEN  |...XLEN bytes of "extra field"...| (more-->)
         +---+---+=================================+

@kojix2 kojix2 changed the title Fix Gzip::Compress::Reader error for gzip files with extra field Fix Compress::Gzip error for gzip files with extra field May 1, 2024
@crysbot
Copy link

crysbot commented May 1, 2024

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/compress-reader-cannot-open-a-file-in-bgzf-format-is-this-a-bug/6262/7

@Sija
Copy link
Contributor

Sija commented May 1, 2024

Spec would be nice :)

@oprypin oprypin linked an issue May 1, 2024 that may be closed by this pull request
@kojix2
Copy link
Contributor Author

kojix2 commented May 2, 2024

Hi @Sija

The current test is the following link. It already contains tests on Extra field.

https://github.com/crystal-lang/crystal/blob/989b7fb31451951696ea8c2aaf3e6f0cc01d56f0/spec/std/gzip/gzip_spec.cr

Here we use Gzip::Writer to create a gzip stream with an Extra field.

  Gzip::Writer.open(io) do |gzip|
    header = gzip.header
    header.modification_time = SAMPLE_TIME
    header.os = SAMPLE_OS
    header.extra = SAMPLE_EXTRA
    header.name = SAMPLE_NAME
    header.comment = SAMPLE_COMMENT

    io.bytesize.should eq(0)
    gzip.flush
    io.bytesize.should_not eq(0)

    gzip.print SAMPLE_CONTENTS
  end

The created gzip stream is then read using Gzip::Reader.

    Gzip::Reader.open(io) do |gzip|
      header = gzip.header.not_nil!
      header.modification_time.should eq(SAMPLE_TIME)
      header.os.should eq(SAMPLE_OS)
      header.extra.should eq(SAMPLE_EXTRA)
      header.name.should eq(SAMPLE_NAME)
      header.comment.should eq(SAMPLE_COMMENT)

      # Reading zero bytes is OK
      gzip.read(Bytes.empty).should eq(0)

      gzip.gets_to_end.should eq(SAMPLE_CONTENTS)
    end

The original code passed the test. Because it created an incorrect stream with XLEN size of one byte instead of two bytes and read it the wrong way.

This pull request failed the test the first time because I was only correcting the reading XLEN. So I also corrected the writing XLEN as well and now it passes the test.

Even then, should we really add another test? If so, what kind of test should we add?

One idea might be to have a hard-coded slice and check that it can be read by Gzip::Reader.

@oprypin Thanks for linking the issue. I think this also fixes the issue (#8933)

@Blacksmoke16 Blacksmoke16 linked an issue May 2, 2024 that may be closed by this pull request
@straight-shoota
Copy link
Member

There should be at least one test that ensures correctly reading an extra field, and one for correctly writing an extra field.
Each test should fail without this patch, and succeed with it.
If the existing test succeeds with and without this patch, it's insufficient. It just test that reading and writing are implementing the same format, but not that this is correct.

So it's probably a good idea to include a gzipped file with extra field as a test fixture and make sure we can read it and produce the same file when writing.

@kojix2
Copy link
Contributor Author

kojix2 commented May 2, 2024

@straight-shoota
Understood. Now I would like to add a test code that fails before the change and passes after the change.

(1) Add a small file test.gz with extra fields to spec/std/data .

https://github.com/crystal-lang/crystal/blob/master/spec/std/data

(2) Write some test code like the following

it "reads zip file with different extra in local file header and central directory header" do
Compress::Zip::File.open(datapath("test.zip")) do |zip|
zip.entries.size.should eq(2)
zip["one.txt"].open(&.gets_to_end).should eq("One")
zip["two.txt"].open(&.gets_to_end).should eq("Two")
end
end

Is my understanding correct?

@kojix2
Copy link
Contributor Author

kojix2 commented May 8, 2024

I created a draft Gzip file for testing.

C code to generate the Gzip file
https://gist.github.com/kojix2/bd9523fdd3c1e7f8be4ce2c91fd17435

Generated gzip file
test.gz

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compress::Gzip::Reader cannot open a file in BGZF format a bug when read extra field gzip compressed data
5 participants