Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(reporting): report meta-data information about chunks. #557

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

qkaiser
Copy link
Contributor

@qkaiser qkaiser commented Apr 14, 2023

Allow handlers to provide a dict value as part of a ValidChunk metadata attribute. That dictionary can contain any relevant metadata information from the perspective of the handler, but we advise handler writers to report parsed information such as header values.

This metadata dict is later reported as part of our ChunkReports and available in the JSON report file if the user requested one.

The idea is to expose metadata to further analysis steps through the unblob report. For example, a binary analysis toolkit would read the load address and architecture from a uImage chunk to analyze the file extracted from that chunk with the right settings.

A note on the 'as_dict' implementation.

The initial idea was to implement it in dissect.cstruct (see fox-it/dissect.cstruct#29), but due to expected changes in the project's API I chose to implement it in unblob so we're not dependent on another project.

Related to #16 and initial discussion in #16 (comment)


You can observe the changes like this:

poetry run unblob -vvv -e /tmp/out -f --report /tmp/report.json tests/integration/archive/sevenzip/__input__/cherry.7z

cat /tmp/report.json| jq '.[0].reports[-1].metadata'
{
  "magic": "7z��'\u001c",
  "version_maj": 0,
  "version_min": 4,
  "crc": 3845252377,
  "next_header_offset": 147,
  "next_header_size": 33,
  "next_header_crc": 2089891445
}

@qkaiser qkaiser added the enhancement New feature or request label Apr 14, 2023
@qkaiser qkaiser self-assigned this Apr 14, 2023
unblob/models.py Outdated Show resolved Hide resolved
unblob/report.py Outdated Show resolved Hide resolved
unblob/file_utils.py Outdated Show resolved Hide resolved
@qkaiser qkaiser force-pushed the 16-metadata-reporting branch 2 times, most recently from cfffb33 to 2ed0237 Compare April 18, 2023 15:40
@qkaiser
Copy link
Contributor Author

qkaiser commented Apr 18, 2023

@e3krisztian implemented the changes we talked about and introduced a test.

unblob/models.py Outdated Show resolved Hide resolved
unblob/models.py Show resolved Hide resolved
@@ -181,6 +181,7 @@ class ChunkReport(Report):
end_offset: int
size: int
is_encrypted: bool
metadata: dict = attr.ib(factory=dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering if we want to validate metadata dict, do we want to enforce that key is a string and value is of a certain type, or we are ok we anything, even nested meta-data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be great to somehow have a "namespace" or at least some convention on metadata variable naming?

What if we want to push data from multiple headers, or permissions, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with enforcing a convention on metadata variable naming. Having a namespace would be too complicated since we can't foresee the metadata field names used by handlers.

I would enforce that metadata is a dict without nested data, keys must be strings and values must be base types.

I would convey information about files created (timestamps, permissions, owner) with something different since it involves way more complex structures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a validator. See 99944e3

@@ -70,4 +70,6 @@ def calculate_chunk(self, file: File, start_offset: int) -> Optional[ValidChunk]
# We read the signature header here to get the offset to the header database
first_db_header = start_offset + len(header) + header.next_header_offset
end_offset = first_db_header + header.next_header_size
return ValidChunk(start_offset=start_offset, end_offset=end_offset)
return ValidChunk(
start_offset=start_offset, end_offset=end_offset, metadata=header
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to pass all attributes from the header as metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This point came up when discussing with @e3krisztian yesterday. I think it's better to only pass the most relevant header attributes rather than the whole instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 75548d2

Allow handlers to provide a dict value as part of a ValidChunk metadata
attribute. That dictionnary can contain any relevant metadata
information from the perspective of the handler, but we advise handler
writers to report parsed information such as header values.

This metadata dict is later reported as part of our ChunkReports and
available in the JSON report file if the user requested one.

The idea is to expose metadata to further analysis steps through the
unblob report. For example, a binary analysis toolkit would read the load
address and architecture from a uImage chunk to analyze the file
extracted from that chunk with the right settings.

A note on the 'as_dict' implementation.

The initial idea was to implement it in dissect.cstruct (see
fox-it/dissect.cstruct#29), but due to expected
changes in the project's API I chose to implement it in unblob so we're
not dependent on another project.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants