Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ByteStream resolved_mime_type property #7670

Closed
wants to merge 6 commits into from

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented May 8, 2024

Why:

Enhance the MIME type resolution process for the ByteStream data class. It introduces a more flexible and priority-based approach to determine the MIME type of a ByteStream, crucial for handling data from diverse sources with varying MIME type declarations.

What:

  • Introduce new feature (see diff) without breaking any existing clients

How can it be used:

  • MIME Type Resolution Flexibility: Users can now specify a custom priority for MIME type resolution when creating a ByteStream instance. This is particularly useful for applications that process both local files and streams of data fetched from the web
  • Enhanced Data Processing: With the improved MIME type resolution, applications can more accurately and easily determine the type of the data they are processing regardless of the origin of ByteStream (local, web resource etc etc)

How did you test it:

  • Unit Testing: Tests were added to verify the correct resolution of MIME types based on various prioritization scenarios, including cases with and without explicitly set MIME types and metadata.
  • No regressions were found

Notes for the reviewer:

  • NA

@github-actions github-actions bot added 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels May 8, 2024
@vblagoje vblagoje added ignore-for-release-notes PRs with this flag won't be included in the release notes. and removed type:documentation Improvements on the docs 2.x Related to Haystack v2.0 labels May 8, 2024
@github-actions github-actions bot added 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels May 8, 2024
@coveralls
Copy link
Collaborator

coveralls commented May 8, 2024

Pull Request Test Coverage Report for Build 9003061415

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.01%) to 90.344%

Totals Coverage Status
Change from base Build 8989574784: 0.01%
Covered Lines: 6521
Relevant Lines: 7218

💛 - Coveralls

@vblagoje vblagoje changed the title Add ByteStream resolved_mime_type taking priority resolution flag into account feat: Add ByteStream resolved_mime_type property May 8, 2024
@vblagoje vblagoje marked this pull request as ready for review May 8, 2024 13:54
@vblagoje vblagoje requested review from a team as code owners May 8, 2024 13:54
@vblagoje vblagoje requested review from dfokina, masci and wochinge and removed request for a team May 8, 2024 13:54
Copy link
Member

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we made this PR bigger than needed? I wouldn't rely on objects' meta, let's just fix the expected behaviour

@@ -86,7 +86,7 @@ def run(self, sources: List[Union[str, Path, ByteStream]]) -> Dict[str, List[Uni
if isinstance(source, Path):
mime_type = self._get_mime_type(source)
elif isinstance(source, ByteStream):
mime_type = source.meta.get("content_type", None)
mime_type = source.resolved_mime_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only change should be

Suggested change
mime_type = source.resolved_mime_type
mime_type = source.mime_type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @masci @wochinge and I went back and forth on this one in the background for several rounds before deciding on this approach.

We can't employ this simple solution without breaking some existing components i.e. ContentFetcher that uses metadata of the bytestream to set content_type. We have diversion between local files that set mime_type directly and web based fetches that set content type. And now this property resolves the mime_type regardless of where the data comes from. That's the idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really breaking, it's more that those components are bugged no? I think by supporting the buggy behaviour we pile up tech debt and make the code more complex, is it really worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i can do that as well. We can find all the references of content_type and replace it, in one PR. Wdyt @wochinge ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this will work, we have to change a few things in various tests, in link content fetcher but the change won't have widespread impact on users. Thanks @masci

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that just relying on .mime_type seems simpler - hard for me to overlook the requirements of these other components though so I don't think i'm a good judge here

@vblagoje
Copy link
Member Author

Dropping in favour of #7681

@vblagoje vblagoje closed this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 ignore-for-release-notes PRs with this flag won't be included in the release notes. topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FileTypeRouter should get mime type from ByteStream mime type attribute instead of `meta
4 participants