Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ByteStream resolved_mime_type property #7670

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion haystack/components/routers/file_type_router.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ def run(self, sources: List[Union[str, Path, ByteStream]]) -> Dict[str, List[Uni
if isinstance(source, Path):
mime_type = self._get_mime_type(source)
elif isinstance(source, ByteStream):
mime_type = source.meta.get("content_type", None)
mime_type = source.resolved_mime_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only change should be

Suggested change
mime_type = source.resolved_mime_type
mime_type = source.mime_type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @masci @wochinge and I went back and forth on this one in the background for several rounds before deciding on this approach.

We can't employ this simple solution without breaking some existing components i.e. ContentFetcher that uses metadata of the bytestream to set content_type. We have diversion between local files that set mime_type directly and web based fetches that set content type. And now this property resolves the mime_type regardless of where the data comes from. That's the idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really breaking, it's more that those components are bugged no? I think by supporting the buggy behaviour we pile up tech debt and make the code more complex, is it really worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i can do that as well. We can find all the references of content_type and replace it, in one PR. Wdyt @wochinge ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this will work, we have to change a few things in various tests, in link content fetcher but the change won't have widespread impact on users. Thanks @masci

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that just relying on .mime_type seems simpler - hard for me to overlook the requirements of these other components though so I don't think i'm a good judge here

else:
raise ValueError(f"Unsupported data source type: {type(source).__name__}")

Expand Down
57 changes: 52 additions & 5 deletions haystack/dataclasses/byte_stream.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Dict, Optional
from typing import Any, Dict, List, Optional


@dataclass
Expand All @@ -12,6 +12,7 @@ class ByteStream:
data: bytes
meta: Dict[str, Any] = field(default_factory=dict, hash=False)
mime_type: Optional[str] = field(default=None)
mime_type_resolution_priority: List[str] = field(default_factory=lambda: ["attribute", "meta"])

def to_file(self, destination_path: Path):
"""
Expand All @@ -24,21 +25,36 @@ def to_file(self, destination_path: Path):

@classmethod
def from_file_path(
cls, filepath: Path, mime_type: Optional[str] = None, meta: Optional[Dict[str, Any]] = None
cls,
filepath: Path,
mime_type: Optional[str] = None,
meta: Optional[Dict[str, Any]] = None,
mime_type_resolution_priority: Optional[List[str]] = None,
wochinge marked this conversation as resolved.
Show resolved Hide resolved
) -> "ByteStream":
"""
Create a ByteStream from the contents read from a file.

:param filepath: A valid path to a file.
:param mime_type: The mime type of the file.
:param meta: Additional metadata to be stored with the ByteStream.
:param mime_type_resolution_priority: The priority order of the mime type resolution
"""
with open(filepath, "rb") as fd:
return cls(data=fd.read(), mime_type=mime_type, meta=meta or {})
return cls(
data=fd.read(),
mime_type=mime_type,
meta=meta or {},
mime_type_resolution_priority=mime_type_resolution_priority or ["attribute", "meta"],
)

@classmethod
def from_string(
cls, text: str, encoding: str = "utf-8", mime_type: Optional[str] = None, meta: Optional[Dict[str, Any]] = None
cls,
text: str,
encoding: str = "utf-8",
mime_type: Optional[str] = None,
meta: Optional[Dict[str, Any]] = None,
mime_type_resolution_priority: Optional[List[str]] = None,
) -> "ByteStream":
"""
Create a ByteStream encoding a string.
Expand All @@ -47,8 +63,14 @@ def from_string(
:param encoding: The encoding used to convert the string into bytes
:param mime_type: The mime type of the file.
:param meta: Additional metadata to be stored with the ByteStream.
:param mime_type_resolution_priority: The priority order of the mime type resolution
"""
return cls(data=text.encode(encoding), mime_type=mime_type, meta=meta or {})
return cls(
data=text.encode(encoding),
mime_type=mime_type,
meta=meta or {},
mime_type_resolution_priority=mime_type_resolution_priority or ["attribute", "meta"],
)

def to_string(self, encoding: str = "utf-8") -> str:
"""
Expand All @@ -59,3 +81,28 @@ def to_string(self, encoding: str = "utf-8") -> str:
:raises: UnicodeDecodeError: If the ByteStream data cannot be decoded with the specified encoding.
"""
return self.data.decode(encoding)

@property
def resolved_mime_type(self) -> Optional[str]:
"""
Returns the resolved MIME type of the ByteStream based on the `mime_type_resolution_priority` property.

The MIME type is consolidated using two different contexts:
- `content_type`: Used for resources fetched from the web and stored in the `meta` field.
- `mime_type`: Used for local files.

The `mime_type_resolution_priority` property prioritizes the resolution of the MIME type based on the order
of preference, picking one of the two sources of truth. The default order is `["attribute", "meta"]`.

This property is useful because it can be used with any `ByteStream` instance, regardless of the origin of
creation, to conveniently determine the MIME type instead of checking both sources of truth separately.

:return: The MIME type if available, otherwise `None`.
"""
sources = {"meta": self.meta.get("content_type", None), "attribute": self.mime_type}

for source in self.mime_type_resolution_priority:
if sources[source]:
return sources[source]

return None
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
enhancements:
- |
Improved MIME type resolution in ByteStream dataclass to allow priority-based MIME type determination. Users can now specify a priority order for resolving the MIME type of a ByteStream, utilizing either the explicitly set mime_type attribute or the content_type found within the ByteStream's metadata. This enhancement simplifies handling of MIME types, especially when dealing with ByteStreams originating from various sources.
38 changes: 38 additions & 0 deletions test/dataclasses/test_byte_stream.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,3 +68,41 @@ def test_to_file(tmp_path, request):
ByteStream(test_str.encode()).to_file(test_path)
with open(test_path, "rb") as fd:
assert fd.read().decode() == test_str


# tests resolved_mime_type with different priority orders
def test_resolved_mime_type_priority():
test_string = "Hello, world!"

b = ByteStream(
data=test_string.encode(),
mime_type="application/octet-stream",
meta={"content_type": "text/plain"},
mime_type_resolution_priority=["meta", "attribute"],
)
assert b.resolved_mime_type == "text/plain"

b.mime_type_resolution_priority = ["attribute", "meta"]
assert b.resolved_mime_type == "application/octet-stream"

b = ByteStream(
data=test_string.encode(), mime_type=None, meta={}, mime_type_resolution_priority=["meta", "attribute"]
)
assert b.resolved_mime_type is None


# test resolved_mime_type with default priority which is ["attribute", "meta"]
def test_resolved_mime_type_no_priority():
test_string = "Hello, world!"

b = ByteStream(data=test_string.encode(), mime_type="application/octet-stream", meta={"content_type": "text/plain"})
assert b.resolved_mime_type == "application/octet-stream"

b = ByteStream(data=test_string.encode(), mime_type=None, meta={"content_type": "text/plain"})
assert b.resolved_mime_type == "text/plain"

b = ByteStream(data=test_string.encode(), mime_type="application/octet-stream", meta={})
assert b.resolved_mime_type == "application/octet-stream"

b = ByteStream(data=test_string.encode(), mime_type=None, meta={})
assert b.resolved_mime_type is None