Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add MetadataBuilder #6636

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

vrunm
Copy link
Contributor

@vrunm vrunm commented Dec 22, 2023

Related Issues

fixes #5702
fixes #5700

Proposed Changes:

Adds a new component MetadataBuilder which takes a list of Documents, the output of a Generator to which these Documents were passed, and adds the output from the Generator as metadata to the Documents.

The Generator takes a list of Documents, and returns replies and metadata.

The MetadataBuilder component takes these replies and metadata and adds them to the Documents.

It does this by adding the replies and metadata to the metadata of the Document.

Best explained through an example:

In this example, three documents are passed to the Generator.
The Generator has generated three replies and metadata for these.
The MetadataBuilder adds the replies and metadata of the Generator as metadata to the three Document objects.
The MetadataBuilder then returns this List of Documents.

metadata_builder = MetadataBuilder()
documents = [
    Document(content="document_0"),
    Document(content="document_1"),
    Document(content="document_2"),
]
replies = ["reply_0", "reply_1", "reply_2"]
metadata = [{"key_0": "value_0"}, {"key_1": "value_1"}, {"key_2": "value_2"}]
result = metadata_builder.run(replies=replies, documents=documents, meta= metadata)
print(result)

{
    'documents': [
        {
            'id': '08005f665ae33e6b3d8fd4a33fc9f09157ff89e8a2f25698ea1f32127748aeeb',
            'content': 'document_0',
            'meta': {'reply': 'reply_0', 'key_0': 'value_0'}
        },
        {
            'id': '60576c63fb17ae9d5dc0cbcc6d7ddfe299bc44a12ff8b7233e73257a3152e9b2',
            'content': 'document_1',
            'meta': {'reply': 'reply_1', 'key_1': 'value_1'}
        },
        {
            'id': '038f78b217389bac27f9a4d690dba2b74d3139cea7ccfa955bcd5b332f3166aa',
            'content': 'document_2',
            'meta': {'reply': 'reply_2', 'key_2': 'value_2'}
        }
    ]
}

How did you test it?

Added unit tests to check when the component:

  • Receives a list of documents, replies and metadata
  • Receives only a list of documents and replies and no metadata.
  • If length of the Document list and the replies list are different and no metadata
  • If length of the Document list and the replies list are different with metadata.
  • If length of the Document list and the metadata list are different.
  • If the length of the Document list, replies list and the metadata list are all different.
  • Having the same keys when the Document metadata already has a reply.

Tests on Pipelines:

  • Added a test for a summarization Pipeline using a HuggingFaceLocalGenerator.

  • Added four tests for a RAG pipeline with the following Generators:

    • HuggingFaceLocalGenerator
    • HuggingFaceTGIGenerator.
    • CohereGenerator
    • GradientGenerator

    The test checks:

    • Three results are obtained from the RAG pipeline.
    • Each result contains extracted answers from the generated responses.
    • The LLM reply has been added to the Document metadata correctly by the MetadataBuilder.

Checklist

@vrunm vrunm requested a review from a team as a code owner December 22, 2023 16:59
@vrunm vrunm requested review from anakin87 and removed request for a team December 22, 2023 16:59
@github-actions github-actions bot added topic:tests 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Dec 22, 2023
@vrunm vrunm changed the title Add MetadataBuilder feat: Add MetadataBuilder Dec 22, 2023
@anakin87
Copy link
Member

@vrunm thanks for this contribution...

I will take an in-depth look after Christmas!

@vrunm vrunm requested a review from a team as a code owner December 22, 2023 17:30
@vrunm vrunm requested review from dfokina and removed request for a team December 22, 2023 17:30
@anakin87
Copy link
Member

I think we should discuss this component.
I want to understand if its scope overlaps with that of Jinja2Builder (#6608).
I would also like to pull in @mathislucka and @sjrl.

@vrunm I see that there are some conflicts, but in any case, I think there will be a few days to wait...

@sjrl
Copy link
Contributor

sjrl commented Jan 2, 2024

I like this a lot! The only addition (based off just reading the example) that I would like to see is being able to specify the key name of the meta field where we would store the generated reply by the LLM. So something like

{
    'documents': [
        {
            'id': '08005f665ae33e6b3d8fd4a33fc9f09157ff89e8a2f25698ea1f32127748aeeb',
            'content': 'document_0',
            'meta': {'user_specified_key': 'reply_0', 'key_0': 'value_0'}
        },
        {
            'id': '60576c63fb17ae9d5dc0cbcc6d7ddfe299bc44a12ff8b7233e73257a3152e9b2',
            'content': 'document_1',
            'meta': {'user_specified_key': 'reply_1', 'key_1': 'value_1'}
        },
        {
            'id': '038f78b217389bac27f9a4d690dba2b74d3139cea7ccfa955bcd5b332f3166aa',
            'content': 'document_2',
            'meta': {'user_specified_key': 'reply_2', 'key_2': 'value_2'}
        }
    ]
}

since I could imagine scenarios of wanting to add multiple meta fields to Documents from calling LLMs multiple times (e.g. one call to summarize and maybe another call to extract entities).

@mathislucka
Copy link
Member

Thank you so much for the contribution @vrunm!

I agree with @sjrl here, the component would be more useful if the user could specify the key that should be used for meta. I would also change the naming of the parameters. replies is specific to the generator but what about cases where a user might want to add extracted entities from an EntityExtractor or maybe use a TextClassificationNode and then add the predicted label to meta?

I could see tonnes of applications here. You could even use it for embeddings. So the advanced embedding generation could look like Jinja2Builder > TextEmbedder > MetadataBuilder. This would give users complete flexibility on how they want to templatize the documents before embedding them.

Also for the case of multiple values that should be added, do people use separate MetadataBuilder instances for that?

Or could we maybe do something like this:

builder = MetadataBuilder(meta_keys=["entities", "summary"])

builder.run(data={"entities": [[...], [...]], "summary": ["...", "..."], documents=[doc1, doc2]}

# would result in {"meta": {"entities": [...], "summary": "..."}} for each document

Thinking about it that way, we could maybe rename the component to DocumentBuilder and give the user complete freedom on how they want to assemble their document?

@vrunm
Copy link
Contributor Author

vrunm commented Jan 22, 2024

I have updated the component where the user can now specify the key that could be used for the meta.
A sample example now:

metadata_builder = MetadataBuilder(meta_keys=["entities", "summary"])
documents = [Document(content="document_0"), Document(content="document_1")]
data = {
    "entities": ["entity1", "entity2", "entity3"],
    "summary": ["Summary 1", "Summary 2", "Summary3"],
}
metadata = [{"": ""}, {"": ""}]
result = metadata_builder.run(documents=documents, data=data, meta=metadata)
print(result)

data = {
    'documents': [
        {
            'id': '08005f665ae33e6b3d8fd4a33fc9f09157ff89e8a2f25698ea1f32127748aeeb',
            'content': 'document_0',
            'meta': {
                'entities': ['entity1', 'entity2', 'entity3'],
                'summary': ['Summary 1', 'Summary 2', 'Summary3']
            }
        },
        {
            'id': '60576c63fb17ae9d5dc0cbcc6d7ddfe299bc44a12ff8b7233e73257a3152e9b2',
            'content': 'document_1',
            'meta': {
                'entities': ['entity1', 'entity2', 'entity3'],
                'summary': ['Summary 1', 'Summary 2', 'Summary3']
            }
        }
    ]
}

@anakin87
Copy link
Member

Hey, @vrunm...

I am sorry to have kept you waiting so long.
See my comment in the original issue: #5702 (comment)

I would put work on this feature on hold until we have better defined what we expect and have made sure that this component fits neatly into a pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MetadataBuilder DocumentsBuilder
4 participants