Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support New Bedrock Embeddings (Cohere) #3315

Closed
metaskills opened this issue Nov 17, 2023 · 5 comments · May be fixed by #5167
Closed

Support New Bedrock Embeddings (Cohere) #3315

metaskills opened this issue Nov 17, 2023 · 5 comments · May be fixed by #5167
Labels
auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features

Comments

@metaskills
Copy link

New embedding are. Seems these use an input_type arg which I would like to use too.

  • cohere.embed-english-v3
  • cohere.embed-multilingual-v3
Copy link

dosubot bot commented Nov 17, 2023

🤖

Hello @metaskills,

I'm Dosu, a friendly bot here to assist you with your langchainjs issues and questions, and help you contribute to the project if you're interested. While we wait for a human maintainer to join us, I'll be working on your issue regarding the new Bedrock Embeddings (Cohere). I'll provide you with a full answer as soon as possible. Thank you for your patience!

@dosubot dosubot bot added the auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features label Nov 17, 2023
Copy link

dosubot bot commented Nov 17, 2023

🤖

Hello @metaskills,

Nice to meet you! I'm Dosu, a bot here to assist you. I can help you tackle bugs, answer questions, and guide you in becoming a contributor. While we wait for a human maintainer, feel free to ask me any questions. I'm here to help!

Sure, you can integrate the input_type argument for the new Bedrock Embeddings (Cohere) - cohere.embed-english-v3 and cohere.embed-multilingual-v3 into the langchainjs repository.

You can add the input_type parameter to the CohereEmbeddingsParams interface and use it in the embedDocuments and embedQuery methods when calling the embeddingWithRetry method.

Here is how you can modify the code:

export interface CohereEmbeddingsParams extends EmbeddingsParams {
  modelName: string;
  inputType?: string; // Add this line

  /**
   * The maximum number of documents to embed in a single request. This is
   * limited by the Cohere API to a maximum of 96.
   */
  batchSize?: number;
}

// ...

export class CohereEmbeddings
  extends Embeddings
  implements CohereEmbeddingsParams
{
  modelName = "small";
  inputType = "text"; // Add this line

  // ...

  constructor(
    fields?: Partial<CohereEmbeddingsParams> & {
      verbose?: boolean;
      apiKey?: string;
    }
  ) {
    // ...

    this.inputType = fieldsWithDefaults?.inputType ?? this.inputType; // Add this line
  }

  // ...

  async embedDocuments(texts: string[]): Promise<number[][]> {
    // ...

    const batchRequests = batches.map((batch) =>
      this.embeddingWithRetry({
        model: this.modelName,
        texts: batch,
        input_type: this.inputType, // Add this line
      })
    );

    // ...
  }

  async embedQuery(text: string): Promise<number[]> {
    // ...

    const { body } = await this.embeddingWithRetry({
      model: this.modelName,
      texts: [text],
      input_type: this.inputType, // Add this line
    });

    // ...
  }

  // ...
}

This way, you can specify the input_type when creating an instance of CohereEmbeddings and it will be used when generating embeddings. If not specified, it defaults to "text".

Please note that the current implementation of the embeddingWithRetry method in the CohereEmbeddings class does not support the input_type parameter. The request parameter only includes model and texts properties. Therefore, you will need to modify the embeddingWithRetry method to include the input_type parameter in the request object.

I hope this helps! Let me know if you have any other questions.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@BrianErikson
Copy link

BrianErikson commented Nov 21, 2023

As supplemental information that I found during my investigation, it appears that embeddings/bedrock.ts#L88-L95 should be revised to accommodate the new Cohere embedding schema on AWS Bedrock --

// New Bedrock schema for Cohere Embedding
{
  "modelId": "cohere.embed-english-v3",
  "contentType": "application/json",
  "accept": "*/*",
  "body": {
    "texts": ["Hello world", "This is a test"],
    "input_type": "search_document"
 } 
}
// Bedrock schema for Titan Embedding
{
  "modelId": "amazon.titan-embed-text-v1",
  "contentType": "application/json",
  "accept": "*/*",
  "body": {
    "inputText": "this is where you place your input text"
   } 
}

AWS Bedrock Documentation on Cohere Embed input_type
Note: it appears that the existing schema still needs to be supported for Titan Embeddings users according to the AWS Bedrock API playground.

Here's a sample of the current error:
Error: An error occurred while embedding documents with Bedrock: Malformed input request: 3 schema violations found, please reformat your input and try again.

EDIT: I don't know how the contribution process works and whether or not this is an agreeable design strategy for the LangChain devs, but I was able to find a solution that worked for me so I thought I'd contribute back to whomever is also stuck on this. Here's a patched BedrockEmbeddings module that can be used in lieu of the BedrockEmbeddings for when the new Cohere embeddings need to be used.

import { BedrockRuntimeClient, InvokeModelCommand } from "@aws-sdk/client-bedrock-runtime";
import { AwsCredentialIdentity, Provider } from "@aws-sdk/types";
import { Embeddings } from "langchain/embeddings/base";
import { BedrockEmbeddingsParams } from "langchain/embeddings/bedrock";

export type CredentialType =
  | AwsCredentialIdentity
  | Provider<AwsCredentialIdentity>;

export class BedrockCohereEmbeddings
  extends Embeddings
  implements BedrockEmbeddingsParams {
  model: string;

  client: BedrockRuntimeClient;

  batchSize = 512;

  constructor(fields?: BedrockEmbeddingsParams) {
    super(fields ?? {});

    this.model = fields?.model ?? "cohere.embed-english-v3";

    this.client =
      fields?.client ??
      new BedrockRuntimeClient({
        region: fields?.region,
        credentials: fields?.credentials,
      });
  }

  /**
   * Embeds an array of documents using the Bedrock model.
   * @param documents The array of documents to be embedded.
   * @param inputType The input type for the embedding process.
   * @returns A promise that resolves to a 2D array of embeddings.
   * @throws If an error occurs while embedding documents with Bedrock.
   */
  protected async _embedDocuments(documents: string[], inputType: string): Promise<number[][]> {
    return this.caller.call(async () => {
      try {
        const res = await this.client.send(
          new InvokeModelCommand({
            modelId: this.model,
            body: JSON.stringify({
              texts: documents.map((doc) => doc.replace(/\n+/g, ' ')),
              input_type: inputType,
            }),
            contentType: "application/json",
            accept: "application/json",
          })
        );

        const body = new TextDecoder().decode(res.body);
        return JSON.parse(body).embeddings;
      } catch (e) {
        console.error({
          error: e,
        });
        if (e instanceof Error) {
          throw new Error(
            `An error occurred while embedding documents with Bedrock: ${e.message}`
          );
        }

        throw new Error(
          "An error occurred while embedding documents with Bedrock"
        );
      }
    });
  }

  /**
   * Method that takes a document as input and returns a promise that
   * resolves to an embedding for the document.
   * @param document Document for which to generate an embedding.
   * @returns Promise that resolves to an embedding for the input document.
   */
  async embedQuery(document: string): Promise<number[]> {
    return this._embedDocuments([document], 'search_query').then((embeddings) => embeddings[0]);
  } 

  /**
   * Method to generate embeddings for an array of texts.
   * @param documents Array of texts for which to generate embeddings.
   * @returns Promise that resolves to a 2D array of embeddings for each input document.
   */
  async embedDocuments(documents: string[]): Promise<number[][]> {
    return this._embedDocuments(documents, 'search_document');
  }
}

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 21, 2024
Copy link

dosubot bot commented Feb 21, 2024

🤖

Hi, @metaskills,

I'm helping the langchainjs team manage their backlog and am marking this issue as stale. From what I understand, the issue you opened requested support for new bedrock embeddings from Cohere, specifically the input_type argument for the cohere.embed-english-v3 and cohere.embed-multilingual-v3 embeddings. The issue has received some activity, including a detailed response from me providing code modifications to integrate the input_type parameter into the CohereEmbeddings class. Additionally, BrianErikson has contributed supplemental information and a patched BedrockCohereEmbeddings module to accommodate the new Cohere embedding schema on AWS Bedrock.

Could you please confirm if this issue is still relevant to the latest version of the langchainjs repository? If it is, please let the langchainjs team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days. Thank you!

@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 28, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 28, 2024
@vishal-oogway
Copy link

I'd like to test cohere embeddings available on Bedrock and think this helper class would be really beneficial. Is it possible to get @BrianErikson 's contribution merged into community extensions ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants