Skip to content

State-of-the-Art Ember embedding model for retrieval augmented generation

License

Notifications You must be signed in to change notification settings

llmrails/ember-v1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

ember-v1

This model has been trained on an extensive corpus of text pairs that encompass a broad spectrum of domains, including finance, science, medicine, law, and various others. During the training process, we incorporated techniques derived from the RetroMAE and SetFit research papers.

We are pleased to offer this model as an API service through our platform, LLMRails. If you are interested, please don't hesitate to sign up.

Plans

  • The research paper will be published soon.
  • The v2 of the model is currently in development and will feature an extended maximum sequence length of 4,000 tokens.

Usage

Use with API request:

curl --location 'https://api.llmrails.com/v1/embeddings' \
--header 'X-API-KEY: {token}' \
--header 'Content-Type: application/json' \
--data '{
   "input": ["This is an example sentence"],
   "model":"embedding-english-v1" # equals to ember-v1
}'

API docs: https://docs.llmrails.com/embedding/embed-text
Langchain plugin: https://python.langchain.com/docs/integrations/text_embedding/llm_rails

Use with transformers:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    "This is an example sentence",
    "Each sentence is converted"
]

tokenizer = AutoTokenizer.from_pretrained("llmrails/ember-v1")
model = AutoModel.from_pretrained("llmrails/ember-v1")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

Use with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = [
	"This is an example sentence",
    "Each sentence is converted"
]

model = SentenceTransformer('llmrails/ember-v1')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

Massive Text Embedding Benchmark (MTEB) Evaluation

Our model achieve state-of-the-art performance on MTEB leaderboard

Model Name Dimension Sequence Length Average (56)
bge-large-en-v1.5 1024 512 64.23
bge-base-en-v1.5 768 512 63.55
ember-v1 1024 512 63.54
text-embedding-ada-002 1536 8191 60.99

Limitation

This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.