Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embeddings #16

Open
pannous opened this issue Mar 7, 2024 · 8 comments
Open

embeddings #16

pannous opened this issue Mar 7, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@pannous
Copy link

pannous commented Mar 7, 2024

Hi, great project!

How hard would it be to extract embeddings from the LLMs?

@eastriverlee eastriverlee added the question Further information is requested label Mar 7, 2024
@eastriverlee
Copy link
Owner

eastriverlee commented Mar 7, 2024

func encode(_ text: borrowing String) -> [Token] should do it.

let text = "hello world"
let llm = LLM(...)
let embeddings = llm.encode(text)

and for decoding:

let decodedText = llm.model.decode(embeddings)

@pannous
Copy link
Author

pannous commented Mar 7, 2024

Thanks, I thought about vector embeddings though:
llm.embedding("King") ≈ llm.embedding("Queen") ≈ [Float]*768

[Token] would just be ≈ one int per word

@eastriverlee
Copy link
Owner

eastriverlee commented Mar 7, 2024

i think you have the wrong definition of LLM embeddings, and it's understandable because i was also once confused about the concept. you might want to check this comment. it's also the reason why i chose not to use the word "embedding" in this library

if you want to test the similarities between embeddings, you can cast the [Token] that's outputted by the encode function as [Float] and use it in a vector DB, or check the cosine similarities between your choices.

however, for checking similarity between simple words like "king" or "queen" in your example, i suggest you to just use the apple's Natural Language framework like this, as LLM token is chosen somewhat arbitrarily(as far as i know), there is no guarantee that "king" and "queen" is similar than "king" and "monitor".

although i haven't tested this myself but for just checking similarities between sentences or words you should use this similarity-search-kit library or use it with this together.

@pannous
Copy link
Author

pannous commented Mar 7, 2024

Thanks, very very helpful links!!!

LLM embeddings are usually a vector of floats, Token encodings are a vector of ints, casting these to float makes no sense, no confusion here ;)

@eastriverlee
Copy link
Owner

eastriverlee commented Mar 7, 2024

i was just saying that you have the option to cast an array of ints as array of floats so that you can check cosine similarity. after all, int is also just a valid one dimensional vector.

i'm glad i was able to help you!

@eastriverlee
Copy link
Owner

eastriverlee commented Mar 8, 2024

so, i researched a bit further on this, being not so sure if i understood the concept correctly. what you are referring to is not related to LLM, indeed. however, embedding models are usually used with LLM, usually for text search, and that's where the confusion occurs. aside from the fact that sometimes some refer tokens as embeddings, that is.

for example, mistral uses mistral-embed model, and openAI uses text-embedding-3-small, text-embedding-3-small, and text-embedding-ada-002. they are used and can be used in conjunction with LLM like mistral 7B, but it has no direct relation with LLMs. as machine learning models are just blackbox that we can not really look into so far, there is not a direct way to retrieve the "actual internal representation" of tokens using LLM either, other than tokens.

@pannous
Copy link
Author

pannous commented Mar 8, 2024

word2vec embeddings were not related to LLM, but today embeddings are (mostly) done via LLMs, or as you pointed out correctly SLMs (small language models) although some believe that using truely large LLMs also give better embeddings.

there is not a direct way to retrieve the "actual internal representation"

I think your research yielded a wrong result there. while using all current activations as embedding would be overkill, LLM embeddings are indeed calculated through some activations:

• Pooling Strategies: Applying operations such as mean or max pooling over activations from one or more layers to create fixed-size embeddings.
• Concatenation of Multiple Layers: Combining activations from multiple layers to form a richer representation.
• Last Layer Embeddings: Using the activations from the last hidden layer of the model as the embedding for a word or sentence. (indeed makes no sense if last layer is tokens)

@eastriverlee
Copy link
Owner

eastriverlee commented Mar 8, 2024

thank you for the clarification and a clear explanation. i was the one who had the wrong idea. my bad. i'll look into this more, find a way to get embeddings through methods that you described, and keep you updated here. i really appreciate it. it's hard to get the right information on LLM related field as a non-researcher. i have to learn more on this.

i'll see if i can implement this in my library referencing this code:
https://github.com/ggerganov/llama.cpp/blob/master/examples/embedding/embedding.cpp#L54

until then, in llama.cpp library that this one depends on, it seems that you will able to get the LLM float embeddings that you want by using float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id).

@eastriverlee eastriverlee reopened this Mar 8, 2024
@eastriverlee eastriverlee added the enhancement New feature or request label Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants