-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix dead link in README
#1137
base: main
Are you sure you want to change the base?
fix dead link in README
#1137
Conversation
We use `ruff` in CI and dev workflow now.
Modify some grammar, punctuation, and spelling errors.
…db#747) For object detection, each row may correspond to an image and each image can have multiple bounding boxes of x-y coordinates. This means that a `bbox` field is potentially "list of list of float". This adds support in our pydantic-pyarrow conversion for nested lists.
Closes lancedb#721 fts will return results as a pyarrow table. Pyarrow tables has a `filter` method but it does not take sql filter strings (only pyarrow compute expressions). Instead, we do one of two things to support `tbl.search("keywords").where("foo=5").limit(10).to_arrow()`: Default path: If duckdb is available then use duckdb to execute the sql filter string on the pyarrow table. Backup path: Otherwise, write the pyarrow table to a lance dataset and then do `to_table(filter=<filter>)` Neither is ideal. Default path has two issues: 1. requires installing an extra library (duckdb) 2. duckdb mangles some fields (like fixed size list => list) Backup path incurs a latency penalty (~20ms on ssd) to write the resultset to disk. In the short term, once lancedb#676 is addressed, we can write the dataset to "memory://" instead of disk, this makes the post filter evaluate much quicker (ETA next week). In the longer term, we'd like to be able to evaluate the filter string on the pyarrow Table directly, one possibility being that we use Substrait to generate pyarrow compute expressions from sql string. Or if there's enough progress on pyarrow, it could support Substrait expressions directly (no ETA) --------- Co-authored-by: Will Jones <[email protected]>
If you add timezone information in the Field annotation for a datetime then that will now be passed to the pyarrow data type. I'm not sure how pyarrow enforces timezones, right now, it silently coerces to the timezone given in the column regardless of whether the input had the matching timezone or not. This is probably not the right behavior. Though we could just make it so the user has to make the pydantic model do the validation instead of doing that at the pyarrow conversion layer.
API has changed significantly, namely `openai.Embedding.create` no longer exists. openai/openai-python#742 Update the OpenAI embedding function and put a minimum on the openai sdk version.
issue separate requests under the hood and concatenate results
Add support for adding lists of string input (e.g., list of categorical labels) Follow-up items: lancedb#757 lancedb#758
Co-authored-by: Aidan <[email protected]>
I found that it was quite incoherent to have to read through the documentation and having to search which submodule that each class should be imported from. For example, it is cumbersome to have to navigate to another documentation page to find out that `EmbeddingFunctionRegistry` is from `lancedb.embeddings`
If the input text is None, Tantivy raises an error complaining it cannot add a NoneType. We handle this upstream so None's are not added to the document. If all of the indexed fields are None then we skip this document.
In addition to lancedb#777, this pull request fixes more typos in the documentation for "Ingest Embedding Functions".
Addressed minor typos and grammatical issues to improve readability --------- Co-authored-by: Christopher Correa <[email protected]>
In Rust and Node, we have been swallowing filter validation errors. If there was an error in parsing the filter, then the filter was silently ignored, returning unfiltered results. Fixes lancedb#1081
This also refactors the rust lancedb index builder API (and, correspondingly, the nodejs API)
the integration test will be covered in another PR: lancedb/sophon#1876
…b#1097) The LanceDB embeddings registry allows users to annotate the pydantic model used as table schema with the desired embedding function, e.g.: ```python class Schema(LanceModel): id: str vector: Vector(openai.ndims()) = openai.VectorField() text: str = openai.SourceField() ``` Tables created like this does not require embeddings to be calculated by the user explicitly, e.g. this works: ```python table.add([{"id": "foo", "text": "rust all the things"}]) ``` However, trying to construct pydantic model instances without vector doesn't because it's a required field. Instead, you need add a default value: ```python class Schema(LanceModel): id: str vector: Vector(openai.ndims()) = openai.VectorField(default=None) text: str = openai.SourceField() ``` then this completes without errors: ```python table.add([Schema(id="foo", text="rust all the things")]) ``` However, all of the vectors are filled with zeros. Instead in add_vector_col we have to add an additional check so that the embedding generation is called.
Increasing event reporting interval from 5mins to 60mins
I know there's a larger effort to have the python client based on the core rust implementation, but in the meantime there have been several issues (lancedb#1072 and lancedb#485) with some of the azure blob storage calls due to pyarrow not natively supporting an azure backend. To this end, I've added an optional import of the fsspec implementation of azure blob storage [`adlfs`](https://pypi.org/project/adlfs/) and passed it to `pyarrow.fs`. I've modified the existing test and manually verified it with some real credentials to make sure it behaves as expected. It should be now as simple as: ```python import lancedb db = lancedb.connect("az://blob_name/path") table = db.open_table("test") table.search(...) ``` Thank you for this cool project and we're excited to start using this for real shortly! 🎉 And thanks to @dwhitena for bringing it to my attention with his prediction guard posts. Co-authored-by: christiandilorenzo <[email protected]>
This PR fixes lancedb#1112. It turned out that K-means is currently used internally, so I figured adding that context to the docs would be nice.
…c API (lancedb#1113) In addition, there are also a number of changes in nodejs to the docstrings of existing methods because this PR adds a jsdoc linter.
@wjones127 after fixing lancedb#1112 I noticed something else on the docs. There's an odd chunk of the docs missing [here](https://lancedb.github.io/lancedb/guides/tables/#from-a-polars-dataframe). I can see the heading, but after clicking it the contents don't show. ![CleanShot 2024-03-15 at 23 40 17@2x](https://github.com/lancedb/lancedb/assets/1019791/04784b19-0200-4c3f-ae17-7a8f871ef9bd) Apon inspection it was a markdown issue, one tab too many on a whole segment. This PR fixes it. It looks like this now and the sections appear again: ![CleanShot 2024-03-15 at 23 42 32@2x](https://github.com/lancedb/lancedb/assets/1019791/c5aaec4c-1c37-474d-9fb0-641f4cf52626)
This will make it easier for 3rd party integrations. They simply need to implement `IntoArrow` for their types in order for those types to be used in ingestion.
… unstable / experimental (lancedb#1131)
…lance_linalg (lancedb#1133) This PR originated from a request to add `Serialize` / `Deserialize` to `lance_linalg::distance::DistanceType`. However, that is a strange request for `lance_linalg` which shouldn't really have to worry about `Serialize` / `Deserialize`. The problem is that `lancedb` is re-using `DistanceType` and things in `lancedb` do need to worry about `Serialize`/`Deserialize` (because `lancedb` needs to support remote client). On the bright side, separating the two types allows us to independently document distance type and allows `lance_linalg` to make changes to `DistanceType` in the future without having to worry about backwards compatibility concerns.
solves lancedb#1086 Usage Reranking with FTS: ``` retriever = db.create_table("fine-tuning", schema=Schema, mode="overwrite") pylist = [{"text": "Carson City is the capital city of the American state of Nevada. At the 2010 United States Census, Carson City had a population of 55,274."}, {"text": "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan."}, {"text": "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas."}, {"text": "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. "}, {"text": "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states."}, {"text": "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."}, ] retriever.add(pylist) retriever.create_fts_index("text", replace=True) query = "What is the capital of the United States?" reranker = CohereReranker(return_score="all") print(retriever.search(query, query_type="fts").limit(10).to_pandas()) print(retriever.search(query, query_type="fts").rerank(reranker=reranker).limit(10).to_pandas()) ``` Result ``` text vector score 0 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 0.729602 1 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 0.678046 2 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 0.671521 3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 0.667898 4 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 0.653422 5 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 0.639346 text vector score _relevance_score 0 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 0.653422 0.979977 1 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 0.671521 0.299105 2 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 0.729602 0.284874 3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 0.667898 0.089614 4 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 0.639346 0.063832 5 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 0.678046 0.041462 ``` ## Vector Search usage: ``` query = "What is the capital of the United States?" reranker = CohereReranker(return_score="all") print(retriever.search(query).limit(10).to_pandas()) print(retriever.search(query).rerank(reranker=reranker, query=query).limit(10).to_pandas()) # <-- Note: passing extra string query here ``` Results ``` text vector _distance 0 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 39.728973 1 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 41.384884 2 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 55.220200 3 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 58.345654 4 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 60.060867 5 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 64.260544 text vector _distance _relevance_score 0 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 41.384884 0.979977 1 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 60.060867 0.299105 2 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 39.728973 0.284874 3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 55.220200 0.089614 4 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 64.260544 0.063832 5 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 58.345654 0.041462 ```
@@ -83,5 +83,5 @@ result = table.search([100, 100]).limit(2).to_pandas() | |||
``` | |||
|
|||
## Blogs, Tutorials & Videos | |||
* 📈 <a href="https://blog.eto.ai/benchmarking-random-access-in-lance-ed690757a826">2000x better performance with Lance over Parquet</a> | |||
* 📈 <a href="https://blog.lancedb.com/announcing-lancedb-5cb0deaa46ee-2/">2000x better performance with Lance over Parquet</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we actually use this one? https://blog.lancedb.com/announcing-lancedb-5cb0deaa46ee-2/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that link is the same as in the diff, no? Or are you considering linking an entirely different page?
…benchmark (lancedb#1137) * fix build error on Mac and remove warning messages --------- Co-authored-by: Qian Zhu <[email protected]> Co-authored-by: qzhu <[email protected]>
No description provided.