Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add py.typed for mypy library stubs #1071

Open
wants to merge 397 commits into
base: main
Choose a base branch
from

Conversation

eduardjbotha
Copy link
Contributor

With lancedb as a dependency, running mypy on a project complains with the following error:

error: Skipping analyzing "lancedb": module is installed, but missing library stubs or py.typed marker  [import-untyped]

changhiskhan and others added 30 commits December 27, 2023 09:31
Closes lancedb#721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once lancedb#676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <[email protected]>
If you add timezone information in the Field annotation for a datetime
then that will now be passed to the pyarrow data type.

I'm not sure how pyarrow enforces timezones, right now, it silently
coerces to the timezone given in the column regardless of whether the
input had the matching timezone or not. This is probably not the right
behavior. Though we could just make it so the user has to make the
pydantic model do the validation instead of doing that at the pyarrow
conversion layer.
API has changed significantly, namely `openai.Embedding.create` no
longer exists.
openai/openai-python#742

Update the OpenAI embedding function and put a minimum on the openai sdk
version.
issue separate requests under the hood and concatenate results
Add support for adding lists of string input (e.g., list of categorical
labels)

Follow-up items: lancedb#757 lancedb#758
I found that it was quite incoherent to have to read through the
documentation and having to search which submodule that each class
should be imported from.

For example, it is cumbersome to have to navigate to another
documentation page to find out that `EmbeddingFunctionRegistry` is from
`lancedb.embeddings`
If the input text is None, Tantivy raises an error
complaining it cannot add a NoneType. We handle this
upstream so None's are not added to the document.
If all of the indexed fields are None then we skip
this document.
In addition to lancedb#777, this pull request fixes more typos in the
documentation for "Ingest Embedding Functions".
Addressed minor typos and grammatical issues to improve readability

---------

Co-authored-by: Christopher Correa <[email protected]>
These examples don't work because of changes in openai api from version
1+
raise exception if fts index does not exist

---------

Co-authored-by: Chang She <[email protected]>
…ancedb#762)

By default tantivy-py uses 128MB heapsize. We change the default to 1GB
and we allow the user to customize this

locally this makes `test_fts.py` run 10x faster
Close lancedb#773 

we pass an empty table over IPC so we don't need to manually deal with
serde. Then we just return the schema attribute from the empty table.

---------

Co-authored-by: albertlockett <[email protected]>
closes lancedb#769 

Add unit test and documentation on using quotes to perform a phrase
query
addresses lancedb#797 

Problem: tantivy does not expose option to explicitly

Proposed solution here: 

1. Add a `.phrase_query()` option
2. Under the hood, LanceDB takes care of wrapping the input in quotes
and replace nested double quotes with single quotes

I've also filed an upstream issue, if they support phrase queries
natively then we can get rid of our manual custom processing here.
westonpace and others added 27 commits February 29, 2024 10:55
This will eventually replace the remote table implementations in python
and node.
…PI (lancedb#1031)

I've also started `ASYNC_MIGRATION.MD` to keep track of the breaking
changes from sync to async python.
…1047)

The renaming of `vectordb` to `lancedb` broke the [quick start
docs](https://lancedb.github.io/lancedb/basic/#__tabbed_5_3) (it's
pointing to a non-existent directory). This PR fixes the code snippets
and the paths in the docs page.

Additionally, more fixes related to indexing docs below 👇🏽.
In order to add support for `add` we needed to migrate the rust `Table`
trait to a `Table` struct and `TableInternal` trait (similar to the way
the connection is designed).

While doing this we also cleaned up some inconsistencies between the
SDKs:

* Python and Node are garbage collected languages and it can be
difficult to trigger something to be freed. The convention for these
languages is to have some kind of close method. I added a close method
to both the table and connection which will drop the underlying rust
object.
* We made significant improvements to table creation in
lancedb@cc5f213
for the `node` SDK. I copied these changes to the `nodejs` SDK.
* The nodejs tables were using fs to create tmp directories and these
were not getting cleaned up. This is mostly harmless but annoying and so
I changed it up a bit to ensure we cleanup tmp directories.
* ~~countRows in the node SDK was returning `bigint`. I changed it to
return `number`~~ (this actually happened in a previous PR)
* Tables and connections now implement `std::fmt::Display` which is
hooked into python's `__repr__`. Node has no concept of a regular "to
string" function and so I added a `display` method.
* Python method signatures are changing so that optional parameters are
always `Optional[foo] = None` instead of something like `foo = False`.
This is because we want those defaults to be in rust whenever possible
(though we still need to mention the default in documentation).
* I changed the python `AsyncConnection/AsyncTable` classes from
abstract classes with a single implementation to just classes because we
no longer have the remote implementation in python.

Note: this does NOT add the `add` function to the remote table. This PR
was already large enough, and the remote implementation is unique
enough, that I am going to do all the remote stuff at a later date (we
should have the structure in place and correct so there shouldn't be any
refactor concerns)

---------

Co-authored-by: Will Jones <[email protected]>
)

The eslint rules specify some formatting requirements that are rather
strict and conflict with vscode's default formatter. I was unable to get
auto-formatting to setup correctly. Also, eslint has quite recently
[given up on
formatting](https://eslint.org/blog/2023/10/deprecating-formatting-rules/)
and recommends using a 3rd party formatter.

This PR adds prettier as the formatter. It restores the eslint rules to
their defaults. This does mean we now have the "no explicit any" check
back on. I know that rule is pedantic but it did help me catch a few
corner cases in type testing that weren't covered in the current code.
Leaving in draft as this is dependent on other PRs.
Arrow-js uses brittle `instanceof` checks throughout the code base.
These fail unless the library instance that produced the object matches
exactly the same instance the vectordb is using. At a minimum, this
means that a user using arrow version 15 (or any version that doesn't
match exactly the version that vectordb is using) will get strange
errors when they try and use vectordb.

However, there are even cases where the versions can be perfectly
identical, and the instanceof check still fails. One such example is
when using `vite` (e.g. vitejs/vite#3910)

This PR solves the problem in a rather brute force, but workable,
fashion. If we encounter a schema that does not pass the `instanceof`
check then we will attempt to sanitize that schema by traversing the
object and, if it has all the correct properties, constructing an
appropriate `Schema` instance via deep cloning.
1. filtering with fts mutated the schema, which caused schema mistmatch
problems with hybrid search as it combines fts and vector search tables.
2. fts with filter failed with `with_row_id`. This was because row_id
was calculated before filtering which caused size mismatch on attaching
it after.
3. The fix for 1 meant that now row_id is attached before filtering but
passing a filter to `to_lance` on a dataset that already contains
`_rowid` raises a panic from lance. So temporarily, in case where fts is
used with a filter AND `with_row_id`, we just force user to using the
duckdb pathway.

---------

Co-authored-by: Chang She <[email protected]>
The fact that we convert errors to strings makes them really hard to
work with. For example, in SaaS we want to know whether the underlying
`lance::Error` was the `InvalidInput` variant, so we can return a 400
instead of a 500.
…ble_names function from sync table_names function (lancedb#1059)

The synchronous table_names function in python lancedb relies on arrow's
filesystem which behaves slightly differently than object_store. As a
result, the function would not work properly in GCS.

However, the async table_names function uses object_store directly and
thus is accurate. In most cases we can fallback to using the async
table_names function and so this PR does so. The one case we cannot is
if the user is already in an async context (we can't start a new async
event loop). Soon, we can just redirect those users to use the async API
instead of the sync API and so that case will eventually go away. For
now, we fallback to the old behavior.
lancedb#1002 accidentally changed `checkout_latest` to do nothing if the table
was already in latest mode. This PR makes sure it forces a reload of the
table (if there is a newer version).
@wjones127
Copy link
Contributor

@eduardjbotha I would like to support this. However, we do not check LanceDB Python's annotations in CI yet, so I don't think many of them are event correct at the moment. Therefore, I'm reticent to turn this on until we start validating our codebase with mypy. I've filed #1117 to do that first. If you would like, you are welcome to open a PR for that.

alexkohler pushed a commit to alexkohler/lancedb that referenced this pull request Apr 20, 2024
provide schema conversio for both schema and reference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet