From f4c892bbad3ce68ece88fae5232a9156ae0524a8 Mon Sep 17 00:00:00 2001 From: <> Date: Thu, 20 Apr 2023 17:56:29 +0000 Subject: [PATCH] Deployed 7e9864c with MkDocs version: 1.4.2 --- .nojekyll | 0 404.html | 1496 ++++ api/cluster/index.html | 1557 ++++ api/configuration/index.html | 1843 +++++ api/index.html | 1669 ++++ api/methods/index.html | 2261 ++++++ assets/_mkdocstrings.css | 16 + assets/images/favicon.png | Bin 0 -> 1870 bytes assets/javascripts/bundle.51198bba.min.js | 29 + assets/javascripts/bundle.51198bba.min.js.map | 8 + assets/javascripts/lunr/min/lunr.ar.min.js | 1 + assets/javascripts/lunr/min/lunr.da.min.js | 18 + assets/javascripts/lunr/min/lunr.de.min.js | 18 + assets/javascripts/lunr/min/lunr.du.min.js | 18 + assets/javascripts/lunr/min/lunr.es.min.js | 18 + assets/javascripts/lunr/min/lunr.fi.min.js | 18 + assets/javascripts/lunr/min/lunr.fr.min.js | 18 + assets/javascripts/lunr/min/lunr.hi.min.js | 1 + assets/javascripts/lunr/min/lunr.hu.min.js | 18 + assets/javascripts/lunr/min/lunr.it.min.js | 18 + assets/javascripts/lunr/min/lunr.ja.min.js | 1 + assets/javascripts/lunr/min/lunr.jp.min.js | 1 + assets/javascripts/lunr/min/lunr.ko.min.js | 1 + assets/javascripts/lunr/min/lunr.multi.min.js | 1 + assets/javascripts/lunr/min/lunr.nl.min.js | 18 + assets/javascripts/lunr/min/lunr.no.min.js | 18 + assets/javascripts/lunr/min/lunr.pt.min.js | 18 + assets/javascripts/lunr/min/lunr.ro.min.js | 18 + assets/javascripts/lunr/min/lunr.ru.min.js | 18 + .../lunr/min/lunr.stemmer.support.min.js | 1 + assets/javascripts/lunr/min/lunr.sv.min.js | 18 + assets/javascripts/lunr/min/lunr.ta.min.js | 1 + assets/javascripts/lunr/min/lunr.th.min.js | 1 + assets/javascripts/lunr/min/lunr.tr.min.js | 18 + assets/javascripts/lunr/min/lunr.vi.min.js | 1 + assets/javascripts/lunr/min/lunr.zh.min.js | 1 + assets/javascripts/lunr/tinyseg.js | 206 + assets/javascripts/lunr/wordcut.js | 6708 +++++++++++++++++ .../workers/search.208ed371.min.js | 42 + .../workers/search.208ed371.min.js.map | 8 + assets/stylesheets/main.ded33207.min.css | 1 + assets/stylesheets/main.ded33207.min.css.map | 1 + assets/stylesheets/palette.a0c5b2b5.min.css | 1 + .../stylesheets/palette.a0c5b2b5.min.css.map | 1 + cloud/index.html | 1841 +++++ embeddings/configuration/cloud/index.html | 1833 +++++ embeddings/configuration/index.html | 1885 +++++ embeddings/index.html | 1642 ++++ embeddings/indexing/index.html | 1750 +++++ embeddings/methods/index.html | 4459 +++++++++++ embeddings/query/index.html | 1821 +++++ examples/index.html | 1950 +++++ faq/index.html | 1549 ++++ further/index.html | 1540 ++++ images/api-dark.png | Bin 0 -> 297369 bytes images/api.excalidraw | 1482 ++++ images/api.png | Bin 0 -> 300602 bytes images/cloud-dark.png | Bin 0 -> 130073 bytes images/cloud.excalidraw | 551 ++ images/cloud.png | Bin 0 -> 130275 bytes images/embeddings-dark.png | Bin 0 -> 869992 bytes images/embeddings.excalidraw | 513 ++ images/embeddings.png | Bin 0 -> 866239 bytes images/examples-dark.png | Bin 0 -> 56962 bytes images/examples.excalidraw | 1060 +++ images/examples.png | Bin 0 -> 56255 bytes images/flows-dark.png | Bin 0 -> 295573 bytes images/flows.excalidraw | 793 ++ images/flows.png | Bin 0 -> 292109 bytes images/further.excalidraw | 162 + images/further.png | Bin 0 -> 151102 bytes images/indexing-dark.png | Bin 0 -> 228587 bytes images/indexing.excalidraw | 1802 +++++ images/indexing.png | Bin 0 -> 228912 bytes images/install-dark.png | Bin 0 -> 49294 bytes images/install.excalidraw | 96 + images/install.png | Bin 0 -> 49586 bytes images/logo.png | Bin 0 -> 12745 bytes images/pipeline-dark.png | Bin 0 -> 58685 bytes images/pipeline.excalidraw | 1439 ++++ images/pipeline.png | Bin 0 -> 64241 bytes images/query-dark.png | Bin 0 -> 321510 bytes images/query.excalidraw | 243 + images/query.png | Bin 0 -> 319712 bytes images/schedule-dark.png | Bin 0 -> 42820 bytes images/schedule.excalidraw | 3788 ++++++++++ images/schedule.png | Bin 0 -> 39261 bytes images/search-dark.png | Bin 0 -> 881205 bytes images/search.excalidraw | 555 ++ images/search.png | Bin 0 -> 876900 bytes images/task-dark.png | Bin 0 -> 59860 bytes images/task.excalidraw | 1124 +++ images/task.png | Bin 0 -> 58592 bytes images/why-dark.png | Bin 0 -> 242826 bytes images/why.excalidraw | 263 + images/why.png | Bin 0 -> 242557 bytes images/workflow-dark.png | Bin 0 -> 281850 bytes images/workflow.excalidraw | 759 ++ images/workflow.png | Bin 0 -> 279252 bytes index.html | 1604 ++++ install/index.html | 1926 +++++ objects.inv | Bin 0 -> 938 bytes overrides/main.html | 24 + pipeline/audio/texttospeech/index.html | 1943 +++++ pipeline/audio/transcription/index.html | 1903 +++++ pipeline/data/segmentation/index.html | 1924 +++++ pipeline/data/tabular/index.html | 1940 +++++ pipeline/data/textractor/index.html | 1847 +++++ pipeline/image/caption/index.html | 1881 +++++ pipeline/image/imagehash/index.html | 1915 +++++ pipeline/image/objects/index.html | 1919 +++++ pipeline/index.html | 1579 ++++ pipeline/text/entity/index.html | 1914 +++++ pipeline/text/extractor/index.html | 2044 +++++ pipeline/text/generator/index.html | 1883 +++++ pipeline/text/labels/index.html | 1921 +++++ pipeline/text/sequences/index.html | 1842 +++++ pipeline/text/similarity/index.html | 1904 +++++ pipeline/text/summary/index.html | 1907 +++++ pipeline/text/translation/index.html | 1988 +++++ pipeline/train/hfonnx/index.html | 1797 +++++ pipeline/train/mlonnx/index.html | 1753 +++++ pipeline/train/trainer/index.html | 1962 +++++ search/search_index.json | 1 + sitemap.xml | 253 + sitemap.xml.gz | Bin 0 -> 226 bytes why/index.html | 1556 ++++ workflow/index.html | 1969 +++++ workflow/schedule/index.html | 1655 ++++ workflow/task/console/index.html | 1731 +++++ workflow/task/export/index.html | 1809 +++++ workflow/task/file/index.html | 1731 +++++ workflow/task/image/index.html | 1731 +++++ workflow/task/index.html | 2070 +++++ workflow/task/retrieve/index.html | 1810 +++++ workflow/task/service/index.html | 1841 +++++ workflow/task/storage/index.html | 1731 +++++ workflow/task/template/index.html | 1733 +++++ workflow/task/url/index.html | 1731 +++++ workflow/task/workflow/index.html | 1711 +++++ 140 files changed, 117411 insertions(+) create mode 100644 .nojekyll create mode 100644 404.html create mode 100644 api/cluster/index.html create mode 100644 api/configuration/index.html create mode 100644 api/index.html create mode 100644 api/methods/index.html create mode 100644 assets/_mkdocstrings.css create mode 100644 assets/images/favicon.png create mode 100644 assets/javascripts/bundle.51198bba.min.js create mode 100644 assets/javascripts/bundle.51198bba.min.js.map create mode 100644 assets/javascripts/lunr/min/lunr.ar.min.js create mode 100644 assets/javascripts/lunr/min/lunr.da.min.js create mode 100644 assets/javascripts/lunr/min/lunr.de.min.js create mode 100644 assets/javascripts/lunr/min/lunr.du.min.js create mode 100644 assets/javascripts/lunr/min/lunr.es.min.js create mode 100644 assets/javascripts/lunr/min/lunr.fi.min.js create mode 100644 assets/javascripts/lunr/min/lunr.fr.min.js create mode 100644 assets/javascripts/lunr/min/lunr.hi.min.js create mode 100644 assets/javascripts/lunr/min/lunr.hu.min.js create mode 100644 assets/javascripts/lunr/min/lunr.it.min.js create mode 100644 assets/javascripts/lunr/min/lunr.ja.min.js create mode 100644 assets/javascripts/lunr/min/lunr.jp.min.js create mode 100644 assets/javascripts/lunr/min/lunr.ko.min.js create mode 100644 assets/javascripts/lunr/min/lunr.multi.min.js create mode 100644 assets/javascripts/lunr/min/lunr.nl.min.js create mode 100644 assets/javascripts/lunr/min/lunr.no.min.js create mode 100644 assets/javascripts/lunr/min/lunr.pt.min.js create mode 100644 assets/javascripts/lunr/min/lunr.ro.min.js create mode 100644 assets/javascripts/lunr/min/lunr.ru.min.js create mode 100644 assets/javascripts/lunr/min/lunr.stemmer.support.min.js create mode 100644 assets/javascripts/lunr/min/lunr.sv.min.js create mode 100644 assets/javascripts/lunr/min/lunr.ta.min.js create mode 100644 assets/javascripts/lunr/min/lunr.th.min.js create mode 100644 assets/javascripts/lunr/min/lunr.tr.min.js create mode 100644 assets/javascripts/lunr/min/lunr.vi.min.js create mode 100644 assets/javascripts/lunr/min/lunr.zh.min.js create mode 100644 assets/javascripts/lunr/tinyseg.js create mode 100644 assets/javascripts/lunr/wordcut.js create mode 100644 assets/javascripts/workers/search.208ed371.min.js create mode 100644 assets/javascripts/workers/search.208ed371.min.js.map create mode 100644 assets/stylesheets/main.ded33207.min.css create mode 100644 assets/stylesheets/main.ded33207.min.css.map create mode 100644 assets/stylesheets/palette.a0c5b2b5.min.css create mode 100644 assets/stylesheets/palette.a0c5b2b5.min.css.map create mode 100644 cloud/index.html create mode 100644 embeddings/configuration/cloud/index.html create mode 100644 embeddings/configuration/index.html create mode 100644 embeddings/index.html create mode 100644 embeddings/indexing/index.html create mode 100644 embeddings/methods/index.html create mode 100644 embeddings/query/index.html create mode 100644 examples/index.html create mode 100644 faq/index.html create mode 100644 further/index.html create mode 100644 images/api-dark.png create mode 100644 images/api.excalidraw create mode 100644 images/api.png create mode 100644 images/cloud-dark.png create mode 100644 images/cloud.excalidraw create mode 100644 images/cloud.png create mode 100644 images/embeddings-dark.png create mode 100644 images/embeddings.excalidraw create mode 100644 images/embeddings.png create mode 100644 images/examples-dark.png create mode 100644 images/examples.excalidraw create mode 100644 images/examples.png create mode 100644 images/flows-dark.png create mode 100644 images/flows.excalidraw create mode 100644 images/flows.png create mode 100644 images/further.excalidraw create mode 100644 images/further.png create mode 100644 images/indexing-dark.png create mode 100644 images/indexing.excalidraw create mode 100644 images/indexing.png create mode 100644 images/install-dark.png create mode 100644 images/install.excalidraw create mode 100644 images/install.png create mode 100644 images/logo.png create mode 100644 images/pipeline-dark.png create mode 100644 images/pipeline.excalidraw create mode 100644 images/pipeline.png create mode 100644 images/query-dark.png create mode 100644 images/query.excalidraw create mode 100644 images/query.png create mode 100644 images/schedule-dark.png create mode 100644 images/schedule.excalidraw create mode 100644 images/schedule.png create mode 100644 images/search-dark.png create mode 100644 images/search.excalidraw create mode 100644 images/search.png create mode 100644 images/task-dark.png create mode 100644 images/task.excalidraw create mode 100644 images/task.png create mode 100644 images/why-dark.png create mode 100644 images/why.excalidraw create mode 100644 images/why.png create mode 100644 images/workflow-dark.png create mode 100644 images/workflow.excalidraw create mode 100644 images/workflow.png create mode 100644 index.html create mode 100644 install/index.html create mode 100644 objects.inv create mode 100644 overrides/main.html create mode 100644 pipeline/audio/texttospeech/index.html create mode 100644 pipeline/audio/transcription/index.html create mode 100644 pipeline/data/segmentation/index.html create mode 100644 pipeline/data/tabular/index.html create mode 100644 pipeline/data/textractor/index.html create mode 100644 pipeline/image/caption/index.html create mode 100644 pipeline/image/imagehash/index.html create mode 100644 pipeline/image/objects/index.html create mode 100644 pipeline/index.html create mode 100644 pipeline/text/entity/index.html create mode 100644 pipeline/text/extractor/index.html create mode 100644 pipeline/text/generator/index.html create mode 100644 pipeline/text/labels/index.html create mode 100644 pipeline/text/sequences/index.html create mode 100644 pipeline/text/similarity/index.html create mode 100644 pipeline/text/summary/index.html create mode 100644 pipeline/text/translation/index.html create mode 100644 pipeline/train/hfonnx/index.html create mode 100644 pipeline/train/mlonnx/index.html create mode 100644 pipeline/train/trainer/index.html create mode 100644 search/search_index.json create mode 100644 sitemap.xml create mode 100644 sitemap.xml.gz create mode 100644 why/index.html create mode 100644 workflow/index.html create mode 100644 workflow/schedule/index.html create mode 100644 workflow/task/console/index.html create mode 100644 workflow/task/export/index.html create mode 100644 workflow/task/file/index.html create mode 100644 workflow/task/image/index.html create mode 100644 workflow/task/index.html create mode 100644 workflow/task/retrieve/index.html create mode 100644 workflow/task/service/index.html create mode 100644 workflow/task/storage/index.html create mode 100644 workflow/task/template/index.html create mode 100644 workflow/task/url/index.html create mode 100644 workflow/task/workflow/index.html diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 000000000..e69de29bb diff --git a/404.html b/404.html new file mode 100644 index 000000000..069e9bd58 --- /dev/null +++ b/404.html @@ -0,0 +1,1496 @@ + + + +
+ + + + + + + + + + + + + + + +The API supports combining multiple API instances into a single logical embeddings index. An example configuration is shown below.
+cluster:
+ shards:
+ - http://127.0.0.1:8002
+ - http://127.0.0.1:8003
+
This configuration aggregates the API instances above as index shards. Data is evenly split among each of the shards at index time. Queries are run in parallel against each shard and the results are joined together. This method allows horizontal scaling and supports very large index clusters.
+This method is only recommended for data sets in the 1 billion+ records. The ANN libraries can easily support smaller data sizes and this method is not worth the additional complexity. At this time, new shards can not be added after building the initial index.
+See the link below for a detailed example covering distributed embeddings clusters.
+Notebook | +Description | ++ |
---|---|---|
Distributed embeddings cluster | +Distribute an embeddings index across multiple data nodes | +
Configuration is set through YAML. In most cases, YAML keys map to fields names in Python. The example in the previous section gave a full-featured example covering a wide array of configuration options.
+Each section below describes the available configuration settings.
+The configuration parser expects a top level embeddings
key to be present in the YAML. All embeddings configuration is supported.
The following example defines an embeddings index.
+path: index path
+writable: true
+
+embeddings:
+ path: vector model
+ content: true
+
Three top level settings are available to control where indexes are saved and if an index is a read-only index.
+path: string
+
Path to save and load the embeddings index. Each API instance can only access a single index at a time.
+writable: boolean
+
Determines if the input embeddings index is writable (true) or read-only (false). This allows serving a read-only index.
+Cloud storage settings can be set under a cloud
top level configuration group.
Pipelines are loaded as top level configuration parameters. Pipeline names are automatically detected in the YAML configuration and created upon startup. All pipelines are supported.
+The following example defines a series of pipelines. Note that entries below are the lower-case names of the pipeline class.
+caption:
+
+extractor:
+ path: model path
+
+labels:
+
+summary:
+
+tabular:
+
+translation:
+
Under each pipeline name, configuration settings for the pipeline can be set.
+Workflows are defined under a top level workflow
key. Each key under the workflow
key is the name of the workflow. Under that is a tasks
key with each task definition.
The following example defines a workflow.
+workflow:
+ sumtranslate:
+ tasks:
+ - action: summary
+ - action: translation
+
Schedules a workflow using a cron expression.
+workflow:
+ index:
+ schedule:
+ cron: 0/10 * * * * *
+ elements: ["api params"]
+ tasks:
+ - task: service
+ url: api url
+ - action: index
+
tasks: list
+
Expects a list of workflow tasks. Each element defines a single workflow task. All task configuration is supported.
+A shorthand syntax for creating tasks is supported. This syntax will automatically map task strings to an action:value
pair.
Example below.
+workflow:
+ index:
+ tasks:
+ - action1
+ - action2
+
Each task element supports the following additional arguments.
+action: string|list
+
Both single and multi-action tasks are supported.
+The action parameter works slightly different when passed via configuration. The parameter(s) needs to be converted into callable method(s). If action is a pipeline that has been defined in the current configuration, it will use that pipeline as the action.
+There are three special action names index
, upsert
and search
. If index
or upsert
are used as the action, the task will collect workflow data elements and load them into defined the embeddings index. If search
is used, the task will execute embeddings queries for each input data element.
Otherwise, the action must be a path to a callable object or function. The configuration parser will resolve the function name and use that as the task action.
+task: string
+
Optionally sets the type of task to create. For example, this could be a file
task or a retrieve
task. If this is not specified, a generic task is created. The list of workflow tasks can be found here.
args: list
+
Optional list of static arguments to pass to the workflow task. These are combined with workflow data to pass to each __call__
.
+
txtai has a full-featured API, backed by FastAPI, that can optionally be enabled for any txtai process. All functionality found in txtai can be accessed via the API.
+The following is an example configuration and startup script for the API.
+Note: This configuration file enables all functionality. For memory-bound systems, splitting pipelines into multiple instances is a best practice.
+# Index file path
+path: /tmp/index
+
+# Allow indexing of documents
+writable: True
+
+# Enbeddings index
+embeddings:
+ path: sentence-transformers/nli-mpnet-base-v2
+
+# Extractive QA
+extractor:
+ path: distilbert-base-cased-distilled-squad
+
+# Zero-shot labeling
+labels:
+
+# Similarity
+similarity:
+
+# Text segmentation
+segmentation:
+ sentences: true
+
+# Text summarization
+summary:
+
+# Text extraction
+textractor:
+ paragraphs: true
+ minlength: 100
+ join: true
+
+# Transcribe audio to text
+transcription:
+
+# Translate text between languages
+translation:
+
+# Workflow definitions
+workflow:
+ sumfrench:
+ tasks:
+ - action: textractor
+ task: url
+ - action: summary
+ - action: translation
+ args: ["fr"]
+ sumspanish:
+ tasks:
+ - action: textractor
+ task: url
+ - action: summary
+ - action: translation
+ args: ["es"]
+
Assuming this YAML content is stored in a file named config.yml, the following command starts the API process.
+CONFIG=config.yml uvicorn "txtai.api:app"
+
uvicorn is a full-featured production ready server with support for SSL and more. See the uvicorn deployment guide for details.
+The default port for the API is 8000. See the uvicorn link above to change this.
+txtai has a number of language bindings which abstract the API (see links below). Alternatively, code can be written to connect directly to the API. Documentation for a live running instance can be found at the /docs
url (i.e. http://localhost:8000/docs). The following example runs a workflow using cURL.
curl \
+ -X POST "http://localhost:8000/workflow" \
+ -H "Content-Type: application/json" \
+ -d '{"name":"sumfrench", "elements": ["https://github.com/neuml/txtai"]}'
+
A local instance can be instantiated. In this case, a txtai application runs internally, without any network connections, providing the same consolidated functionality. This enables running txtai in Python with configuration.
+The configuration above can be run in Python with:
+from txtai.app import Application
+
+# Load and run workflow
+app = Application(config.yml)
+app.workflow("sumfrench", ["https://github.com/neuml/txtai"])
+
See this link for a full list of methods.
+The API can be containerized and run. This will bring up an API instance without having to install Python, txtai or any dependencies on your machine!
+See this section for more information.
+The following programming languages have bindings with the txtai API:
+See the link below for a detailed example covering how to use the API.
+Notebook | +Description | ++ |
---|---|---|
API Gallery | +Using txtai in JavaScript, Java, Rust and Go | +
+API (Application)
+
+
+
+
+Base API template. The API is an extended txtai application, adding the ability to cluster API instances together.
+Downstream applications can extend this base template to add/modify functionality.
+ +txtai/api/base.py
class API(Application):
+ """
+ Base API template. The API is an extended txtai application, adding the ability to cluster API instances together.
+
+ Downstream applications can extend this base template to add/modify functionality.
+ """
+
+ def __init__(self, config, loaddata=True):
+ super().__init__(config, loaddata)
+
+ # Embeddings cluster
+ self.cluster = None
+ if self.config.get("cluster"):
+ self.cluster = Cluster(self.config["cluster"])
+
+ # pylint: disable=W0221
+ def search(self, query, limit=None, request=None):
+ # When search is invoked via the API, limit is set from the request
+ # When search is invoked directly, limit is set using the method parameter
+ limit = self.limit(request.query_params.get("limit") if request and hasattr(request, "query_params") else limit)
+
+ if self.cluster:
+ return self.cluster.search(query, limit)
+
+ return super().search(query, limit)
+
+ def batchsearch(self, queries, limit=None):
+ if self.cluster:
+ return self.cluster.batchsearch(queries, self.limit(limit))
+
+ return super().batchsearch(queries, limit)
+
+ def add(self, documents):
+ """
+ Adds a batch of documents for indexing.
+
+ Downstream applications can override this method to also store full documents in an external system.
+
+ Args:
+ documents: list of {id: value, text: value}
+
+ Returns:
+ unmodified input documents
+ """
+
+ if self.cluster:
+ self.cluster.add(documents)
+ else:
+ super().add(documents)
+
+ return documents
+
+ def index(self):
+ """
+ Builds an embeddings index for previously batched documents.
+ """
+
+ if self.cluster:
+ self.cluster.index()
+ else:
+ super().index()
+
+ def upsert(self):
+ """
+ Runs an embeddings upsert operation for previously batched documents.
+ """
+
+ if self.cluster:
+ self.cluster.upsert()
+ else:
+ super().upsert()
+
+ def delete(self, ids):
+ """
+ Deletes from an embeddings index. Returns list of ids deleted.
+
+ Args:
+ ids: list of ids to delete
+
+ Returns:
+ ids deleted
+ """
+
+ if self.cluster:
+ return self.cluster.delete(ids)
+
+ return super().delete(ids)
+
+ def count(self):
+ """
+ Total number of elements in this embeddings index.
+
+ Returns:
+ number of elements in embeddings index
+ """
+
+ if self.cluster:
+ return self.cluster.count()
+
+ return super().count()
+
+ def limit(self, limit):
+ """
+ Parses the number of results to return from the request. Allows range of 1-250, with a default of 10.
+
+ Args:
+ limit: limit parameter
+
+ Returns:
+ bounded limit
+ """
+
+ # Return between 1 and 250 results, defaults to 10
+ return max(1, min(250, int(limit) if limit else 10))
+
add(self, documents)
+
+
+Adds a batch of documents for indexing.
+Downstream applications can override this method to also store full documents in an external system.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
documents |
+ + | list of {id: value, text: value} |
+ required | +
Returns:
+Type | +Description | +
---|---|
+ | unmodified input documents |
+
txtai/api/base.py
def add(self, documents):
+ """
+ Adds a batch of documents for indexing.
+
+ Downstream applications can override this method to also store full documents in an external system.
+
+ Args:
+ documents: list of {id: value, text: value}
+
+ Returns:
+ unmodified input documents
+ """
+
+ if self.cluster:
+ self.cluster.add(documents)
+ else:
+ super().add(documents)
+
+ return documents
+
batchsearch(self, queries, limit=None)
+
+
+Finds documents in the embeddings model most similar to the input queries. Returns +a list of {id: value, score: value} sorted by highest score per query, where id is +the document id in the embeddings model.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
queries |
+ + | queries text |
+ required | +
limit |
+ + | maximum results |
+ None |
+
Returns:
+Type | +Description | +
---|---|
list of {id |
+ value, score: value} per query |
+
txtai/api/base.py
def batchsearch(self, queries, limit=None):
+ if self.cluster:
+ return self.cluster.batchsearch(queries, self.limit(limit))
+
+ return super().batchsearch(queries, limit)
+
count(self)
+
+
+Total number of elements in this embeddings index.
+ +Returns:
+Type | +Description | +
---|---|
+ | number of elements in embeddings index |
+
txtai/api/base.py
def count(self):
+ """
+ Total number of elements in this embeddings index.
+
+ Returns:
+ number of elements in embeddings index
+ """
+
+ if self.cluster:
+ return self.cluster.count()
+
+ return super().count()
+
delete(self, ids)
+
+
+Deletes from an embeddings index. Returns list of ids deleted.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
ids |
+ + | list of ids to delete |
+ required | +
Returns:
+Type | +Description | +
---|---|
+ | ids deleted |
+
txtai/api/base.py
def delete(self, ids):
+ """
+ Deletes from an embeddings index. Returns list of ids deleted.
+
+ Args:
+ ids: list of ids to delete
+
+ Returns:
+ ids deleted
+ """
+
+ if self.cluster:
+ return self.cluster.delete(ids)
+
+ return super().delete(ids)
+
index(self)
+
+
+Builds an embeddings index for previously batched documents.
+ +txtai/api/base.py
def index(self):
+ """
+ Builds an embeddings index for previously batched documents.
+ """
+
+ if self.cluster:
+ self.cluster.index()
+ else:
+ super().index()
+
search(self, query, limit=None, request=None)
+
+
+Finds documents in the embeddings model most similar to the input query. Returns +a list of {id: value, score: value} sorted by highest score, where id is the +document id in the embeddings model.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
query |
+ + | query text |
+ required | +
limit |
+ + | maximum results, used if request is None |
+ None |
+
Returns:
+Type | +Description | +
---|---|
list of {id |
+ value, score: value} |
+
txtai/api/base.py
def search(self, query, limit=None, request=None):
+ # When search is invoked via the API, limit is set from the request
+ # When search is invoked directly, limit is set using the method parameter
+ limit = self.limit(request.query_params.get("limit") if request and hasattr(request, "query_params") else limit)
+
+ if self.cluster:
+ return self.cluster.search(query, limit)
+
+ return super().search(query, limit)
+
upsert(self)
+
+
+Runs an embeddings upsert operation for previously batched documents.
+ +txtai/api/base.py
def upsert(self):
+ """
+ Runs an embeddings upsert operation for previously batched documents.
+ """
+
+ if self.cluster:
+ self.cluster.upsert()
+ else:
+ super().upsert()
+
+
Scalable cloud-native applications can be built with txtai. The following container runtimes are supported.
+Images for txtai are available on Docker Hub for CPU and GPU installs. The CPU install is recommended when GPUs aren't available given the image is half the size.
+The base txtai images have no models installed and models will be downloaded each time the container starts. Caching the models is recommended as that will significantly reduce container start times. This can be done a couple different ways.
+docker run -v <local dir>:/models -e TRANSFORMERS_CACHE=/models --rm -it <docker image>
+
The txtai images found on Docker Hub are configured to support most situations. This image can be locally built with different options as desired.
+Examples build commands below.
+# Get Dockerfile
+wget https://raw.githubusercontent.com/neuml/txtai/master/docker/base/Dockerfile
+
+# Build Ubuntu 18.04 image running Python 3.7
+docker build -t txtai --build-arg BASE_IMAGE=ubuntu:18.04 --build-arg PYTHON_VERSION=3.7 .
+
+# Build image with GPU support
+docker build -t txtai --build-arg GPU=1 .
+
+# Build minimal image with the base txtai components
+docker build -t txtai --build-arg COMPONENTS= .
+
As mentioned previously, model caching is recommended to reduce container start times. The following commands demonstrate this. In all cases, it is assumed a config.yml file is present in the local directory with the desired configuration set.
+This section builds an image that caches models and starts an API service. The config.yml file should be configured with the desired components to expose via the API.
+The following is a sample config.yml file that creates an Embeddings API service.
+# config.yml
+writable: true
+
+embeddings:
+ path: sentence-transformers/nli-mpnet-base-v2
+ content: true
+
The next section builds the image and starts an instance.
+# Get Dockerfile
+wget https://raw.githubusercontent.com/neuml/txtai/master/docker/api/Dockerfile
+
+# CPU build
+docker build -t txtai-api .
+
+# GPU build
+docker build -t txtai-api --build-arg BASE_IMAGE=neuml/txtai-gpu .
+
+# Run
+docker run -p 8000:8000 --rm -it txtai-api
+
This section builds a scheduled workflow service. More on scheduled workflows can be found here.
+# Get Dockerfile
+wget https://raw.githubusercontent.com/neuml/txtai/master/docker/service/Dockerfile
+
+# CPU build
+docker build -t txtai-service .
+
+# GPU build
+docker build -t txtai-service --build-arg BASE_IMAGE=neuml/txtai-gpu .
+
+# Run
+docker run --rm -it txtai-service
+
This section builds a single run workflow. Example workflows can be found here.
+# Get Dockerfile
+wget https://raw.githubusercontent.com/neuml/txtai/master/docker/workflow/Dockerfile
+
+# CPU build
+docker build -t txtai-workflow .
+
+# GPU build
+docker build -t txtai-workflow --build-arg BASE_IMAGE=neuml/txtai-gpu .
+
+# Run
+docker run --rm -it txtai-workflow <workflow name> <workflow parameters>
+
One of the most powerful features of txtai is building YAML-configured applications with the "build once, run anywhere" approach. API instances and workflows can run locally, on a server, on a cluster or serverless.
+Serverless instances of txtai are supported on frameworks such as AWS Lambda, Google Cloud Functions, Azure Cloud Functions and Kubernetes with Knative.
+The following steps show a basic example of how to build a serverless API instance with AWS SAM.
+# config.yml
+writable: true
+
+embeddings:
+ path: sentence-transformers/nli-mpnet-base-v2
+ content: true
+
# template.yml
+Resources:
+ txtai:
+ Type: AWS::Serverless::Function
+ Properties:
+ PackageType: Image
+ MemorySize: 3000
+ Timeout: 20
+ Events:
+ Api:
+ Type: Api
+ Properties:
+ Path: "/{proxy+}"
+ Method: ANY
+ Metadata:
+ Dockerfile: Dockerfile
+ DockerContext: ./
+ DockerTag: api
+
Install AWS SAM
+Run following
+# Get Dockerfile and application
+wget https://raw.githubusercontent.com/neuml/txtai/master/docker/aws/api.py
+wget https://raw.githubusercontent.com/neuml/txtai/master/docker/aws/Dockerfile
+
+# Build the docker image
+sam build
+
+# Start API gateway and Lambda instance locally
+sam local start-api -p 8000 --warm-containers LAZY
+
+# Verify instance running (should return 0)
+curl http://localhost:8080/count
+
If successful, a local API instance is now running in a "serverless" fashion. This configuration can be deployed to AWS using SAM. See this link for more information.
+txtai scales with container orchestration systems. This can be self-hosted or with a cloud provider such as Amazon Elastic Kubernetes Service, Google Kubernetes Engine and Azure Kubernetes Service. There are also other smaller providers with a managed Kubernetes offering.
+A full example covering how to build a serverless txtai application on Kubernetes with Knative can be found here.
+ + + + + + +The following describes parameters used to sync indexes with cloud storage. Cloud object storage, the Hugging Face Hub and custom providers are all supported.
+Parameters are set via the embeddings.load and embeddings.save methods.
+provider: string
+
Cloud provider. Can be one of the following:
+Cloud object storage. Set to one of these providers.
+Hugging Face Hub. Set to huggingface-hub
.
Custom providers. Set to the full class path of the custom provider.
+container: string
+
Container/bucket/directory/repository name.
+In addition to the above common configuration, the cloud object storage provider has the following additional configuration parameters.
+key: string
+
Provider-specific access key. Can also be set via ACCESS_KEY environment variable. Ensure the configuration file is secured if added to the file.
+secret: string
+
Provider-specific access secret. Can also be set via ACCESS_SECRET environment variable. Ensure the configuration file is secured if added to the file.
+host: string
+
Optional server host name. Set when using a local cloud storage server.
+port: int
+
Optional server port. Set when using a local cloud storage server.
+token: string
+
Optional temporary session token
+region: string
+
Optional parameter to specify the storage region, provider-specific.
+The huggingface-hub provider supports the following additional configuration parameters. More on these parameters can be found in the Hugging Face Hub's documentation.
+revision: string
+
Optional Git revision id which can be a branch name, a tag, or a commit hash
+cache: string
+
Path to the folder where cached files are stored
+token: string|boolean
+
Token to be used for the download. If set to True, the token will be read from the Hugging Face config folder.
+ + + + + + +This following describes available embeddings configuration. These parameters are set via the Embeddings constructor.
+format: pickle|json
+
Sets the configuration storage format. Defaults to pickle
.
path: string
+
Sets the path for a vectors model. When using a transformers/sentence-transformers model, this can be any model on the +Hugging Face Hub or a local file path. Otherwise, it must be a local file path to a word embeddings model.
+method: transformers|sentence-transformers|words|external
+
Sentence embeddings method to use. If the method is not provided, it is inferred using the path
.
sentence-transformers
and words
require the similarity extras package to be installed.
Builds sentence embeddings using a transformers model. While this can be any transformers model, it works best with +models trained to build sentence embeddings.
+Same as transformers but loads models with the sentence-transformers library.
+Builds sentence embeddings using a word embeddings model. Transformers models are the preferred vector backend in most cases. Word embeddings models may be deprecated in the future.
+storevectors: boolean
+
Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies.
+scoring: bm25|tfidf|sif
+
A scoring model builds weighted averages of word vectors for a given sentence. Supports BM25, TF-IDF and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built.
+pca: int
+
Removes n principal components from generated sentence embeddings. When enabled, a TruncatedSVD model is built to help with dimensionality reduction. After pooling of vectors creates a single sentence embedding, this method is applied.
+Sentence embeddings are loaded via an external model or API. Requires setting the transform parameter to a function that translates data into vectors.
+transform: function
+
When method is external
, this function transforms input content into embeddings. The input to this function is a list of data. This method must return either a numpy array or list of numpy arrays.
batch: int
+
Sets the transform batch size. This parameter controls how input streams are chunked and vectorized.
+encodebatch: int
+
Sets the encode batch size. This parameter controls the underlying vector model batch size. This often corresponds to a GPU batch size, which controls GPU memory usage.
+tokenize: boolean
+
Enables string tokenization (defaults to false). This method applies tokenization rules that only work with English language text and may increase the quality of +English language sentence embeddings in some situations.
+instructions:
+ query: prefix for queries
+ data: prefix for indexing
+
Instruction-based models use prefixes to modify how embeddings are computed. This is especially useful with asymmetric search, which is when the query and indexed data are of vastly different lengths. In other words, short queries with long documents.
+E5-base is an example of a model that accepts instructions. It takes query:
and passage:
prefixes and uses those to generate embeddings that work well for asymmetric search.
backend: faiss|hnsw|annoy|custom
+
Approximate Nearest Neighbor (ANN) index backend for storing generated sentence embeddings. Defaults to faiss
. Additional backends require the
+similarity extras package to be installed. Add custom backends via setting this parameter to the fully resolvable
+class string.
Backend-specific settings are set with a corresponding configuration object having the same name as the backend (i.e. annoy, faiss, or hnsw). None of these are required and are set to defaults if omitted.
+faiss:
+ components: comma separated list of components - defaults to "Flat" for small
+ indices and "IVFx,Flat" for larger indexes where
+ x = 4 * sqrt(embeddings count)
+ nprobe: search probe setting (int) - defaults to x/16 (as defined above)
+ for larger indexes
+ quantize: store vectors with 8-bit precision vs 32-bit (boolean)
+ defaults to false
+ mmap: load as on-disk index (boolean) - trade query response time for a
+ smaller RAM footprint, defaults to false
+ sample: percent of data to use for model training (0.0 - 1.0)
+ reduces indexing time for larger (>1M+ row) indexes, defaults to 1.0
+
See the following Faiss documentation links for more information.
+ +hnsw:
+ efconstruction: ef_construction param for init_index (int) - defaults to 200
+ m: M param for init_index (int) - defaults to 16
+ randomseed: random-seed param for init_index (int) - defaults to 100
+ efsearch: ef search param (int) - defaults to None and not set
+
See Hnswlib documentation for more information on these parameters.
+annoy:
+ ntrees: number of trees (int) - defaults to 10
+ searchk: search_k search setting (int) - defaults to -1
+
See Annoy documentation for more information on these parameters. Note that annoy indexes can not be modified after creation, upserts/deletes and other modifications are not supported.
+content: boolean|sqlite|duckdb|custom
+
Enables content storage. When true, the default storage engine, sqlite
will be used. Also supports duckdb
. Add custom storage engines via setting this parameter to the fully resolvable class string.
functions: list
+
List of functions with user-defined SQL functions, only used when content is enabled. Each list element must be one of the following:
+query:
+ path: sets the path for the query model - this can be any model on the
+ Hugging Face Model Hub or a local file path.
+ prefix: text prefix to prepend to all inputs
+ maxlength: maximum generated sequence length
+
Query translation model. Translates natural language queries to txtai compatible SQL statements.
+graph:
+ backend: graph network backend (string), defaults to "networkx"
+ batchsize: batch query size, used to query embeddings index (int)
+ defaults to 256
+ limit: maximum number of results to return per embeddings query (int)
+ defaults to 15
+ minscore: minimum score required to consider embeddings query matches (float)
+ defaults to 0.1
+ approximate: when true, queries only run for nodes without edges (boolean)
+ defaults to true
+ topics: see below
+
Enables graph storage. When set, a graph network is built using the embeddings index. Graph nodes are synced with each embeddings index operation (index/upsert/delete). Graph edges are created using the embeddings index upon completion of each index/upsert/delete embeddings index call.
+Add custom graph storage engines via setting the graph.backend
parameter to the fully resolvable class string.
Defaults are tuned so that in most cases these values don't need to be changed.
+topics:
+ algorithm: community detection algorithm (string), options are
+ louvain (default), greedy, lpa
+ level: controls number of topics (string), options are best (default) or first
+ resolution: controls number of topics (int), larger values create more
+ topics (int), defaults to 100
+ labels: scoring index method used to build topic labels (string)
+ options are bm25 (default), tfidf, sif
+ terms: number of frequent terms to use for topic labels (int), defaults to 4
+ stopwords: optional list of stop words to exclude from topic labels
+ categories: optional list of categories used to group topics, allows
+ granular topics with broad categories grouping topics
+
Enables topic modeling. Defaults are tuned so that in most cases these values don't need to be changed (except for categories). These parameters are available for advanced use cases where one wants full control over the community detection process.
+ + + + + + +
+
Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results that have the same meaning, not necessarily the same keywords.
+The following code snippet shows how to build and search an embeddings index.
+from txtai.embeddings import Embeddings
+
+# Create embeddings model, backed by sentence-transformers & transformers
+embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
+
+data = [
+ "US tops 5 million confirmed virus cases",
+ "Canada's last fully intact ice shelf has suddenly collapsed, " +
+ "forming a Manhattan-sized iceberg",
+ "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
+ "The National Park Service warns against sacrificing slower friends " +
+ "in a bear attack",
+ "Maine man wins $1M from $25 lottery ticket",
+ "Make huge profits without work, earn up to $100,000 a day"
+]
+
+# Create an index for the list of text
+embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
+
+print("%-20s %s" % ("Query", "Best Match"))
+print("-" * 50)
+
+# Run an embeddings search for each query
+for query in ("feel good story", "climate change", "public health story", "war",
+ "wildlife", "asia", "lucky", "dishonest junk"):
+ # Extract uid of first result
+ # search result format: (uid, score)
+ uid = embeddings.search(query, 1)[0][0]
+
+ # Print text
+ print("%-20s %s" % (query, data[uid]))
+
An embeddings instance can be created as follows:
+embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
+
The example above builds a transformers-based embeddings instance. In this case, when loading and searching for data, a transformers model is used to vectorize data.
+The embeddings instance is configuration-driven based on what is passed in the constructor. Embeddings indexes store vectors and can optionally store content. Content storage enables additional filtering and data retrieval options.
+After creating a new embeddings instance, the next step is adding data to it.
+embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
+
The index method takes an iterable collection of tuples with three values.
+Element | +Description | +
---|---|
id | +unique record id | +
data | +input data to index, can be text, a dictionary or object | +
tags | +optional tags string, used to mark/label data as it's indexed | +
When the data element is a dictionary and it has a field named text
, that will be used for indexing.
The input iterable can be a list or generator. Generators help with indexing very large datasets as only portions of the data is in memory at any given time.
+More information on indexing can be found in the index guide.
+Once data is indexed, it is ready for search.
+embeddings.search(query, limit)
+
The search method takes two parameters, the query and query limit. The results format is different based on whether content is stored or not.
+(id, score)
when content is not stored{**query columns}
when content is storedBoth natural language and SQL queries are supported. More information can be found in the query guide.
+See this link for a full list of embeddings examples.
+ + + + + + +
+
This section gives an in-depth overview on how to index data with txtai. We'll cover vectorization, indexing, updating and deleting data.
+The most compute intensive step in building an index is vectorization. The path parameter sets the path to the vector model. There is logic to automatically detect the vector model method but it can also be set directly.
+The batch and encodebatch parameters control the vectorization process. Larger values for batch
will pass larger batches to the vectorization method. Larger values for encodebatch
will pass larger batches for each vector encode call. In the case of GPU vector models, larger values will consume more GPU memory.
Data is buffered to temporary storage during indexing as embeddings vectors can be quite large (for example 768 dimensions of float32 is 768 * 4 = 3072 bytes per vector). Once vectorization is complete, a mmapped array is created with all vectors for Approximate Nearest Neighbor (ANN) indexing.
+As mentioned above, computed vectors are stored in an ANN. There are various index backends that can be configured. Faiss is the default backend.
+Embeddings indexes can optionally store content. When this is enabled, the input content is saved in a database alongside the computed vectors. This enables filtering on additional fields and content retrieval.
+Data is loaded into an index with either an index or upsert call.
+embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
+embeddings.upsert([(uid, text, None) for uid, text in enumerate(data)])
+
The index
call will build a brand new index replacing an existing one. upsert
will insert or update records. upsert
ops do not require a full index rebuild.
Indexes can be stored in a directory using the save method.
+embeddings.save("/path/to/save")
+
Compressed indexes are also supported.
+embeddings.save("/path/to/save/index.tar.gz")
+
In addition to saving indexes locally, they can also be persisted to cloud storage.
+embeddings.save("/path/to/save/index.tar.gz", cloud={...})
+
This is especially useful when running in a serverless context or otherwise running on temporary compute. Cloud storage is only supported with compressed indexes.
+Embeddings indexes can be restored using the load method.
+embeddings.load("/path/to/load")
+
Content can be removed from the index with the delete method. This method takes a list of ids to delete.
+embeddings.delete(ids)
+
When content storage is enabled, reindex can be called to rebuild the index with new settings. For example, the backend can be switched from faiss to hnsw or the vector model can be updated. This prevents having to go back to the original raw data.
+embeddings.reindex({"path": "sentence-transformers/all-MiniLM-L6-v2", "backend": "hnsw"})
+
Dimensionality reduction with UMAP combined with HDBSCAN is a popular topic modeling method found in a number of libraries. txtai takes a different approach with a semantic graph.
+Enabling a graph network adds a semantic graph at index time as data is being vectorized. Vector embeddings are used to create relationships in the graph. Finally, community detection algorithms build topic clusters. Semantic graphs can also be used to analyze data connectivity.
+This approach has the advantage of only having to vectorize data once. It also has the advantage of better topic precision given there isn't a dimensionality reduction operation (UMAP). Semantic graph examples are shown below.
+Get a mapping of discovered topics to associated ids.
+embeddings.graph.topics
+
Show the most central nodes in the index.
+embeddings.graph.centrality()
+
Show how node 1 and node 2 are connected in the graph.
+embeddings.graph.showpath(id1, id2)
+
Graphs are persisted alongside an embeddings index. Each save and load will also save and load the graph.
+When using word vector backed models with scoring set, a separate call is required before calling index
as follows:
embeddings.score([(uid, text, None) for uid, text in enumerate(data)])
+embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
+
Two calls are required to support generator-backed iteration of data. The scoring index requires a separate full-pass of the data.
+Scoring instances can also create a standalone keyword-based index (BM25, TF-IDF). See this link to learn more.
+ + + + + + +
+Embeddings
+
+
+
+Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts +will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results +that have the same meaning, not necessarily the same keywords.
+ +txtai/embeddings/base.py
class Embeddings:
+ """
+ Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts
+ will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results
+ that have the same meaning, not necessarily the same keywords.
+ """
+
+ # pylint: disable = W0231
+ def __init__(self, config=None):
+ """
+ Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be
+ synchronized.
+
+ Args:
+ config: embeddings configuration
+ """
+
+ # Index configuration
+ self.config = None
+
+ # Dimensionality reduction and scoring index - word vectors only
+ self.reducer, self.scoring = None, None
+
+ # Embeddings vector model - transforms data into similarity vectors
+ self.model = None
+
+ # Approximate nearest neighbor index
+ self.ann = None
+
+ # Document database
+ self.database = None
+
+ # Resolvable functions
+ self.functions = None
+
+ # Graph network
+ self.graph = None
+
+ # Query model
+ self.query = None
+
+ # Index archive
+ self.archive = None
+
+ # Set initial configuration
+ self.configure(config)
+
+ def score(self, documents):
+ """
+ Builds a scoring index. Only used by word vectors models.
+
+ Args:
+ documents: list of (id, data, tags)
+ """
+
+ # Build scoring index over documents
+ if self.scoring:
+ self.scoring.index(documents)
+
+ def index(self, documents, reindex=False):
+ """
+ Builds an embeddings index. This method overwrites an existing index.
+
+ Args:
+ documents: list of (id, data, tags)
+ reindex: if this is a reindex operation in which case database creation is skipped, defaults to False
+ """
+
+ # Set configuration to default configuration, if empty
+ if not self.config:
+ self.configure(self.defaults())
+
+ # Create document database, if necessary
+ if not reindex:
+ self.database = self.createdatabase()
+
+ # Reset archive since this is a new index
+ self.archive = None
+
+ # Create graph, if necessary
+ self.graph = self.creategraph()
+
+ # Create transform action
+ transform = Transform(self, Action.REINDEX if reindex else Action.INDEX)
+
+ with tempfile.NamedTemporaryFile(mode="wb", suffix=".npy") as buffer:
+ # Load documents into database and transform to vectors
+ ids, dimensions, embeddings = transform(documents, buffer)
+ if ids:
+ # Build LSA model (if enabled). Remove principal components from embeddings.
+ if self.config.get("pca"):
+ self.reducer = Reducer(embeddings, self.config["pca"])
+ self.reducer(embeddings)
+
+ # Normalize embeddings
+ self.normalize(embeddings)
+
+ # Save index dimensions
+ self.config["dimensions"] = dimensions
+
+ # Create approximate nearest neighbor index
+ self.ann = ANNFactory.create(self.config)
+
+ # Add embeddings to the index
+ self.ann.index(embeddings)
+
+ # Save indexids-ids mapping for indexes with no database, except when this is a reindex action
+ if not reindex and not self.database:
+ self.config["ids"] = ids
+
+ # Index graph, if necessary
+ if self.graph:
+ self.graph.index(Search(self, True), self.batchsimilarity)
+
+ def upsert(self, documents):
+ """
+ Runs an embeddings upsert operation. If the index exists, new data is
+ appended to the index, existing data is updated. If the index doesn't exist,
+ this method runs a standard index operation.
+
+ Args:
+ documents: list of (id, data, tags)
+ """
+
+ # Run standard insert if index doesn't exist or it has no records
+ if not self.count():
+ self.index(documents)
+ return
+
+ # Create transform action
+ transform = Transform(self, Action.UPSERT)
+
+ with tempfile.NamedTemporaryFile(mode="wb", suffix=".npy") as buffer:
+ # Load documents into database and transform to vectors
+ ids, _, embeddings = transform(documents, buffer)
+ if ids:
+ # Remove principal components from embeddings, if necessary
+ if self.reducer:
+ self.reducer(embeddings)
+
+ # Normalize embeddings
+ self.normalize(embeddings)
+
+ # Append embeddings to the index
+ self.ann.append(embeddings)
+
+ # Save indexids-ids mapping for indexes with no database
+ if not self.database:
+ self.config["ids"] = self.config["ids"] + ids
+
+ # Graph upsert, if necessary
+ if self.graph:
+ self.graph.upsert(Search(self, True), self.batchsimilarity)
+
+ def delete(self, ids):
+ """
+ Deletes from an embeddings index. Returns list of ids deleted.
+
+ Args:
+ ids: list of ids to delete
+
+ Returns:
+ list of ids deleted
+ """
+
+ # List of internal indices for each candidate id to delete
+ indices = []
+
+ # List of deleted ids
+ deletes = []
+
+ if self.database:
+ # Retrieve indexid-id mappings from database
+ ids = self.database.ids(ids)
+
+ # Parse out indices and ids to delete
+ indices = [i for i, _ in ids]
+ deletes = sorted(set(uid for _, uid in ids))
+
+ # Delete ids from database
+ self.database.delete(deletes)
+ elif self.ann:
+ # Lookup indexids from config for indexes with no database
+ indexids = self.config["ids"]
+
+ # Find existing ids
+ for uid in ids:
+ indices.extend([index for index, value in enumerate(indexids) if uid == value])
+
+ # Clear config ids
+ for index in indices:
+ deletes.append(indexids[index])
+ indexids[index] = None
+
+ # Delete indices from ann embeddings
+ if indices:
+ # Delete ids from index
+ self.ann.delete(indices)
+
+ # Delete ids from graph
+ if self.graph:
+ self.graph.delete(indices)
+
+ return deletes
+
+ def reindex(self, config, columns=None, function=None):
+ """
+ Recreates the approximate nearest neighbor (ann) index using config. This method only works if document
+ content storage is enabled.
+
+ Args:
+ config: new config
+ columns: optional list of document columns used to rebuild data
+ function: optional function to prepare content for indexing
+ """
+
+ if self.database:
+ # Keep content and objects parameters to ensure database is preserved
+ config["content"] = self.config["content"]
+ if "objects" in self.config:
+ config["objects"] = self.config["objects"]
+
+ # Reset configuration
+ self.configure(config)
+
+ # Reset function references
+ if self.functions:
+ self.functions.reset()
+
+ # Reindex
+ if function:
+ self.index(function(self.database.reindex(columns)), True)
+ else:
+ self.index(self.database.reindex(columns), True)
+
+ def transform(self, document):
+ """
+ Transforms document into an embeddings vector.
+
+ Args:
+ document: (id, data, tags)
+
+ Returns:
+ embeddings vector
+ """
+
+ return self.batchtransform([document])[0]
+
+ def batchtransform(self, documents, category=None):
+ """
+ Transforms documents into embeddings vectors.
+
+ Args:
+ documents: list of (id, data, tags)
+ category: category for instruction-based embeddings
+
+ Returns:
+ embeddings vectors
+ """
+
+ # Convert documents into sentence embeddings
+ embeddings = self.model.batchtransform(documents, category)
+
+ # Reduce the dimensionality of the embeddings. Scale the embeddings using this
+ # model to reduce the noise of common but less relevant terms.
+ if self.reducer:
+ self.reducer(embeddings)
+
+ # Normalize embeddings
+ self.normalize(embeddings)
+
+ return embeddings
+
+ def count(self):
+ """
+ Total number of elements in this embeddings index.
+
+ Returns:
+ number of elements in this embeddings index
+ """
+
+ return self.ann.count() if self.ann else 0
+
+ def search(self, query, limit=None):
+ """
+ Finds documents most similar to the input queries. This method will run either an approximate
+ nearest neighbor (ann) search or an approximate nearest neighbor + database search depending
+ on if a database is available.
+
+ Args:
+ query: input query
+ limit: maximum results
+
+ Returns:
+ list of (id, score) for ann search, list of dict for an ann+database search
+ """
+
+ results = self.batchsearch([query], limit)
+ return results[0] if results else results
+
+ def batchsearch(self, queries, limit=None):
+ """
+ Finds documents most similar to the input queries. This method will run either an approximate
+ nearest neighbor (ann) search or an approximate nearest neighbor + database search depending
+ on if a database is available.
+
+ Args:
+ queries: input queries
+ limit: maximum results
+
+ Returns:
+ list of (id, score) per query for ann search, list of dict per query for an ann+database search
+ """
+
+ return Search(self)(queries, limit if limit else 3)
+
+ def similarity(self, query, data):
+ """
+ Computes the similarity between query and list of data. Returns a list of
+ (id, score) sorted by highest score, where id is the index in data.
+
+ Args:
+ query: input query
+ data: list of data
+
+ Returns:
+ list of (id, score)
+ """
+
+ return self.batchsimilarity([query], data)[0]
+
+ def batchsimilarity(self, queries, data):
+ """
+ Computes the similarity between list of queries and list of data. Returns a list
+ of (id, score) sorted by highest score per query, where id is the index in data.
+
+ Args:
+ queries: input queries
+ data: list of data
+
+ Returns:
+ list of (id, score) per query
+ """
+
+ # Convert queries to embedding vectors
+ queries = self.batchtransform(((None, query, None) for query in queries), "query")
+ data = self.batchtransform(((None, row, None) for row in data), "data")
+
+ # Dot product on normalized vectors is equal to cosine similarity
+ scores = np.dot(queries, data.T).tolist()
+
+ # Add index and sort desc based on score
+ return [sorted(enumerate(score), key=lambda x: x[1], reverse=True) for score in scores]
+
+ def explain(self, query, texts=None, limit=None):
+ """
+ Explains the importance of each input token in text for a query.
+
+ Args:
+ query: input query
+ texts: optional list of (text|list of tokens), otherwise runs search query
+ limit: optional limit if texts is None
+
+ Returns:
+ list of dict per input text where a higher token scores represents higher importance relative to the query
+ """
+
+ results = self.batchexplain([query], texts, limit)
+ return results[0] if results else results
+
+ def batchexplain(self, queries, texts=None, limit=None):
+ """
+ Explains the importance of each input token in text for a list of queries.
+
+ Args:
+ queries: input queries
+ texts: optional list of (text|list of tokens), otherwise runs search queries
+ limit: optional limit if texts is None
+
+ Returns:
+ list of dict per input text per query where a higher token scores represents higher importance relative to the query
+ """
+
+ return Explain(self)(queries, texts, limit)
+
+ def terms(self, query):
+ """
+ Extracts keyword terms from a query.
+
+ Args:
+ query: input query
+
+ Returns:
+ query reduced down to keyword terms
+ """
+
+ return self.batchterms([query])[0]
+
+ def batchterms(self, queries):
+ """
+ Extracts keyword terms from a list of queries.
+
+ Args:
+ queries: list of queries
+
+ Returns:
+ list of queries reduced down to keyword term strings
+ """
+
+ return Terms(self)(queries)
+
+ def exists(self, path=None, cloud=None, **kwargs):
+ """
+ Checks if an index exists at path.
+
+ Args:
+ path: input path
+ cloud: cloud storage configuration
+ kwargs: additional configuration as keyword args
+
+ Returns:
+ True if index exists, False otherwise
+ """
+
+ # Check if this exists in a cloud instance
+ cloud = self.createcloud(cloud=cloud, **kwargs)
+ if cloud:
+ return cloud.exists(path)
+
+ # Check if this is an archive file and exists
+ path, apath = self.checkarchive(path)
+ if apath:
+ return os.path.exists(apath)
+
+ # Return true if path has a config or config.json file and an embeddings file
+ return path and (os.path.exists(f"{path}/config") or os.path.exists(f"{path}/config.json")) and os.path.exists(f"{path}/embeddings")
+
+ def load(self, path=None, cloud=None, **kwargs):
+ """
+ Loads an existing index from path.
+
+ Args:
+ path: input path
+ cloud: cloud storage configuration
+ kwargs: additional configuration as keyword args
+ """
+
+ # Load from cloud, if configured
+ cloud = self.createcloud(cloud=cloud, **kwargs)
+ if cloud:
+ path = cloud.load(path)
+
+ # Check if this is an archive file and extract
+ path, apath = self.checkarchive(path)
+ if apath:
+ self.archive.load(apath)
+
+ # Load index configuration
+ self.config = self.loadconfig(path)
+
+ # Approximate nearest neighbor index - stores embeddings vectors
+ self.ann = ANNFactory.create(self.config)
+ self.ann.load(f"{path}/embeddings")
+
+ # Dimensionality reduction model - word vectors only
+ if self.config.get("pca"):
+ self.reducer = Reducer()
+ self.reducer.load(f"{path}/lsa")
+
+ # Embedding scoring index - word vectors only
+ if self.config.get("scoring"):
+ self.scoring = ScoringFactory.create(self.config["scoring"])
+ self.scoring.load(f"{path}/scoring")
+
+ # Sentence vectors model - transforms data to embeddings vectors
+ self.model = self.loadvectors()
+
+ # Query model
+ self.query = self.loadquery()
+
+ # Document database - stores document content
+ self.database = self.createdatabase()
+ if self.database:
+ self.database.load(f"{path}/documents")
+
+ # Graph network - stores relationships
+ self.graph = self.creategraph()
+ if self.graph:
+ self.graph.load(f"{path}/graph")
+
+ def save(self, path, cloud=None, **kwargs):
+ """
+ Saves an index in a directory at path unless path ends with tar.gz, tar.bz2, tar.xz or zip.
+ In those cases, the index is stored as a compressed file.
+
+ Args:
+ path: output path
+ cloud: cloud storage configuration
+ kwargs: additional configuration as keyword args
+ """
+
+ if self.config:
+ # Check if this is an archive file
+ path, apath = self.checkarchive(path)
+
+ # Create output directory, if necessary
+ os.makedirs(path, exist_ok=True)
+
+ # Copy sentence vectors model
+ if self.config.get("storevectors"):
+ shutil.copyfile(self.config["path"], os.path.join(path, os.path.basename(self.config["path"])))
+
+ self.config["path"] = os.path.basename(self.config["path"])
+
+ # Save index configuration
+ self.saveconfig(path)
+
+ # Save approximate nearest neighbor index
+ self.ann.save(f"{path}/embeddings")
+
+ # Save dimensionality reduction model (word vectors only)
+ if self.reducer:
+ self.reducer.save(f"{path}/lsa")
+
+ # Save embedding scoring index (word vectors only)
+ if self.scoring:
+ self.scoring.save(f"{path}/scoring")
+
+ # Save document database
+ if self.database:
+ self.database.save(f"{path}/documents")
+
+ # Save graph
+ if self.graph:
+ self.graph.save(f"{path}/graph")
+
+ # If this is an archive, save it
+ if apath:
+ self.archive.save(apath)
+
+ # Save to cloud, if configured
+ cloud = self.createcloud(cloud=cloud, **kwargs)
+ if cloud:
+ cloud.save(apath if apath else path)
+
+ def close(self):
+ """
+ Closes this embeddings index and frees all resources.
+ """
+
+ self.config, self.reducer, self.scoring, self.model = None, None, None, None
+ self.ann, self.graph, self.query, self.archive = None, None, None, None
+
+ # Close database connection if open
+ if self.database:
+ self.database.close()
+ self.database, self.functions = None, None
+
+ def info(self):
+ """
+ Prints the current embeddings index configuration.
+ """
+
+ # Copy and edit config
+ config = self.config.copy()
+
+ # Remove ids array if present
+ config.pop("ids", None)
+
+ # Print configuration
+ print(json.dumps(config, sort_keys=True, default=str, indent=2))
+
+ def configure(self, config):
+ """
+ Sets the configuration for this embeddings index and loads config-driven models.
+
+ Args:
+ config: embeddings configuration
+ """
+
+ # Configuration
+ self.config = config
+
+ if self.config and self.config.get("method") != "transformers":
+ # Dimensionality reduction model
+ self.reducer = None
+
+ # Embedding scoring method - weighs each word in a sentence
+ self.scoring = ScoringFactory.create(self.config["scoring"]) if self.config and self.config.get("scoring") else None
+ else:
+ self.reducer, self.scoring = None, None
+
+ # Sentence vectors model - transforms data to embeddings vectors
+ self.model = self.loadvectors() if self.config else None
+
+ # Query model
+ self.query = self.loadquery() if self.config else None
+
+ def defaults(self):
+ """
+ Builds a default configuration.
+
+ Returns:
+ default configuration
+ """
+
+ return {"path": "sentence-transformers/all-MiniLM-L6-v2"}
+
+ def loadconfig(self, path):
+ """
+ Loads index configuration. This method supports both config pickle files and config.json files.
+
+ Args:
+ path: path to directory
+
+ Returns:
+ dict
+ """
+
+ # Configuration
+ config = None
+
+ # Determine if config is json or pickle
+ jsonconfig = os.path.exists(f"{path}/config.json")
+
+ # Set config file name
+ name = "config.json" if jsonconfig else "config"
+
+ # Load configuration
+ with open(f"{path}/{name}", "r" if jsonconfig else "rb") as handle:
+ config = json.load(handle) if jsonconfig else pickle.load(handle)
+
+ # Build full path to embedding vectors file
+ if config.get("storevectors"):
+ config["path"] = os.path.join(path, config["path"])
+
+ return config
+
+ def saveconfig(self, path):
+ """
+ Saves index configuration. This method saves to JSON if possible, otherwise it falls back to pickle.
+
+ Args:
+ path: path to directory
+
+ Returns:
+ dict
+ """
+
+ # Default to pickle config
+ jsonconfig = self.config.get("format", "pickle") == "json"
+
+ # Set config file name
+ name = "config.json" if jsonconfig else "config"
+
+ # Write configuration
+ with open(f"{path}/{name}", "w" if jsonconfig else "wb", encoding="utf-8" if jsonconfig else None) as handle:
+ if jsonconfig:
+ # Write config as JSON
+ json.dump(self.config, handle, default=str, indent=2)
+ else:
+ # Write config as pickle format
+ pickle.dump(self.config, handle, protocol=__pickle__)
+
+ def loadvectors(self):
+ """
+ Loads a vector model set in config.
+
+ Returns:
+ vector model
+ """
+
+ return VectorsFactory.create(self.config, self.scoring)
+
+ def loadquery(self):
+ """
+ Loads a query model set in config.
+
+ Returns:
+ query model
+ """
+
+ if "query" in self.config:
+ return Query(**self.config["query"])
+
+ return None
+
+ def checkarchive(self, path):
+ """
+ Checks if path is an archive file.
+
+ Args:
+ path: path to check
+
+ Returns:
+ (working directory, current path) if this is an archive, original path otherwise
+ """
+
+ # Create archive instance, if necessary
+ self.archive = ArchiveFactory.create()
+
+ # Check if path is an archive file
+ if self.archive.isarchive(path):
+ # Return temporary archive working directory and original path
+ return self.archive.path(), path
+
+ return path, None
+
+ def createcloud(self, **cloud):
+ """
+ Creates a cloud instance from config.
+
+ Args:
+ cloud: cloud configuration
+ """
+
+ # Merge keyword args and keys under the cloud parameter
+ config = cloud
+ if "cloud" in config and config["cloud"]:
+ config.update(config.pop("cloud"))
+
+ # Create cloud instance from config and return
+ return CloudFactory.create(config) if config else None
+
+ def createdatabase(self):
+ """
+ Creates a database from config. This method will also close any existing database connection.
+
+ Returns:
+ new database, if enabled in config
+ """
+
+ # Free existing database resources
+ if self.database:
+ self.database.close()
+
+ config = self.config.copy()
+
+ # Create references to callable functions
+ self.functions = Functions(self) if "functions" in config else None
+ if self.functions:
+ config["functions"] = self.functions(config)
+
+ # Create database from config and return
+ return DatabaseFactory.create(config)
+
+ def creategraph(self):
+ """
+ Creates a graph from config.
+
+ Returns:
+ new graph, if enabled in config
+ """
+
+ return GraphFactory.create(self.config["graph"]) if "graph" in self.config else None
+
+ def normalize(self, embeddings):
+ """
+ Normalizes embeddings using L2 normalization. Operation applied directly on array.
+
+ Args:
+ embeddings: input embeddings matrix
+ """
+
+ # Calculation is different for matrices vs vectors
+ if len(embeddings.shape) > 1:
+ embeddings /= np.linalg.norm(embeddings, axis=1)[:, np.newaxis]
+ else:
+ embeddings /= np.linalg.norm(embeddings)
+
__init__(self, config=None)
+
+
+ special
+
+
+Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be +synchronized.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
config |
+ + | embeddings configuration |
+ None |
+
txtai/embeddings/base.py
def __init__(self, config=None):
+ """
+ Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be
+ synchronized.
+
+ Args:
+ config: embeddings configuration
+ """
+
+ # Index configuration
+ self.config = None
+
+ # Dimensionality reduction and scoring index - word vectors only
+ self.reducer, self.scoring = None, None
+
+ # Embeddings vector model - transforms data into similarity vectors
+ self.model = None
+
+ # Approximate nearest neighbor index
+ self.ann = None
+
+ # Document database
+ self.database = None
+
+ # Resolvable functions
+ self.functions = None
+
+ # Graph network
+ self.graph = None
+
+ # Query model
+ self.query = None
+
+ # Index archive
+ self.archive = None
+
+ # Set initial configuration
+ self.configure(config)
+
batchexplain(self, queries, texts=None, limit=None)
+
+
+Explains the importance of each input token in text for a list of queries.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
queries |
+ + | input queries |
+ required | +
texts |
+ + | optional list of (text|list of tokens), otherwise runs search queries |
+ None |
+
limit |
+ + | optional limit if texts is None |
+ None |
+
Returns:
+Type | +Description | +
---|---|
+ | list of dict per input text per query where a higher token scores represents higher importance relative to the query |
+
txtai/embeddings/base.py
def batchexplain(self, queries, texts=None, limit=None):
+ """
+ Explains the importance of each input token in text for a list of queries.
+
+ Args:
+ queries: input queries
+ texts: optional list of (text|list of tokens), otherwise runs search queries
+ limit: optional limit if texts is None
+
+ Returns:
+ list of dict per input text per query where a higher token scores represents higher importance relative to the query
+ """
+
+ return Explain(self)(queries, texts, limit)
+
batchsearch(self, queries, limit=None)
+
+
+Finds documents most similar to the input queries. This method will run either an approximate +nearest neighbor (ann) search or an approximate nearest neighbor + database search depending +on if a database is available.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
queries |
+ + | input queries |
+ required | +
limit |
+ + | maximum results |
+ None |
+
Returns:
+Type | +Description | +
---|---|
+ | list of (id, score) per query for ann search, list of dict per query for an ann+database search |
+
txtai/embeddings/base.py
def batchsearch(self, queries, limit=None):
+ """
+ Finds documents most similar to the input queries. This method will run either an approximate
+ nearest neighbor (ann) search or an approximate nearest neighbor + database search depending
+ on if a database is available.
+
+ Args:
+ queries: input queries
+ limit: maximum results
+
+ Returns:
+ list of (id, score) per query for ann search, list of dict per query for an ann+database search
+ """
+
+ return Search(self)(queries, limit if limit else 3)
+
batchsimilarity(self, queries, data)
+
+
+Computes the similarity between list of queries and list of data. Returns a list +of (id, score) sorted by highest score per query, where id is the index in data.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
queries |
+ + | input queries |
+ required | +
data |
+ + | list of data |
+ required | +
Returns:
+Type | +Description | +
---|---|
+ | list of (id, score) per query |
+
txtai/embeddings/base.py
def batchsimilarity(self, queries, data):
+ """
+ Computes the similarity between list of queries and list of data. Returns a list
+ of (id, score) sorted by highest score per query, where id is the index in data.
+
+ Args:
+ queries: input queries
+ data: list of data
+
+ Returns:
+ list of (id, score) per query
+ """
+
+ # Convert queries to embedding vectors
+ queries = self.batchtransform(((None, query, None) for query in queries), "query")
+ data = self.batchtransform(((None, row, None) for row in data), "data")
+
+ # Dot product on normalized vectors is equal to cosine similarity
+ scores = np.dot(queries, data.T).tolist()
+
+ # Add index and sort desc based on score
+ return [sorted(enumerate(score), key=lambda x: x[1], reverse=True) for score in scores]
+
batchterms(self, queries)
+
+
+Extracts keyword terms from a list of queries.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
queries |
+ + | list of queries |
+ required | +
Returns:
+Type | +Description | +
---|---|
+ | list of queries reduced down to keyword term strings |
+
txtai/embeddings/base.py
def batchterms(self, queries):
+ """
+ Extracts keyword terms from a list of queries.
+
+ Args:
+ queries: list of queries
+
+ Returns:
+ list of queries reduced down to keyword term strings
+ """
+
+ return Terms(self)(queries)
+
batchtransform(self, documents, category=None)
+
+
+Transforms documents into embeddings vectors.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
documents |
+ + | list of (id, data, tags) |
+ required | +
category |
+ + | category for instruction-based embeddings |
+ None |
+
Returns:
+Type | +Description | +
---|---|
+ | embeddings vectors |
+
txtai/embeddings/base.py
def batchtransform(self, documents, category=None):
+ """
+ Transforms documents into embeddings vectors.
+
+ Args:
+ documents: list of (id, data, tags)
+ category: category for instruction-based embeddings
+
+ Returns:
+ embeddings vectors
+ """
+
+ # Convert documents into sentence embeddings
+ embeddings = self.model.batchtransform(documents, category)
+
+ # Reduce the dimensionality of the embeddings. Scale the embeddings using this
+ # model to reduce the noise of common but less relevant terms.
+ if self.reducer:
+ self.reducer(embeddings)
+
+ # Normalize embeddings
+ self.normalize(embeddings)
+
+ return embeddings
+
close(self)
+
+
+Closes this embeddings index and frees all resources.
+ +txtai/embeddings/base.py
def close(self):
+ """
+ Closes this embeddings index and frees all resources.
+ """
+
+ self.config, self.reducer, self.scoring, self.model = None, None, None, None
+ self.ann, self.graph, self.query, self.archive = None, None, None, None
+
+ # Close database connection if open
+ if self.database:
+ self.database.close()
+ self.database, self.functions = None, None
+
count(self)
+
+
+Total number of elements in this embeddings index.
+ +Returns:
+Type | +Description | +
---|---|
+ | number of elements in this embeddings index |
+
txtai/embeddings/base.py
def count(self):
+ """
+ Total number of elements in this embeddings index.
+
+ Returns:
+ number of elements in this embeddings index
+ """
+
+ return self.ann.count() if self.ann else 0
+
delete(self, ids)
+
+
+Deletes from an embeddings index. Returns list of ids deleted.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
ids |
+ + | list of ids to delete |
+ required | +
Returns:
+Type | +Description | +
---|---|
+ | list of ids deleted |
+
txtai/embeddings/base.py
def delete(self, ids):
+ """
+ Deletes from an embeddings index. Returns list of ids deleted.
+
+ Args:
+ ids: list of ids to delete
+
+ Returns:
+ list of ids deleted
+ """
+
+ # List of internal indices for each candidate id to delete
+ indices = []
+
+ # List of deleted ids
+ deletes = []
+
+ if self.database:
+ # Retrieve indexid-id mappings from database
+ ids = self.database.ids(ids)
+
+ # Parse out indices and ids to delete
+ indices = [i for i, _ in ids]
+ deletes = sorted(set(uid for _, uid in ids))
+
+ # Delete ids from database
+ self.database.delete(deletes)
+ elif self.ann:
+ # Lookup indexids from config for indexes with no database
+ indexids = self.config["ids"]
+
+ # Find existing ids
+ for uid in ids:
+ indices.extend([index for index, value in enumerate(indexids) if uid == value])
+
+ # Clear config ids
+ for index in indices:
+ deletes.append(indexids[index])
+ indexids[index] = None
+
+ # Delete indices from ann embeddings
+ if indices:
+ # Delete ids from index
+ self.ann.delete(indices)
+
+ # Delete ids from graph
+ if self.graph:
+ self.graph.delete(indices)
+
+ return deletes
+
exists(self, path=None, cloud=None, **kwargs)
+
+
+Checks if an index exists at path.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
path |
+ + | input path |
+ None |
+
cloud |
+ + | cloud storage configuration |
+ None |
+
kwargs |
+ + | additional configuration as keyword args |
+ {} |
+
Returns:
+Type | +Description | +
---|---|
+ | True if index exists, False otherwise |
+
txtai/embeddings/base.py
def exists(self, path=None, cloud=None, **kwargs):
+ """
+ Checks if an index exists at path.
+
+ Args:
+ path: input path
+ cloud: cloud storage configuration
+ kwargs: additional configuration as keyword args
+
+ Returns:
+ True if index exists, False otherwise
+ """
+
+ # Check if this exists in a cloud instance
+ cloud = self.createcloud(cloud=cloud, **kwargs)
+ if cloud:
+ return cloud.exists(path)
+
+ # Check if this is an archive file and exists
+ path, apath = self.checkarchive(path)
+ if apath:
+ return os.path.exists(apath)
+
+ # Return true if path has a config or config.json file and an embeddings file
+ return path and (os.path.exists(f"{path}/config") or os.path.exists(f"{path}/config.json")) and os.path.exists(f"{path}/embeddings")
+
explain(self, query, texts=None, limit=None)
+
+
+Explains the importance of each input token in text for a query.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
query |
+ + | input query |
+ required | +
texts |
+ + | optional list of (text|list of tokens), otherwise runs search query |
+ None |
+
limit |
+ + | optional limit if texts is None |
+ None |
+
Returns:
+Type | +Description | +
---|---|
+ | list of dict per input text where a higher token scores represents higher importance relative to the query |
+
txtai/embeddings/base.py
def explain(self, query, texts=None, limit=None):
+ """
+ Explains the importance of each input token in text for a query.
+
+ Args:
+ query: input query
+ texts: optional list of (text|list of tokens), otherwise runs search query
+ limit: optional limit if texts is None
+
+ Returns:
+ list of dict per input text where a higher token scores represents higher importance relative to the query
+ """
+
+ results = self.batchexplain([query], texts, limit)
+ return results[0] if results else results
+
index(self, documents, reindex=False)
+
+
+Builds an embeddings index. This method overwrites an existing index.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
documents |
+ + | list of (id, data, tags) |
+ required | +
reindex |
+ + | if this is a reindex operation in which case database creation is skipped, defaults to False |
+ False |
+
txtai/embeddings/base.py
def index(self, documents, reindex=False):
+ """
+ Builds an embeddings index. This method overwrites an existing index.
+
+ Args:
+ documents: list of (id, data, tags)
+ reindex: if this is a reindex operation in which case database creation is skipped, defaults to False
+ """
+
+ # Set configuration to default configuration, if empty
+ if not self.config:
+ self.configure(self.defaults())
+
+ # Create document database, if necessary
+ if not reindex:
+ self.database = self.createdatabase()
+
+ # Reset archive since this is a new index
+ self.archive = None
+
+ # Create graph, if necessary
+ self.graph = self.creategraph()
+
+ # Create transform action
+ transform = Transform(self, Action.REINDEX if reindex else Action.INDEX)
+
+ with tempfile.NamedTemporaryFile(mode="wb", suffix=".npy") as buffer:
+ # Load documents into database and transform to vectors
+ ids, dimensions, embeddings = transform(documents, buffer)
+ if ids:
+ # Build LSA model (if enabled). Remove principal components from embeddings.
+ if self.config.get("pca"):
+ self.reducer = Reducer(embeddings, self.config["pca"])
+ self.reducer(embeddings)
+
+ # Normalize embeddings
+ self.normalize(embeddings)
+
+ # Save index dimensions
+ self.config["dimensions"] = dimensions
+
+ # Create approximate nearest neighbor index
+ self.ann = ANNFactory.create(self.config)
+
+ # Add embeddings to the index
+ self.ann.index(embeddings)
+
+ # Save indexids-ids mapping for indexes with no database, except when this is a reindex action
+ if not reindex and not self.database:
+ self.config["ids"] = ids
+
+ # Index graph, if necessary
+ if self.graph:
+ self.graph.index(Search(self, True), self.batchsimilarity)
+
info(self)
+
+
+Prints the current embeddings index configuration.
+ +txtai/embeddings/base.py
def info(self):
+ """
+ Prints the current embeddings index configuration.
+ """
+
+ # Copy and edit config
+ config = self.config.copy()
+
+ # Remove ids array if present
+ config.pop("ids", None)
+
+ # Print configuration
+ print(json.dumps(config, sort_keys=True, default=str, indent=2))
+
load(self, path=None, cloud=None, **kwargs)
+
+
+Loads an existing index from path.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
path |
+ + | input path |
+ None |
+
cloud |
+ + | cloud storage configuration |
+ None |
+
kwargs |
+ + | additional configuration as keyword args |
+ {} |
+
txtai/embeddings/base.py
def load(self, path=None, cloud=None, **kwargs):
+ """
+ Loads an existing index from path.
+
+ Args:
+ path: input path
+ cloud: cloud storage configuration
+ kwargs: additional configuration as keyword args
+ """
+
+ # Load from cloud, if configured
+ cloud = self.createcloud(cloud=cloud, **kwargs)
+ if cloud:
+ path = cloud.load(path)
+
+ # Check if this is an archive file and extract
+ path, apath = self.checkarchive(path)
+ if apath:
+ self.archive.load(apath)
+
+ # Load index configuration
+ self.config = self.loadconfig(path)
+
+ # Approximate nearest neighbor index - stores embeddings vectors
+ self.ann = ANNFactory.create(self.config)
+ self.ann.load(f"{path}/embeddings")
+
+ # Dimensionality reduction model - word vectors only
+ if self.config.get("pca"):
+ self.reducer = Reducer()
+ self.reducer.load(f"{path}/lsa")
+
+ # Embedding scoring index - word vectors only
+ if self.config.get("scoring"):
+ self.scoring = ScoringFactory.create(self.config["scoring"])
+ self.scoring.load(f"{path}/scoring")
+
+ # Sentence vectors model - transforms data to embeddings vectors
+ self.model = self.loadvectors()
+
+ # Query model
+ self.query = self.loadquery()
+
+ # Document database - stores document content
+ self.database = self.createdatabase()
+ if self.database:
+ self.database.load(f"{path}/documents")
+
+ # Graph network - stores relationships
+ self.graph = self.creategraph()
+ if self.graph:
+ self.graph.load(f"{path}/graph")
+
reindex(self, config, columns=None, function=None)
+
+
+Recreates the approximate nearest neighbor (ann) index using config. This method only works if document +content storage is enabled.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
config |
+ + | new config |
+ required | +
columns |
+ + | optional list of document columns used to rebuild data |
+ None |
+
function |
+ + | optional function to prepare content for indexing |
+ None |
+
txtai/embeddings/base.py
def reindex(self, config, columns=None, function=None):
+ """
+ Recreates the approximate nearest neighbor (ann) index using config. This method only works if document
+ content storage is enabled.
+
+ Args:
+ config: new config
+ columns: optional list of document columns used to rebuild data
+ function: optional function to prepare content for indexing
+ """
+
+ if self.database:
+ # Keep content and objects parameters to ensure database is preserved
+ config["content"] = self.config["content"]
+ if "objects" in self.config:
+ config["objects"] = self.config["objects"]
+
+ # Reset configuration
+ self.configure(config)
+
+ # Reset function references
+ if self.functions:
+ self.functions.reset()
+
+ # Reindex
+ if function:
+ self.index(function(self.database.reindex(columns)), True)
+ else:
+ self.index(self.database.reindex(columns), True)
+
save(self, path, cloud=None, **kwargs)
+
+
+Saves an index in a directory at path unless path ends with tar.gz, tar.bz2, tar.xz or zip. +In those cases, the index is stored as a compressed file.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
path |
+ + | output path |
+ required | +
cloud |
+ + | cloud storage configuration |
+ None |
+
kwargs |
+ + | additional configuration as keyword args |
+ {} |
+
txtai/embeddings/base.py
def save(self, path, cloud=None, **kwargs):
+ """
+ Saves an index in a directory at path unless path ends with tar.gz, tar.bz2, tar.xz or zip.
+ In those cases, the index is stored as a compressed file.
+
+ Args:
+ path: output path
+ cloud: cloud storage configuration
+ kwargs: additional configuration as keyword args
+ """
+
+ if self.config:
+ # Check if this is an archive file
+ path, apath = self.checkarchive(path)
+
+ # Create output directory, if necessary
+ os.makedirs(path, exist_ok=True)
+
+ # Copy sentence vectors model
+ if self.config.get("storevectors"):
+ shutil.copyfile(self.config["path"], os.path.join(path, os.path.basename(self.config["path"])))
+
+ self.config["path"] = os.path.basename(self.config["path"])
+
+ # Save index configuration
+ self.saveconfig(path)
+
+ # Save approximate nearest neighbor index
+ self.ann.save(f"{path}/embeddings")
+
+ # Save dimensionality reduction model (word vectors only)
+ if self.reducer:
+ self.reducer.save(f"{path}/lsa")
+
+ # Save embedding scoring index (word vectors only)
+ if self.scoring:
+ self.scoring.save(f"{path}/scoring")
+
+ # Save document database
+ if self.database:
+ self.database.save(f"{path}/documents")
+
+ # Save graph
+ if self.graph:
+ self.graph.save(f"{path}/graph")
+
+ # If this is an archive, save it
+ if apath:
+ self.archive.save(apath)
+
+ # Save to cloud, if configured
+ cloud = self.createcloud(cloud=cloud, **kwargs)
+ if cloud:
+ cloud.save(apath if apath else path)
+
score(self, documents)
+
+
+Builds a scoring index. Only used by word vectors models.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
documents |
+ + | list of (id, data, tags) |
+ required | +
txtai/embeddings/base.py
def score(self, documents):
+ """
+ Builds a scoring index. Only used by word vectors models.
+
+ Args:
+ documents: list of (id, data, tags)
+ """
+
+ # Build scoring index over documents
+ if self.scoring:
+ self.scoring.index(documents)
+
search(self, query, limit=None)
+
+
+Finds documents most similar to the input queries. This method will run either an approximate +nearest neighbor (ann) search or an approximate nearest neighbor + database search depending +on if a database is available.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
query |
+ + | input query |
+ required | +
limit |
+ + | maximum results |
+ None |
+
Returns:
+Type | +Description | +
---|---|
+ | list of (id, score) for ann search, list of dict for an ann+database search |
+
txtai/embeddings/base.py
def search(self, query, limit=None):
+ """
+ Finds documents most similar to the input queries. This method will run either an approximate
+ nearest neighbor (ann) search or an approximate nearest neighbor + database search depending
+ on if a database is available.
+
+ Args:
+ query: input query
+ limit: maximum results
+
+ Returns:
+ list of (id, score) for ann search, list of dict for an ann+database search
+ """
+
+ results = self.batchsearch([query], limit)
+ return results[0] if results else results
+
similarity(self, query, data)
+
+
+Computes the similarity between query and list of data. Returns a list of +(id, score) sorted by highest score, where id is the index in data.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
query |
+ + | input query |
+ required | +
data |
+ + | list of data |
+ required | +
Returns:
+Type | +Description | +
---|---|
+ | list of (id, score) |
+
txtai/embeddings/base.py
def similarity(self, query, data):
+ """
+ Computes the similarity between query and list of data. Returns a list of
+ (id, score) sorted by highest score, where id is the index in data.
+
+ Args:
+ query: input query
+ data: list of data
+
+ Returns:
+ list of (id, score)
+ """
+
+ return self.batchsimilarity([query], data)[0]
+
terms(self, query)
+
+
+Extracts keyword terms from a query.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
query |
+ + | input query |
+ required | +
Returns:
+Type | +Description | +
---|---|
+ | query reduced down to keyword terms |
+
txtai/embeddings/base.py
def terms(self, query):
+ """
+ Extracts keyword terms from a query.
+
+ Args:
+ query: input query
+
+ Returns:
+ query reduced down to keyword terms
+ """
+
+ return self.batchterms([query])[0]
+
transform(self, document)
+
+
+Transforms document into an embeddings vector.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
document |
+ + | (id, data, tags) |
+ required | +
Returns:
+Type | +Description | +
---|---|
+ | embeddings vector |
+
txtai/embeddings/base.py
def transform(self, document):
+ """
+ Transforms document into an embeddings vector.
+
+ Args:
+ document: (id, data, tags)
+
+ Returns:
+ embeddings vector
+ """
+
+ return self.batchtransform([document])[0]
+
upsert(self, documents)
+
+
+Runs an embeddings upsert operation. If the index exists, new data is +appended to the index, existing data is updated. If the index doesn't exist, +this method runs a standard index operation.
+ +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
documents |
+ + | list of (id, data, tags) |
+ required | +
txtai/embeddings/base.py
def upsert(self, documents):
+ """
+ Runs an embeddings upsert operation. If the index exists, new data is
+ appended to the index, existing data is updated. If the index doesn't exist,
+ this method runs a standard index operation.
+
+ Args:
+ documents: list of (id, data, tags)
+ """
+
+ # Run standard insert if index doesn't exist or it has no records
+ if not self.count():
+ self.index(documents)
+ return
+
+ # Create transform action
+ transform = Transform(self, Action.UPSERT)
+
+ with tempfile.NamedTemporaryFile(mode="wb", suffix=".npy") as buffer:
+ # Load documents into database and transform to vectors
+ ids, _, embeddings = transform(documents, buffer)
+ if ids:
+ # Remove principal components from embeddings, if necessary
+ if self.reducer:
+ self.reducer(embeddings)
+
+ # Normalize embeddings
+ self.normalize(embeddings)
+
+ # Append embeddings to the index
+ self.ann.append(embeddings)
+
+ # Save indexids-ids mapping for indexes with no database
+ if not self.database:
+ self.config["ids"] = self.config["ids"] + ids
+
+ # Graph upsert, if necessary
+ if self.graph:
+ self.graph.upsert(Search(self, True), self.batchsimilarity)
+
+
This section covers how to query data with txtai. The simplest way to search for data is building a natural language string with the desired content to find. txtai also supports querying with SQL. We'll cover both methods here.
+In the simplest case, the query is text and the results are index text that is most similar to the query text.
+embeddings.search("feel good story")
+embeddings.search("wildlife")
+
The queries above search the index for similarity matches on feel good story
and wildlife
. If content storage is enabled, a list of {**query columns}
is returned. Otherwise, a list of (id, score)
tuples are returned.
txtai supports more complex queries with SQL. This is only supported if content storage is enabled. txtai has a translation layer that analyzes input SQL statements and combines similarity results with content stored in a relational database.
+SQL queries are run through embeddings.search
like natural language queries but the examples below only show the SQL query for conciseness.
embeddings.search("SQL query")
+
The similar clause is a txtai function that enables similarity searches with SQL.
+SELECT id, text, score FROM txtai WHERE similar('feel good story')
+SELECT id, text, score FROM txtai WHERE similar('feel good story')
+
The similar clause takes two arguments:
+similar("query", "number of candidates")
+
Argument | +Description | +
---|---|
query | +natural language query to run | +
number of candidates | +number of candidate results to return | +
The txtai query layer has to join results from two separate components, a relational store and a similarity index. With a similar clause, a similarity search is run and those ids are fed to the underlying database query.
+The number of candidates should be larger than the desired number of results when applying additional filter clauses. This ensures that limit
results are still returned after applying additional filters. If the number of candidates is not specified, it is defaulted as follows:
Content can be indexed in multiple ways when content storage is enabled. Remember that input documents take the form of (id, data, tags)
tuples. If data is a string, then content is primarily filtered with similar clauses. If data is a dictionary, then all fields in the dictionary are indexed and searchable.
For example:
+embeddings.index([(0, {"text": "text to index", "flag": True,
+ "entry": "2022-01-01"}, None)])
+
With the above input data, queries can now have more complex filters.
+SELECT text, flag, entry FROM txtai WHERE similar('query') AND flag = 1
+AND entry >= '2022-01-01'
+
txtai's query layer automatically detects columns and translates queries into a format that can be understood by the underlying database.
+Nested dictionaries/JSON is supported and can be escaped with bracket statements.
+embeddings.index([(0, {"text": "text to index",
+ "parent": {"child element": "abc"}}, None)])
+
SELECT text FROM txtai WHERE [parent.child element] ='abc'
+
Note the bracket statement escaping the nested column with spaces in the name.
+The goal of txtai's query language is to closely support all functions in the underlying database engine. The main challenge is ensuring dynamic columns are properly escaped into the engines native query function.
+Aggregation query examples.
+SELECT count(*) FROM txtai WHERE similar('feel good story') AND score >= 0.15
+SELECT max(length(text)) FROM txtai WHERE similar('feel good story')
+AND score >= 0.15
+SELECT count(*), flag FROM txtai GROUP BY flag ORDER BY count(*) DESC
+
txtai has support for storing and retrieving binary objects. Binary objects can be retrieved as shown in the example below.
+# Get an image
+request = open("demo.gif")
+
+# Insert record
+embeddings.index([("txtai", {"text": "txtai executes machine-learning workflows.",
+ "object": request.read()}, None)])
+
+# Query txtai and get associated object
+query = "select object from txtai where similar('machine learning') limit 1"
+result = embeddings.search(query)[0]["object"]
+
Custom, user-defined SQL functions extend selection, filtering and ordering clauses with additional logic. For example, the following snippet defines a function that translates text using a translation pipeline.
+# Translation pipeline
+translate = Translation()
+
+# Create embeddings index
+embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2",
+ "content": True,
+ "functions": [translate]})
+
+# Run a search using a custom SQL function
+embeddings.search("""
+select
+ text,
+ translation(text, 'de', null) 'text (DE)',
+ translation(text, 'es', null) 'text (ES)',
+ translation(text, 'fr', null) 'text (FR)'
+from txtai where similar('feel good story')
+limit 1
+""")
+
Natural language queries with filters can be converted to txtai-compatible SQL statements with query translation. For example:
+embeddings.search("feel good story since yesterday")
+
can be converted to a SQL statement with a similar clause and date filter.
+select id, text, score from txtai where similar('feel good story') and
+entry >= date('now', '-1 day')
+
This requires setting a query translation model. The default query translation model is t5-small-txtsql but this can easily be finetuned to handle different use cases.
+When content storage is enabled, txtai becomes a dual storage engine. Content is stored in an underlying database (currently supports SQLite) along with an Approximate Nearest Neighbor (ANN) index. These components combine to deliver similarity search alongside traditional structured search.
+The ANN index stores ids and vectors for each input element. When a natural language query is run, the query is translated into a vector and a similarity query finds the best matching ids. When a database is added into the mix, an additional step is applied. This step takes those ids and effectively inserts them as part of the underlying database query.
+Dynamic columns are supported via the underlying engine. For SQLite, data is stored as JSON and dynamic columns are converted into json_extract
clauses. This same concept can be expanded to other storage engines like PostgreSQL and could even work with NoSQL stores.
+
The examples directory has a series of notebooks and applications giving an overview of txtai. See the sections below.
+Build semantic/similarity/vector/neural search applications.
+Notebook | +Description | ++ |
---|---|---|
Introducing txtai ▶️ | +Overview of the functionality provided by txtai | +|
Build an Embeddings index with Hugging Face Datasets | +Index and search Hugging Face Datasets | +|
Build an Embeddings index from a data source | +Index and search a data source with word embeddings | +|
Add semantic search to Elasticsearch | +Add semantic search to existing search systems | +|
Similarity search with images | +Embed images and text into the same space for search | +|
Distributed embeddings cluster | +Distribute an embeddings index across multiple data nodes | +|
What's new in txtai 4.0 | +Content storage, SQL, object storage, reindex and compressed indexes | +|
Anatomy of a txtai index | +Deep dive into the file formats behind a txtai embeddings index | +|
Custom Embeddings SQL functions | +Add user-defined functions to Embeddings SQL | +|
Model explainability | +Explainability for semantic search | +|
Query translation | +Domain-specific natural language queries with query translation | +|
Build a QA database | +Question matching with semantic search | +|
Embeddings components | +Composable search with vector, SQL and scoring components | +|
Semantic Graphs | +Explore topics, data connectivity and run network analysis | +|
Topic Modeling with BM25 | +Topic modeling backed by a BM25 index | +|
Prompt-driven search with LLMs | +Embeddings-guided and Prompt-driven search with Large Language Models (LLMs) | +|
Embeddings in the Cloud | +Load and use an embeddings index from the Hugging Face Hub | +
Transform data with language model backed pipelines.
+Notebook | +Description | ++ |
---|---|---|
Extractive QA with txtai | +Introduction to extractive question-answering with txtai | +|
Extractive QA with Elasticsearch | +Run extractive question-answering queries with Elasticsearch | +|
Extractive QA to build structured data | +Build structured datasets using extractive question-answering | +|
Apply labels with zero shot classification | +Use zero shot learning for labeling, classification and topic modeling | +|
Building abstractive text summaries | +Run abstractive text summarization | +|
Extract text from documents | +Extract text from PDF, Office, HTML and more | +|
Text to speech generation | +Generate speech from text | +|
Transcribe audio to text | +Convert audio files to text | +|
Translate text between languages | +Streamline machine translation and language detection | +|
Generate image captions and detect objects | +Captions and object detection for images | +|
Near duplicate image detection | +Identify duplicate and near-duplicate images | +|
API Gallery | +Using txtai in JavaScript, Java, Rust and Go | +
Efficiently process data at scale.
+Notebook | +Description | ++ |
---|---|---|
Run pipeline workflows ▶️ | +Simple yet powerful constructs to efficiently process data | +|
Transform tabular data with composable workflows | +Transform, index and search tabular data | +|
Tensor workflows | +Performant processing of large tensor arrays | +|
Entity extraction workflows | +Identify entity/label combinations | +|
Workflow Scheduling | +Schedule workflows with cron expressions | +|
Push notifications with workflows | +Generate and push notifications with workflows | +|
Pictures are a worth a thousand words | +Generate webpage summary images with DALL-E mini | +|
Run txtai with native code | +Execute workflows in native code with the Python C API | +|
Prompt templates and task chains | +Build model prompts and connect tasks together with workflows | +
Train NLP models.
+Notebook | +Description | ++ |
---|---|---|
Train a text labeler | +Build text sequence classification models | +|
Train without labels | +Use zero-shot classifiers to train new models | +|
Train a QA model | +Build and fine-tune question-answering models | +|
Train a language model from scratch | +Build new language models | +|
Export and run models with ONNX | +Export models with ONNX, run natively in JavaScript, Java and Rust | +|
Export and run other machine learning models | +Export and run models from scikit-learn, PyTorch and more | +
Series of example applications with txtai. Links to hosted versions on Hugging Face Spaces also provided.
+Application | +Description | ++ |
---|---|---|
Basic similarity search | +Basic similarity search example. Data from the original txtai demo. | +🤗 | +
Book search | +Book similarity search application. Index book descriptions and query using natural language statements. | +Local run only | +
Image search | +Image similarity search application. Index a directory of images and run searches to identify images similar to the input query. | +🤗 | +
Summarize an article | +Summarize an article. Workflow that extracts text from a webpage and builds a summary. | +🤗 | +
Wiki search | +Wikipedia search application. Queries Wikipedia API and summarizes the top result. | +🤗 | +
Workflow builder | +Build and execute txtai workflows. Connect summarization, text extraction, transcription, translation and similarity search pipelines together to run unified workflows. | +🤗 | +
Below is a list of frequently asked questions and common issues encountered.
+Issue
+Embeddings query errors like this:
+SQLError: no such function: json_extract
+
Solution
+Upgrade Python version as it doesn't have SQLite support for json_extract
+Issue
+Segmentation faults and similar errors on macOS
+Solution
+Downgrade PyTorch to <= 1.12. See issue #377 for more on this issue.
+Issue
+ContextualVersionConflict
exception when importing certain libraries while running one of the examples notebooks on Google Colab
Solution
+Restart the kernel. See issue #409 for more on this issue.
+mUQr6@WziXR)jm@_O+Rk<1Lkeevfc
z{s>>k>dV%&MzK*#S_<~=!*)Xzd1jzmzB)J1Yj{AR6ckCyD%COlE*;cC(^K!-xztq<
zbv5b#PPquUQG1;XmyXcvmoE0