Skip to content

Releases: JSv4/OpenContracts

v2.0.0 b1 - Add Data Extract and Corpus Querying

19 Jun 15:45
f55cdcf
Compare
Choose a tag to compare

2.0.0 Beta 1

Added Grid-based Data Extraction and Corpus Querying

This update extends the analytical capabilities of the application, allowing for automated and background extraction of structured data from documents, improving efficiency and scalability.

We've added a couple models on the backend:

Extract: Represents a headless, background annotation task linked to a Corpus and Fieldset.
Fieldset: Defines a reusable set of fields for Extracts, linked to Columns.
Column: Represents a discrete data structure to extract from a document, with various properties like query, match_text, output_type, and more.
Datacell: Represents extracted data for each column and document, storing data as JSON.
LanguageModel: Represents a language model to be used in the extraction process.

Improved Test Suite

  • LlamaIndex is being tested with vcr.py so we actually have realistic tests and mocks for corpus query and corpus extract tasks
  • Added a lot of graphql query and endpoint tests

New GUI Elements

  • There is now an extract tab and a number of GUI elements to make it easy to construct an extract grid made up of documents, corpora and re-usable columns.
  • Within the Corpus view, there is a query tab you can use to ask questions of the corpus

What's Changed

Full Changelog: v1.3.0...v2.0.0b1

Add Nlm Parser

04 Jun 04:01
ef648e4
Compare
Choose a tag to compare

Major feature is addition of nlm ingestor microservice which will eventually totally replace the PAWLs preprocessor (which has some periodic issues for certain doc types). This allows us to import layout blocks along with the document and token layers.

What's Changed

  • Add Documentation on Annotation Creation Logic + Component(s) by @JSv4 in #113
  • Create overview.md by @JSv4 in #114
  • Add Nlm-ingestor by @JSv4 in #115
  • Add Structural Annotations and Vector Embeddings by @JSv4 in #116

Full Changelog: v1.2.2...v1.3.0

Upgrade Parser

13 Sep 04:50
be3de1c
Compare
Choose a tag to compare

I moved the PAWLs parser to its own repo and am now pointing my dependency there. I also noticed that I had made some changes beyond bug fixes in my work to improve outputs where PDF image quality is bad. While this did improve the results, I inadvertently introduced a scaling issue with the token coordinate system, and the tokens were offset from the image, so labeling was effectively broken. I rolled back the OCR quality workarounds I added to fix the scaling issue in my new repo. These can be added back in later, but, for now, OpenContracts functionality is restored.

What's Changed

  • Fix Broken Coordinate System in Parser by @JSv4 in #112

Full Changelog: v1.2.1...v1.2.2

Add Annotated Document Import Mutation

13 May 03:49
c2b2902
Compare
Choose a tag to compare

Created a new format that encapsulates a document's pdf, its text, its PAWLs tokens and all annotations which can be imported in a single API call. This will be useful for remote clients that might process a document and then want to upload multiple annotations simultaneously. Will also support planned feature to export single annotated documents in addition to entire corpuses.

What's Changed

  • Added import task to import a single annotated doc. Also added a test. by @JSv4 in #110

Full Changelog: v1.2.0...v1.2.1

Add More Export Formats

10 Mar 06:10
bce304a
Compare
Choose a tag to compare

The main feature addition here is the ability to export documents into FUNSD-style annotations that can easily be loaded into LayoutLM-style models. There is also a LangChain export, but it's not fully-baked yet . At the moment, it just exports full document text and metadata. This release also comes with a number of bug fixes.

What's Changed

  • Fix Quickstart Docs by @JSv4 in #84
  • Fix Django Auth by @JSv4 in #86
  • Add Export Format Choice GUI by @JSv4 in #88
  • Quickstart updated to include steps to configure .env files. by @JSv4 in #89
  • Add Funsd Export by @JSv4 in #92

Full Changelog: v1.1.0...v1.2.0

v1.1.0 - Add Metadata Annotations and Improve Parser

28 Feb 04:31
44d85ff
Compare
Choose a tag to compare

Initial release of a version of OpenContracts that supports "metadata" annotations - essentially data fields the user (or API) can populate. Long-term, it'd be great to support multiple data types, but, for now, this is just string data. I've also rebuilt the document processing pipeline for higher performance and more robust handling of extreme variations in document sizes. Every document is split into single pages and then the pages are added to a queue for processing. I do need to add some documentation on how to "tune" celeryworkers for your machine. I'd suggest starting with --concurrency=1 (single threaded) and then scaling the celery worker service via Docker Compose, but there are probably other approaches that would work too.

What's Changed

Full Changelog: v1.0.1...v1.1.0

v1.0.1 - Added API Token Authorization

20 Nov 00:25
87ad0b3
Compare
Choose a tag to compare

New Features:

This release adds an API Token Authorization mechanism so you can more easily integrate OpenContracts into backend services and infrastructure.

Chores:

A number of packages have been upgraded. See below.

What's Changed

Full Changelog: v1.0.0...v1.0.1

First Public Release

24 Oct 04:13
Compare
Choose a tag to compare

Initial public release, with sample deployments including Gremlin Analyzers.