Releases: JSv4/OpenContracts
v2.0.0 b1 - Add Data Extract and Corpus Querying
2.0.0 Beta 1
Added Grid-based Data Extraction and Corpus Querying
This update extends the analytical capabilities of the application, allowing for automated and background extraction of structured data from documents, improving efficiency and scalability.
We've added a couple models on the backend:
Extract: Represents a headless, background annotation task linked to a Corpus and Fieldset.
Fieldset: Defines a reusable set of fields for Extracts, linked to Columns.
Column: Represents a discrete data structure to extract from a document, with various properties like query, match_text, output_type, and more.
Datacell: Represents extracted data for each column and document, storing data as JSON.
LanguageModel: Represents a language model to be used in the extraction process.
Improved Test Suite
- LlamaIndex is being tested with vcr.py so we actually have realistic tests and mocks for corpus query and corpus extract tasks
- Added a lot of graphql query and endpoint tests
New GUI Elements
- There is now an extract tab and a number of GUI elements to make it easy to construct an extract grid made up of documents, corpora and re-usable columns.
- Within the Corpus view, there is a query tab you can use to ask questions of the corpus
What's Changed
Full Changelog: v1.3.0...v2.0.0b1
Add Nlm Parser
Major feature is addition of nlm ingestor microservice which will eventually totally replace the PAWLs preprocessor (which has some periodic issues for certain doc types). This allows us to import layout blocks along with the document and token layers.
What's Changed
- Add Documentation on Annotation Creation Logic + Component(s) by @JSv4 in #113
- Create overview.md by @JSv4 in #114
- Add Nlm-ingestor by @JSv4 in #115
- Add Structural Annotations and Vector Embeddings by @JSv4 in #116
Full Changelog: v1.2.2...v1.3.0
Upgrade Parser
I moved the PAWLs parser to its own repo and am now pointing my dependency there. I also noticed that I had made some changes beyond bug fixes in my work to improve outputs where PDF image quality is bad. While this did improve the results, I inadvertently introduced a scaling issue with the token coordinate system, and the tokens were offset from the image, so labeling was effectively broken. I rolled back the OCR quality workarounds I added to fix the scaling issue in my new repo. These can be added back in later, but, for now, OpenContracts functionality is restored.
What's Changed
Full Changelog: v1.2.1...v1.2.2
Add Annotated Document Import Mutation
Created a new format that encapsulates a document's pdf, its text, its PAWLs tokens and all annotations which can be imported in a single API call. This will be useful for remote clients that might process a document and then want to upload multiple annotations simultaneously. Will also support planned feature to export single annotated documents in addition to entire corpuses.
What's Changed
Full Changelog: v1.2.0...v1.2.1
Add More Export Formats
The main feature addition here is the ability to export documents into FUNSD-style annotations that can easily be loaded into LayoutLM-style models. There is also a LangChain export, but it's not fully-baked yet . At the moment, it just exports full document text and metadata. This release also comes with a number of bug fixes.
What's Changed
- Fix Quickstart Docs by @JSv4 in #84
- Fix Django Auth by @JSv4 in #86
- Add Export Format Choice GUI by @JSv4 in #88
- Quickstart updated to include steps to configure .env files. by @JSv4 in #89
- Add Funsd Export by @JSv4 in #92
Full Changelog: v1.1.0...v1.2.0
v1.1.0 - Add Metadata Annotations and Improve Parser
Initial release of a version of OpenContracts that supports "metadata" annotations - essentially data fields the user (or API) can populate. Long-term, it'd be great to support multiple data types, but, for now, this is just string data. I've also rebuilt the document processing pipeline for higher performance and more robust handling of extreme variations in document sizes. Every document is split into single pages and then the pages are added to a queue for processing. I do need to add some documentation on how to "tune" celeryworkers for your machine. I'd suggest starting with --concurrency=1 (single threaded) and then scaling the celery worker service via Docker Compose, but there are probably other approaches that would work too.
What's Changed
- Bump sphinx from 5.0.2 to 5.3.0 by @dependabot in #29
- Bump mypy from 0.982 to 0.991 by @dependabot in #32
- Bump postgres from 15.0 to 15.1 in /compose/production/postgres by @dependabot in #31
- Bump flake8 from 4.0.1 to 5.0.4 by @dependabot in #30
- Bump django-celery-beat from 2.2.1 to 2.4.0 by @dependabot in #13
- Bump django-model-utils from 4.2.0 to 4.3.1 by @dependabot in #38
- Bump traefik from v2.9.4 to 2.9.5 in /compose/production/traefik by @dependabot in #40
- Bump django-environ from 0.8.1 to 0.9.0 by @dependabot in #36
- Bump django-stubs from 1.12.0 to 1.13.0 by @dependabot in #39
- Add .dockerignore by @JSv4 in #50
- Bump actions/checkout from 3.1.0 to 3.2.0 by @dependabot in #48
- Bump traefik from 2.9.5 to 2.9.6 in /compose/production/traefik by @dependabot in #47
- Bump redis from 3.5.3 to 4.4.0 by @dependabot in #45
- Bump drf-extra-fields from 3.2.1 to 3.4.1 by @dependabot in #42
- Bump sphinx from 5.3.0 to 6.1.0 by @dependabot in #55
- Bump actions/checkout from 3.2.0 to 3.3.0 by @dependabot in #54
- Bump actions/setup-node from 3.5.1 to 3.6.0 by @dependabot in #53
- Bump psycopg2 from 2.9.3 to 2.9.5 by @dependabot in #52
- Bump argon2-cffi from 21.1.0 to 21.3.0 by @dependabot in #44
- Add Backend Tweaks for Metadata Annotations by @JSv4 in #56
- Bump flake8 from 5.0.4 to 6.0.0 by @dependabot in #59
- Bump pytz from 2022.5 to 2022.7 by @dependabot in #57
- Bump pillow from 9.2.0 to 9.4.0 by @dependabot in #58
- Add GUI Elements to Filter on Metadata and Thumbnails for Docs by @JSv4 in #75
- Bump djangorestframework-stubs from 1.4.0 to 1.8.0 by @dependabot in #78
- Bump postgres from 15.1 to 15.2 in /compose/production/postgres by @dependabot in #72
- Bump redis from 4.4.0 to 4.5.1 by @dependabot in #71
- Make Document Processing Pipeline More Fault Tolerant by @JSv4 in #79
Full Changelog: v1.0.1...v1.1.0
v1.0.1 - Added API Token Authorization
New Features:
This release adds an API Token Authorization mechanism so you can more easily integrate OpenContracts into backend services and infrastructure.
Chores:
A number of packages have been upgraded. See below.
What's Changed
- Updated codecov badge. by @JSv4 in #10
- Added frontend .env file samples and guidance. by @JSv4 in #11
- Bump actions/checkout from 3.0.2 to 3.1.0 by @dependabot in #5
- Bump crispy-bootstrap5 from 0.6 to 0.7 by @dependabot in #8
- Bump black from 22.6.0 to 22.10.0 by @dependabot in #9
- Bump traefik from v2.8.7 to v2.9.1 in /compose/production/traefik by @dependabot in #2
- Bump mypy from 0.971 to 0.982 by @dependabot in #7
- Bump responses from 0.21.0 to 0.22.0 by @dependabot in #4
- Bump postgres from 14.5 to 15.0 in /compose/production/postgres by @dependabot in #1
- Bump actions/setup-node from 3.4.1 to 3.5.1 by @dependabot in #3
- Update Tests and Remove Configs by @JSv4 in #16
- Bump flake8-isort from 4.1.1 to 5.0.0 by @dependabot in #6
- Bump django-coverage-plugin from 2.0.2 to 2.0.3 by @dependabot in #15
- Bump typing-extensions from 4.3.0 to 4.4.0 by @dependabot in #18
- Bump pytz from 2021.3 to 2022.5 by @dependabot in #17
- Bump coverage from 6.2 to 6.5.0 by @dependabot in #14
- Add Cookie Consent by @JSv4 in #19
- Bump pydantic from 1.9.1 to 1.10.2 by @dependabot in #23
- Bump scikit-learn from 1.1.1 to 1.1.3 by @dependabot in #22
- Bump django-debug-toolbar from 3.2.2 to 3.7.0 by @dependabot in #21
- Bump traefik from v2.9.1 to v2.9.4 in /compose/production/traefik by @dependabot in #20
- Bump pytest-cov from 3.0.0 to 4.0.0 by @dependabot in #26
- Bump django-storages[boto3] from 1.12.3 to 1.13.1 by @dependabot in #24
- Bump celery from 5.2.1 to 5.2.7 by @dependabot in #25
- Add an API Token Auth Mechanism by @JSv4 in #33
- Update Test Env File by @JSv4 in #34
Full Changelog: v1.0.0...v1.0.1
First Public Release
Initial public release, with sample deployments including Gremlin Analyzers.