Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.4.13
0.4.12
0.4.11
0.4.11
- Adds
partition_doc
for partitioning Word documents in.doc
format. Requireslibreoffice
. - Adds
partition_ppt
for partitioning PowerPoint documents in.ppt
format. Requireslibreoffice
.
0.4.10
0.4.10
- Fixes
ElementMetadata
so that it's JSON serializable when the filename is aPath
object.
0.4.9
0.4.9
- Added ingest modules and s3 connector
- Default to
url=None
forpartition_pdf
andpartition_image
- Add ability to skip English specific check by setting the
UNSTRUCTURED_LANGUAGE
env var to""
. - Document
Element
objects now track metadata
0.4.8
0.4.8
- Modified XML and HTML parsers not to load comments.
0.4.7
- Added the ability to pull an HTML document from a url in
partition_html
. - Added the the ability to get file summary info from lists of filenames and lists
of file contents. - Added optional page break to
partition
for.pptx
,.pdf
, images, and.html
files. - Added
to_dict
method to document elements. - Include more unicode quotes in
replace_unicode_quotes
.
0.4.6
0.4.6
- Loosen the default cap threshold to
0.5
. - Add a
UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD
environment variable for controlling
the cap ratio threshold. - Unknown text elements are identified as
Text
for HTML and plain text documents. Body Text
styles no longer default toNarrativeText
for Word documents. The style information
is insufficient to determine that the text is narrative.- Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
- Adds an
Address
element for capturing elements that only contain an address. - Suppress the
UserWarning
when detectron is called. - Checks that titles and narrative test have at least one English word.
- Checks that titles and narrative text are at least 50% alpha characters.
- Restricts titles to a maximum word length. Adds a
UNSTRUCTURED_TITLE_MAX_WORD_LENGTH
environment variable for controlling the max number of words in a title. - Updated
partition_pptx
to order the elements on the page
0.4.4
0.4.4
- Updated
partition_pdf
andpartition_image
to returnunstructured
Element
objects - Fixed the healthcheck url path when partitioning images and PDFs via API
- Adds an optional
coordinates
attribute to document objects - Adds
FigureCaption
andCheckBox
document elements - Added ability to split lists detected in
LayoutElement
objects - Adds
partition_pptx
for partitioning PowerPoint documents - LayoutParser models now download from HugginfaceHub instead of DropBox
- Fixed file type detection for XML and HTML files on Amazone Linux
0.4.3
0.4.3
- Adds
requests
as a base dependency - Fix in
exceeds_cap_ratio
so the function doesn't break with empty text - Fix bug in
_parse_received_data
. - Update
detect_filetype
to properly handle.doc
,.xls
, and.ppt
.