Skip to content

Latest commit

 

History

History
732 lines (633 loc) · 28.4 KB

datasets_vie.md

File metadata and controls

732 lines (633 loc) · 28.4 KB

Datasets for

Visual Information Extraction

🗒️List of Index


SROIE

License Commercial Adapt Share


Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
626 - 347 Receipt English

Link2
Link2

Entity Extraction Entity F1-score

Link2

SROIE is a dataset for the 2019 ICDAR Robust Reading Challenge on Scanned Receipts OCR and Information Extraction competition. It contains 973 samples, 626 for training and 347 for testing. Each receipt contains four kinds of key entities: Company, Address, Date, and Total.

Line-level OCR results and texts of key entities are available for each sample. However, it is important to note that the two annotations are not aligned. In order to perform Entity Extraction using token tagging approaches like LayoutLM, it is necessary to have tags for each word. This can be achieved either through rule-based methods or by manually re-labeling the data.

Indeed, the quality of data annotation plays a crucial role in the Entity Extraction performance. We conduct experiments with ViBERTgrid. When the model is trained with high quality annotations (re-labelled manually), the entity F1 can reach 97+. While training with poor quality annotations (rule-based matching) results in a entity F1 of 60.


CORD

License Commercial Adapt Share

Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
800 100 100 Receipt English

Link

Entity Extraction Entity F1-score
Entity Linking Linking F1-score
Document Structure Parsing Structured Field F1-score, TED Acc

Link2

CORD is an English receipt dataset proposed by Clova-AI. 1000 samples are currently publicly available, where 800 are for training, 100 for validation, and 100 for testing. The receipt images are captured by cameras, which may introduce interference such as paper bending and background noise. However, the dataset includes high-quality annotations with key labels for each word and linking between entities. It encompasses four main categories of key information, and can be further divided into 30 sub-key fields. Notably, the entities in CORD are hierarchically related, making the task of extracting all the structured fields particularly challenging for models.


FUNSD

License Commercial Research

Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
149 - 50 Forms English

Link

Entity Extraction Entity F1-score
Entity Linking Linking F1-score

Link2

A dataset for Text Detection, Optical Character Recognition, Spatial Layout Analysis and Form Understanding. It consists of 199 fully annotated forms, containing a total of 31485 words, 9707 semantic entities and 5304 relations. For each text segment and word, the dataset provides the corresponding OCR result. Furthermore, the annotations also include the category of each paragraph and linkings between entities.


XFUND

License Commercial Adapt Share

Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
149*7 - 50*7 Forms Chinese, Japanese, Spanish, French, Italian, German, Portuguese

Link

Entity Extraction Entity F1-score
Entity Linking Linking F1-score

Link2

XFUND is a multilingual form understanding benchmark dataset that includes human-labeled forms with key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). It is an extension of the FUNSD dataset, the annotations and evaluation metric are the same as FUNSD.


EPHOIE

License Access

Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
1183 - 311 Paper Head Chinese

Link

Entity Extraction Entity F1-score

Link2

The EPHOIE Dataset comprises 1,494 images that were collected and scanned from real examination papers from different schools in China. The authors of the dataset have cropped the paper head regions, which contain all the key information. The texts in the dataset consist of both handwritten and printed Chinese characters, arranged in horizontal and arbitrary quadrilateral shapes. The dataset also includes complex layouts and noisy backgrounds, which contribute to its generalization capabilities. In total, the dataset encompasses 11 key categories, such as name, class, and student ID. Each character in the dataset is annotated, allowing for the direct application of token classification models using the original labels.


CER-VIR

License Commercial Adapt Share

Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
2989 - 1200 Receipt Chinese
English

Link
Link

Structure Parsing Entity Matching Score

The CER-VIR dataset contains receipts in both Chinese and English. Each sample contains key information including company, date, total, tax and items. The item field within each sample can be further divided into three subkeys: item name, item count, and item unit price. The task associated with this dataset involves extracting all the key fields from a given sample, including all the subkeys within the item field.

To ensure consistency, the extracted result should be properly formatted. For instance, date entities should be provided in the format of YYYY-MM-DD. The dataset also includes OCR results for reference. Additionally, the annotations of the key entities are provided in formatted string forms, which may differ from the actual content displayed in the image. This aspect of the dataset makes the task significantly more challenging compared to other existing benchmarks in the field of Visual Information Extraction.


SIBR

License Commercial Adapt Share

Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
600 - 400 Receipt, Bills Chinese, English

Link

Entity Extraction Entity F1-score
Entity Linking Linking F1-score

There are 1000 images in the SIBR, including 600 Chinese invoices, 300 English bills of entry, and 100 bilingual receipts. SIBR is well annotated with 71227 entity-level boxes and 39004 links. In comparison to other real scene datasets like SROIE and EPHOIE, SIBR offers a wider range of appearances and more diverse structures.

The document images within the SIBR dataset pose additional challenges as they are sourced from real-world applications. These challenges include severe noise, uneven illumination, image deformation, printing shift, and complicated links. Similar to FUNSD, the SIBR dataset contains 3 kinds of key information including question, answer, and header. It is worth noting that the entity with multiple lines in SIBR is represented by text segments and intra-links between them. Models are required to extract the full entity given only the text segment annotations.


EATEN

License

Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
271440 30160 400 Train Ticket Chinese

Link

Entity Extraction Mean Entity Accuracy
88200 9800 2000 Passport Mean Entity Accuracy
178200 19800 2000 Business Card Entity F1-score

The EATEN dataset covers three scenarios: Train Ticket, Passport, and Business Card.

The train ticket subset includes a total of 2k real images and 300k synthetic images. Real images were shot in a finance department with inconsistent lighting conditions, orientations, background noise, imaging distortions. The train tickets contains 8 key categories.

The passport subset includes a total 100k synthetic images with 7 key categories.

The business card subset contains 200k synthetic images with 10 key categories. The positions of the key entities are not constant and some entities may not exist, which is a challenge for applying VIE.

The Mean Entity Accuracy is calculated as shown below $$ mEA = \sum_{i=0}^{I-1}\mathbb{I}(y^i==g^i)/I $$ where $y^i$ denotes the prediction of the $i$th field, $g^i$ denotes the corresponding ground-truth, $I$ denotes the number of entities and $\mathbb{I}$ is the indicator function that return 1 if $y^i == g^i$ else return 0.


WildReceipt

License Commercial Adapt Share

The WildReceipt dataset is introduced by the mmocr repository, which follow the Apache License 2.0.

Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
1740 - 472 Receipt English

Link

Entity Extraction Entity F1-score
Entity Linking Node F1-score & Edge F1-score

The WildReceipt dataset has two version: the CloseSet and OpenSet.

The CloseSet divides text boxes into 26 categories. There are 12 key-value pairs of fine-grained key information categories, such as (Prod_item_value, Prod_item_key), (Prod_price_value, Prod_price_key), and (Tax_value, Tax_key), plus two more "do not care" categories: Ignore and Others. The objective of the CloseSet is to apply Entity Extraction.

The OpenSet have only 4 possible categories: background, key, value, and others. The connectivity between nodes are annotated as edge labels. If a pair of key-value nodes have the same edge label, they are connected by an valid edge. The objective of the OpenSet is to extract pairs from the given sample.


Kleister

License

Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
254 83 203 Contracts English

Link

Entity Extraction Entity F1-score
1729 440 609 Financial Reports Link

The Kleister dataset contains two subset: NDA and Charity.

The goal of the NDA task is to Extract the key information from NDAs (Non-Disclosure Agreements) about the involved parties, jurisdiction, contract term, and effective date. It contains 540 documents with 3229 pages.

The goal of the Charity task is to retrieve 8 kinds of key information including charity address (but not other addresses), charity number, charity name and its annual income and spending in GBP (British Pounds) in PDF files published by British charities. It contains 2788 financial reports with 61643 pages in total.


VRDU

License

Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
10/50/100/200 - 300 Registration Forms English

Link

Entity Extraction Type-Aware Matching F1-score
10/50/100/200 - 300 Political Advertisements Entity Extraction
& Document Parsing

This benchmark includes two datasets: Ad-buy Forms and Registration Forms. These documents consist of structured data with a comprehensive schema, including nested repeated fields. They have complex layouts that clearly distinguish them from long text documents and incorporate a variety of templates. Additionally, the OCR results are of high-quality. The authors have provided token-level annotations for the ground truth, ensuring there is no ambiguity when mapping the annotations to the input text.

The Registration Forms subset contains 6 types of key fields: file_date, foreign_principal_name, registrant_name, registration_ID, signer_name, and signer_title. The Ad-buy Forms contains 9 key fields including advertiser, agency, contract_ID, flight_start_date, flight_end_date, ross_amount, product, TV_address, and property. Further more, nested-fields containing line_item (description, start_date, end_date, sub_price) are also annotated in the Ad-buy Forms subset.

About Type-Aware Matching F1-score

It is common practice to compare the extracted entity with the ground-truth using strict string matching. However, such a simple approach may lead to unreasonable results in many scenarios. For example, “$ 40,000” does not match with “40,000” because of the missing dollar sign when extracting the total price from a receipt, and “July 1, 2022” does not match with “07/01/2022”. Dates may be present in different formats in different parts of the document, and a model should not be arbitrarily penalized for picking the wrong instance. We implement different matching functions for each entity name based on the type associated with that entity. The VRDU evaluation scripts will convert all price values into a numeric type before comparison. Similarly, date strings are parsed, and a standard date-equality function is used to determine equality.


POIE

License

Number of Samples Type Language Access Link Task Evaluation Metric
Train Validate Test
2250 - 750 Product Nutrition Tables English

Link

Entity Extraction Entity F1-score

The images in POIE contain Nutrition Facts labels from various commodities in the real world, which have larger variances in layout, severe distortion, noisy backgrounds, and more types of entities than existing datasets. POIE contains images with variable appearances and styles (such as structured, semi-structured, and unstructured styles), complex layouts, and noisy backgrounds distorted by folds, bends, deformations, and perspectives. The types of entities in POIE reach 21, and a few entities have different forms, which is very common and pretty challenging for VIE in the wild. Besides there are often multiple words in each entity, which appears zero or once in every image. These properties mentioned above can help enhance the robustness and generalization of VIE models to better cope with more challenging applications.