Datasets for

Visual Information Extraction

🗒️List of Index

SROIE
CORD
FUNSD
XFUND
EPHOIE
CER-VIR
SIBR
EATEN
WildReceipt
Kleister
VRDU
POIE

SROIE

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
626	-	347	Receipt	English		Entity Extraction	Entity F1-score

SROIE is a dataset for the 2019 ICDAR Robust Reading Challenge on Scanned Receipts OCR and Information Extraction competition. It contains 973 samples, 626 for training and 347 for testing. Each receipt contains four kinds of key entities: Company, Address, Date, and Total.

Line-level OCR results and texts of key entities are available for each sample. However, it is important to note that the two annotations are not aligned. In order to perform Entity Extraction using token tagging approaches like LayoutLM, it is necessary to have tags for each word. This can be achieved either through rule-based methods or by manually re-labeling the data.

Indeed, the quality of data annotation plays a crucial role in the Entity Extraction performance. We conduct experiments with ViBERTgrid. When the model is trained with high quality annotations (re-labelled manually), the entity F1 can reach 97+. While training with poor quality annotations (rule-based matching) results in a entity F1 of 60.

CORD

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
800	100	100	Receipt	English		Entity Extraction	Entity F1-score
						Entity Linking	Linking F1-score
						Document Structure Parsing	Structured Field F1-score, TED Acc

CORD is an English receipt dataset proposed by Clova-AI. 1000 samples are currently publicly available, where 800 are for training, 100 for validation, and 100 for testing. The receipt images are captured by cameras, which may introduce interference such as paper bending and background noise. However, the dataset includes high-quality annotations with key labels for each word and linking between entities. It encompasses four main categories of key information, and can be further divided into 30 sub-key fields. Notably, the entities in CORD are hierarchically related, making the task of extracting all the structured fields particularly challenging for models.

FUNSD

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
149	-	50	Forms	English		Entity Extraction	Entity F1-score
149	-	50	Forms	English		Entity Linking	Linking F1-score

A dataset for Text Detection, Optical Character Recognition, Spatial Layout Analysis and Form Understanding. It consists of 199 fully annotated forms, containing a total of 31485 words, 9707 semantic entities and 5304 relations. For each text segment and word, the dataset provides the corresponding OCR result. Furthermore, the annotations also include the category of each paragraph and linkings between entities.

XFUND

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
149*7	-	50*7	Forms	Chinese, Japanese, Spanish, French, Italian, German, Portuguese		Entity Extraction	Entity F1-score
149*7	-	50*7	Forms			Entity Linking	Linking F1-score

XFUND is a multilingual form understanding benchmark dataset that includes human-labeled forms with key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). It is an extension of the FUNSD dataset, the annotations and evaluation metric are the same as FUNSD.

EPHOIE

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
1183	-	311	Paper Head	Chinese		Entity Extraction	Entity F1-score

The EPHOIE Dataset comprises 1,494 images that were collected and scanned from real examination papers from different schools in China. The authors of the dataset have cropped the paper head regions, which contain all the key information. The texts in the dataset consist of both handwritten and printed Chinese characters, arranged in horizontal and arbitrary quadrilateral shapes. The dataset also includes complex layouts and noisy backgrounds, which contribute to its generalization capabilities. In total, the dataset encompasses 11 key categories, such as name, class, and student ID. Each character in the dataset is annotated, allowing for the direct application of token classification models using the original labels.

CER-VIR

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
2989	-	1200	Receipt	Chinese English		Structure Parsing	Entity Matching Score

The CER-VIR dataset contains receipts in both Chinese and English. Each sample contains key information including company, date, total, tax and items. The item field within each sample can be further divided into three subkeys: item name, item count, and item unit price. The task associated with this dataset involves extracting all the key fields from a given sample, including all the subkeys within the item field.

To ensure consistency, the extracted result should be properly formatted. For instance, date entities should be provided in the format of YYYY-MM-DD. The dataset also includes OCR results for reference. Additionally, the annotations of the key entities are provided in formatted string forms, which may differ from the actual content displayed in the image. This aspect of the dataset makes the task significantly more challenging compared to other existing benchmarks in the field of Visual Information Extraction.

SIBR

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
600	-	400	Receipt, Bills	Chinese, English		Entity Extraction	Entity F1-score
600	-	400	Receipt, Bills	Chinese, English		Entity Linking	Linking F1-score

There are 1000 images in the SIBR, including 600 Chinese invoices, 300 English bills of entry, and 100 bilingual receipts. SIBR is well annotated with 71227 entity-level boxes and 39004 links. In comparison to other real scene datasets like SROIE and EPHOIE, SIBR offers a wider range of appearances and more diverse structures.

The document images within the SIBR dataset pose additional challenges as they are sourced from real-world applications. These challenges include severe noise, uneven illumination, image deformation, printing shift, and complicated links. Similar to FUNSD, the SIBR dataset contains 3 kinds of key information including question, answer, and header. It is worth noting that the entity with multiple lines in SIBR is represented by text segments and intra-links between them. Models are required to extract the full entity given only the text segment annotations.

EATEN

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
271440	30160	400	Train Ticket	Chinese		Entity Extraction	Mean Entity Accuracy
88200	9800	2000	Passport				Mean Entity Accuracy
178200	19800	2000	Business Card				Entity F1-score

The EATEN dataset covers three scenarios: Train Ticket, Passport, and Business Card.

The train ticket subset includes a total of 2k real images and 300k synthetic images. Real images were shot in a finance department with inconsistent lighting conditions, orientations, background noise, imaging distortions. The train tickets contains 8 key categories.

The passport subset includes a total 100k synthetic images with 7 key categories.

The business card subset contains 200k synthetic images with 10 key categories. The positions of the key entities are not constant and some entities may not exist, which is a challenge for applying VIE.

The Mean Entity Accuracy is calculated as shown below $$ mEA = \sum_{i=0}^{I-1}\mathbb{I}(y^i==g^i)/I $$ where $y^i$ denotes the prediction of the $i$th field, $g^i$ denotes the corresponding ground-truth, $I$ denotes the number of entities and $\mathbb{I}$ is the indicator function that return 1 if $y^i == g^i$ else return 0.

WildReceipt

The WildReceipt dataset is introduced by the mmocr repository, which follow the Apache License 2.0.

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
1740	-	472	Receipt	English		Entity Extraction	Entity F1-score
1740	-	472	Receipt	English		Entity Linking	Node F1-score & Edge F1-score

The WildReceipt dataset has two version: the CloseSet and OpenSet.

The CloseSet divides text boxes into 26 categories. There are 12 key-value pairs of fine-grained key information categories, such as (Prod_item_value, Prod_item_key), (Prod_price_value, Prod_price_key), and (Tax_value, Tax_key), plus two more "do not care" categories: Ignore and Others. The objective of the CloseSet is to apply Entity Extraction.

The OpenSet have only 4 possible categories: background, key, value, and others. The connectivity between nodes are annotated as edge labels. If a pair of key-value nodes have the same edge label, they are connected by an valid edge. The objective of the OpenSet is to extract pairs from the given sample.

Kleister

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
254	83	203	Contracts	English		Entity Extraction	Entity F1-score
1729	440	609	Financial Reports	English		Entity Extraction	Entity F1-score

The Kleister dataset contains two subset: NDA and Charity.

The goal of the NDA task is to Extract the key information from NDAs (Non-Disclosure Agreements) about the involved parties, jurisdiction, contract term, and effective date. It contains 540 documents with 3229 pages.

The goal of the Charity task is to retrieve 8 kinds of key information including charity address (but not other addresses), charity number, charity name and its annual income and spending in GBP (British Pounds) in PDF files published by British charities. It contains 2788 financial reports with 61643 pages in total.

VRDU

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
10/50/100/200	-	300	Registration Forms	English		Entity Extraction	Type-Aware Matching F1-score
10/50/100/200	-	300	Political Advertisements	English		Entity Extraction & Document Parsing	Type-Aware Matching F1-score

This benchmark includes two datasets: Ad-buy Forms and Registration Forms. These documents consist of structured data with a comprehensive schema, including nested repeated fields. They have complex layouts that clearly distinguish them from long text documents and incorporate a variety of templates. Additionally, the OCR results are of high-quality. The authors have provided token-level annotations for the ground truth, ensuring there is no ambiguity when mapping the annotations to the input text.

The Registration Forms subset contains 6 types of key fields: file_date, foreign_principal_name, registrant_name, registration_ID, signer_name, and signer_title. The Ad-buy Forms contains 9 key fields including advertiser, agency, contract_ID, flight_start_date, flight_end_date, ross_amount, product, TV_address, and property. Further more, nested-fields containing line_item (description, start_date, end_date, sub_price) are also annotated in the Ad-buy Forms subset.

About Type-Aware Matching F1-score

It is common practice to compare the extracted entity with the ground-truth using strict string matching. However, such a simple approach may lead to unreasonable results in many scenarios. For example, “$ 40,000” does not match with “40,000” because of the missing dollar sign when extracting the total price from a receipt, and “July 1, 2022” does not match with “07/01/2022”. Dates may be present in different formats in different parts of the document, and a model should not be arbitrarily penalized for picking the wrong instance. We implement different matching functions for each entity name based on the type associated with that entity. The VRDU evaluation scripts will convert all price values into a numeric type before comparison. Similarly, date strings are parsed, and a standard date-equality function is used to determine equality.

POIE

Number of Samples			Type	Language	Access Link	Task	Evaluation Metric
Train	Validate	Test	Type	Language	Access Link	Task	Evaluation Metric
2250	-	750	Product Nutrition Tables	English		Entity Extraction	Entity F1-score

The images in POIE contain Nutrition Facts labels from various commodities in the real world, which have larger variances in layout, severe distortion, noisy backgrounds, and more types of entities than existing datasets. POIE contains images with variable appearances and styles (such as structured, semi-structured, and unstructured styles), complex layouts, and noisy backgrounds distorted by folds, bends, deformations, and perspectives. The types of entities in POIE reach 21, and a few entities have different forms, which is very common and pretty challenging for VIE in the wild. Besides there are often multiple words in each entity, which appears zero or once in every image. These properties mentioned above can help enhance the robustness and generalization of VIE models to better cope with more challenging applications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets_vie.md

datasets_vie.md

Datasets for

Visual Information Extraction

🗒️List of Index

SROIE

CORD

FUNSD

XFUND

EPHOIE

CER-VIR

SIBR

EATEN

WildReceipt

Kleister

VRDU

About Type-Aware Matching F1-score

POIE

Files

datasets_vie.md

Latest commit

History

datasets_vie.md

File metadata and controls

Datasets for

Visual Information Extraction

🗒️List of Index

SROIE

CORD

FUNSD

XFUND

EPHOIE

CER-VIR

SIBR

EATEN

WildReceipt

Kleister

VRDU

About Type-Aware Matching F1-score

POIE