Table of Contents Detection Module

Purpose

Detects and extracts tables of content from the PDF file.

Given a pdf it generates a TableOfContent element containing an array of TableOfContentItem's with info for each item.

Searches for keywords and specific paragraph formats and then extracts the info from each detected paragraph with Regular Expressions

Following is an example of the configuration of the table-of-contents-detection module:

[
  "table-of-contents-detection",
  {
    "pageKeywords": [
      "pag",
      "pagina",
      "page,
    ]
  }
]

pageKeywords: Optional. Array of "page" string to prepend and search for TOC items with format "page X - Section A". Defaults to "pag".

The accuracy is high on one-column documents.