Skip to content

programmers_guide_eng

Bogatenkova Anastasiya edited this page Dec 17, 2020 · 3 revisions

General scheme of pipeline

  1. User sends file and additional parameters via POST request.
  2. API module saves file in a temporary directory and calls manager (i.e code in this file dedoc/api/dedoc_api.py)
  3. Manager renames file saving extension. It is important that file's name doesn't contain spaces, ascii symbols, injections and other unnecessary stuff. After that manager tries to convert file with FileConverter. Manager's code is here dedoc/manager/dedoc_manager.py.
  4. FileConverter checks if it can convert file with the specified extension. If it's able to do that, then it performs the procedure and returns the name of the converted file. Otherwise it returns the input file name. Code is here dedoc_project/converters/file_converter.py.
  5. After the procedure, information is extracted from the file - this is done by DocParser. The pair (UnstructuredDocument, if file contains attachments) is returned. Code is in file dedoc/readers/doc_parser.py.
  6. StructureConstructor creates structured file. It (constructor) takes UnstructuredDocument as input parameter and returns DocumentContent. Example can be found in dedoc/structure_constructor/tree_constructor.py.
  7. MetadataExtractor enriches document with metadata. Its code can be found in file dedoc/metadata_extractor/basic_metadata_extractor.py 8*. (optional) Attachments are being extracted and analized. This procedure is performed by manager (each attachment file goes from stage 2 to 8).
  8. User gets result as a response.

Every step in detail:

API

Is responsible for processing requests and sending responses back, it also contains helper functions (e.g. for dealing with online-docs, displaying logo and etc.). Code is stored in file dedoc/api/dedoc_api.py

Manager

Manager is performing the major part of the work, but as it often happens, he does that by delegating tasks to his subordinates. Manager is responsible for all of the pipeline stages except for getting and sending the response back. Manager can process file from request as well as from local file system. Manager's configuration is done with special conf file (it is stored in dedoc/manager_config.py). Code is here dedoc/manager/dedoc_manager.py.

FileConverter

FileConverter tries to convert file, it has a list of basic converters for it. FileConverter 'asks' every converter if it can process file with this particular extension and if yes - it returns new name of processed file. If none of the listed converters can perform operation, then converter simply returns the file name.

DocParser

DocParser has a list of basic readers, with which it performs file reading process. One by one, DocReader asks every listed reader if it is able to read the file or not (it depends on the file extension). If no reader is able to read the file, then BadFileFormatException is raised. Otherwise, the file is read by one of the readers.

BaseReader results

BaseReader is used for deriving data and metadata about document's content (UnstructuredDocument) and information if document can possibly have attachments. UnstructuredDocument consists of list of pages and lines, where every line is represented as LineWithMeta class object.

LineWithMeta contains text, metadata about the text (which type the line is, number of the line, etc.), list of annotations (annotation contains information about individual words and parts of the text), and also HierarchyLevel which is necessary for folding the document.

HierarchyLevel defines nesting level: it (nesting level) is defined by 2 numbers - level1 and level2 (the less number is, the more important the line is). For example, if we see the lines (nesting level is indicated in brackets), then we can understand that the first line is the heading, the second one is nested in the first, and the third one is nested in the second one.

How to implement your own extension to dedoc

Look here to get more information.