Administrative documents, reports, balance sheets, presentations, etc. Companies often have many data-rich PDF documents at their disposal. Often unused, unstructured and voluminous, they may contain text, images and tables, all with different layouts… A multiplicity that will complicate their analysis and information extraction.
The automatic extraction of a PDF document, as proposed by Menaps, can be broken down into a sequence of steps called a pipeline. Processes for collection, organisation, storage, accessibility and exploitation have to be developed and implemented. In this context, the mission of the Data Engineers within the Menaps teams is to create this pipeline.
The first step is to connect to the source databases containing the documents to retrieve them automatically. This step avoids manual, repetitive and tedious retrieval of documents. For this, we can use, for example, robots that will simulate human retrieval to automate it (Robotic Process Automation).
Once the documents are available, they must be converted into a textual representation so that the information can be retrieved and used. We will extract tables with the Python library Camelot, images and text with Poppler.
The next steps are to parse, clean, manipulate, cross-reference and structure the data so that it can be used in individual analytical applications.
In order to make the data available and exploitable by data analysts and data scientists, we set up SQL and NoSQL databases with the extracted and transformed data. Finally, we develop APIs to simplify their access to the data.
The design of pipelines to automatically extract data from a PDF document is an essential step in making available data that is, at first glance, unusable. It will allow Menaps’ data scientists and data analysts to benefit from structured data to start their descriptive, predictive or prescriptive analyses.
Julien LOUTON – Data Engineer Menaps