Back to Category Page
EXTRACT TABLES FROM PDFS
How does Canopy Extract work?

How does Canopy Extract work?

All the data we need is in table format. Our data is invariably in table format.

Last Updated: ‎‎‎‎‏‏‎ ‎
January 24, 2023

All the data we need is in table format

Our data is invariably in table format. Typically we need to extract the following 3 tables from each PDF document

  • Holdings
  • Transactions
  • Current Account Credits and Debits

Canopy Extract is designed to extract any table (not just the 3 tables above) from any PDF document. In case you need to extract charts and images from a PDF document then Canopy Extract is not for you.

Extract needs the PDF document and an Excel Configuration file

To work the PDF Extract needs two files

  • PDF document to be extracted (e-PDF is preferred, but paper scans will also work)
  • Excel Configuration File (which describes the table to be extracted)
The Extract needs an Excel Configuration File (which describes the table to be extracted)
The Extract needs an Excel Configuration File (which describes the table to be extracted)

What does a Typical PDF document look like

Multilayer headers and nesting are the key issues while extracting data from a PDF table

Typical table in a Bank Statement
Typical table in a Bank Statement

What does an Excel Configuration file look like

The Excel Configuration file for the above table is given below. Further details are on page Parts of a Config File

Excel Configuration file to extract the Holdings table in the image above
Excel Configuration file to extract the Holdings table in the image above