Make tables from PDF datasheets available as spreadsheets

matthijs · July 3, 2023, 6:34pm

Toradex publishes various datasheets for their modules and carrier boards. These datasheets contain tables with (among others) pin listings, that I would like to have available in a spreadsheet form, to be used in pinout documentation for our project.

It seems others have been also asking for this here and here. Those requests were solved by someone create a one-off CSV (manually?) and by the Colibri Compatibility Guide (which seems to contain the kind of info I am looking for, but I cannot see a version for the Verdin that I am using).

I’m mostly interested in two tables from the PDF. First, one to map the SODIMM pins to the SoC pins (and for bonus, also all their ALTx functions), which is essentially this table:

Second, one to map the Dahlia board I/O connector pins to the SODIMM pins, which is essentially this table (plus the following one for X19):

Copy-pasting this data from the PDF file into a spreadsheet does not seem to produce a usable table, unfortunately (if someone knows a trick to achieve this, I’d be interested). I would expect the source files that these PDFs are generated from have the tables in a more usable form, if those could be automatically extracted, that would be awesome.

Note that I also looked at the pinout designer tool (web version), but that seems to be function-oriented instead of pin-oriented, only produces output for pins assigned a function explicitly, does not seem to allow lookup of Dahlia connector pin numbers, nor adding comments or so, so I think that does not really help me here.

matthijs · July 3, 2023, 8:13pm

I made a bit more progress - turns out the camelot tool can actually process the PDF datasheets and extract the tables reasonably accurately. It’s not perfect, though (multi-line table cells are extracted as separate rows, but I created a small python script to fix that in most cases), but works well enough to get usable data with not too much effort.

Here’s the commands I used:

camelot --format csv --output verdin.csv --pages 28,29,30 stream verdin_development_board_datasheet_v1.1.pdf 
camelot --format csv --output verdin.csv --pages 28,29,30 stream verdin_imx8m_plus_datasheet_v1.1.pdf 
camelot --format csv --output verdin2.csv --pages 18-24 stream ~/docs/Electronics/Datasheets/verdin_imx8m_plus_datasheet_v1.1.pdf

With this script for post-processing:

import sys

files = sys.argv[1:]
writer = csv.writer(sys.stdout)


def write(rows):
    # Assert all rows have the same length
    assert(len(rows) * len(rows[0]) == sum(map(len, rows)))
    output = [''.join(values) for values in zip(*rows)]
    writer.writerow(output)


for filename in files:
    with open(filename, newline='') as file:
        rows = []
        extra = 0

        for row in csv.reader(file):
            # This uses the second field to distinguish a normal row
            # from a word-wrapped fake row, since that field is never
            # word-wrapped for the content we are using.
            # This works by counting the leading rows (extra > 1), then
            # negating extra when we find the real row, so it ends up
            # back at 0 when we found all trailing rows (and then we
            # just concatentate all rows collected until then output
            # them). Probably fragile, but works ok for our content.
            if not row[1]:
                extra += 1
            elif extra >= 0:
                extra *= -1
            else:
                # Encountering a normal row while expecting trailing
                # extra rows should not normally happen, except in weird
                # cases where the extra fields are not balanced top and
                # bottom, in which case this probably messes up the
                # output and should be manually fixed
                rows[-1][-1] += "Weird layout - FIX MANUALLY"
                write(rows)
                rows = []
                extra = 0

            row.append("")  # Room for error messages
            rows.append(row)

            # print(extra, row)

            if extra == 0:
                write(rows)
                rows = []

        if extra != 0:
            rows[-1][-1] += "Weird layout - FIX MANUALLY"
            write(rows)

With some more minor manual processing, this produces the following spreadsheet:

Dahlia Verdin Plus Pinout.ods (32.2 KB)

Next step is adding our custom expansion board pinout and cross-referencing those with the data from these tables, but I ran out of time, and wanted to share this result already.

josep.tx · July 4, 2023, 9:42am

Hello @matthijs ,
Thanks for the info and for sharing this Python script
In the meantime I have asked internally about your request. Will let you know if there are any updates.

Best regards,
Josep

josep.tx · July 11, 2023, 2:50pm

Hello @matthijs ,
An update about this topic:

We discussed it internally and we will make the tables available in LaTex format, along with the datasheets, whenever the team finds some time .
The you can convert them to Excel format using an online conversion tool , such as
Convert LaTeX Table to Excel - Table Convert Online

But for the specific case of the Verdin iMX8MM, it will take some time.

Best regards,
Josep

matthijs · July 12, 2023, 12:26pm

Awesome, thanks for that!

For my particular case, I can move forward with the tables I extracted from the PDF, but I look forward to using the LaTeX tables in the future when I next need some other table