Very interesting! I had never heard of Apache PDFBox before, I must give it a tr...

jahewson · on July 8, 2017

PDFBox committer here, if you want even lower-level access to the page content stream, without anything 'clever' at all, check out the PDFGraphicsStreamEngine class, which is a superclass of the text extraction and rendering classes. Gives you access to the raw glyphs. You can override PageRenderer too, for visual debugging, e.g. render glyph bounding boxes. We have an interactive Swing PDFDebugger which does just that.

https://github.com/apache/pdfbox/blob/6f18d7c4bef4d23a22d/ex...

joosters · on July 8, 2017

Thanks for the guidance, I'll take a look.

robinhowlett · on July 8, 2017

Yes I encountered similar issues but many of them were able to be solved.

With PDFBox I was able to deal with the content at a very low level (on a per-character basis), so that when for instance building a String, I would insert a pipe character when the distance between adjacent characters was greater than the width of the space character and then detect that when translating to a certain field.

See the convertToText() method for an example: https://github.com/robinhowlett/chart-parser/blob/master/src...

and https://github.com/robinhowlett/chart-parser/blob/f8d651e9a1... for when I used this technique

tcho · on July 8, 2017

Very cool, good to see the level of control this package allows.

bpicolo · on July 7, 2017

Huh, interesting. I was looking around for PDF libs previously and PDFBox didn't show up in google results. pdftk was the only one that showed up in Google results anywhere useful.

Edit: Looks like it's on the second page of results and I never made it that far, heh. Goes to show how biasing the first page of results is.