Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Very interesting! I had never heard of Apache PDFBox before, I must give it a try. I have a similar program that parses horse racing PDFs from sites such as www.racehorserunner.com - which are of a much simpler format, but cause endless problems for me when the PDFs have layout problems. For example, issues like one column being too long and overlapping with another, e.g the last race on http://www.racehorserunner.com/Archives/ELP/ELP170702.pdf

All PDF parsers that I have tried cope very badly with these kind of situations, and often try to be 'too clever' in that they value the final layout of the text over and above the individual strings.

Have you experienced similar problems with PDFBox, or does it handle formatting and layout fairly reliably?



PDFBox committer here, if you want even lower-level access to the page content stream, without anything 'clever' at all, check out the PDFGraphicsStreamEngine class, which is a superclass of the text extraction and rendering classes. Gives you access to the raw glyphs. You can override PageRenderer too, for visual debugging, e.g. render glyph bounding boxes. We have an interactive Swing PDFDebugger which does just that.

https://github.com/apache/pdfbox/blob/6f18d7c4bef4d23a22d/ex...


Thanks for the guidance, I'll take a look.


Yes I encountered similar issues but many of them were able to be solved.

With PDFBox I was able to deal with the content at a very low level (on a per-character basis), so that when for instance building a String, I would insert a pipe character when the distance between adjacent characters was greater than the width of the space character and then detect that when translating to a certain field.

See the convertToText() method for an example: https://github.com/robinhowlett/chart-parser/blob/master/src...

and https://github.com/robinhowlett/chart-parser/blob/f8d651e9a1... for when I used this technique


Very cool, good to see the level of control this package allows.


Huh, interesting. I was looking around for PDF libs previously and PDFBox didn't show up in google results. pdftk was the only one that showed up in Google results anywhere useful.

Edit: Looks like it's on the second page of results and I never made it that far, heh. Goes to show how biasing the first page of results is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: