[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Omaha.pm] PDF to Text parsing



On Oct 4, 2012, at 2:31 AM, Rob Townley <rob.townley@gmail.com> wrote:
> Interesting.  i suppose your PDFs only contained images of text, but
> not actual text, hence the need for OCR?  If so, i may use this for
> something else.

PDF files often contain both: (1) Text with layout, placement, and font information. And (2) images. Those images may happen to have pixels in them which humans interpret as text. Those pixels can sometimes be OCRd to produce text.

PDF::OCR2 does both of these things for you. It can be used to "extract all text and all image ocr from pdf". 

Again, it all depends on the PDF file.   :)

I'm guessing Chris was dealing with a directory full of images. 

HTH,

j