[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Omaha.pm] PDF to Text parsing

To: Rob.Townley@gmail.com, "Perl Mongers of Omaha, Nebraska USA" <omaha-pm@pm.org>
Subject: Re: [Omaha.pm] PDF to Text parsing
From: Jay Hannah <jay@jays.net>
Date: Thu, 4 Oct 2012 04:29:53 -0500
Delivered-to: mailman-omaha-pm@mailman.pm.dev
Delivered-to: omaha-pm@pm.org
In-reply-to: <CA+VdTb8byvR=H8Yyo+Yxxk22V5Cofuq3BBZEtEtnkxxbv=F7_Q@mail.gmail.com>
List-archive: <http://mail.pm.org/pipermail/omaha-pm/>
List-help: <mailto:omaha-pm-request@pm.org?subject=help>
List-id: "Perl Mongers of Omaha, Nebraska USA" <omaha-pm.pm.org>
List-post: <mailto:omaha-pm@pm.org>
List-subscribe: <http://mail.pm.org/mailman/listinfo/omaha-pm>, <mailto:omaha-pm-request@pm.org?subject=subscribe>
List-unsubscribe: <http://mail.pm.org/mailman/options/omaha-pm>, <mailto:omaha-pm-request@pm.org?subject=unsubscribe>
References: <mailman.9.1349290810.1792.omaha-pm@pm.org> <506CE6A5.5020708@gmail.com> <CA+VdTb8byvR=H8Yyo+Yxxk22V5Cofuq3BBZEtEtnkxxbv=F7_Q@mail.gmail.com>
Reply-to: "Perl Mongers of Omaha, Nebraska USA" <omaha-pm@pm.org>

On Oct 4, 2012, at 2:31 AM, Rob Townley <rob.townley@gmail.com> wrote:
> Interesting.  i suppose your PDFs only contained images of text, but
> not actual text, hence the need for OCR?  If so, i may use this for
> something else.

PDF files often contain both: (1) Text with layout, placement, and font information. And (2) images. Those images may happen to have pixels in them which humans interpret as text. Those pixels can sometimes be OCRd to produce text.

PDF::OCR2 does both of these things for you. It can be used to "extract all text and all image ocr from pdf". 

Again, it all depends on the PDF file.   :)

I'm guessing Chris was dealing with a directory full of images. 

HTH,

j

References:
- Re: [Omaha.pm] [omaha] PDF to Text parsing (Jay Hannah)
  - From: Chris Brandstetter <sirloxelroy@gmail.com>
- Re: [Omaha.pm] [omaha] PDF to Text parsing (Jay Hannah)
  - From: Rob Townley <rob.townley@gmail.com>

Prev by Date: Re: [Omaha.pm] [omaha] PDF to Text parsing (Jay Hannah)
Next by Date: Re: [Omaha.pm] PDF to Text parsing
Previous by thread: Re: [Omaha.pm] [omaha] PDF to Text parsing (Jay Hannah)
Next by thread: Re: [Omaha.pm] PDF to Text parsing
Index(es):
- Date
- Thread