[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Omaha.pm] [omaha] PDF to Text parsing

To: Omaha Python Users Group <omaha@python.org>
Subject: Re: [Omaha.pm] [omaha] PDF to Text parsing
From: Jay Hannah <jay@jays.net>
Date: Tue, 2 Oct 2012 14:18:35 -0500
Cc: Nebraska USA Perl Mongers of Omaha <omaha-pm@pm.org>
Delivered-to: mailman-omaha-pm@mailman.pm.dev
Delivered-to: omaha-pm@pm.org
In-reply-to: <9AECCD8B-C8AF-4EA6-B163-900DD2BD5EC3@unomaha.edu>
List-archive: <http://mail.pm.org/pipermail/omaha-pm/>
List-help: <mailto:omaha-pm-request@pm.org?subject=help>
List-id: "Perl Mongers of Omaha, Nebraska USA" <omaha-pm.pm.org>
List-post: <mailto:omaha-pm@pm.org>
List-subscribe: <http://mail.pm.org/mailman/listinfo/omaha-pm>, <mailto:omaha-pm-request@pm.org?subject=subscribe>
List-unsubscribe: <http://mail.pm.org/mailman/options/omaha-pm>, <mailto:omaha-pm-request@pm.org?subject=unsubscribe>
References: <CA+VdTb854=jZ4Le9qqEpeQA4P=E0RNc1pZNTBeu5Xg4WeGCCCQ@mail.gmail.com> <9AECCD8B-C8AF-4EA6-B163-900DD2BD5EC3@unomaha.edu>
Reply-to: "Perl Mongers of Omaha, Nebraska USA" <omaha-pm@pm.org>

On Oct 2, 2012, at 11:03 AM, "Rob Townley" <rob.townley@gmail.com> wrote:
> The ReEnergizeProgram.org auditor said that a big slowdown is getting
> all the data from PDF based bills from MUD and OPPD into a spreadsheet
> / database.  Sounds like they email stuff, copy-n-paste alot, and then
> email on.
> 
> What perl/python/php modules would you recommend for parsing the text from PDF?

On Oct 2, 2012, at 11:10 AM, Burch Kealey <bkealey@unomaha.edu> wrote:
> Send us one as an example this is really a trivial task

Ya, send us an example PDF. There are 475 PDF libraries on CPAN, but your mileage will vary and the only way to know for sure is to actually try... Here's all the hits, and the one I'd probably try first for this job:

   https://metacpan.org/search?q=PDF
   https://metacpan.org/module/PDF::OCR2

Good luck!  :)

j
Omaha Perl Mongers: http://omaha.pm.org

P.S.   PDF scraping is usually really gross. Government orgs often publish PDF archives as if those are data APIs, and they're really not. Poke MUD and OPPD to publish JSON or XML APIs / archives.

Prev by Date: [Omaha.pm] Perl is VERY good at database manipulation :)
Next by Date: Re: [Omaha.pm] [omaha] PDF to Text parsing (Jay Hannah)
Previous by thread: [Omaha.pm] Perl is VERY good at database manipulation :)
Next by thread: Re: [Omaha.pm] [omaha] PDF to Text parsing (Jay Hannah)
Index(es):
- Date
- Thread