[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Omaha.pm] [omaha] PDF to Text parsing (Jay Hannah)
Interesting. i suppose your PDFs only contained images of text, but
not actual text, hence the need for OCR? If so, i may use this for
something else.
On Wed, Oct 3, 2012 at 8:30 PM, Chris Brandstetter
<sirloxelroy@gmail.com> wrote:
> I used "Image::OCR::Tesseract 'get_ocr' " and " HTML::TextToHTML " then
> I did something like this
>
> "
> $text = qx($ocrbin \"$fullpath\");
> $text =~ s/[^0-9A-Za-z\
> \n\t\.\?\!\@\#\$\%\&\*\(\)\"\'\/\,\;\:\+\=\-]//gi;
> my $conv = new HTML::TextToHTML(
> default_link_dict=>'');
> $ocrdhtml = $conv->process_chunk($text);
> $contents = $text;
> $preview = $ocrdhtml;
> "
>
> Not the prettiest, but it worked to OCR and store the text and previews
> of the documents in a database.
>
>
> Chris Brandstetter
>
> -----BEGIN GEEK CODE BLOCK-----
> Version: 3.1
> GCS/IT d+(-) s++:++ a C++++$ UBLISXC*++++$ P++++$ L+++$ E-- W+++ N+ o K-
> w-- O M++$ V PS- PE Y+ PGP++ t++ 5+++ X+ R- tv-- b+>+++ DI D+ G+ e+ h++
> r
> y?
> ------END GEEK CODE BLOCK------
>
>
>>> The ReEnergizeProgram.org auditor said that a big slowdown is getting
>>> all the data from PDF based bills from MUD and OPPD into a spreadsheet
>>> / database. Sounds like they email stuff, copy-n-paste alot, and then
>>> email on.
>>>
>>> What perl/python/php modules would you recommend for parsing the text from PDF?
>>
>> On Oct 2, 2012, at 11:10 AM, Burch Kealey <bkealey@unomaha.edu> wrote:
>>> Send us one as an example this is really a trivial task
>> Ya, send us an example PDF. There are 475 PDF libraries on CPAN, but your mileage will vary and the only way to know for sure is to actually try... Here's all the hits, and the one I'd probably try first for this job:
>>
>> https://metacpan.org/search?q=PDF
>> https://metacpan.org/module/PDF::OCR2
>>
>> Good luck! :)
>>
>> j
>> Omaha Perl Mongers: http://omaha.pm.org
>>
>>
>>
>> P.S. PDF scraping is usually really gross. Government orgs often publish PDF archives as if those are data APIs, and they're really not. Poke MUD and OPPD to publish JSON or XML APIs / archives.
>>
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> Omaha-pm mailing list
>> Omaha-pm@pm.org
>> http://mail.pm.org/mailman/listinfo/omaha-pm
>>
>> ------------------------------
>>
>> End of Omaha-pm Digest, Vol 102, Issue 1
>> ****************************************
>
> _______________________________________________
> Omaha-pm mailing list
> Omaha-pm@pm.org
> http://mail.pm.org/mailman/listinfo/omaha-pm