|
Answer» Example: folder c:\test\ contains hundreds of PDF files. i need to automate extraction of information inside a pdf file that looks LIKE a table: Code: [Select]------------------- LOGO PIC ------------------ [u]TRANSACTION[/u] NO ITEM NAME PCS PRICE TOTAL 1 AAAA 1 200 200 2 ITEM B 3 10 30 3 ITEM X 5 2 10 ------- 240 20/04/2009 Approved by,
(blabalabal) i need the result either in a variable, txt file, string, or output into screen. example result can be in the following format or any others format will work also: Code: [Select]AAA,1,200 ITEM B,3,10 ITEM X,5,2 20/04/2009 how to automate all of this? i've adobe acrobat & foxit reader installed, and i am open for solution in any language, preferebly the language used can create ADODB connection, such as vbscript. there are a lot of paid program out there that can batch-convert pdf to text, but i am looking for simple solution with EDITABLE source code.
it's not urgent, but i am curious on how to work this out.few ways 1) in *nix, there's pdftotext tool. you can search for a windows port of that, which can extract out the text you want. after that, do string processing using batch 2) i do not know yet about a PDF library for vbscript. i guess you could just call external tools from vbscript and do the same as in 1) but in vbscript. 3) Other programming languages like Perl (or Python/ruby), where PDF libraries have been created to manipulated PDFs easily. an example using Perl's CAM::PDF module
Code: [Select]use CAM::PDF; my $pdf = CAM::PDF->new('test.pdf'); my $page1 = $pdf->getPageText(1); @contents = split /\n/ ,$page1; for $k (0 .. scalar(@contents) ){ if ($contents[$k] =~ /-----/ ) {next} if ( $contents[$k] =~ /Approved by/) {print $contents[$k-1];$f=0;next} if ( $contents[$k] =~ /NO|ITEM NAME/ ) { $f=1;next} if ($f) { @l = split /\s{2,}/,$contents[$k]; printf "%s,%s,%s\n" , $l[1] ,$l[2],$l[3] } } output: Code: [Select]c:\test> perl test.pl AAAA,1,200 ITEM B,3,10 ITEM X,5,2 240,, ,, 20/04/2009
As for ADODB using Perl. Code: [Select]use Win32::OLE; my $conn = Win32::OLE->new('ADODB.Connection'); $conn->{Provider} = "some provider"; $conn->Open; ..... more code .... thanks ghostdog.
when i google "perl download", there are few version of perl, such as cpan perl, stawberry perl, active perl, perl express, etc. which one do you reckon best to download? os is win xp.
is this cam-pdf library is built-in the perl distribution or a seperate download?you can search for ActiveState Perl. I use that on Windows. to install modules from CPAN, very simple, just type cpan on the command prompt Code: [Select]c:\test> cpan then from the cpan shell, type install CAM::PDF. Type ? in the cpan shell for more help. another method to install Perl modules is PPM. See here for more info. thank you no.8
for the information and code.no problem. have fun.
|