1.

Can't We Just Extract The Text Without Parsing The Whole Document Or Extract Text As It Is Parsed?

Answer»

Not really, for a couple reasons.

  1. If the document is encrypted then you need to parse at LEAST until the encryption dictionary before you can decrypt.
  2. Sometimes the PDFont CONTAINS VITAL information needed for text extraction.
  3. Text on a page does not have to be drawn in reading order. For example; if the page said "Hello World", the pdf COULD have been written such that "World" gets drawn and then the cursor moves to the left and the word "Hello" is drawn.

Not really, for a couple reasons.



Discussion

No Comment Found