itextsharp upgrade to itext7

Why am I getting duplicate pages extracted from iText7 C#?

Actually it is not the same text being returned from sequential pages. Instead you get

  • the text from page 1 when you extract page 1;
  • the text from pages 1 and 2 when you extract page 2;
  • the text from pages 1, 2, and 3 when you extract page 3;
  • ...

Often this happens for code that re-uses a text extraction strategy for multiple pages. But that's not the case in your code, you correctly create a new strategy object for each page. Thus the cause must be in the PDF itself.

And indeed, each page of your document does contain the contents of all previous pages, too, merely outside its crop box. To extract only the text in the respective page crop box you have to filter, e.g. like this:

string SRC = @"285187.pdf";

PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));

Console.WriteLine("\n285187 Filtered\n============\n");

for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
    var strategy = new SimpleTextExtractionStrategy();
    var pdfPage = pdfDoc.GetPage(i);

    var filter = new IEventFilter[1];
    filter[0] = new TextRegionEventFilter(pdfPage.GetCropBox());
    var filteredTextEventListener = new FilteredTextEventListener(strategy, filter);

    var currentText = PdfTextExtractor.GetTextFromPage(pdfPage, filteredTextEventListener);

    Console.WriteLine("PAGE {0}", i);
    Console.WriteLine(currentText);
}

pdfDoc.Close();

需要注意的是,策略換成LocationTextExtractionStrategy讀出來的內容就和原來一樣了

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章