Extracting text from pdf using iText7 c# library



iText7 is a open source library used to create, modify and read pdf documents. iText7 is the latest version in its family. Previous version also exist but in this article we are using latest version.

Here, we are assuming that our pdf document has either text content or tabular format text content. Now, if we want to read it by using iText7, below is the approach. But if pdf document has any images this will not fetch those details.

public static void ExtractTextFromPDF(string filePath)
{
 PdfReader pdfReader = new PdfReader(filePath);
  PdfDocument pdfDoc = new PdfDocument(pdfReader);
  for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
  {
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
     string pageContent = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy);
  }
  pdfDoc.Close();
  pdfReader.Close();
 }       

 

Labels: iText7, iText7 library, iText7 c#





Recent Posts

Categories
Subscribe

Receive Quality Tutorials Straight in your Inbox by submitting your Email below:

Delivered by FeedBurner

Protected by Copyscape Duplicate Content Checker