In this article, we will talk about, how wer can read text and images from a pdf file c# open source library.
To read text and images from a PDF file one by one in C#, you can use a PDF library such as iTextSharp or PdfSharp.
iTextSharp is a popular library for creating and manipulating PDF files content , It allows developers to generate PDF documents from scratch, as well as modify existing PDFs.
The library provides a wide range of features, such as adding text, images, and tables, creating bookmarks, and encrypting and digitally signing PDF documents.
iTextSharp is a popular tool for generating dynamic PDF documents in .NET applications.
Here’s an example code using iTextSharp:
using iTextSharp.text.pdf;
using iTextSharp.text;
using System.Drawing;
string filePath = "path/to/pdf/file.pdf";
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
PdfDictionary pageDictionary = reader.GetPageN(page);
// Extract text from the current page
string pageText = PdfTextExtractor.GetTextFromPage(reader, page);
// Get all the images from the current page
RenderFilter[] filters = new RenderFilter[1];
filters[0] = new RenderFilter(RenderFilter[] excludeList, TextRenderMode mode);
ImageRenderInfo[] imageInfos = PdfImageObject.GetImageRenderInfo(pageDictionary);
// Iterate through the imageInfos and extract the images
foreach (ImageRenderInfo imageInfo in imageInfos)
{
PdfImageObject imageObject = imageInfo.GetImage();
Image image = imageObject.GetDrawingImage();
// Process the image as needed
}
}
reader.Close();
Above code reads in a PDF file, loops through each page, and extracts the text and images from each page. The text is extracted using the PdfTextExtractor class, and the images are extracted using the PdfImageObject class. The GetDrawingImage() method returns the image as a System.Drawing.Image object, which can be processed as needed.