Извлечение данных из PDF в C# с помощью PDF Extractor. PDF‑файлы широко используются для хранения документов, так как сохраняют форматирование на разных устройствах. Однако работа с PDF часто требует извлечения конкретного содержимого — изображений, текста, метаданных или структурированных данных — для повторного использования, анализа или редактирования. Освоив извлечение из PDF, вы сможете экономить время, улучшать рабочие процессы и получать более глубокие инсайты из файлов, с которыми работаете.
Ключевые возможности
PDF часто содержат логотипы, диаграммы, фотографии или отсканированные изображения. Извлечение этих изображений позволяет использовать их повторно без необходимости копировать целые страницы. High-Resolution Image Extraction – Retrieve images exactly as they appear in your PDF for professional use.
Text extraction lets you convert the readable content of a PDF into editable text. This is especially helpful when you need to repurpose or analyze written content. Choose from three precision modes to suit your needs:
Pure Mode — Retains original formatting for structured output
Raw Mode — Extracts plain text without formatting
Flatten Mode — Removes special characters and formatting for clean, minimal text
Properties extraction lets you information about PDF document. Available properties that may interest you: FileName, Title, Author, Subject, Keywords, Created, Modified, Application, PDF Producer, Number of Pages.
PDF forms are widely used in applications, surveys, invoices, and contracts. They allow users to enter information directly into interactive fields. But once the forms are filled out, organizations often need to extract that data for storage, reporting, or analysis.
using Documentize;.License.Set("license.lic"); - Optional.ImageExtractorOptions with the input file path and other necessary settingsPdfExtractor.Extract with an instance of ExtractImagesOptions as parameterResultContainer.ResultCollectionExtractTextOptions and set input PDFPdfExtractor.Extract with an instance of ExtractTextOptions as parameter and access the extracted textExtractFormDataToDsvOptions to configure the process of exporting data to CSVPdfExtractor.Extract method, passing the options as a parameterPDF Extractor for .NET is a powerful tool designed to extract images, text, metadata from PDF documents, or Form Data in PDF quickly and easily. It seamlessly integrates into your .NET application, offering a user-friendly solution for accessing visual content from PDFs.
No, this plugin is specifically for extraction from PDFs. For other PDF-related tasks, you can explore the additional plugins available in Documentize library or leverage its full capabilities for document processing.
Extracting this data can be useful for analyze documents, prepare reports, work with AI.
Currently this plugin extracts images in PNG format. Forms data exports specifically into CSV format. If you need other formats like JSON or XML, you may need to use additional tools or customize the output yourself.
If the PDF is scanned or contains images of text, an OCR (Optical Character Recognition) process may be required to convert the image-based text into an editable format.