PDF Extractor in C#

Extract Data from PDF in C# with PdfExtractor. PDF 被广泛用于存储文档，因为它们能够在不同设备之间保持格式一致。然而，在处理 PDF 时，通常需要提取特定内容——例如图像、文本、元数据或结构化数据——以便重新使用、分析或编辑。掌握 PDF 提取技术可帮助您节省时间、优化工作流，并深入洞察所处理的文件。

Key Features

🔹 提取图像

PDF 中常包含徽标、图表、照片或扫描图像。提取这些图像可以在不复制整页的情况下重复使用它们。高分辨率图像提取——以原始质量获取 PDF 中的图像，供专业使用。

🔹 提取文本

文本提取可将 PDF 中可读的内容转换为可编辑的文本。这在需要重新利用或分析书面内容时尤为有用。可根据需求选择三种精准模式：

Pure Mode — 保留原始格式，以获得结构化输出

Raw Mode — 提取纯文本，无格式

Flatten Mode — 去除特殊字符和格式，生成简洁的最小化文本

🔹 提取属性（元数据）

属性提取让您获取 PDF 文档的相关信息。可能感兴趣的属性包括：FileName、Title、Author、Subject、Keywords、Created、Modified、Application、PDF Producer、Number of Pages。

🔹 导出 AcroForms 数据

PDF 表单在应用、调查、发票和合同中广泛使用。用户可以直接在交互式字段中输入信息。但表单填写完成后，组织通常需要提取这些数据进行存储、报告或分析。

Getting Started

Download the assembly files from Here or NuGet.
Reference Documentize in your .NET project.
Add using Documentize;.
Set your license License.Set("license.lic"); - Optional.

Why Choose PDF Extractor

Ideal for developers and businesses managing visual content in reports, presentations, and archives.
Fast, efficient extraction for easy content reuse.
Multiple extraction modes for maximum flexibility.
Seamless .NET integration for simplified workflows.
Supported operating systems include Windows 7-11, and Windows Server 2003-2022, macOS (10.12+), and Linux.
Supported frameworks from 4.0 to 8.0.
Compatible with various Microsoft Visual Studio versions.
Detailed and high-quality documentation

How to Extract Images with PDF Extractor

Configure ImageExtractorOptions with the input file path and other necessary settings
Call PdfExtractor.Extract with an instance of ExtractImagesOptions as parameter
Access the extracted images through the ResultContainer.ResultCollection

Via .NET

How to Extract Text from PDF

Create instances of ExtractTextOptions and set input PDF
Call PdfExtractor.Extract with an instance of ExtractTextOptions as parameter and access the extracted text

Via .NET

How to Export PDF fields data

Create an instance of ExtractFormDataToDsvOptions to configure the process of exporting data to CSV
Add input and output files to the options
Call the PdfExtractor.Extract method, passing the options as a parameter

Via .NET

常见问题

What is PDF Extractor?

PDF Extractor for .NET is a powerful tool designed to extract images, text, metadata from PDF documents, or Form Data in PDF quickly and easily. It seamlessly integrates into your .NET application, offering a user-friendly solution for accessing visual content from PDFs.

Can I use PDF Extractor for .NET for other PDF operations?

No, this plugin is specifically for extraction from PDFs. For other PDF-related tasks, you can explore the additional plugins available in Documentize library or leverage its full capabilities for document processing.

Why would I need to extract text/images/metadata/form data from a PDF?

Extracting this data can be useful for analyze documents, prepare reports, work with AI.

What types of output formats does it support?

Currently this plugin extracts images in PNG format. Forms data exports specifically into CSV format. If you need other formats like JSON or XML, you may need to use additional tools or customize the output yourself.

Can I extract text from scanned PDFs?

If the PDF is scanned or contains images of text, an OCR (Optical Character Recognition) process may be required to convert the image-based text into an editable format.

PDF Extractor in C# .NET

使用 Documentize 提取 PDF 文档中的图像、文本、元数据和表单数据