1. 产品
  2.   PDF Extractor

PDF Extractor in C#/.NET

从 PDF 文档中提取图像、文本、元数据和表单数据

PDF Extractor

Extract Data from PDF in C# with PDF Extractor. PDF 被广泛用于存储文档,因为它们能够在不同设备间保持格式。然而,在处理 PDF 时常需要提取特定内容——如图像、文本、元数据或结构化数据——以便重新使用、分析或编辑。掌握 PDF 提取技术可帮助您节省时间、优化工作流程,并深入洞察所处理的文件。

Key Features

PDF 通常包含徽标、图表、照片或扫描图像。提取这些图像可以在不复制整页的情况下进行再利用。 高分辨率图像提取 – 完全按照 PDF 中的原始显示提取图像,满足专业需求。

文本提取可以将 PDF 中的可读内容转换为可编辑的文本。这在需要重新利用或分析书面内容时尤为有用。 提供三种精度模式以满足不同需求:

Pure Mode — 保留原始格式,生成结构化输出

Raw Mode — 提取纯文本,不保留任何格式

Flatten Mode — 去除特殊字符和格式,生成干净、精简的文本

元数据提取可获取 PDF 文档的相关信息。可能感兴趣的属性包括:FileName、Title、Author、Subject、Keywords、Created、Modified、Application、PDF Producer、Number of Pages。

PDF 表单广泛用于应用、调查、发票和合同。用户可以直接在交互式字段中输入信息。表单填写完成后,组织通常需要提取这些数据以进行存储、报告或分析。

Getting Started

Why Choose PDF Extractor

  • Ideal for developers and businesses managing visual content in reports, presentations, and archives.
  • Fast, efficient extraction for easy content reuse.
  • Multiple extraction modes for maximum flexibility.
  • Seamless .NET integration for simplified workflows.
  • Supported operating systems include Windows 7-11, and Windows Server 2003-2022, macOS (10.12+), and Linux.
  • Supported frameworks from 4.0 to 8.0.
  • Compatible with various Microsoft Visual Studio versions.
  • Detailed and high-quality documentation

How to Extract Images with PDF Extractor

  • Configure ImageExtractorOptions with the input file path and other necessary settings
  • Call PdfExtractor.Extract with an instance of ExtractImagesOptions as parameter
  • Access the extracted images through the ResultContainer.ResultCollection


How to Extract Text from PDF

  • Create instances of ExtractTextOptions and set input PDF
  • Call PdfExtractor.Extract with an instance of ExtractTextOptions as parameter and access the extracted text


How to Export PDF fields data

  • Create an instance of ExtractFormDataToDsvOptions to configure the process of exporting data to CSV
  • Add input and output files to the options
  • Call the PdfExtractor.Extract method, passing the options as a parameter


How to Extract Properties from PDF


常见问题

What is PDF Extractor?

PDF Extractor for .NET 是一款强大的工具,专为快速简便地从 PDF 文档中提取图像、文本、元数据或表单数据而设计。它可无缝集成到您的 .NET 应用程序中,提供友好的方式访问 PDF 中的可视内容。

Can I use PDF Extractor for .NET for other PDF operations?

No, this plugin is specifically for extraction from PDFs. For other PDF-related tasks, you can explore the additional plugins available in Documentize library or leverage its full capabilities for document processing.

Why would I need to extract text/images/metadata/form data from a PDF?

Extracting this data can be useful for analyze documents, prepare reports, work with AI.

What types of output formats does it support?

Currently this plugin extracts images in PNG format. Forms data exports specifically into CSV format. If you need other formats like JSON or XML, you may need to use additional tools or customize the output yourself.

Can I extract text from scanned PDFs?

If the PDF is scanned or contains images of text, an OCR (Optical Character Recognition) process may be required to convert the image-based text into an editable format.

 中文