使用 PDF Extractor 在 C# 中提取 PDF 数据。 PDF 被广泛用于存储文档,因为它们能够在不同设备间保持格式一致。然而,处理 PDF 往往需要提取特定内容——如图像、文本、元数据或结构化数据——以便重用、分析或编辑。掌握 PDF 提取技术,您可以节省时间、优化工作流,并从文件中获得更深入的洞察。
关键特性
PDF 常包含徽标、图表、照片或扫描图像。提取这些图像可在不复制整页的情况下重复使用。 高分辨率图像提取 —— 完全保留 PDF 中图像的原始质量,适用于专业用途。
文本提取可以将 PDF 中的可读内容转换为可编辑的文本。这在需要重用或分析书面内容时尤为有用。 提供三种精度模式供您选择:
Pure Mode — 保留原始格式,生成结构化输出
Raw Mode — 提取纯文本,不含格式
Flatten Mode — 去除特殊字符和格式,得到简洁的纯文本
属性提取可获取 PDF 文档的相关信息。可能感兴趣的属性包括:FileName、Title、Author、Subject、Keywords、Created、Modified、Application、PDF Producer、Number of Pages。
PDF 表单广泛用于应用、调查、发票和合同,用户可直接在交互式字段中填写信息。但表单填写完成后,组织通常需要提取这些数据以进行存储、报告或分析。
using Documentize;.License.Set("license.lic"); - Optional.ImageExtractorOptions with the input file path and other necessary settingsPdfExtractor.Extract with an instance of ExtractImagesOptions as parameterResultContainer.ResultCollectionExtractTextOptions and set input PDFPdfExtractor.Extract with an instance of ExtractTextOptions as parameter and access the extracted textExtractFormDataToDsvOptions to configure the process of exporting data to CSVPdfExtractor.Extract method, passing the options as a parameterPDF Extractor for .NET is a powerful tool designed to extract images, text, metadata from PDF documents, or Form Data in PDF quickly and easily. It seamlessly integrates into your .NET application, offering a user-friendly solution for accessing visual content from PDFs.
No, this component is specifically for extraction from PDFs. For other PDF-related tasks, you can explore the additional components available in Documentize library or leverage its full capabilities for document processing.
Extracting this data can be useful for analyze documents, prepare reports, work with AI.
Currently this component extracts images in PNG format. Forms data exports specifically into CSV format. If you need other formats like JSON or XML, you may need to use additional tools or customize the output yourself.
If the PDF is scanned or contains images of text, an OCR (Optical Recognition) process may be required to convert the image-based text into an editable format.