Extract Data from PDF in C# with PdfExtractor. PDF 被广泛用于存储文档,因为它们能够在不同设备之间保持格式一致。然而,在处理 PDF 时,通常需要提取特定内容——例如图像、文本、元数据或结构化数据——以便重新使用、分析或编辑。掌握 PDF 提取技术可帮助您节省时间、优化工作流,并深入洞察所处理的文件。
Key Features
PDF 中常包含徽标、图表、照片或扫描图像。提取这些图像可以在不复制整页的情况下重复使用它们。 高分辨率图像提取——以原始质量获取 PDF 中的图像,供专业使用。
文本提取可将 PDF 中可读的内容转换为可编辑的文本。这在需要重新利用或分析书面内容时尤为有用。 可根据需求选择三种精准模式:
Pure Mode — 保留原始格式,以获得结构化输出
Raw Mode — 提取纯文本,无格式
Flatten Mode — 去除特殊字符和格式,生成简洁的最小化文本
属性提取让您获取 PDF 文档的相关信息。可能感兴趣的属性包括:FileName、Title、Author、Subject、Keywords、Created、Modified、Application、PDF Producer、Number of Pages。
PDF 表单在应用、调查、发票和合同中广泛使用。用户可以直接在交互式字段中输入信息。但表单填写完成后,组织通常需要提取这些数据进行存储、报告或分析。
using Documentize;.License.Set("license.lic"); - Optional.ImageExtractorOptions with the input file path and other necessary settingsPdfExtractor.Extract with an instance of ExtractImagesOptions as parameterResultContainer.ResultCollectionExtractTextOptions and set input PDFPdfExtractor.Extract with an instance of ExtractTextOptions as parameter and access the extracted textExtractFormDataToDsvOptions to configure the process of exporting data to CSVPdfExtractor.Extract method, passing the options as a parameterPDF Extractor for .NET is a powerful tool designed to extract images, text, metadata from PDF documents, or Form Data in PDF quickly and easily. It seamlessly integrates into your .NET application, offering a user-friendly solution for accessing visual content from PDFs.
No, this plugin is specifically for extraction from PDFs. For other PDF-related tasks, you can explore the additional plugins available in Documentize library or leverage its full capabilities for document processing.
Extracting this data can be useful for analyze documents, prepare reports, work with AI.
Currently this plugin extracts images in PNG format. Forms data exports specifically into CSV format. If you need other formats like JSON or XML, you may need to use additional tools or customize the output yourself.
If the PDF is scanned or contains images of text, an OCR (Optical Character Recognition) process may be required to convert the image-based text into an editable format.