PDF Extractor

Trích xuất dữ liệu từ PDF trong C# với PDF Extractor. PDF được sử dụng rộng rãi để lưu trữ tài liệu vì chúng giữ nguyên định dạng trên các thiết bị khác nhau. Tuy nhiên, làm việc với PDF thường đòi hỏi phải trích xuất các nội dung cụ thể—như hình ảnh, văn bản, siêu dữ liệu hoặc dữ liệu có cấu trúc—để tái sử dụng, phân tích hoặc chỉnh sửa. Khi nắm vững kỹ thuật trích xuất PDF, bạn có thể tiết kiệm thời gian, cải thiện quy trình làm việc và khai thác sâu hơn các thông tin từ các tệp bạn xử lý.

Các tính năng chính

🔹 Trích xuất hình ảnh

PDF thường chứa logo, biểu đồ, ảnh hoặc hình ảnh quét. Việc trích xuất những hình ảnh này cho phép bạn tái sử dụng chúng mà không cần sao chép toàn bộ trang. Trích xuất hình ảnh độ phân giải cao – Lấy hình ảnh chính xác như trong PDF để sử dụng chuyên nghiệp.

🔹 Trích xuất văn bản

Trích xuất văn bản giúp bạn chuyển nội dung có thể đọc được của PDF thành văn bản có thể chỉnh sửa. Điều này đặc biệt hữu ích khi bạn cần tái sử dụng hoặc phân tích nội dung viết. Chọn từ ba chế độ độ chính xác để phù hợp với nhu cầu của bạn:

Pure Mode — Giữ nguyên định dạng gốc cho đầu ra có cấu trúc

Raw Mode — Trích xuất văn bản thuần không có định dạng

Flatten Mode — Loại bỏ ký tự đặc biệt và định dạng để có văn bản sạch, tối giản

🔹 Trích xuất thuộc tính (Siêu dữ liệu)

Trích xuất thuộc tính cho phép bạn lấy thông tin về tài liệu PDF. Các thuộc tính có thể bạn quan tâm: FileName, Title, Author, Subject, Keywords, Created, Modified, Application, PDF Producer, Number of Pages.

🔹 Xuất dữ liệu từ AcroForms

Biểu mẫu PDF được sử dụng rộng rãi trong các ứng dụng, khảo sát, hoá đơn và hợp đồng. Chúng cho phép người dùng nhập thông tin trực tiếp vào các trường tương tác. Nhưng một khi biểu mẫu đã được điền, các tổ chức thường cần trích xuất dữ liệu đó để lưu trữ, báo cáo hoặc phân tích.

Getting Started

Download the assembly files from Here or NuGet.
Reference Documentize in your .NET project.
Add using Documentize;.
Set your license License.Set("license.lic"); - Optional.

Why Choose PDF Extractor

Ideal for developers and businesses managing visual content in reports, presentations, and archives.
Fast, efficient extraction for easy content reuse.
Multiple extraction modes for maximum flexibility.
Seamless .NET integration for simplified workflows.
Supported operating systems include Windows 7-11, and Windows Server 2003-2022, macOS (10.12+), and Linux.
Supported frameworks from 4.0 to 8.0.
Compatible with various Microsoft Visual Studio versions.
Detailed and high-quality documentation

How to Extract Images with PDF Extractor

Configure ImageExtractorOptions with the input file path and other necessary settings
Call PdfExtractor.Extract with an instance of ExtractImagesOptions as parameter
Access the extracted images through the ResultContainer.ResultCollection

How to Extract Text from PDF

Create instances of ExtractTextOptions and set input PDF
Call PdfExtractor.Extract with an instance of ExtractTextOptions as parameter and access the extracted text

Câu Hỏi Thường Gặp

What is PDF Extractor?

PDF Extractor for .NET is a powerful tool designed to extract images, text, metadata from PDF documents, or Form Data in PDF quickly and easily. It seamlessly integrates into your .NET application, offering a user-friendly solution for accessing visual content from PDFs.

Can I use PDF Extractor for .NET for other PDF operations?

No, this plugin is specifically for extraction from PDFs. For other PDF-related tasks, you can explore the additional plugins available in Documentize library or leverage its full capabilities for document processing.

Why would I need to extract text/images/metadata/form data from a PDF?

Extracting this data can be useful for analyze documents, prepare reports, work with AI.

What types of output formats does it support?

Currently this plugin extracts images in PNG format. Forms data exports specifically into CSV format. If you need other formats like JSON or XML, you may need to use additional tools or customize the output yourself.

Can I extract text from scanned PDFs?

If the PDF is scanned or contains images of text, an OCR (Optical Character Recognition) process may be required to convert the image-based text into an editable format.

PDF Extractor in C#/.NET

Trích xuất hình ảnh, văn bản, siêu dữ liệu và dữ liệu biểu mẫu từ tài liệu PDF