Documind is an open-source platform that utilizes AI to extract structured data from PDFs and other document formats. With customizable schemas and powerful API integrations, it offers both local and cloud deployment options, making document processing seamless and efficient for developers and businesses alike.
Documind is a cutting-edge open-source platform designed for efficiently extracting structured data from documents, particularly PDFs, using advanced AI technologies. This powerful tool automates the transformation and analysis of document content, making it an invaluable asset for businesses and developers looking to optimize data processing workflows.
Key Features
- PDF to Image Conversion: Converts PDFs into images, enabling thorough AI analysis and processing.
- AI-Powered Data Extraction: Utilizes the OpenAI API to identify and structure pertinent information from various document formats.
- Customizable Extraction Schemas: Users can define specific schemas for data extraction tailored to diverse document types, ensuring flexibility and precision.
- Versatile Deployment: Designed for easy implementation on both local and cloud environments, offering users choice based on their needs.
Explore the Hosted Version 🚀
An exciting hosted version of Documind will soon be available, providing a fully managed API experience. Skip the installation hassles and start extracting data seamlessly by requesting access.
System Requirements
Documind relies on specific software dependencies, which include Ghostscript for handling PDF operations and GraphicsMagick for image processing. Ensure these are installed on your system before using Documind:
# Installation for macOS
brew install ghostscript graphicsmagick
# Installation for Debian/Ubuntu
sudo apt-get update
sudo apt-get install -y ghostscript graphicsmagick
Also, ensure you have Node.js (v18+) and NPM installed for optimal performance.
Getting Started with Documind
Define Your Schema
To effectively use Documind, start by creating a schema that outlines the data fields you want to extract. For example, a schema to extract information from a bank statement might look like this:
const schema = [
{
name: "accountNumber",
type: "string",
description: "The account number of the bank statement."
},
{
name: "openingBalance",
type: "number",
description: "The opening balance of the account."
},
{
name: "transactions",
type: "array",
description: "List of transactions in the account.",
children: [
{
name: "date",
type: "string",
description: "Transaction date."
},
{
name: "creditAmount",
type: "number",
description: "Credit Amount of the transaction."
},
{
name: "debitAmount",
type: "number",
description: "Debit Amount of the transaction."
},
{
name: "description",
type: "string",
description: "Transaction description."
}
]
}
];
Run Documind
Use Documind to process your PDF documents with the following example code:
import { extract } from 'documind';
const runExtraction = async () => {
const result = await extract({
file: 'https://bank_statement.pdf',
schema
});
console.log("Extracted Data:", result);
};
runExtraction();
Example Output
Here’s what the output might resemble after processing:
{
"success": true,
"pages": 1,
"data": {
"accountNumber": "100002345",
"openingBalance": 3200,
"transactions": [
{
"date": "2021-05-12",
"creditAmount": null,
"debitAmount": 100,
"description": "transfer to Tom"
},
{
"date": "2021-05-12",
"creditAmount": 50,
"debitAmount": null,
"description": "For lunch the other day"
}
],
"closingBalance": 2420
},
"fileName": "bank_statement.pdf"
}
Contributing
We welcome contributions from the community! If you have suggestions for improvements or new features, please feel free to submit a pull request.