PitchHut logo
ValidEx
by liable_bronze_carmen
Streamline the extraction of structured data from unstructured sources.
Pitch

ValidEx is a Python library designed to simplify the retrieval and extraction of structured data from unstructured sources. With features like heuristic data cleaning, concurrency support, and model creation, it empowers users to efficiently manage data sources while minimizing errors and enhancing data quality.

Description

ValidEx is a comprehensive Python library designed to streamline the retrieval, extraction, and training of structured data from various unstructured sources. This powerful tool is ideal for developers and data scientists looking to automate and enhance their data extraction processes from a multitude of formats, including web pages, text files, PDFs, and more.

Features

  • Structured Data Extraction: Effortlessly parse and extract structured data from diverse sources, transforming unstructured content into usable datasets.
  • Heuristic Data Cleaning: Automatically perform text normalization, whitespace management, and deduplication to improve data quality.
  • Concurrency Support: Process multiple sources simultaneously to enhance efficiency and speed during data extraction.
  • Retry Mechanism: Automatically retry failed extraction attempts, ensuring a robust and reliable extraction process.
  • Hallucination Check: Implement strategies to detect and mitigate inaccuracies in extracted data, enhancing the reliability of outputs.
  • Fine-tuning Dataset Export: Generate datasets in JSONL format tailored for OpenAI chat fine-tuning, facilitating advanced machine learning applications.
  • Local Model Creation: Create custom extraction models that combine Named Entity Recognition (NER) and regular expressions for specialized data extraction needs.

Quick Start

To demonstrate its functionality, here is a simple usage example:

import validex
from pydantic import BaseModel

class Superhero(BaseModel):
    name: str
    age: int
    power: str
    enemies: list[str]

def main():
    app = validex.App()
    app.add("https://www.britannica.com/topic/list-of-superheroes-2024795")
    superheroes = app.extract(Superhero)
    print(f"Extracted superheroes: {list(superheroes)}")

if __name__ == "__main__":
    main()

This code initializes the ValidEx app, adds a URL for extraction, and outputs the extracted superhero data, showcasing the library's straightforward approach to data retrieval.

Experimental Features

ValidEx also supports experimental features such as data export and local model training:

app.export_jsonl("fine_tune.jsonl")
app.fit()
app.save("state.validex")

Additionally, the library allows for multi-model extraction to handle various data structures in a single call:

multi_results = app.multi_extract(Superhero, Superhero2)
print(f"Multi-extraction results: {multi_results}")

Contributing

Contributions to ValidEx are encouraged. Prospective contributors should fork the repository, create a new branch, commit their changes, and open a pull request. Please refer to the contributing guidelines for detailed steps.

0 comments

No comments yet.

Sign in to be the first to comment.