PitchHut
Log in / Sign up
Wimsey
16 views
Easy and flexible data contracts for your data validation needs.
Pitch

Wimsey is a lightweight, open-source library designed for flexible data contracts. Seamlessly integrate it with your preferred dataframe library, from Pandas to Dask, using simple YAML, JSON, or Python formats. Experience minimal overhead and rapid imports, ensuring efficient data validation tailored to your requirements.

Description

Wimsey is a lightweight and flexible open-source library designed to simplify the implementation of data contracts, allowing you to ensure the integrity and validity of your data effortlessly. Here’s what makes Wimsey a must-have tool for data professionals:

  • 🐋 Compatible with Multiple DataFrame Libraries: Leveraging the power of Narwhals, Wimsey supports testing in various dataframe libraries, including Pandas, Polars, Dask, CuDF, Rapids, Arrow, and Modin. This flexibility allows you to integrate seamlessly into your existing workflow.

  • 🎍 Versatile Contract Formats: Write contracts in your preferred format, whether that’s YAML, JSON, or Python. Choose the style that best fits your project and team preferences.

  • 🪶 Ultra Lightweight: Designed for quick imports with minimal overhead, Wimsey only requires two dependencies: Narwhals and FSSpec. This means you can get started without any bloat.

  • 🥔 Simple and Intuitive API: With just two straightforward functions for testing dataframes and a simple dataclass for results, Wimsey focuses on reducing the mental overhead typically associated with data validation.

Why Use Data Contracts?

Data contracts are a powerful tool for ensuring the correctness of data values, especially at critical boundary points. They specify conditions that data must satisfy, enabling users to validate incoming data efficiently. Here’s an example of how you might express a data contract in YAML:

- test: columns_should
  be:
    - first_name
    - last_name
    - rating
- column: rating
  test: max_should
  be_less_than_or_equal_to: 10

Easy Validation and Testing

Wimsey provides dual mechanisms for verifying data integrity:

  • Validate: A method that raises an error if any tests fail, returning your dataframe if successful. It integrates smoothly with Polars or Pandas pipe methods to ensure data quality at every step:
import polars as pl
import wimsey

df = (
  pl.read_csv("hopefully_nice_data.csv")
  .pipe(wimsey.validate, "tests.json")
  .group_by("name").agg(pl.col("value").sum())
)
  • Test: A single function that returns a FinalResult datatype, giving a clear overview of each individual test’s success or failure:
import pandas as pd
import wimsey

df = pd.read_csv("hopefully_nice_data.csv")
results = wimsey.test(df, "tests.yaml")

if results.success:
  print("Yay we have good data! 🥳")
else:
  print(f"Oh nooo, something's up! 😭")
  print([i for i in results.results if not i.success])

Future Development

Wimsey is continually evolving with plans to introduce additional data tests, enhanced test coverage, performance boosts, and more informative error messages. We aim to develop a user-friendly API for data profiling that generates minimal tests from sample data.

We encourage community engagement! If you have suggestions, ideas, or want to add new tests, don’t hesitate to raise an issue or submit a pull request.

Discover the full potential of your data with Wimsey and ensure every data point meets your quality standards!