fruitstand - Efficient regression testing for LLM prompts to ensure consistent behavior.

fruitstand

Efficient regression testing for LLM prompts to ensure consistent behavior.

Pitch

Fruitstand is a powerful library designed to assist in regression testing of LLM prompts. With its unique approach to measuring response similarity, it ensures that transitions between models maintain expected behavior, helping prevent unintended consequences during upgrades or changes.

Description

Overview

The fruitstand library provides a robust solution for regression testing LLM (Large Language Model) prompts. It addresses the complexities associated with the nondeterministic behavior of LLMs, making it essential for developers who want to ensure consistency in their applications as they upgrade or switch models.

Key Features

Similarity Threshold Testing: Unlike traditional methods that compare exact outputs, this library focuses on comparing responses based on a defined threshold of similarity. This is crucial for maintaining the intended functionality as models evolve.
Baseline Creation: Establish a baseline using a specific LLM and model. This baseline serves as a reference point to verify that changing or upgrading models does not lead to undesirable behavior changes.

Use Cases

For instance, when utilizing an LLM for intent detection in a chatbot, ensuring that the model accurately recognizes intents across versions is critical.

Example Prompt:

Based on the provided user prompt, determine if the user wanted to:
1. Change their address
2. Change their name
3. Cancel their subscription

User Prompt:
I would like to update my subscription.

In this scenario, fruitstand helps confirm that as models are modified, the correct intent is consistently identified.

Running Tests

Fruitstand simplifies the process of testing LLM responses. Two main steps are involved:

Creating a Baseline: This setup captures the expected behavior of your current model.
Testing Other Models: Use the established baseline to evaluate other LLMs or models to ensure they maintain functionality.

Command Line Usage

Users can run tests using command-line arguments, allowing flexibility and automation. For creating a baseline:

fruitstand baseline -o ./baseline -f ./data/test_data.json -qllm openai -qm "gpt-4o-mini" -qkey sk-******** -ellm openai -em text-embedding-3-large -ekey sk--********

For testing:

fruitstand test -b ./baseline/baseline__openai_gpt-4o-mini__openai_text-embedding-3-large__1736980847061344.json -o ./test_results/data -f ./data/test_data.json -llm openai -m "gpt-4o-mini" -qkey sk-******** -ekey sk--******** -threshold 0.85

Python Integration

The library also offers direct usage within Python applications:

from fruitstand import Fruitstand

fruitstand = Fruitstand()

openai_api_key = "your_openai_api_key"

baseline_data = fruitstand.baseline(
    query_llm="openai",
    query_api_key=openai_api_key,
    query_model="gpt-4o-mini",
    embeddings_llm="openai",
    embeddings_api_key=openai_api_key,
    embeddings_model="text-embedding-3-large",
    test_data=[
        "How far is the earth from the sun?",
        "Where is Manchester in the UK?"
    ]
)

print("Baseline data:", baseline_data)

Testing functionality:

test_data = fruitstand.test(
    test_query_llm="openai",
    test_query_api_key=openai_api_key,
    test_query_model="gpt-4o-mini",
    baseline_embeddings_api_key=openai_api_key,
    baseline_data=baseline_data,
    test_data=[
        "How far is the earth from the sun?",
        "Where is Manchester in the UK?"
    ],
    success_threshold=0.85
)

print("Test data:", test_data)

Outcome

On completion of tests, a JSON document returns results detailing:

Query: The input provided to the model.
Response: The output generated by the model during testing.
Status: Indicates whether the test was successful based on the similarity threshold specified.
Similarity: Represents the degree of similarity between the output and the baseline response.

0 comments

No comments yet.

New comment