PitchHut
Log in / Sign up
bug-in-the-code-stack
9 views
Spotting the unseen: Measure LLM's bug detection prowess.
Pitch

Introducing the Bug In The Haystack benchmark, an innovative tool designed to assess the ability of large language models (LLMs) to identify bugs in complex codebases. By utilizing Python source code as background noise, we enable precise measurement of LLM performance in real-world coding scenarios, enhancing the capability of software development tools.

Description

Bug In The Code Stack is a pioneering benchmark designed to evaluate the ability of large language models (LLMs) in detecting bugs within extensive codebases. By leveraging randomly assembled Python source code as a backdrop and syntactic bugs as targets, this innovative benchmark parallels the Needle In The Haystack evaluation, aiming to assess the efficiency and accuracy of LLMs in retrieving critical code-related information across a broad contextual framework. Such capabilities are pivotal for enhancing software engineering tools and co-pilot applications.

Key Features

  • Comprehensive Evaluation: Each model is assessed using its latest version to ensure accurate performance metrics.
  • Diverse Models Tested: The benchmark includes a variety of notable LLMs such as GPT-4o, GPT-4-Turbo, Claude 3 Opus, and many others, providing a broad spectrum of results.

Example Bug Detection

To illustrate the benchmark's effectiveness, consider the following code snippet:

1 | def fahrenheit_to_celsius(fahrenheit):
2 |   return (fahrenheit - 32) * 5.0/9.0
3 |
4 | def is_prime(num:
5 |     if num <= 1:
6 |         return False
7 |     for i in range(2, int(num**0.5) + 1):
8 |         if num % i == 0:
9 |             return False
10|     return True

Answer: 4, missing_parenthesis

Visualization of Results

The performance of various models can be visually compared through detailed datasets that display their respective detection capabilities:

  • Comparison of Target Depth @ 0.5
  • Results for multiple models including GPT-4o, GPT-4-Turbo, and Claude 3 Opus can be accessed through various links showcasing their effectiveness in bug detection.

Notebooks and Data Handling

The repository includes several notebooks that facilitate easy experimentation and data processing, such as:

  • Data Processing: notebooks/bug_in_the_code_stack_python_source_code_preprocessing.ipynb
  • Experimentation: Notebooks for running experiments across various models are provided, making it convenient for developers and researchers to jump right in.

Dataset Access

The benchmark uses a curated dataset for experiments, found in datasets/bug_in_the_code_stack_alpaca_dataset.csv. Additionally, all notebooks and datasets are available from the Bug In The Code Stack Google Drive.

Whether you're a developer looking to enhance bug detection capabilities or a researcher interested in LLM performance, Bug In The Code Stack offers valuable insights and tools for your coding needs.