Skip to content

ucbepic/DataAgentBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

686 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DAB: Data Agent Benchmark

πŸ”₯ DAB is the first benchmark for evaluating data agents on realistic, complex, data-oriented tasks. It is a collaborative effort between UC Berkeley and Hasura PromptQL.

🀝 We welcome contributions to the leaderboard! Submit a Pull Request following the instruction below to share your agent results and see them ranked on DAB.

DAB captures four core properties of real-world enterprise data workloads across industries:

  • Multi-database integration
  • Ill-formatted key joins
  • Unstructured text transformation
  • Domain knowledge

Unlike prior SQL-only or single-database benchmarks, DAB stresses agents under realistic enterprise data complexity.

πŸ† Leaderboard

Rank Model Pass@1 Date
1 PromptQL (Claude-Opus-4.6) 0.48 2026-03-02
2 Gemini-3-Pro 0.37 2026-03-02
3 GPT-5-mini 0.29 2026-03-02
4 GPT-5.2 0.25 2026-03-02
5 Kimi-K2 0.23 2026-03-02
6 Gemini-2.5-Flash 0.09 2026-03-02

How to Submit to the Leaderboard

To contribute your agent's results to the DAB leaderboard:

  1. Collect results from 50 runs on all queries across all datasets.

  2. Organize all your run results into a single JSON file following this structure:

    [
      {
        "dataset": "<dataset_name>",   // e.g., "bookreview"
        "query": "<query_id>",         // e.g., "1"
        "run": "<run_number>",         // 0–49 for 50 runs
        "answer": "<agent_generated_answer>"
      },
      ...
      // Include an entry for every run of every query across all datasets
    ]
  3. Submit a Pull Request to this repository including:

    • The JSON results file

    • A brief description of your agent configuration, including:

      • The name of your agent

      • Backbone LLM model(s) name and version

      • Whether dataset hints were used

      • Any additional notes or special settings you want to highlight

⚠️ You must include all 50 runs for each query in your dataset. Missing runs may result in your submission being rejected from the leaderboard.

πŸ“š Table of Contents

πŸ“Š Benchmark Overview

This benchmarks contain 12 datasets and 54 queries across 9 domains and 4 DBMSes:

Dataset #DBs DBMSes #Tbl #Queries
agnews 2 MongoDB, SQLite 3 4
bookreview 2 PostgreSQL, SQLite 2 3
crmarenapro 6 DuckDB, PostgreSQL, SQLite 27 13
deps_dev_v1 2 DuckDB, SQLite 3 2
github_repos 2 DuckDB, SQLite 6 4
googlelocal 2 PostgreSQL, SQLite 2 4
music_brainz_20k 2 DuckDB, SQLite 2 3
pancancer_atlas 2 DuckDB, PostgreSQL 3 3
patents 2 PostgreSQL, SQLite 2 3
stockindex 2 DuckDB, SQLite 2 3
stockmarket 2 DuckDB, SQLite 2754 5
yelp 2 DuckDB, MongoDB 5 7

βš™οΈ Prerequisites

Before running DAB, please complete the following setup steps.

Clone the Repository

Some datasets in DAB contain large database files exceeding 50MB and are thus stored in Git LFS. To automatically get the full datasets, you need to ensure you have Git LFS enabled:

git lfs install

Then you can run:

git clone https://github.com/ucbepic/DataAgentBench.git
cd DataAgentBench

One database file of PATENTS dataset, patent_publication.db, exceeds Git LFS file-size limits (5GB). It is on google drive.

Option 1: Manually download the database to query_PATENTS/query_dataset/patent_publication.db

Option 1: Run the following script to automatically download the database:

bash download.sh

Install Dependencies

We recommend using a dedicated virtual environment to ensure reproducibility.

Using Conda (recommended):

conda env create -f environment.yaml
conda activate dabench

This will install all required dependencies specified in enviroment.yaml.

Setup Docker

  • Install Docker Follow the official guide.

    Version used in our experiments: 28.4.0

  • Build the Docker image: The image includes Python 3.12, Pandas, and PyArrow pre-installed:

    docker build -t python-data:3.12 .

Setup Databases

DAB evaluates agents across multiple database systems, so you must install and configure the following databases locally.

  • PostgreSQL: Install PostgreSQL from the official website and start the server.
    • Minimum required version: 17.5
    • Version used in our experiments: 17.7 (Ubuntu 17.7-3.pgdg24.04+1)
  • MongDB: Install MongoDB Community Edition from the official website and start the server.
    • Version used in our experiments: v8.2.1
  • SQLite & DuckDB: They operate directly on database files and do not require running a server.

Set Database Configurations

After installing all databases, you need to configure connection parameters to match your local setup.

Default configuration values (defined in db_config.py):

Parameter Default value
PG_CLIENT "psql"
PG_HOST "127.0.0.1"
PG_PORT 5432
PG_USER "postgres"
PG_PASSWORD ""
PG_DB "test"
MONGO_URI "mongodb://localhost:27017/"
SQLITE_PATH "data/mydb.sqlite"
DUCKDB_PATH "data/mydb.duckdb"

Option 1: Create a .envfile in the project root. E.g.,

# PostgreSQL
PG_PASSWORD=your_password
PG_DB=mydb
# MongoDB (if authentication is required)
MONGO_URI=mongodb://username:password@localhost:27017/?authSource=admin

Option 2: Modifying the defaults in db_config.py.

Add API credentials

Create a .env file in the project root and add your API keys:

AZURE_API_BASE=
AZURE_API_KEY=
AZURE_API_VERSION=
GEMINI_API_KEY=
TOGETHER_API_KEY=

Currently, we support

  • Microsoft Azure API (for GPT models)
  • Goolge Gemini API (for Gemini models)
  • Together.AI API (for Kimi and Qwen models)

If you want to use a model not yet supported by default, you may register it in DataAgent.py:

# DataAgent.py (from line 76)
if "gpt" in deployment_name.lower():
      self.client = AzureOpenAI(
          api_key=os.getenv("AZURE_API_KEY"),
          api_version=os.getenv("AZURE_API_VERSION"),
          azure_endpoint=os.getenv("AZURE_API_BASE")
      )
  # add a new model here as an `elif` branch
  else:
      raise ValueError(f"Unsupported deployment name: {deployment_name}")

▢️ Run the Benchmark

Run the Built-in Agent on a Single Query

DAB comes with a built-in agent. You can run the agent on a specific query as follow:

Example: Run a single trial of GPT-5-mini on query1 of the bookreview dataset, with up to 100 agent iterations and dataset hints enabled:

python run_agent.py \
    --dataset bookreview \
    --query_id 1 \
    --llm gpt-5-mini \
    --iterations 100 \
    --use_hints \
    --root_name run_0

Execution Logs

Logs for this run will be saved under:

query_bookreview/query1/logs/data_agent/run_0

The log directory has the following structure:

run_0/
β”œβ”€ exec_tool_work_dir/     <- Working directory for the `execute_python` tool (Docker)
β”œβ”€ final_agent.json        <- Full agent trajectory and execution statistics
β”œβ”€ llm_calls.jsonl         <- All LLM API calls made by the agent
└─ tool_calls.jsonl        <- All tool calls made by the agent

Validate Agent Answer

DAB provides utilities to compute both aggregated Pass@1 accuracy of a dataset and single-run correctness.

Compute Pass@1 (50 Runs)

We provide a script avg_accuracy.py to compute Pass@1 accuracy across 50 runs per query of a dataset.

Before running the script, make sure your run logs are organized under the following directory structure (you may need to first move them from the dataset folder to the results-<model_name>/ directory):

results-gpt-5-mini/
└─ query_bookreview/
   └─ query1/
      └─ data_agent/
         β”œβ”€ run_0/
         β”œβ”€ run_1/
         β”œβ”€ ...
         └─ run_49/

Then compute Pass@1 as follows:

from python_script.avg_accuracy import avg_acc

print(avg_acc("bookreview", "gpt-5-mini"))

This will aggregate validation results across runs and queries and report the final Pass@1 accuracy for the dataset.

Validate a Single Run

After an agent run completes, you can validate its final answer against the ground truth:

from pathlib import Path
import json

run_dir = Path("query_bookreview/query1/logs/data_agent/run_0")

with open(run_dir / "final_agent.json", encoding="utf-8") as f:
    llm_json = json.load(f)

llm_answer = llm_json["final_result"]
term_reason = llm_json["terminate_reason"]

if term_reason == "no_tool_call":
    validation_result = {"is_valid": False}
else:
    validation_result = validate(query_dir, llm_answer, term_reason)

The validation result follows this structure:

{
  "timestamp": "YYYYMMDD_HHMMSS",
  "query_name": "query1",
  "is_valid": True/False,   # Whether the agent’s answer matches the ground truth
  "reason": "...",          # Explanation for success or failure
  "ground_truth": "...",    # The ground-truth answer
  "llm_answer": "...",      # The agent's final answer
}

πŸ“ Datasets and Queries

Dataset

A dataset in DAB is organized as a folder under the project root. For example, for dataset bookreview:

query_bookreview/
β”œβ”€ query_dataset/                  <- All database files
β”‚  β”œβ”€ books_info.sql
β”‚  └─ review_query.db
β”œβ”€ query1/                         <- Each query stored in a separate folder
β”œβ”€ db_config.yaml                   <- Database connection configuration
β”œβ”€ db_description.txt               <- Description of the database schemas
└─ db_description_with_hint.txt     <- Optional hints for queries

Query

Each query is stored within its corresponding dataset folder:

query_1/
β”œβ”€ query.json           <- The query as a double-quoted string
β”œβ”€ ground_truth.csv     <- The ground-truth answer in plain text
└─ validate.py          <- Python script to validate an agent's output

validate.py defines a validate function with the following signature:

def validate(llm_output: str):
    """
    Validate if the ground truth is present in the agent's answer.

    Returns:
        (True, "OK")   – if the answer matches the ground-truth
        (False, reason) – if it does not match
    """

πŸ€– Create Your Customized Agents

DAB allows you to implement and run your own agents while leveraging built-in tools for database querying and Python execution. These utilities, located under tools/, provide:

  • πŸ”— Automatic Database Connection

    • Loads database configurations from db_config.yaml for each dataset.
    • No need to manually connect to PostgreSQL, MySQL, MongoDB, SQLite, or DuckDB β€” all connections are handled automatically via db_utils.
  • πŸ›‘οΈ Read-only Database Querying

    • Ensures queries are read-only, preventing accidental writes that could pollute the data.
  • 🐳 Safe Python Execution Environment

    • Executes Python code with a 600-second timeout in a Docker environment with Python 3.12, pandas, and pyarrow pre-installed.
    • Protects your local machine from unsafe operations by the agent.
  • πŸ“„ Agent Reference Implementation

    • DataAgent.py is a fully functional built-in agent.
    • Use it as a template to implement and test your own agent.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages