DAB: Data Agent Benchmark

🔥 DAB is the first benchmark for evaluating data agents on realistic, complex, data-oriented tasks. It is a collaborative effort between UC Berkeley and Hasura PromptQL.

🤝 We welcome contributions to the leaderboard! Submit a Pull Request following the instruction below to share your agent results and see them ranked on DAB.

DAB captures four core properties of real-world enterprise data workloads across industries:

Multi-database integration
Ill-formatted key joins
Unstructured text transformation
Domain knowledge

Unlike prior SQL-only or single-database benchmarks, DAB stresses agents under realistic enterprise data complexity.

🏆 Leaderboard

Rank	Model	Pass@1	Date
1	PromptQL (Claude-Opus-4.6)	0.48	2026-03-02
2	Gemini-3-Pro	0.37	2026-03-02
3	GPT-5-mini	0.29	2026-03-02
4	GPT-5.2	0.25	2026-03-02
5	Kimi-K2	0.23	2026-03-02
6	Gemini-2.5-Flash	0.09	2026-03-02

How to Submit to the Leaderboard

To contribute your agent's results to the DAB leaderboard:

Collect results from 50 runs on all queries across all datasets.

Organize all your run results into a single JSON file following this structure:

[
  {
    "dataset": "<dataset_name>",   // e.g., "bookreview"
    "query": "<query_id>",         // e.g., "1"
    "run": "<run_number>",         // 0–49 for 50 runs
    "answer": "<agent_generated_answer>"
  },
  ...
  // Include an entry for every run of every query across all datasets
]

Submit a Pull Request to this repository including:
- The JSON results file
- A brief description of your agent configuration, including:
  - The name of your agent
  - Backbone LLM model(s) name and version
  - Whether dataset hints were used
  - Any additional notes or special settings you want to highlight

⚠️ You must include all 50 runs for each query in your dataset. Missing runs may result in your submission being rejected from the leaderboard.

📊 Benchmark Overview

This benchmarks contain 12 datasets and 54 queries across 9 domains and 4 DBMSes:

Dataset	#DBs	DBMSes	#Tbl	#Queries
agnews	2	MongoDB, SQLite	3	4
bookreview	2	PostgreSQL, SQLite	2	3
crmarenapro	6	DuckDB, PostgreSQL, SQLite	27	13
deps_dev_v1	2	DuckDB, SQLite	3	2
github_repos	2	DuckDB, SQLite	6	4
googlelocal	2	PostgreSQL, SQLite	2	4
music_brainz_20k	2	DuckDB, SQLite	2	3
pancancer_atlas	2	DuckDB, PostgreSQL	3	3
patents	2	PostgreSQL, SQLite	2	3
stockindex	2	DuckDB, SQLite	2	3
stockmarket	2	DuckDB, SQLite	2754	5
yelp	2	DuckDB, MongoDB	5	7

⚙️ Prerequisites

Before running DAB, please complete the following setup steps.

Clone the Repository

Some datasets in DAB contain large database files exceeding 50MB and are thus stored in Git LFS. To automatically get the full datasets, you need to ensure you have Git LFS enabled:

git lfs install

Then you can run:

git clone https://github.com/ucbepic/DataAgentBench.git
cd DataAgentBench

One database file of PATENTS dataset, patent_publication.db, exceeds Git LFS file-size limits (5GB). It is on google drive.

Option 1: Manually download the database to query_PATENTS/query_dataset/patent_publication.db

Option 1: Run the following script to automatically download the database:

bash download.sh

Install Dependencies

We recommend using a dedicated virtual environment to ensure reproducibility.

Using Conda (recommended):

conda env create -f environment.yaml
conda activate dabench

This will install all required dependencies specified in enviroment.yaml.

Setup Docker

Install Docker Follow the official guide.

Version used in our experiments: 28.4.0
Build the Docker image: The image includes Python 3.12, Pandas, and PyArrow pre-installed:
```
docker build -t python-data:3.12 .
```

Setup Databases

DAB evaluates agents across multiple database systems, so you must install and configure the following databases locally.

PostgreSQL: Install PostgreSQL from the official website and start the server.
- Minimum required version: 17.5
- Version used in our experiments: 17.7 (Ubuntu 17.7-3.pgdg24.04+1)
MongDB: Install MongoDB Community Edition from the official website and start the server.
- Version used in our experiments: v8.2.1
SQLite & DuckDB: They operate directly on database files and do not require running a server.

Set Database Configurations

After installing all databases, you need to configure connection parameters to match your local setup.

Default configuration values (defined in db_config.py):

Parameter	Default value
PG_CLIENT	"psql"
PG_HOST	"127.0.0.1"
PG_PORT	5432
PG_USER	"postgres"
PG_PASSWORD	""
PG_DB	"test"
MONGO_URI	"mongodb://localhost:27017/"
SQLITE_PATH	"data/mydb.sqlite"
DUCKDB_PATH	"data/mydb.duckdb"

Option 1: Create a .envfile in the project root. E.g.,

# PostgreSQL
PG_PASSWORD=your_password
PG_DB=mydb
# MongoDB (if authentication is required)
MONGO_URI=mongodb://username:password@localhost:27017/?authSource=admin

Option 2: Modifying the defaults in db_config.py.

Add API credentials

Create a .env file in the project root and add your API keys:

AZURE_API_BASE=
AZURE_API_KEY=
AZURE_API_VERSION=
GEMINI_API_KEY=
TOGETHER_API_KEY=

Currently, we support

Microsoft Azure API (for GPT models)
Goolge Gemini API (for Gemini models)
Together.AI API (for Kimi and Qwen models)

If you want to use a model not yet supported by default, you may register it in DataAgent.py:

# DataAgent.py (from line 76)
if "gpt" in deployment_name.lower():
      self.client = AzureOpenAI(
          api_key=os.getenv("AZURE_API_KEY"),
          api_version=os.getenv("AZURE_API_VERSION"),
          azure_endpoint=os.getenv("AZURE_API_BASE")
      )
  # add a new model here as an `elif` branch
  else:
      raise ValueError(f"Unsupported deployment name: {deployment_name}")

▶️ Run the Benchmark

Run the Built-in Agent on a Single Query

DAB comes with a built-in agent. You can run the agent on a specific query as follow:

Example: Run a single trial of GPT-5-mini on query1 of the bookreview dataset, with up to 100 agent iterations and dataset hints enabled:

python run_agent.py \
    --dataset bookreview \
    --query_id 1 \
    --llm gpt-5-mini \
    --iterations 100 \
    --use_hints \
    --root_name run_0

Execution Logs

Logs for this run will be saved under:

query_bookreview/query1/logs/data_agent/run_0

The log directory has the following structure:

run_0/
├─ exec_tool_work_dir/     <- Working directory for the `execute_python` tool (Docker)
├─ final_agent.json        <- Full agent trajectory and execution statistics
├─ llm_calls.jsonl         <- All LLM API calls made by the agent
└─ tool_calls.jsonl        <- All tool calls made by the agent

Validate Agent Answer

DAB provides utilities to compute both aggregated Pass@1 accuracy of a dataset and single-run correctness.

Compute Pass@1 (50 Runs)

We provide a script avg_accuracy.py to compute Pass@1 accuracy across 50 runs per query of a dataset.

Before running the script, make sure your run logs are organized under the following directory structure (you may need to first move them from the dataset folder to the results-<model_name>/ directory):

results-gpt-5-mini/
└─ query_bookreview/
   └─ query1/
      └─ data_agent/
         ├─ run_0/
         ├─ run_1/
         ├─ ...
         └─ run_49/

Then compute Pass@1 as follows:

from python_script.avg_accuracy import avg_acc

print(avg_acc("bookreview", "gpt-5-mini"))

This will aggregate validation results across runs and queries and report the final Pass@1 accuracy for the dataset.

Validate a Single Run

After an agent run completes, you can validate its final answer against the ground truth:

from pathlib import Path
import json

run_dir = Path("query_bookreview/query1/logs/data_agent/run_0")

with open(run_dir / "final_agent.json", encoding="utf-8") as f:
    llm_json = json.load(f)

llm_answer = llm_json["final_result"]
term_reason = llm_json["terminate_reason"]

if term_reason == "no_tool_call":
    validation_result = {"is_valid": False}
else:
    validation_result = validate(query_dir, llm_answer, term_reason)

The validation result follows this structure:

{
  "timestamp": "YYYYMMDD_HHMMSS",
  "query_name": "query1",
  "is_valid": True/False,   # Whether the agent’s answer matches the ground truth
  "reason": "...",          # Explanation for success or failure
  "ground_truth": "...",    # The ground-truth answer
  "llm_answer": "...",      # The agent's final answer
}

📝 Datasets and Queries

Dataset

A dataset in DAB is organized as a folder under the project root. For example, for dataset bookreview:

query_bookreview/
├─ query_dataset/                  <- All database files
│  ├─ books_info.sql
│  └─ review_query.db
├─ query1/                         <- Each query stored in a separate folder
├─ db_config.yaml                   <- Database connection configuration
├─ db_description.txt               <- Description of the database schemas
└─ db_description_with_hint.txt     <- Optional hints for queries

Query

Each query is stored within its corresponding dataset folder:

query_1/
├─ query.json           <- The query as a double-quoted string
├─ ground_truth.csv     <- The ground-truth answer in plain text
└─ validate.py          <- Python script to validate an agent's output

validate.py defines a validate function with the following signature:

def validate(llm_output: str):
    """
    Validate if the ground truth is present in the agent's answer.

    Returns:
        (True, "OK")   – if the answer matches the ground-truth
        (False, reason) – if it does not match
    """

🤖 Create Your Customized Agents

DAB allows you to implement and run your own agents while leveraging built-in tools for database querying and Python execution. These utilities, located under tools/, provide:

🔗 Automatic Database Connection
- Loads database configurations from db_config.yaml for each dataset.
- No need to manually connect to PostgreSQL, MySQL, MongoDB, SQLite, or DuckDB — all connections are handled automatically via db_utils.
🛡️ Read-only Database Querying
- Ensures queries are read-only, preventing accidental writes that could pollute the data.
🐳 Safe Python Execution Environment
- Executes Python code with a 600-second timeout in a Docker environment with Python 3.12, pandas, and pyarrow pre-installed.
- Protects your local machine from unsafe operations by the agent.
📄 Agent Reference Implementation
- DataAgent.py is a fully functional built-in agent.
- Use it as a template to implement and test your own agent.

Name		Name	Last commit message	Last commit date
Latest commit History 686 Commits
common_scaffold		common_scaffold
failure_analysis		failure_analysis
plotting_scripts		plotting_scripts
query_DEPS_DEV_V1		query_DEPS_DEV_V1
query_GITHUB_REPOS		query_GITHUB_REPOS
query_PANCANCER_ATLAS		query_PANCANCER_ATLAS
query_PATENTS		query_PATENTS
query_agnews		query_agnews
query_bookreview		query_bookreview
query_crmarenapro		query_crmarenapro
query_googlelocal		query_googlelocal
query_music_brainz_20k		query_music_brainz_20k
query_stockindex		query_stockindex
query_stockmarket		query_stockmarket
query_yelp		query_yelp
stats_scripts		stats_scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
download.sh		download.sh
environment.yaml		environment.yaml
logging_config.py		logging_config.py
run_agent.py		run_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DAB: Data Agent Benchmark

🏆 Leaderboard

How to Submit to the Leaderboard

📚 Table of Contents

📊 Benchmark Overview

⚙️ Prerequisites

Clone the Repository

Install Dependencies

Setup Docker

Setup Databases

Set Database Configurations

Add API credentials

▶️ Run the Benchmark

Run the Built-in Agent on a Single Query

Execution Logs

Validate Agent Answer

Compute Pass@1 (50 Runs)

Validate a Single Run

📝 Datasets and Queries

Dataset

Query

🤖 Create Your Customized Agents

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DAB: Data Agent Benchmark

🏆 Leaderboard

How to Submit to the Leaderboard

📚 Table of Contents

📊 Benchmark Overview

⚙️ Prerequisites

Clone the Repository

Install Dependencies

Setup Docker

Setup Databases

Set Database Configurations

Add API credentials

▶️ Run the Benchmark

Run the Built-in Agent on a Single Query

Execution Logs

Validate Agent Answer

Compute Pass@1 (50 Runs)

Validate a Single Run

📝 Datasets and Queries

Dataset

Query

🤖 Create Your Customized Agents

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages