π₯ DAB is the first benchmark for evaluating data agents on realistic, complex, data-oriented tasks. It is a collaborative effort between UC Berkeley and Hasura PromptQL.
π€ We welcome contributions to the leaderboard! Submit a Pull Request following the instruction below to share your agent results and see them ranked on DAB.
DAB captures four core properties of real-world enterprise data workloads across industries:
- Multi-database integration
- Ill-formatted key joins
- Unstructured text transformation
- Domain knowledge
Unlike prior SQL-only or single-database benchmarks, DAB stresses agents under realistic enterprise data complexity.
| Rank | Model | Pass@1 | Date |
|---|---|---|---|
| 1 | PromptQL (Claude-Opus-4.6) | 0.48 | 2026-03-02 |
| 2 | Gemini-3-Pro | 0.37 | 2026-03-02 |
| 3 | GPT-5-mini | 0.29 | 2026-03-02 |
| 4 | GPT-5.2 | 0.25 | 2026-03-02 |
| 5 | Kimi-K2 | 0.23 | 2026-03-02 |
| 6 | Gemini-2.5-Flash | 0.09 | 2026-03-02 |
To contribute your agent's results to the DAB leaderboard:
-
Collect results from 50 runs on all queries across all datasets.
-
Organize all your run results into a single JSON file following this structure:
[ { "dataset": "<dataset_name>", // e.g., "bookreview" "query": "<query_id>", // e.g., "1" "run": "<run_number>", // 0β49 for 50 runs "answer": "<agent_generated_answer>" }, ... // Include an entry for every run of every query across all datasets ] -
Submit a Pull Request to this repository including:
-
The JSON results file
-
A brief description of your agent configuration, including:
-
The name of your agent
-
Backbone LLM model(s) name and version
-
Whether dataset hints were used
-
Any additional notes or special settings you want to highlight
-
-
- π Benchmark Overview
- βοΈ Prerequisites
βΆοΈ Run the Benchmark- π Datasets and Queries
- π€ Create Your Customized Agents
This benchmarks contain 12 datasets and 54 queries across 9 domains and 4 DBMSes:
| Dataset | #DBs | DBMSes | #Tbl | #Queries |
|---|---|---|---|---|
| agnews | 2 | MongoDB, SQLite | 3 | 4 |
| bookreview | 2 | PostgreSQL, SQLite | 2 | 3 |
| crmarenapro | 6 | DuckDB, PostgreSQL, SQLite | 27 | 13 |
| deps_dev_v1 | 2 | DuckDB, SQLite | 3 | 2 |
| github_repos | 2 | DuckDB, SQLite | 6 | 4 |
| googlelocal | 2 | PostgreSQL, SQLite | 2 | 4 |
| music_brainz_20k | 2 | DuckDB, SQLite | 2 | 3 |
| pancancer_atlas | 2 | DuckDB, PostgreSQL | 3 | 3 |
| patents | 2 | PostgreSQL, SQLite | 2 | 3 |
| stockindex | 2 | DuckDB, SQLite | 2 | 3 |
| stockmarket | 2 | DuckDB, SQLite | 2754 | 5 |
| yelp | 2 | DuckDB, MongoDB | 5 | 7 |
Before running DAB, please complete the following setup steps.
Some datasets in DAB contain large database files exceeding 50MB and are thus stored in Git LFS. To automatically get the full datasets, you need to ensure you have Git LFS enabled:
git lfs installThen you can run:
git clone https://github.com/ucbepic/DataAgentBench.git
cd DataAgentBenchOne database file of PATENTS dataset, patent_publication.db, exceeds Git LFS file-size limits (5GB). It is on google drive.
Option 1:
Manually download the database to query_PATENTS/query_dataset/patent_publication.db
Option 1: Run the following script to automatically download the database:
bash download.shWe recommend using a dedicated virtual environment to ensure reproducibility.
Using Conda (recommended):
conda env create -f environment.yaml
conda activate dabenchThis will install all required dependencies specified in enviroment.yaml.
-
Install Docker Follow the official guide.
Version used in our experiments: 28.4.0
-
Build the Docker image: The image includes Python 3.12, Pandas, and PyArrow pre-installed:
docker build -t python-data:3.12 .
DAB evaluates agents across multiple database systems, so you must install and configure the following databases locally.
- PostgreSQL:
Install PostgreSQL from the official website and start the server.
- Minimum required version: 17.5
- Version used in our experiments: 17.7 (Ubuntu 17.7-3.pgdg24.04+1)
- MongDB:
Install MongoDB Community Edition from the official website and start the server.
- Version used in our experiments: v8.2.1
- SQLite & DuckDB: They operate directly on database files and do not require running a server.
After installing all databases, you need to configure connection parameters to match your local setup.
Default configuration values (defined in db_config.py):
| Parameter | Default value |
|---|---|
| PG_CLIENT | "psql" |
| PG_HOST | "127.0.0.1" |
| PG_PORT | 5432 |
| PG_USER | "postgres" |
| PG_PASSWORD | "" |
| PG_DB | "test" |
| MONGO_URI | "mongodb://localhost:27017/" |
| SQLITE_PATH | "data/mydb.sqlite" |
| DUCKDB_PATH | "data/mydb.duckdb" |
Option 1:
Create a .envfile in the project root. E.g.,
# PostgreSQL
PG_PASSWORD=your_password
PG_DB=mydb
# MongoDB (if authentication is required)
MONGO_URI=mongodb://username:password@localhost:27017/?authSource=admin
Option 2: Modifying the defaults in db_config.py.
Create a .env file in the project root and add your API keys:
AZURE_API_BASE=
AZURE_API_KEY=
AZURE_API_VERSION=
GEMINI_API_KEY=
TOGETHER_API_KEY=
Currently, we support
- Microsoft Azure API (for GPT models)
- Goolge Gemini API (for Gemini models)
- Together.AI API (for Kimi and Qwen models)
If you want to use a model not yet supported by default, you may register it in DataAgent.py:
# DataAgent.py (from line 76)
if "gpt" in deployment_name.lower():
self.client = AzureOpenAI(
api_key=os.getenv("AZURE_API_KEY"),
api_version=os.getenv("AZURE_API_VERSION"),
azure_endpoint=os.getenv("AZURE_API_BASE")
)
# add a new model here as an `elif` branch
else:
raise ValueError(f"Unsupported deployment name: {deployment_name}")DAB comes with a built-in agent. You can run the agent on a specific query as follow:
Example: Run a single trial of GPT-5-mini on query1 of the bookreview dataset, with up to 100 agent iterations and dataset hints enabled:
python run_agent.py \
--dataset bookreview \
--query_id 1 \
--llm gpt-5-mini \
--iterations 100 \
--use_hints \
--root_name run_0Logs for this run will be saved under:
query_bookreview/query1/logs/data_agent/run_0
The log directory has the following structure:
run_0/
ββ exec_tool_work_dir/ <- Working directory for the `execute_python` tool (Docker)
ββ final_agent.json <- Full agent trajectory and execution statistics
ββ llm_calls.jsonl <- All LLM API calls made by the agent
ββ tool_calls.jsonl <- All tool calls made by the agent
DAB provides utilities to compute both aggregated Pass@1 accuracy of a dataset and single-run correctness.
We provide a script avg_accuracy.py to compute Pass@1 accuracy across 50 runs per query of a dataset.
Before running the script, make sure your run logs are organized under the following directory structure (you may need to first move them from the dataset folder to the results-<model_name>/ directory):
results-gpt-5-mini/
ββ query_bookreview/
ββ query1/
ββ data_agent/
ββ run_0/
ββ run_1/
ββ ...
ββ run_49/
Then compute Pass@1 as follows:
from python_script.avg_accuracy import avg_acc
print(avg_acc("bookreview", "gpt-5-mini"))This will aggregate validation results across runs and queries and report the final Pass@1 accuracy for the dataset.
After an agent run completes, you can validate its final answer against the ground truth:
from pathlib import Path
import json
run_dir = Path("query_bookreview/query1/logs/data_agent/run_0")
with open(run_dir / "final_agent.json", encoding="utf-8") as f:
llm_json = json.load(f)
llm_answer = llm_json["final_result"]
term_reason = llm_json["terminate_reason"]
if term_reason == "no_tool_call":
validation_result = {"is_valid": False}
else:
validation_result = validate(query_dir, llm_answer, term_reason)The validation result follows this structure:
{
"timestamp": "YYYYMMDD_HHMMSS",
"query_name": "query1",
"is_valid": True/False, # Whether the agentβs answer matches the ground truth
"reason": "...", # Explanation for success or failure
"ground_truth": "...", # The ground-truth answer
"llm_answer": "...", # The agent's final answer
}A dataset in DAB is organized as a folder under the project root. For example, for dataset bookreview:
query_bookreview/
ββ query_dataset/ <- All database files
β ββ books_info.sql
β ββ review_query.db
ββ query1/ <- Each query stored in a separate folder
ββ db_config.yaml <- Database connection configuration
ββ db_description.txt <- Description of the database schemas
ββ db_description_with_hint.txt <- Optional hints for queries
Each query is stored within its corresponding dataset folder:
query_1/
ββ query.json <- The query as a double-quoted string
ββ ground_truth.csv <- The ground-truth answer in plain text
ββ validate.py <- Python script to validate an agent's output
validate.py defines a validate function with the following signature:
def validate(llm_output: str):
"""
Validate if the ground truth is present in the agent's answer.
Returns:
(True, "OK") β if the answer matches the ground-truth
(False, reason) β if it does not match
"""DAB allows you to implement and run your own agents while leveraging built-in tools for database querying and Python execution. These utilities, located under tools/, provide:
-
π Automatic Database Connection
- Loads database configurations from
db_config.yamlfor each dataset. - No need to manually connect to PostgreSQL, MySQL, MongoDB, SQLite, or DuckDB β all connections are handled automatically via
db_utils.
- Loads database configurations from
-
π‘οΈ Read-only Database Querying
- Ensures queries are read-only, preventing accidental writes that could pollute the data.
-
π³ Safe Python Execution Environment
- Executes Python code with a 600-second timeout in a Docker environment with Python 3.12,
pandas, andpyarrowpre-installed. - Protects your local machine from unsafe operations by the agent.
- Executes Python code with a 600-second timeout in a Docker environment with Python 3.12,
-
π Agent Reference Implementation
DataAgent.pyis a fully functional built-in agent.- Use it as a template to implement and test your own agent.