Text-to-SQL Agent

A production-grade Text-to-SQL agent built with LangGraph that converts natural language questions into SQL queries. The agent uses LLMs for language understanding, schema grounding, query planning, and SQL generation, with robust error handling and repair mechanisms.

Features

Natural Language to SQL: Convert questions like "Show me the top 10 pop songs from 2018" into SQL queries
Multi-Dataset Support: Works with musicoset_popularity, musicoset_metadata, and musicoset_songfeatures datasets
Intelligent Schema Understanding: Automatic schema introspection, foreign key detection, and join graph generation
Robust Error Handling: Automatic SQL repair and retry mechanisms
Safety Validation: Built-in security checks to prevent unsafe queries
Clarification Support: Asks for clarification when queries are ambiguous
LangSmith Integration: Built-in tracing support for observability

Project Structure

sqlagent/
├── agents/              # LangGraph agent implementation
│   ├── nodes/          # Individual graph nodes (20 nodes)
│   ├── conditions.py  # Conditional routing logic
│   ├── graph.py       # Main graph definition
│   ├── state.py       # State schema
│   └── tools/         # LLM clients and prompts
├── db/                 # Database utilities
│   ├── connection.py  # Database connection management
│   ├── load_sqlite.py # CSV to SQLite loader
│   └── schema.py      # Schema introspection utilities
├── dataset/            # CSV data files
│   ├── musicoset_popularity/
│   ├── musicoset_metadata/
│   └── musicoset_songfeatures/
├── data/               # Generated database
│   └── db/
│       └── musicoset_popularity.sqlite
├── main.py            # CLI entry point
├── requirements.txt   # Python dependencies
└── .env               # Environment variables (API keys)

Prerequisites

Python 3.9 or higher
pip or conda
LLM API key (OpenAI or Anthropic)

Installation

1. Clone or Navigate to the Project

cd /path/to/sqlagent

2. Install Dependencies

pip install -r requirements.txt

Required packages:

pandas>=2.0.0 - Data loading
openai>=1.0.0 - OpenAI LLM support
anthropic>=0.18.0 - Anthropic LLM support
langgraph>=0.0.40 - Graph-based agent framework
langchain>=0.1.0 - LangChain integration
python-dotenv>=1.0.1 - Environment variable management

3. Set Up Environment Variables

Create a .env file in the sqlagent/ directory:

# LLM API Keys (at least one required)
OPENAI_API_KEY=sk-your-openai-key-here
# OR
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key-here

# LangSmith Tracing (optional but recommended)
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=lsv2-your-langsmith-key-here
LANGCHAIN_PROJECT=text-to-sql-agent

Note: The agent will work with fallback rule-based methods if no LLM API key is provided, but LLM integration provides much better results.

Dataset Setup

Important: The dataset CSV files are not included in this repository due to their large size (some files exceed GitHub's 100 MB limit). You need to download and place them manually.

Required Folder Structure

Create the following exact folder structure in the sqlagent/ directory:

sqlagent/
├── dataset/
│   ├── musicoset_popularity/
│   │   ├── album_chart.csv
│   │   ├── album_pop.csv
│   │   ├── artist_chart.csv
│   │   ├── artist_pop.csv
│   │   ├── song_chart.csv
│   │   └── song_pop.csv
│   ├── musicoset_metadata/
│   │   ├── albums.csv
│   │   ├── artists.csv
│   │   ├── releases.csv
│   │   ├── songs.csv
│   │   └── tracks.csv
│   └── musicoset_songfeatures/
│       ├── acoustic_features.csv
│       └── lyrics.csv

Downloading the Datasets

You need to obtain the dataset files from the original sources:

musicoset_popularity (6 CSV files):
- album_chart.csv
- album_pop.csv
- artist_chart.csv
- artist_pop.csv
- song_chart.csv
- song_pop.csv
musicoset_metadata (5 CSV files):
- albums.csv
- artists.csv
- releases.csv
- songs.csv
- tracks.csv
musicoset_songfeatures (2 CSV files):
- acoustic_features.csv
- lyrics.csv (⚠️ Large file: ~142 MB)

Quick Setup Commands

# Navigate to project directory
cd sqlagent

# Create dataset directory structure
mkdir -p dataset/musicoset_popularity
mkdir -p dataset/musicoset_metadata
mkdir -p dataset/musicoset_songfeatures

# Place your CSV files in the respective directories
# Example:
# cp /path/to/your/files/*.csv dataset/musicoset_popularity/

Verify Dataset Files

After placing the files, verify the structure:

# Check all files are present
ls -lh dataset/musicoset_popularity/*.csv
ls -lh dataset/musicoset_metadata/*.csv
ls -lh dataset/musicoset_songfeatures/*.csv

# Expected output: 13 CSV files total
find dataset -name "*.csv" | wc -l  # Should output: 13

Note: The dataset/ directory and its subdirectories are tracked by git, but the .csv files themselves are ignored (see .gitignore). This allows the folder structure to be preserved in the repository while excluding large data files.

Loading the Dataset

Prerequisites: Make sure you have set up the dataset files as described in the Dataset Setup section above.

The project includes three datasets that need to be loaded into SQLite:

musicoset_popularity: Chart and popularity data (6 tables)
musicoset_metadata: Album, artist, and song metadata (5 tables)
musicoset_songfeatures: Acoustic features and lyrics (2 tables)

Step 1: Verify Dataset Files

Ensure all CSV files are present in the dataset/ directory with the correct folder structure:

# Check dataset structure
ls -R dataset/

# Verify file count
find dataset -name "*.csv" | wc -l  # Should output: 13

You should see:

dataset/musicoset_popularity/*.csv (6 files)
dataset/musicoset_metadata/*.csv (5 files)
dataset/musicoset_songfeatures/*.csv (2 files)

Step 2: Load Data into SQLite

Run the data loader:

python3 -m db.load_sqlite

Or from the project root:

cd sqlagent
python3 -m db.load_sqlite

This will:

Create the SQLite database at data/db/musicoset_popularity.sqlite
Load all 13 CSV files into tables
Create indexes on foreign keys and commonly queried columns
Display progress and row counts for each table

Expected Output:

============================================================
Loading all dataset data into SQLite
============================================================
Loading album_chart.csv into table 'album_chart'...
  ✓ Loaded 471,706 rows into 'album_chart'
...
✓ Database created successfully at: .../musicoset_popularity.sqlite
============================================================

Step 3: Verify Database

You can verify the database was created:

sqlite3 data/db/musicoset_popularity.sqlite ".tables"

Should show all 13 tables.

Running the Agent

Interactive Mode (Recommended)

Start the interactive CLI:

python3 main.py

This will start an interactive session where you can ask questions:

============================================================
Text-to-SQL Agent - Interactive Mode
============================================================
Enter your questions (type 'exit', 'quit', or 'q' to stop)
Type 'help' for usage examples

> Show me the top 10 pop songs from 2018

Read Graph.md to understand graph and its conditional logics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-SQL Agent

Features

Project Structure

Prerequisites

Installation

1. Clone or Navigate to the Project

2. Install Dependencies

3. Set Up Environment Variables

Dataset Setup

Required Folder Structure

Downloading the Datasets

Quick Setup Commands

Verify Dataset Files

Loading the Dataset

Step 1: Verify Dataset Files

Step 2: Load Data into SQLite

Step 3: Verify Database

Running the Agent

Interactive Mode (Recommended)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agents		agents
dataset		dataset
db		db
.gitignore		.gitignore
README.md		README.md
graph.md		graph.md
how it works.md		how it works.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Text-to-SQL Agent

Features

Project Structure

Prerequisites

Installation

1. Clone or Navigate to the Project

2. Install Dependencies

3. Set Up Environment Variables

Dataset Setup

Required Folder Structure

Downloading the Datasets

Quick Setup Commands

Verify Dataset Files

Loading the Dataset

Step 1: Verify Dataset Files

Step 2: Load Data into SQLite

Step 3: Verify Database

Running the Agent

Interactive Mode (Recommended)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages