A production-grade Text-to-SQL agent built with LangGraph that converts natural language questions into SQL queries. The agent uses LLMs for language understanding, schema grounding, query planning, and SQL generation, with robust error handling and repair mechanisms.
- Natural Language to SQL: Convert questions like "Show me the top 10 pop songs from 2018" into SQL queries
- Multi-Dataset Support: Works with musicoset_popularity, musicoset_metadata, and musicoset_songfeatures datasets
- Intelligent Schema Understanding: Automatic schema introspection, foreign key detection, and join graph generation
- Robust Error Handling: Automatic SQL repair and retry mechanisms
- Safety Validation: Built-in security checks to prevent unsafe queries
- Clarification Support: Asks for clarification when queries are ambiguous
- LangSmith Integration: Built-in tracing support for observability
sqlagent/
├── agents/ # LangGraph agent implementation
│ ├── nodes/ # Individual graph nodes (20 nodes)
│ ├── conditions.py # Conditional routing logic
│ ├── graph.py # Main graph definition
│ ├── state.py # State schema
│ └── tools/ # LLM clients and prompts
├── db/ # Database utilities
│ ├── connection.py # Database connection management
│ ├── load_sqlite.py # CSV to SQLite loader
│ └── schema.py # Schema introspection utilities
├── dataset/ # CSV data files
│ ├── musicoset_popularity/
│ ├── musicoset_metadata/
│ └── musicoset_songfeatures/
├── data/ # Generated database
│ └── db/
│ └── musicoset_popularity.sqlite
├── main.py # CLI entry point
├── requirements.txt # Python dependencies
└── .env # Environment variables (API keys)
- Python 3.9 or higher
- pip or conda
- LLM API key (OpenAI or Anthropic)
cd /path/to/sqlagentpip install -r requirements.txtRequired packages:
pandas>=2.0.0- Data loadingopenai>=1.0.0- OpenAI LLM supportanthropic>=0.18.0- Anthropic LLM supportlanggraph>=0.0.40- Graph-based agent frameworklangchain>=0.1.0- LangChain integrationpython-dotenv>=1.0.1- Environment variable management
Create a .env file in the sqlagent/ directory:
# LLM API Keys (at least one required)
OPENAI_API_KEY=sk-your-openai-key-here
# OR
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key-here
# LangSmith Tracing (optional but recommended)
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=lsv2-your-langsmith-key-here
LANGCHAIN_PROJECT=text-to-sql-agentNote: The agent will work with fallback rule-based methods if no LLM API key is provided, but LLM integration provides much better results.
Important: The dataset CSV files are not included in this repository due to their large size (some files exceed GitHub's 100 MB limit). You need to download and place them manually.
Create the following exact folder structure in the sqlagent/ directory:
sqlagent/
├── dataset/
│ ├── musicoset_popularity/
│ │ ├── album_chart.csv
│ │ ├── album_pop.csv
│ │ ├── artist_chart.csv
│ │ ├── artist_pop.csv
│ │ ├── song_chart.csv
│ │ └── song_pop.csv
│ ├── musicoset_metadata/
│ │ ├── albums.csv
│ │ ├── artists.csv
│ │ ├── releases.csv
│ │ ├── songs.csv
│ │ └── tracks.csv
│ └── musicoset_songfeatures/
│ ├── acoustic_features.csv
│ └── lyrics.csv
You need to obtain the dataset files from the original sources:
-
musicoset_popularity (6 CSV files):
album_chart.csvalbum_pop.csvartist_chart.csvartist_pop.csvsong_chart.csvsong_pop.csv
-
musicoset_metadata (5 CSV files):
albums.csvartists.csvreleases.csvsongs.csvtracks.csv
-
musicoset_songfeatures (2 CSV files):
acoustic_features.csvlyrics.csv(⚠️ Large file: ~142 MB)
# Navigate to project directory
cd sqlagent
# Create dataset directory structure
mkdir -p dataset/musicoset_popularity
mkdir -p dataset/musicoset_metadata
mkdir -p dataset/musicoset_songfeatures
# Place your CSV files in the respective directories
# Example:
# cp /path/to/your/files/*.csv dataset/musicoset_popularity/After placing the files, verify the structure:
# Check all files are present
ls -lh dataset/musicoset_popularity/*.csv
ls -lh dataset/musicoset_metadata/*.csv
ls -lh dataset/musicoset_songfeatures/*.csv
# Expected output: 13 CSV files total
find dataset -name "*.csv" | wc -l # Should output: 13Note: The dataset/ directory and its subdirectories are tracked by git, but the .csv files themselves are ignored (see .gitignore). This allows the folder structure to be preserved in the repository while excluding large data files.
Prerequisites: Make sure you have set up the dataset files as described in the Dataset Setup section above.
The project includes three datasets that need to be loaded into SQLite:
- musicoset_popularity: Chart and popularity data (6 tables)
- musicoset_metadata: Album, artist, and song metadata (5 tables)
- musicoset_songfeatures: Acoustic features and lyrics (2 tables)
Ensure all CSV files are present in the dataset/ directory with the correct folder structure:
# Check dataset structure
ls -R dataset/
# Verify file count
find dataset -name "*.csv" | wc -l # Should output: 13You should see:
dataset/musicoset_popularity/*.csv(6 files)dataset/musicoset_metadata/*.csv(5 files)dataset/musicoset_songfeatures/*.csv(2 files)
Run the data loader:
python3 -m db.load_sqliteOr from the project root:
cd sqlagent
python3 -m db.load_sqliteThis will:
- Create the SQLite database at
data/db/musicoset_popularity.sqlite - Load all 13 CSV files into tables
- Create indexes on foreign keys and commonly queried columns
- Display progress and row counts for each table
Expected Output:
============================================================
Loading all dataset data into SQLite
============================================================
Loading album_chart.csv into table 'album_chart'...
✓ Loaded 471,706 rows into 'album_chart'
...
✓ Database created successfully at: .../musicoset_popularity.sqlite
============================================================
You can verify the database was created:
sqlite3 data/db/musicoset_popularity.sqlite ".tables"Should show all 13 tables.
Start the interactive CLI:
python3 main.pyThis will start an interactive session where you can ask questions:
============================================================
Text-to-SQL Agent - Interactive Mode
============================================================
Enter your questions (type 'exit', 'quit', or 'q' to stop)
Type 'help' for usage examples
> Show me the top 10 pop songs from 2018
Read Graph.md to understand graph and its conditional logics