Data Generator Tool

data-generator-tool is a small Rust CLI for generating fake tabular data from a JSON schema. It can also read existing CSV or Parquet data, then write it back out as a single file or a partitioned dataset.

This project was built to make it easy to create realistic-looking datasets for demos, local development, and data-processing experiments without depending on production data.

What it does

Generates a polars::DataFrame from a schema file.
Writes output as CSV or Parquet.
Supports single-file output and partitioned output.
Can read back existing CSV/Parquet input for a simple round trip.
Uses deterministic per-row seeding for most generators, so repeated runs are stable when the seed is fixed.

Example workflow

Generate 1,000 rows of CSV:

cargo run -- -s schema.json -r 1000 -f csv -o out.csv

Read it back:

cargo run -- -i out.csv -f csv

Generate partitioned Parquet output into a directory:

mkdir out
cargo run -- -s schema.json -r 1000 -f parquet -o out

CLI options

The main flags can also be supplied through environment variables.

Flag	Env var	Description	Default
`-s, --schema <SCHEMA_FILE>`	`DATAGEN_SCHEMA_FILE`	JSON schema file	`schema.json`
`-r, --rows <NUM_ROWS>`	`DATAGEN_NUM_ROWS`	Number of rows to generate	`10000`
`-t, --threads <NO_THREADS>`	`DATAGEN_NUM_THREADS`	Rayon worker threads	`1`
`-o, --output <OUTPUT_PATH>`	`DATAGEN_OUTPUT_PATH`	Output file or directory	not set
`-i, --input <INPUT_PATH>`	`DATAGEN_INPUT_PATH`	Read an existing file or dataset directory	not set
`-f, --format <FORMAT>`	`DATAGEN_OUTPUT_FORMAT`	`csv` or `parquet`	`csv`

Notes:

-f/--format controls both read mode and write mode.
The tool does not infer CSV vs Parquet from the file extension.
-t/--threads sets RAYON_NUM_THREADS before generation starts.

Schema format

The schema is a JSON file with a main_seed and a columns array. Each column needs at least a name and type, and may include extra fields depending on the generator.

Minimal example:

{
  "main_seed": 0,
  "columns": [
    { "name": "id", "type": "RowNumber" },
    { "name": "name", "type": "FullName" },
    { "name": "active", "type": "Bool", "ratio": 0.5 }
  ]
}

Common column types

Implemented generators include:

Hash
Uuid
RowNumber
Number
Date
Bool
Enum
Words
Numerify
FirstName
LastName
FullName
Address
City
State
Zip
FreeEmail
CompanyName
PhoneNumber
StreetName

The example schema.json in this repo shows a broader mix of these types working together.

Some types accept additional fields:

seed — per-column override for the default seed
format — used by Hash, Numerify, and Date
min / max — for Number
start / end — for Date
ratio — for Bool
values — for Enum
count — for Words

If an unsupported type is used, generation currently panics.

Output behavior

Single file output

If -o points to a file path, the tool writes:

the data file itself, and
a sibling stats file named *-stats.csv or *-stats.parquet

The stats file contains basic string-column summaries such as column name, row count, and max string length.

Output layout

Partitioned output

If -o points to a directory, or ends with /, the tool writes a partitioned layout:

output/
  dataset=0/
  part-00000.csv
  part-00000-stats.csv
  part-00001.csv
  part-00001-stats.csv

For Parquet, the same layout is used with .parquet files.

Existing .csv or .parquet files inside the dataset directory are cleaned up before rewriting.

CSV specifics

CSV output uses ; as the separator.
This applies to both single-file and partitioned CSV output.

Read mode

When -i/--input is set, the tool reads the given file or dataset directory instead of generating new data.

If the input path is a file, it reads a single CSV or Parquet file.
If the input path is a directory, it recursively reads partitioned files.

Notes and limitations

The tool does not infer CSV vs Parquet from the file extension.
Unsupported generator types still fail fast.
The CLI is intentionally simple and geared toward local data generation, not large-scale orchestration.

Development notes

schema.json is the quickest smoke test for generator changes.
The project uses polars, rayon, fake, clap, serde_json, sha2, chrono, uuid, and strum.
On Windows, cargo test may require the MSVC build tools / link.exe.

Future improvements

Good next steps for the project would be:

friendlier schema validation and error messages
more supported fake-data generators
clearer reporting around unsupported column types
optional examples or presets for common dataset shapes

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
schema.json		schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Generator Tool

What it does

Example workflow

CLI options

Schema format

Common column types

Output behavior

Single file output

Output layout

Partitioned output

CSV specifics

Read mode

Notes and limitations

Development notes

Future improvements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Generator Tool

What it does

Example workflow

CLI options

Schema format

Common column types

Output behavior

Single file output

Output layout

Partitioned output

CSV specifics

Read mode

Notes and limitations

Development notes

Future improvements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages