Skip to content

daergoth/DataGeneratorTool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Generator Tool

data-generator-tool is a small Rust CLI for generating fake tabular data from a JSON schema. It can also read existing CSV or Parquet data, then write it back out as a single file or a partitioned dataset.

This project was built to make it easy to create realistic-looking datasets for demos, local development, and data-processing experiments without depending on production data.

What it does

  • Generates a polars::DataFrame from a schema file.
  • Writes output as CSV or Parquet.
  • Supports single-file output and partitioned output.
  • Can read back existing CSV/Parquet input for a simple round trip.
  • Uses deterministic per-row seeding for most generators, so repeated runs are stable when the seed is fixed.

Example workflow

Generate 1,000 rows of CSV:

cargo run -- -s schema.json -r 1000 -f csv -o out.csv

Read it back:

cargo run -- -i out.csv -f csv

Generate partitioned Parquet output into a directory:

mkdir out
cargo run -- -s schema.json -r 1000 -f parquet -o out

CLI options

The main flags can also be supplied through environment variables.

Flag Env var Description Default
-s, --schema <SCHEMA_FILE> DATAGEN_SCHEMA_FILE JSON schema file schema.json
-r, --rows <NUM_ROWS> DATAGEN_NUM_ROWS Number of rows to generate 10000
-t, --threads <NO_THREADS> DATAGEN_NUM_THREADS Rayon worker threads 1
-o, --output <OUTPUT_PATH> DATAGEN_OUTPUT_PATH Output file or directory not set
-i, --input <INPUT_PATH> DATAGEN_INPUT_PATH Read an existing file or dataset directory not set
-f, --format <FORMAT> DATAGEN_OUTPUT_FORMAT csv or parquet csv

Notes:

  • -f/--format controls both read mode and write mode.
  • The tool does not infer CSV vs Parquet from the file extension.
  • -t/--threads sets RAYON_NUM_THREADS before generation starts.

Schema format

The schema is a JSON file with a main_seed and a columns array. Each column needs at least a name and type, and may include extra fields depending on the generator.

Minimal example:

{
  "main_seed": 0,
  "columns": [
    { "name": "id", "type": "RowNumber" },
    { "name": "name", "type": "FullName" },
    { "name": "active", "type": "Bool", "ratio": 0.5 }
  ]
}

Common column types

Implemented generators include:

  • Hash
  • Uuid
  • RowNumber
  • Number
  • Date
  • Bool
  • Enum
  • Words
  • Numerify
  • FirstName
  • LastName
  • FullName
  • Address
  • City
  • State
  • Zip
  • FreeEmail
  • CompanyName
  • PhoneNumber
  • StreetName

The example schema.json in this repo shows a broader mix of these types working together.

Some types accept additional fields:

  • seed — per-column override for the default seed
  • format — used by Hash, Numerify, and Date
  • min / max — for Number
  • start / end — for Date
  • ratio — for Bool
  • values — for Enum
  • count — for Words

If an unsupported type is used, generation currently panics.

Output behavior

Single file output

If -o points to a file path, the tool writes:

  • the data file itself, and
  • a sibling stats file named *-stats.csv or *-stats.parquet

The stats file contains basic string-column summaries such as column name, row count, and max string length.

Output layout

Partitioned output

If -o points to a directory, or ends with /, the tool writes a partitioned layout:

output/
  dataset=0/
  part-00000.csv
  part-00000-stats.csv
  part-00001.csv
  part-00001-stats.csv

For Parquet, the same layout is used with .parquet files.

Existing .csv or .parquet files inside the dataset directory are cleaned up before rewriting.

CSV specifics

  • CSV output uses ; as the separator.
  • This applies to both single-file and partitioned CSV output.

Read mode

When -i/--input is set, the tool reads the given file or dataset directory instead of generating new data.

  • If the input path is a file, it reads a single CSV or Parquet file.
  • If the input path is a directory, it recursively reads partitioned files.

Notes and limitations

  • The tool does not infer CSV vs Parquet from the file extension.
  • Unsupported generator types still fail fast.
  • The CLI is intentionally simple and geared toward local data generation, not large-scale orchestration.

Development notes

  • schema.json is the quickest smoke test for generator changes.
  • The project uses polars, rayon, fake, clap, serde_json, sha2, chrono, uuid, and strum.
  • On Windows, cargo test may require the MSVC build tools / link.exe.

Future improvements

Good next steps for the project would be:

  • friendlier schema validation and error messages
  • more supported fake-data generators
  • clearer reporting around unsupported column types
  • optional examples or presets for common dataset shapes

License

This project is licensed under the MIT License. See LICENSE for details.

About

Rust CLI for generating fake tabular data from a JSON schema, with CSV/Parquet output and optional round-trip reading of existing datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages