docs: add plan for workflow chaining#552
Conversation
Proposes replacing the in-place allow_resize mechanism with a Pipeline class that chains multiple generation stages. Each stage gets a fresh fixed-size tracker, and resize becomes a between-stage concern.
…ntrols, edge cases
greptile-apps (PR #594, r3189904028): `ProviderRepository.load`'s YAML-default `DeprecationWarning` was using `warnings.warn(stacklevel=2)`, which attributes to whichever data_designer frame called `load()` — controllers, services, list/reset commands, agent introspection. Every real call path lands on `data_designer.cli.*`, which falls under Python's default `ignore::DeprecationWarning` filter and is silenced. Audit found two more sites with the same problem: - `DatasetBuilder._resolve_async_compatibility` (`allow_resize` / issue #552) — was using `stacklevel=4` to walk past `_resolve_async_compatibility -> build/build_preview -> interface -> user`. Brittle: any added frame (decorator, async wrapping, the `try/except DeprecationWarning: raise` boundary) shifts attribution silently. The existing test passed only because it used `simplefilter("always") + record=True`, which records warnings regardless of attribution. - `ProviderController._handle_change_default` — was using `stacklevel=2`, which lands on the menu dispatcher in the same controller module. `print_warning` already shows the message visually, but programmatic observers (`pytest.warns`, `filterwarnings("error", ...)`) saw a library-attributed entry that default filters silenced. All three migrated to `warn_at_caller` (the helper from 247fa30) so attribution lands on the user's call site regardless of internal chain shape. `data_designer` is already in `DEFAULT_INTERNAL_PREFIXES`, so the walk escapes the entire library in one pass. Added attribution regression tests at each site asserting `warning.filename == __file__`. A future regression to `warnings.warn(stacklevel=N)` now fails CI instead of silently silencing the user-facing nudge: - `test_load_with_yaml_default_attributes_warning_to_caller` (test_provider_repository.py) - `test_resolve_async_compatibility` extended with the same assertion - `test_handle_change_default_emits_deprecation_warning` rewritten from `pytest.warns(...)` to a `catch_warnings(record=True)` block that filters for the message and asserts `filename == __file__` (`pytest.warns` does not check attribution, so the rewrite is required to actually catch the regression). 3,125 tests pass (548 config + 1,923 engine + 654 interface). Refs #589
* feat(models): deprecate implicit default provider routing Emit DeprecationWarning whenever the legacy "implicit default provider" path is exercised: `ModelConfig.provider=None`, the registry-level `ModelProviderRegistry.default`, the YAML `default:` key in `~/.data-designer/model_providers.yaml`, and the CLI's "Change default provider" workflow. `resolve_model_provider_registry` skips passing `default=` in the single-provider case so the common construction path stays quiet. Multi-provider registries still pass `default` (per `check_implicit_default`) and warn accordingly. Update docs, the package README, and test fixtures to specify `provider=` explicitly on every `ModelConfig`. New tests cover each warning entry point and pin the post-deprecation happy paths. Refs #589 Made-with: Cursor * fix(models): address PR #594 review feedback Greptile P1: ProviderRepository.load emitted its DeprecationWarning inside a `try/except Exception` block. Under `filterwarnings("error", DeprecationWarning)` the warn would raise, the except would swallow it, and `load()` would silently return None (losing the registry). Move the warn outside the catch-all so the strict-warning path no longer drops valid configs. Greptile P2 / johnnygreco: `_warn_on_implicit_provider` and `_warn_on_explicit_default` use `stacklevel=2`, which lands inside pydantic v2's validator dispatch rather than at the user's `ModelConfig(...)` / `ModelProviderRegistry(...)` call. That broke both attribution (the source line was unhelpful) and Python's once-per-location dedup (every call collapsed to the same pydantic-internal key, suppressing all but the first warning). Introduce `data_designer.config.utils.warning_helpers.warn_at_caller`, which walks past the helper, validator, and any pydantic frames to find the user's call site and emits via `warnings.warn_explicit` with the user frame's `__warningregistry__`. Keeps attribution accurate and dedup keyed on the user's (filename, lineno). johnnygreco: align the `provider_repository.py` warning copy with the sibling site in `default_model_settings.py` ("specify provider= explicitly on each ModelConfig instead") so both YAML-default warning sites give the same migration instruction. The previous wording pointed users at "ModelConfig entries" inside `model_providers.yaml`, where ModelConfig entries don't actually live. johnnygreco: dedup the cascade in `DataDesigner.__init__`. With `model_providers=None` and a YAML `default:`, the user previously saw two DeprecationWarnings for the same root cause — `get_default_provider_name()` warns about the YAML key, then `resolve_model_provider_registry(...)` re-warns from `_warn_on_explicit_default`. Suppress the registry-level duplicate in the YAML-fallback branch via `warnings.catch_warnings()` so users see exactly one warning per user action. johnnygreco: tighten `_warn_on_explicit_default` to fire only when `default is not None`. Passing `default=None` explicitly is semantically equivalent to omitting it (caller is opting *out* of a registry-level default), and shouldn't trigger the deprecation nudge. johnnygreco: add a `model_validate({...})` regression test for `ModelConfig` so the deserialization path (legacy on-disk configs) is pinned alongside the construction path. Tests: - Update `test_load_exists` and `test_save` to omit `default=` so the roundtrip stops exercising the deprecated YAML-default path unguarded (Greptile note). - Wrap `test_resolve_model_provider_registry_with_explicit_default`, `test_get_provider`, and `test_init_user_supplied_providers_preserve_first_wins_over_yaml_default` in `pytest.warns` so the suite stays green under `-W error::DeprecationWarning` (Greptile note). - Add `test_explicit_default_none_does_not_emit_deprecation_warning` to pin the tightened predicate. - Add `test_init_yaml_default_emits_single_deprecation_warning` to pin the cascade-dedup behavior. Refs #589 Made-with: Cursor * fix(models): make deprecation warnings visible under default filters andreatgretel (PR #594): the YAML-default warning in `get_default_provider_name` and the registry-default warning emitted from inside DataDesigner helpers were attributing to data_designer library frames, not user code. Python's default filter chain includes `ignore::DeprecationWarning`, so library-attributed entries are silenced — meaning a normal `DataDesigner()` call with a YAML `default:` set showed nothing, and `resolve_model_provider_registry` warnings were similarly invisible. Two related changes: 1. `warn_at_caller`: extend the default skip-list from `("pydantic",)` to `("pydantic", "pydantic_core", "data_designer")` so the walk escapes both pydantic's validator-dispatch frames and data_designer helper frames before attributing. Also tighten the prefix predicate to exact-or-dotted-prefix matching (`name == p or name.startswith(p + ".")`) so e.g. `pydantic_helpers` is not falsely matched as part of `pydantic` (johnnygreco nit). Allow callers to pass a custom `skip_prefixes` for flexibility. Drop the "skip frame 0+1 unconditionally" guard now that prefix matching covers it. 2. `get_default_provider_name`: switch from `warnings.warn(stacklevel=2)` to `warn_at_caller`. The previous stacklevel pointed into `default_model_settings.py`, which is a library file → silenced under default filters. Verified the fix empirically with `python -W default`: warning is now attributed to the user's call site and rendered. johnnygreco (PR #594): add the missing `test_explicit_default_none_does_not_emit_deprecation_warning` regression for the `self.default is not None` predicate landed in the prior round. Tests: - New `test_warning_helpers.py` pins prefix-matching precision (rejects `pydantic_helpers` / `data_designer_other`), default skip-list contents, attribution past skip-prefix frames, and per-call-site dedup behavior. - `test_get_default_provider_name_warning_attributes_to_user_frame` pins andreatgretel's repro for the YAML-default site. - `test_explicit_default_warning_attributes_to_user_frame` pins the multi-frame case: construction goes through `resolve_model_provider_registry`, so the walk has to escape both pydantic and data_designer before landing on the test file. - `test_explicit_default_none_does_not_emit_deprecation_warning` pins johnnygreco's predicate-tightening regression. 3,124 tests pass (540 config + 1,923 engine + 653 interface; +10 net from this round). Refs #589 Made-with: Cursor * fix(models): apply warn_at_caller to remaining deprecation sites greptile-apps (PR #594, r3189904028): `ProviderRepository.load`'s YAML-default `DeprecationWarning` was using `warnings.warn(stacklevel=2)`, which attributes to whichever data_designer frame called `load()` — controllers, services, list/reset commands, agent introspection. Every real call path lands on `data_designer.cli.*`, which falls under Python's default `ignore::DeprecationWarning` filter and is silenced. Audit found two more sites with the same problem: - `DatasetBuilder._resolve_async_compatibility` (`allow_resize` / issue #552) — was using `stacklevel=4` to walk past `_resolve_async_compatibility -> build/build_preview -> interface -> user`. Brittle: any added frame (decorator, async wrapping, the `try/except DeprecationWarning: raise` boundary) shifts attribution silently. The existing test passed only because it used `simplefilter("always") + record=True`, which records warnings regardless of attribution. - `ProviderController._handle_change_default` — was using `stacklevel=2`, which lands on the menu dispatcher in the same controller module. `print_warning` already shows the message visually, but programmatic observers (`pytest.warns`, `filterwarnings("error", ...)`) saw a library-attributed entry that default filters silenced. All three migrated to `warn_at_caller` (the helper from 247fa30) so attribution lands on the user's call site regardless of internal chain shape. `data_designer` is already in `DEFAULT_INTERNAL_PREFIXES`, so the walk escapes the entire library in one pass. Added attribution regression tests at each site asserting `warning.filename == __file__`. A future regression to `warnings.warn(stacklevel=N)` now fails CI instead of silently silencing the user-facing nudge: - `test_load_with_yaml_default_attributes_warning_to_caller` (test_provider_repository.py) - `test_resolve_async_compatibility` extended with the same assertion - `test_handle_change_default_emits_deprecation_warning` rewritten from `pytest.warns(...)` to a `catch_warnings(record=True)` block that filters for the message and asserts `filename == __file__` (`pytest.warns` does not check attribution, so the rewrite is required to actually catch the regression). 3,125 tests pass (548 config + 1,923 engine + 654 interface). Refs #589
…, fingerprint feature available - Update allow_resize framing: now logs DeprecationWarning and falls back to sync (#553), no longer hard-rejected. Async is default as of #592. - Reference DataDesignerConfig.fingerprint() (#587) as the per-stage hash for resume invalidation. - Rename _validate_async_compatibility() to _resolve_async_compatibility() to match current code. - Mark Phase 2 step 1 as done; list the concrete docs that still need updates.
…ant, on-disk handoffs, DAG-ready, acreate sidecar - Resolve in-memory vs on-disk handoff to always-on-disk inside Pipeline; reserve in-memory for to_config_builder() notebook ergonomic. - Add Composability section: parent DataDesigner reuse is a load-bearing API contract for throttle coordination across stages and parallel branches. - Add Engine API surface section: acreate() as a small additive sidecar, independent of chaining v1 but a hard dependency for Phase 4. - Promote DAG semantics from "future work" to "designed-in"; add Phase 4 (parallel branches via asyncio.gather over acreate); demote auto-chaining to Phase 5. - New Resolved decisions section captures the three load-bearing API decisions; trim the Open questions list accordingly. - Mention possible future external orchestration only as a vague composability constraint, no commitment.
- Soften "Door open for external orchestration" - drop throttle-backend-as-seam framing; cross-reference Future considerations. - Make acreate() scope explicit (in-process); cross-process orchestration is not the same problem. - Add Phase 4 scope clarifier - branch parallelism, not stage pipelining. - New Future considerations section: external orchestration (vague, uncommitted) and pipelined execution of dependent stages.
Review: PR #552 — docs: add plan for workflow chainingSummaryThis PR adds a single planning document at
No code changes. FindingsArchitectural alignment — strong
Completeness — mostly complete, with a few areas to tighten before implementation
Feasibility — feasible as structured
Minor observations
VerdictStrong plan. Architecturally aligned with the project's layering and invariants, feasibility-grounded in existing primitives, and thoughtful about forward compatibility (DAG-internal-v1, acreate-as-sidecar). The one place the plan would benefit from a tightening pass before implementation is the stage data contract and fingerprint scope: callback output layout, callback fingerprinting for resume, and inclusion of Recommend approving the plan with a note to resolve items 1–3 above before starting Phase 1 implementation. |
Greptile SummaryThis PR adds a design plan for workflow chaining in DataDesigner — an orchestration layer that sequences multiple generation stages, passing each stage's output on disk as the next stage's seed. As a secondary benefit, the plan also covers
|
| Filename | Overview |
|---|---|
| plans/workflow-chaining/workflow-chaining.md | New design plan for workflow chaining; no code changes. Core design is sound and well-reasoned. Two design ambiguities: callback output path format contract is unresolved, and allow_empty short-circuit leaves the results shape undefined for skipped stages. |
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
plans/workflow-chaining/workflow-chaining.md:96-117
**Callback output format contract is unspecified**
The plan states the callback "returns a path that the next stage will seed from," but doesn't define the format that `LocalFileSeedSource` expects at that path. Stage outputs use a `parquet-files/` subdirectory (shown in the artifact layout), and the callback example reads from `stage_output_path / "parquet-files"`. However, the same callback writes a flat file to `out / "data.parquet"` — not a `parquet-files/` subdirectory — and returns `out`. If `LocalFileSeedSource` expects the `parquet-files/` convention it would fail to find any data at the callback's returned path. The plan needs to resolve whether (a) callbacks must produce the same `parquet-files/` structure as stage outputs, or (b) `LocalFileSeedSource` is flexible enough to accept any directory with flat `.parquet` files at the root. The example in the filter-then-enrich use case at line 278 has the same inconsistency.
### Issue 2 of 2
plans/workflow-chaining/workflow-chaining.md:117
**`allow_empty=True` short-circuit leaves `results` shape undefined**
When `allow_empty=True` causes the pipeline to short-circuit, the plan doesn't specify what `results` contains for the skipped stages. If a caller does `results["enriched"].load_dataset()` after a short-circuit, the expected behavior — `None`, an empty `DatasetCreationResults`, a sentinel object, or a `KeyError` — should be decided here. The choice affects how users write post-pipeline code that references all stage names unconditionally.
Reviews (2): Last reviewed commit: "docs: address workflow chaining review c..." | Re-trigger Greptile
Summary
allow_resizeand simplification of sync/async engine convergence (deprecation already shipped in chore: async engine readiness - blockers and polish before default #553).What's in the plan
add_stage(),run(), between-stage callbacks. Reuses the parentDataDesignerso all stages share oneModelRegistry/ThrottleManager.to_config_builder()convenience on results for lightweight notebook chaining.LocalFileSeedSource. In-memoryDataFrameSeedSourceis reserved for theto_config_builder()notebook ergonomic and is explicitly not a Pipeline.depends_on=[...]).acreate()engine sidecar. Small additive async API onDataDesigner. Independent of chaining v1; hard dependency for Phase 4. Enables in-process parallel-independent workflows viaasyncio.gather.allow_resizeremoval following the deprecation already inmainfrom chore: async engine readiness - blockers and polish before default #553.DataDesignerConfig.fingerprint()(feat(config): add deterministic fingerprint for workflow configs #587) composed withnum_records, DD version, and upstream stage fingerprint.Phases
to_config_builder()(can ship independently).acreate()onDataDesigner(independent track; can land before/alongside/after Phase 1).allow_resize(deprecation already shipped in chore: async engine readiness - blockers and polish before default #553; this phase finishes the removal).asyncio.gatheroveracreate(). Hard dependency on the sidecar.Resolved decisions
Pipeline. In-memory mode reserved forto_config_builder().Pipelineis constructed viadd.pipeline()and reuses the parentDataDesigneracross all stages - load-bearing for throttle coordination.Future considerations (uncommitted)
DataDesignerreuse, on-disk handoffs, no new engine surface) compose naturally with such a system.Open questions
Preview support, config serialization for auto-chaining, naming, image/media column forwarding, downstream seeding scope.
No code changes - plan document only.