Skip to content

feat(engine): let column configs declare all model aliases for the startup health check#626

Open
nabinchha wants to merge 1 commit intomainfrom
feat/get-model-aliases-606
Open

feat(engine): let column configs declare all model aliases for the startup health check#626
nabinchha wants to merge 1 commit intomainfrom
feat/get-model-aliases-606

Conversation

@nabinchha
Copy link
Copy Markdown
Contributor

📋 Summary

Plugin column configs that depend on more than one model alias (generator + judge, critic, etc.) could not opt their secondary aliases into the standard startup model health check, and configs without a model_alias field crashed the collection loop with AttributeError. This PR adds a single overridable accessor — SingleColumnConfig.get_model_aliases() — so every column type, built-in or plugin, can declare exactly which model endpoints the startup health check should ping.

🔗 Related Issue

Fixes #606

🔄 Changes

  • New default accessor SingleColumnConfig.get_model_aliases() (base.py) reads model_alias via getattr so configs without that field return [] instead of raising. Built-in LLM, embedding, and image configs work unchanged via this default.
  • CustomColumnConfig.get_model_aliases() override (column_configs.py) surfaces decorator-declared aliases through the same hook, replacing the isinstance(config, CustomColumnConfig) branch in the builder.
  • _run_model_health_check_if_needed (dataset_builder.py) collapses to a single loop over config.get_model_aliases(). The CustomColumnConfig import in the engine is removed (no longer needed).
  • Plugin docs (docs/plugins/models.md) update the multi-model PairwiseJudgeColumnConfig example to override get_model_aliases() instead of the previous workaround that registration-validated extra aliases in _validate(). The "Health checks and scheduling" section is updated to reflect accessor-driven collection.
  • New tests in test_columns.py, test_dataset_builder.py, and test_custom.py cover: the default behavior on configs without model_alias, the primary-alias path on built-in LLM/embedding/image configs (parametrized), the override pattern from the docs example, the multi-alias collection in the builder (regression test for feat(engine): let plugin column configs declare all model aliases for the startup health check #606), and the no-op skip path for sampler-only configs.

🧪 Testing

  • pytest packages/data-designer-config/tests packages/data-designer-engine/tests packages/data-designer/tests passes (3262 tests)
  • Unit tests added: 6 new tests covering default behavior, built-in configs, the multi-model override pattern, custom column decorator metadata, and the dataset-builder health-check collection loop
  • ruff format --check and ruff check clean across the workspace
  • License headers up to date
  • E2E tests added/updated — N/A (config/builder layer, no new end-to-end surface)

✅ Checklist

  • Follows commit message conventions (feat(engine): prefix matches recent history)
  • Commits are signed off (DCO)
  • Architecture docs updated — N/A (AGENTS.md invariants unchanged; this is a hook on an existing class, not a layering change)

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

  • packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py — this PR drops the column_type_is_model_generated(config.column_type) gate from the health-check loop, which the issue's proposed snippet kept. The reason: CustomColumnGenerator extends ColumnGenerator, not ColumnGeneratorWithModelRegistry, so that gate excludes custom columns and the existing isinstance branch would not actually be subsumed by the new accessor. Calling get_model_aliases() on every config relies on the default returning [] for non-model configs (samplers, expressions, validators, seed datasets), which is verified by the new test_run_model_health_check_skips_when_no_model_aliases test. The new test_run_model_health_check_collects_aliases_from_get_model_aliases test asserts that built-in LLM and CustomColumnConfig aliases are both collected through the single loop.
  • docs/plugins/models.md — the PairwiseJudgeColumnGenerator example loses its _validate() method. That method previously called get_model_config(...) for the secondary alias to fail-fast on registration errors. Now that get_model_aliases() opts that alias into run_health_check, the registration check happens for free and the endpoint is actually exercised at startup, so the manual workaround is redundant.

…artup health check

Plugin column configs that depend on more than one model alias (generator + judge,
critic, etc.) previously could not opt their secondary aliases into the standard
startup health check, and configs without a `model_alias` field crashed the
collection loop with AttributeError.

Add `SingleColumnConfig.get_model_aliases()` as the single override hook the
builder uses to enumerate aliases. The default returns the column's primary
`model_alias` (if any), so built-in LLM, embedding, and image columns work
unchanged. `CustomColumnConfig` overrides it to surface decorator-declared
aliases, replacing the special-case `isinstance` branch in the builder. Plugin
configs with multiple model fields override it to opt every endpoint into the
health check.

Fixes #606

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha nabinchha requested a review from a team as a code owner May 9, 2026 00:24
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Docs preview: https://a39ab724.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 9, 2026

Greptile Summary

This PR introduces SingleColumnConfig.get_model_aliases() as the single source of truth for which model endpoints each column config registers with the startup health check. The refactor fixes two issues: an AttributeError crash when iterating configs that lack a model_alias field, and the inability of multi-model plugin configs to opt secondary aliases into the health check.

  • SingleColumnConfig.get_model_aliases() is added to base.py as the default hook; it uses getattr to read model_alias and returns [] when absent, eliminating the AttributeError path.
  • CustomColumnConfig.get_model_aliases() overrides the default to surface decorator-declared aliases; _run_model_health_check_if_needed in dataset_builder.py collapses its two-branch logic to a single update call per config.
  • The docs PairwiseJudgeColumnConfig example is updated to show the override pattern and the now-unnecessary _validate() workaround is removed.

Confidence Score: 5/5

The change is safe to merge — it fixes two real bugs with a minimal, well-tested API surface and no behavioural regressions on existing configs.

All built-in configs continue to work identically through the default implementation. The collapsed health-check loop relies on get_model_aliases() returning [] for non-model configs, which is verified by the new test_run_model_health_check_skips_when_no_model_aliases test. The multi-alias regression is covered by test_run_model_health_check_collects_aliases_from_get_model_aliases. No existing call sites of model_alias or CustomColumnConfig.model_aliases are broken by the refactor.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/config/base.py Adds get_model_aliases() to SingleColumnConfig; uses getattr defensively so configs without model_alias return [] instead of raising.
packages/data-designer-config/src/data_designer/config/column_configs.py Adds CustomColumnConfig.get_model_aliases() override that delegates to the existing model_aliases property, routing decorator-declared aliases through the new hook.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Collapses the two-branch health-check collection loop to a single config.get_model_aliases() call; removes the now-unused CustomColumnConfig import.
packages/data-designer-config/tests/config/test_columns.py New parametrized tests cover: empty-list default for configs without model_alias, primary-alias path for built-in configs, and the multi-alias override pattern.
packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py Two new builder-level tests: regression for #606 (multi-alias collection) and the no-op skip path when no config returns any alias.
packages/data-designer-engine/tests/engine/column_generators/generators/test_custom.py Existing integration test extended to assert that get_model_aliases() on a CustomColumnConfig returns the decorator-declared alias list.
docs/plugins/models.md Docs updated to reflect the accessor-driven health-check model; PairwiseJudgeColumnConfig example gains get_model_aliases() override and loses the redundant _validate() workaround.

Sequence Diagram

sequenceDiagram
    participant Builder as DatasetBuilder
    participant Config as SingleColumnConfig<br/>(base)
    participant CustomCfg as CustomColumnConfig<br/>(override)
    participant Registry as ModelRegistry

    Builder->>Builder: _run_model_health_check_if_needed()
    loop for each config in single_column_configs
        alt default (LLM / embedding / image configs)
            Builder->>Config: get_model_aliases()
            Config-->>Builder: [model_alias] or []
        else CustomColumnConfig
            Builder->>CustomCfg: get_model_aliases()
            CustomCfg-->>Builder: [alias_a, alias_b, ...]
        else plugin override (e.g. PairwiseJudge)
            Builder->>Config: get_model_aliases()
            Config-->>Builder: [model_alias, judge_model_alias]
        end
        Builder->>Builder: model_aliases.update(...)
    end
    alt model_aliases not empty
        Builder->>Registry: run_health_check(list(model_aliases))
        Registry-->>Builder: ok / raises
    end
Loading

Reviews (1): Last reviewed commit: "feat(engine): let column configs declare..." | Re-trigger Greptile

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Review: PR #626feat(engine): let column configs declare all model aliases for the startup health check

Summary

Replaces two special cases in _run_model_health_check_if_needed (a column_type_is_model_generated branch and an isinstance(config, CustomColumnConfig) branch) with a single polymorphic accessor SingleColumnConfig.get_model_aliases() on the config base class. The default reads model_alias via getattr (returning [] when absent), and CustomColumnConfig overrides it to surface the decorator-declared aliases. Fixes #606 — plugin configs that declare multiple model aliases can now opt every endpoint into the startup health check, and configs without a model_alias field no longer crash the loop.

Change is small and focused: ~24 lines of source changes, ~100 lines of tests, doc updates to match.

Findings

Design / architecture

  • Good fit for the "declare, don't orchestrate" contract. Moving the alias-collection decision from the engine's isinstance branches to a virtual method on the config is consistent with the registry/polymorphism pattern called out in AGENTS.md. The engine becomes one line and no longer needs to know about CustomColumnConfig.
  • Import layering improves. Dropping from data_designer.config.column_configs import CustomColumnConfig in dataset_builder.py reduces concrete-type coupling between engine and config. Direction is unchanged (engine → config) but the engine now only depends on the base class hook.
  • Blast radius is small. Only one call site (_run_model_health_check_if_needed) consumes the new accessor; the method is new and can't break subclasses that don't override it.

Correctness

  • Dropped column_type_is_model_generated gate is intentional and safe. The PR description explains the reasoning well: CustomColumnGenerator extends ColumnGenerator, not ColumnGeneratorWithModelRegistry, so the old gate excluded custom columns — which is exactly why the prior code needed the isinstance branch. The new code relies on the default get_model_aliases() returning [] for any config without a model_alias attribute, which is covered by test_run_model_health_check_skips_when_no_model_aliases.
  • getattr(self, "model_alias", None) is the right escape hatch. It correctly degrades to [] for configs with no model_alias field, fixing the AttributeError crash called out in the issue. The if alias else [] branch also eats empty strings, which is desirable (an empty alias would not resolve against ModelRegistry).
  • Backward compatibility preserved. Built-in LLMTextColumnConfig / LLMCodeColumnConfig / EmbeddingColumnConfig / ImageColumnConfig reach the same behavior as before through the default implementation. Any out-of-tree plugin that does not override get_model_aliases() also sees no change — only the primary model_alias is pinged, matching the previous contract.
  • Minor semantic shift to be aware of. The old gate restricted health checks to configs whose column_type was registered as model-generated. The new code will collect model_alias from any SingleColumnConfig subclass that happens to define a model_alias attribute, even if its generator is not model-backed. In practice this is only LLM/embedding/image configs today, so no current config is affected — but worth noting as a future footgun if someone adds a non-model column that carries a model_alias field for other reasons. Consider documenting this in the base-class docstring or, more conservatively, keeping the column_type_is_model_generated gate and only falling through to the accessor for CustomColumnConfig. The current approach is cleaner; flagging for awareness rather than as a blocker.

API shape

  • CustomColumnConfig now has both a model_aliases property (existing) and a get_model_aliases() method (new) that trivially returns self.model_aliases. The two names are close enough to cause confusion — model_aliases is the decorator-metadata getter, get_model_aliases() is the health-check hook. The override docstring helps, but a future refactor could unify these (e.g., have the base class's get_model_aliases() default be a property, or rename the existing property). Not a blocker for this PR.
  • get_model_aliases() is a public, non-abstract method on SingleColumnConfig. That's the right call for this change, but note it becomes part of the plugin author API surface — the docs update handles this.

Docs (docs/plugins/models.md)

  • The rewrite is accurate and reads well. The deleted paragraph explaining the _validate() registration-check workaround is genuinely obsolete now that run_health_check covers the same ground and pings the endpoint.
  • The PairwiseJudgeColumnGenerator example no longer defines _validate(). This is consistent with the new guidance, but existing plugins that still implement _validate() for this purpose will continue to work — it just becomes redundant, not broken. A one-line note pointing this out to readers with existing plugins might be worth adding, though not essential.
  • L163's "uncommon but valid" caveat about configs without a model_alias field is a nice touch.

Tests

  • Coverage is proportional to the change. The six new tests hit: default behavior on a field-less config, primary-alias path on four built-ins (parametrized), multi-alias override pattern, CustomColumnConfig override flowing through the property, and both the collection and skip branches of _run_model_health_check_if_needed.
  • test_run_model_health_check_collects_aliases_from_get_model_aliases is a strong regression test for feat(engine): let plugin column configs declare all model aliases for the startup health check #606 — it directly exercises the bug (secondary aliases not being pinged).
  • Minor gap: no explicit test for the model_alias="" empty-string edge case (the if alias branch would filter it). Not worth blocking on.
  • Test placement matches project convention (config tests in the config package, engine tests in the engine package).

Style / conventions

  • from __future__ import annotations present in touched source files (verified in base.py and column_configs.py context).
  • Type annotations are present and use modern syntax (list[str], str | None).
  • No relative imports added.
  • The alias: str | None = getattr(self, "model_alias", None) annotation is a slight over-claim — getattr could technically return anything, and model_alias on some config could be non-str. In practice model_alias is typed as str across built-in configs, so the annotation holds. Fine to leave as-is.

Security / performance

  • No security concerns. The change does not broaden what is health-checked in a way that would expose new endpoints — it simply lets multi-alias configs opt their secondary endpoints in, which is the explicit goal.
  • No performance concerns. The loop was already O(configs); it remains so.

Verdict

Approve with minor suggestions. This is a clean, well-scoped refactor that fixes a real bug (#606), removes a special case from the engine, and improves separation of concerns between engine and config. Tests are adequate and the docs are updated in lockstep. The only thing I'd consider adding is a one-line clarification in the base-class docstring (or in the plugin docs) that the accessor will be consulted for every SingleColumnConfig, so a plugin author who defines a model_alias on a non-model config should be aware the default collects it — but that's a nice-to-have, not a blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(engine): let plugin column configs declare all model aliases for the startup health check

1 participant