Bound stop-token check to written tokens in dflash_generate by SuperMarioYL · Pull Request #109 · z-lab/dflash

SuperMarioYL · 2026-05-08T09:51:15Z

Problem

dflash_generate pre-allocates output_ids with mask_token_id past the
prompt at dflash/model.py:79. The in-loop early-exit check at the bottom of the
decode loop scanned the full pre-allocated tail:

if stop_token_ids is not None and any(
    stop_token_id in output_ids[:, num_input_tokens:]
    for stop_token_id in stop_token_ids
):
    break

When mask_token_id happens to be one of the stop_token_ids — a
model-config-dependent edge case the project already cares about
(see #76 "Preserve output tokens that equal mask_token_id") —
mask_token_id in the unwritten tail of the buffer satisfies the
in check on the very first iteration and generation aborts after
one block.

Fix

Aligning with the post-loop trim a few lines below — which already
uses torch.isin over the trimmed slice — the in-loop check now
scopes the scan to output_ids[0, num_input_tokens : start + 1],
i.e. positions that have actually been written this run. The
pre-allocated stop_token_tensor is hoisted out of the loop so both
the in-loop and post-loop checks share it.

Tests

Added tests/test_model.py covering pure-Python / pure-tensor logic
that runs on CPU without weights:

build_target_layer_ids interpolation (1-/2-/4-layer cases)
extract_context_feature offset+concat shape and values
sample argmax / temperature paths
regression test for the buffer-scan pattern: reproduces the
legacy check firing spuriously, asserts the new check does not
sibling test confirming the new check still fires on a real
stop token after the cursor advances

Wired in via a [project.optional-dependencies] test extra so
existing backends are unaffected:

uv pip install -e ".[test]"
python -m pytest tests/test_model.py -v
# 8 passed in 9.15s

Refs #76.

`dflash_generate` pre-allocates `output_ids` with `mask_token_id` past the prompt (model.py:79). The in-loop early-exit check at the bottom of the decode loop scanned the full pre-allocated tail: if stop_token_ids is not None and any( stop_token_id in output_ids[:, num_input_tokens:] for stop_token_id in stop_token_ids ): break When `mask_token_id` happens to be one of the `stop_token_ids` (a model-config-dependent edge case the project already cares about — see PR z-lab#76 "Preserve output tokens that equal mask_token_id"), `mask_token_id` in the unwritten tail of the buffer satisfies the `in` check on the very first iteration and generation aborts after one block. Aligning with the post-loop trim at model.py:151-155 — which already uses `torch.isin` over the trimmed slice — the in-loop check now scopes the scan to `output_ids[0, num_input_tokens : start + 1]`, i.e. positions that have actually been written this run. The pre-allocated tensor is hoisted out of the loop so both checks share it. Tests ----- Added `tests/test_model.py` covering: * `build_target_layer_ids` interpolation (1-layer, 2-layer, 4-layer) * `extract_context_feature` offset+concat shape and values * `sample` argmax / temperature paths * regression test for the buffer-scan pattern (legacy check fires spuriously, new check does not) and a sibling test confirming the new check still detects a real stop token after the cursor advances. Wired in via a `[project.optional-dependencies] test` extra so existing backends are unaffected: uv pip install -e ".[test]" python -m pytest tests/test_model.py -v

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound stop-token check to written tokens in dflash_generate#109

Bound stop-token check to written tokens in dflash_generate#109
SuperMarioYL wants to merge 1 commit intoz-lab:mainfrom
SuperMarioYL:fix/stop-token-buffer-scan

SuperMarioYL commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SuperMarioYL commented May 8, 2026

Problem

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant