Skip to content

Bound stop-token check to written tokens in dflash_generate#109

Open
SuperMarioYL wants to merge 1 commit intoz-lab:mainfrom
SuperMarioYL:fix/stop-token-buffer-scan
Open

Bound stop-token check to written tokens in dflash_generate#109
SuperMarioYL wants to merge 1 commit intoz-lab:mainfrom
SuperMarioYL:fix/stop-token-buffer-scan

Conversation

@SuperMarioYL
Copy link
Copy Markdown

Problem

dflash_generate pre-allocates output_ids with mask_token_id past the
prompt at dflash/model.py:79. The in-loop early-exit check at the bottom of the
decode loop scanned the full pre-allocated tail:

if stop_token_ids is not None and any(
    stop_token_id in output_ids[:, num_input_tokens:]
    for stop_token_id in stop_token_ids
):
    break

When mask_token_id happens to be one of the stop_token_ids — a
model-config-dependent edge case the project already cares about
(see #76 "Preserve output tokens that equal mask_token_id") —
mask_token_id in the unwritten tail of the buffer satisfies the
in check on the very first iteration and generation aborts after
one block.

Fix

Aligning with the post-loop trim a few lines below — which already
uses torch.isin over the trimmed slice — the in-loop check now
scopes the scan to output_ids[0, num_input_tokens : start + 1],
i.e. positions that have actually been written this run. The
pre-allocated stop_token_tensor is hoisted out of the loop so both
the in-loop and post-loop checks share it.

Tests

Added tests/test_model.py covering pure-Python / pure-tensor logic
that runs on CPU without weights:

  • build_target_layer_ids interpolation (1-/2-/4-layer cases)
  • extract_context_feature offset+concat shape and values
  • sample argmax / temperature paths
  • regression test for the buffer-scan pattern: reproduces the
    legacy check firing spuriously, asserts the new check does not
  • sibling test confirming the new check still fires on a real
    stop token after the cursor advances

Wired in via a [project.optional-dependencies] test extra so
existing backends are unaffected:

uv pip install -e ".[test]"
python -m pytest tests/test_model.py -v
# 8 passed in 9.15s

Refs #76.

`dflash_generate` pre-allocates `output_ids` with `mask_token_id` past the
prompt (model.py:79). The in-loop early-exit check at the bottom of the
decode loop scanned the full pre-allocated tail:

    if stop_token_ids is not None and any(
        stop_token_id in output_ids[:, num_input_tokens:]
        for stop_token_id in stop_token_ids
    ):
        break

When `mask_token_id` happens to be one of the `stop_token_ids` (a
model-config-dependent edge case the project already cares about — see
PR z-lab#76 "Preserve output tokens that equal mask_token_id"), `mask_token_id`
in the unwritten tail of the buffer satisfies the `in` check on the very
first iteration and generation aborts after one block.

Aligning with the post-loop trim at model.py:151-155 — which already
uses `torch.isin` over the trimmed slice — the in-loop check now scopes
the scan to `output_ids[0, num_input_tokens : start + 1]`, i.e. positions
that have actually been written this run. The pre-allocated tensor is
hoisted out of the loop so both checks share it.

Tests
-----
Added `tests/test_model.py` covering:
  * `build_target_layer_ids` interpolation (1-layer, 2-layer, 4-layer)
  * `extract_context_feature` offset+concat shape and values
  * `sample` argmax / temperature paths
  * regression test for the buffer-scan pattern (legacy check fires
    spuriously, new check does not) and a sibling test confirming the
    new check still detects a real stop token after the cursor advances.

Wired in via a `[project.optional-dependencies] test` extra so existing
backends are unaffected:

    uv pip install -e ".[test]"
    python -m pytest tests/test_model.py -v
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant