Skip to content

[Help]: Metis SE long-form inconsistency across chunks / degraded quality on 60s+ files #488

@shrisha108

Description

@shrisha108

Hi Amphion team,

Thank you for releasing Metis SE. We tested it carefully and the short-form restoration quality is genuinely impressive.

However, we are seeing a strong long-form consistency problem with speech enhancement, and I wanted to ask whether this is expected behavior of the current SE inference path, or whether we are missing the intended way to run it on longer audio.

What we observe

Metis SE works very well on short clips around 9 seconds.

But on longer inputs (for example 60+ seconds), quality becomes much less stable:

  • full-file SE inference sounds worse than short-form inference
  • if we split a long file into 9-second chunks, each chunk is restored with a noticeably different character
  • chunk-to-chunk differences are clearly audible:
    • some chunks are louder, others quieter
    • some chunks are cleaner, others more degraded
    • the restored voice/timbre/character is not consistent across the full file

So in practice, chunking does not solve the problem:
it only turns one bad long-form result into multiple good/bad local restorations stitched together.

What we already tested

We tried the following systematically:

  1. Full-file inference on long audio
  2. Fixed-size chunking
  3. Chunking with overlap + waveform crossfade
  4. Larger context windows with center-crop output
  5. Seed sweeps
  6. Greedy decoding modifications
  7. Parameter sweeps for stage1 / S2A settings
  8. Prompt-anchor experiments

Results

1. Seed behavior

  • same seed on the same chunk is deterministic
  • different seeds can produce noticeably different restorations for that chunk
  • but a seed that sounds good for chunk A does not generalize to chunk B

So there does not seem to be a single “good seed” for the whole file.

2. Overlap / context

  • overlap and crossfade only smooth transitions
  • they do not solve the underlying inconsistency
  • chunks still sound like independently regenerated restorations

3. Greedy decoding

We modified decoding to reduce sampling randomness.

Result:

  • it reduced seed lottery inside the same chunk
  • but chunk-to-chunk inconsistency remained
  • overall quality was slightly worse

4. Prompt-anchor experiments

We also tested prompt-based chunking ideas inspired by the prompt-conditioned paths in Metis:

  • fixed prompt anchor
  • rolling prompt anchor (using previous restored chunk as prompt)

Result:

  • both improved consistency somewhat
  • rolling prompt was slightly better than fixed prompt
  • but the inconsistency is still very noticeable and not sufficient for high-quality long-form restoration

Main question

Is Metis SE currently expected to work mainly on short clips, and not on long-form consistent speech enhancement?

Or is there an intended inference strategy for long audio that we are missing?

More specifically:

  1. Is long-form SE consistency a known limitation of the current Metis release?
  2. Is there any recommended way to do long-form SE beyond naive chunking?
  3. Is there any intended continuity-conditioning / prompt-conditioning mechanism for SE, similar in spirit to VC/TSE prompt usage?
  4. Are there any hidden or recommended inference settings for improving long-form consistency?

Practical summary

From our tests, the core issue seems to be:

  • short clips sound very good
  • long clips are not stable
  • chunking alone does not fix it
  • prompt anchoring helps slightly, but not enough

So the current behavior suggests that SE restoration is locally strong but does not preserve a stable restoration “character” over long audio.

If helpful, I can also provide a more compact reproduction report.

This would be extremely valuable for restoration of archival / lecture / real-world degraded speech, where long-form consistency matters much more than short demo quality.

Thanks again for the release.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions