[Help]:  Metis SE long-form inconsistency across chunks / degraded quality on 60s+ files

Hi Amphion team,

Thank you for releasing Metis SE. We tested it carefully and the short-form restoration quality is genuinely impressive.

However, we are seeing a strong long-form consistency problem with speech enhancement, and I wanted to ask whether this is expected behavior of the current SE inference path, or whether we are missing the intended way to run it on longer audio.

## What we observe

Metis SE works very well on short clips around 9 seconds.

But on longer inputs (for example 60+ seconds), quality becomes much less stable:

- full-file SE inference sounds worse than short-form inference
- if we split a long file into 9-second chunks, each chunk is restored with a noticeably different character
- chunk-to-chunk differences are clearly audible:
  - some chunks are louder, others quieter
  - some chunks are cleaner, others more degraded
  - the restored voice/timbre/character is not consistent across the full file

So in practice, chunking does not solve the problem:
it only turns one bad long-form result into multiple good/bad local restorations stitched together.

## What we already tested

We tried the following systematically:

1. Full-file inference on long audio
2. Fixed-size chunking
3. Chunking with overlap + waveform crossfade
4. Larger context windows with center-crop output
5. Seed sweeps
6. Greedy decoding modifications
7. Parameter sweeps for stage1 / S2A settings
8. Prompt-anchor experiments

## Results

### 1. Seed behavior
- same seed on the same chunk is deterministic
- different seeds can produce noticeably different restorations for that chunk
- but a seed that sounds good for chunk A does not generalize to chunk B

So there does not seem to be a single “good seed” for the whole file.

### 2. Overlap / context
- overlap and crossfade only smooth transitions
- they do **not** solve the underlying inconsistency
- chunks still sound like independently regenerated restorations

### 3. Greedy decoding
We modified decoding to reduce sampling randomness.

Result:
- it reduced seed lottery inside the same chunk
- but chunk-to-chunk inconsistency remained
- overall quality was slightly worse

### 4. Prompt-anchor experiments
We also tested prompt-based chunking ideas inspired by the prompt-conditioned paths in Metis:

- fixed prompt anchor
- rolling prompt anchor (using previous restored chunk as prompt)

Result:
- both improved consistency somewhat
- rolling prompt was slightly better than fixed prompt
- but the inconsistency is still very noticeable and not sufficient for high-quality long-form restoration

## Main question

Is Metis SE currently expected to work mainly on short clips, and not on long-form consistent speech enhancement?

Or is there an intended inference strategy for long audio that we are missing?

More specifically:

1. Is long-form SE consistency a known limitation of the current Metis release?
2. Is there any recommended way to do long-form SE beyond naive chunking?
3. Is there any intended continuity-conditioning / prompt-conditioning mechanism for SE, similar in spirit to VC/TSE prompt usage?
4. Are there any hidden or recommended inference settings for improving long-form consistency?

## Practical summary

From our tests, the core issue seems to be:

- short clips sound very good
- long clips are not stable
- chunking alone does not fix it
- prompt anchoring helps slightly, but not enough

So the current behavior suggests that SE restoration is locally strong but does not preserve a stable restoration “character” over long audio.

If helpful, I can also provide a more compact reproduction report.

This would be extremely valuable for restoration of archival / lecture / real-world degraded speech, where long-form consistency matters much more than short demo quality.

Thanks again for the release.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Help]: Metis SE long-form inconsistency across chunks / degraded quality on 60s+ files #488

What we observe

What we already tested

Results

1. Seed behavior

2. Overlap / context

3. Greedy decoding

4. Prompt-anchor experiments

Main question

Practical summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Help]: Metis SE long-form inconsistency across chunks / degraded quality on 60s+ files #488

Description

What we observe

What we already tested

Results

1. Seed behavior

2. Overlap / context

3. Greedy decoding

4. Prompt-anchor experiments

Main question

Practical summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions