[Bug] Vulkan buffer-size OOM on Ryzen AI Max+/Strix Halo during Qwen3 conditioning for Flux.2-Klein

### Git commit

Unknown from downstream report. The log appears to come from a prebuilt Windows `sd-server` release. Latest upstream release checked by the downstream maintainer at time of report: `master-721-8caa3f9`.

### Operating System & Version

Windows, exact version unknown.

### GGML backends

Vulkan

### Hardware

AMD Ryzen AI Max+ 395 / Strix Halo-class machine with 128 GB system RAM and 96 GB allocated/shared as VRAM.

### Command-line arguments used

This came through the Klein-Paint GUI, which launches `sd-server` with the selected Flux.2-Klein model trio and these optional flags depending on settings:

```text
--diffusion-fa
--offload-to-cpu
--vae-tiling
--cfg-scale 1.0
--listen-ip 127.0.0.1
--listen-port 7399
```

The reporter reproduced the failure with offload-to-cpu both on and off, and flash attention both on and off. The default generation was 1024x1024 with Euler and Flux.2-Klein + Qwen3 text encoder.

### Steps to reproduce

1. Use a Ryzen AI Max+ 395 / Strix Halo-class system with large shared VRAM exposed to Vulkan, e.g. 96 GB allocated from 128 GB RAM.
2. Start `sd-server` with Flux.2-Klein and a Qwen3 text encoder.
3. Request a 1024x1024 image generation.
4. Try with `--offload-to-cpu` on/off and `--diffusion-fa` on/off.

### What you expected to happen

Generation should either run successfully, or fail cleanly with an actionable error explaining that the Vulkan backend hit a per-buffer/device-buffer-size limit and suggesting relevant mitigations.

### What actually happened

The process fails while preparing Qwen3 conditioning graph weights. This looks related to #1290, but it is not the VAE decode path, so `--vae-tiling` is unlikely to address this particular failure.

Log excerpt:

```text
main.cpp:148  - listening on: http://127.0.0.1:7399
[system] Ready — capabilities via /sdcpp/v1/capabilities
stable-diffusion.cpp:4515 - generate_image 1024x1024
[INFO ] denoiser.hpp:776  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3539 - sampling using Euler method

ggml_extend.hpp:67   - ggml_vulkan: Failed to allocate pinned memory (Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory)
[WARN ]
model_loader.cpp:1236 - loading tensors completed, taking 3.35s (read: 2.64s, memcpy: 0.00s, convert: 0.03s, copy_to_backend: 0.00s)

ggml_vulkan: Device memory allocation of size 2489319424 failed.
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
[ERROR] ggml_extend.hpp:70   - alloc_tensor_range: failed to allocate Vulkan0 buffer of size 2489319424
[ERROR] model_manager.cpp:291  - model manager alloc compute params backend buffer failed, num_tensors = 298
[ERROR] ggml_extend.hpp:1897 - qwen3 prepare graph weights failed
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\src\conditioning/conditioner.hpp:1671: GGML_ASSERT(!hidden_states.empty()) failed
[system] Process exited (code=3221226505 signal=null)
```

### Additional context / environment details

This was reported downstream from Klein-Paint, a GUI wrapper around the prebuilt `sd-server`; the wrapper only starts/proxies `sd-server` and does not allocate model buffers itself.

The failure appears to be a Vulkan max-buffer-size / contiguous-buffer limitation rather than total memory exhaustion: requested allocation is ~2.32 GiB (`2489319424` bytes) on a system configured with much more shared VRAM. It may be related to the `GGML_VK_FORCE_MAX_BUFFER_SIZE` workaround mentioned in #1290, but this path involves Qwen3 text-encoder graph weights rather than VAE decode.

It would help if stable-diffusion.cpp could either:

1. avoid the single large Vulkan allocation for Qwen3 conditioning graph weights on Vulkan/UMA systems,
2. expose/document an appropriate workaround for this path, or
3. fail before the assert with an actionable error instead of `GGML_ASSERT(!hidden_states.empty())`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Vulkan buffer-size OOM on Ryzen AI Max+/Strix Halo during Qwen3 conditioning for Flux.2-Klein #1718

Git commit

Operating System & Version

GGML backends

Hardware

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Additional context / environment details

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] Vulkan buffer-size OOM on Ryzen AI Max+/Strix Halo during Qwen3 conditioning for Flux.2-Klein #1718

Description

Git commit

Operating System & Version

GGML backends

Hardware

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Additional context / environment details

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions