Skip to content

[Bug] Vulkan buffer-size OOM on Ryzen AI Max+/Strix Halo during Qwen3 conditioning for Flux.2-Klein #1718

Description

@thomas9120

Git commit

Unknown from downstream report. The log appears to come from a prebuilt Windows sd-server release. Latest upstream release checked by the downstream maintainer at time of report: master-721-8caa3f9.

Operating System & Version

Windows, exact version unknown.

GGML backends

Vulkan

Hardware

AMD Ryzen AI Max+ 395 / Strix Halo-class machine with 128 GB system RAM and 96 GB allocated/shared as VRAM.

Command-line arguments used

This came through the Klein-Paint GUI, which launches sd-server with the selected Flux.2-Klein model trio and these optional flags depending on settings:

--diffusion-fa
--offload-to-cpu
--vae-tiling
--cfg-scale 1.0
--listen-ip 127.0.0.1
--listen-port 7399

The reporter reproduced the failure with offload-to-cpu both on and off, and flash attention both on and off. The default generation was 1024x1024 with Euler and Flux.2-Klein + Qwen3 text encoder.

Steps to reproduce

  1. Use a Ryzen AI Max+ 395 / Strix Halo-class system with large shared VRAM exposed to Vulkan, e.g. 96 GB allocated from 128 GB RAM.
  2. Start sd-server with Flux.2-Klein and a Qwen3 text encoder.
  3. Request a 1024x1024 image generation.
  4. Try with --offload-to-cpu on/off and --diffusion-fa on/off.

What you expected to happen

Generation should either run successfully, or fail cleanly with an actionable error explaining that the Vulkan backend hit a per-buffer/device-buffer-size limit and suggesting relevant mitigations.

What actually happened

The process fails while preparing Qwen3 conditioning graph weights. This looks related to #1290, but it is not the VAE decode path, so --vae-tiling is unlikely to address this particular failure.

Log excerpt:

main.cpp:148  - listening on: http://127.0.0.1:7399
[system] Ready — capabilities via /sdcpp/v1/capabilities
stable-diffusion.cpp:4515 - generate_image 1024x1024
[INFO ] denoiser.hpp:776  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3539 - sampling using Euler method

ggml_extend.hpp:67   - ggml_vulkan: Failed to allocate pinned memory (Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory)
[WARN ]
model_loader.cpp:1236 - loading tensors completed, taking 3.35s (read: 2.64s, memcpy: 0.00s, convert: 0.03s, copy_to_backend: 0.00s)

ggml_vulkan: Device memory allocation of size 2489319424 failed.
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
[ERROR] ggml_extend.hpp:70   - alloc_tensor_range: failed to allocate Vulkan0 buffer of size 2489319424
[ERROR] model_manager.cpp:291  - model manager alloc compute params backend buffer failed, num_tensors = 298
[ERROR] ggml_extend.hpp:1897 - qwen3 prepare graph weights failed
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\src\conditioning/conditioner.hpp:1671: GGML_ASSERT(!hidden_states.empty()) failed
[system] Process exited (code=3221226505 signal=null)

Additional context / environment details

This was reported downstream from Klein-Paint, a GUI wrapper around the prebuilt sd-server; the wrapper only starts/proxies sd-server and does not allocate model buffers itself.

The failure appears to be a Vulkan max-buffer-size / contiguous-buffer limitation rather than total memory exhaustion: requested allocation is ~2.32 GiB (2489319424 bytes) on a system configured with much more shared VRAM. It may be related to the GGML_VK_FORCE_MAX_BUFFER_SIZE workaround mentioned in #1290, but this path involves Qwen3 text-encoder graph weights rather than VAE decode.

It would help if stable-diffusion.cpp could either:

  1. avoid the single large Vulkan allocation for Qwen3 conditioning graph weights on Vulkan/UMA systems,
  2. expose/document an appropriate workaround for this path, or
  3. fail before the assert with an actionable error instead of GGML_ASSERT(!hidden_states.empty()).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions