Skip to content

[Feature Request] 支持对话时视觉跟随 / visual gaze following during conversation #70

@xuruiray

Description

@xuruiray

Background

I originally wanted StackChan to look toward the person it is interacting with during AI Agent conversation. This is related to, but not exactly the same as, traditional sound-source following. The desired behavior is visual gaze / face following during LISTENING and SPEAKING, while keeping the existing official idle behavior otherwise.

Related PR: #72

The PR has now been changed to a standalone experiment only. It no longer changes the production firmware path.

Experiment Process

1. Lightweight visual heuristic in the main firmware

The first attempt added a conversation gaze modifier to the AI Agent path and used lightweight camera heuristics based on sampled skin-color / motion cues.

Result:

  • It could compile and flash, but real tracking quality was poor.
  • The target estimate drifted easily.
  • The head sometimes kept moving toward one side even when the face was centered.
  • It did not form a reliable closed-loop tracking behavior.

2. ESP-DL / face detection evaluation

I then evaluated a more standard vision path using Espressif ESP-DL / human_face_detect.

A standalone demo was created under:

experiments/vision_tracking_demo

The demo isolates only the parts needed for the experiment:

  • GC0308 DVP camera at 320x240 RGB565
  • ESP-DL human_face_detect
  • display debug dot for face position
  • yaw / pitch serial servo output

It intentionally does not start the full AI Agent stack, Wi-Fi/MQTT, audio AFE, LVGL app flow, or normal idle animation system.

Result:

  • The standalone demo can build and run the camera -> face detection -> smoothing -> servo command loop.
  • With smoothing, deadband, short target hold, and servo rate limiting, the isolated demo became usable enough for basic face tracking.
  • The screen dot is helpful for debugging whether the detector position and servo command agree.

3. Integration back into the full AI Agent firmware

After the standalone demo became acceptable, I tried moving the idea back into the main AI Agent firmware so that idle motion before conversation would gaze toward the face while keeping other official behavior unchanged.

Result:

  • The behavior became much worse than the standalone demo.
  • The device sometimes rebooted during early iterations.
  • Tracking was sluggish or jittery when AI Agent, audio, network, UI, camera, and servo logic were all active.
  • Serial logs showed very low internal SRAM in several runs. In the worst full-firmware tracking runs, free internal SRAM dropped to only a few KB, with observed low-water marks around hundreds of bytes to a few KB.
  • I also observed AFE: Ringbuffer of AFE(FEED) is full warnings, MQTT/Wi-Fi instability, and slow response while experimenting with visual tracking in the full firmware.

Current Conclusion

The current hardware can run a minimal face tracking demo, but the full AI Agent firmware appears to be very close to the ESP32-S3 internal SRAM / real-time scheduling limit.

My current conclusion is:

  • Basic visual face tracking is feasible as a standalone demo on this hardware.
  • Merging it directly into the production AI Agent path is not ready yet.
  • The limiting factor seems less like a single algorithm bug and more like a hardware/resource budget issue: CPU time, internal SRAM, camera quality, audio AFE, network stack, UI, and servo control are all competing at once.

Possible next directions:

  • make visual tracking an optional experimental build feature only;
  • use a much cheaper detector or ROI-based tracker after initial face detection;
  • run vision at a very low fixed rate in a fully asynchronous pipeline;
  • increase memory/partition budget if maintainers are comfortable with that;
  • offload vision to stronger hardware or a dedicated vision module;
  • use sound direction only as a fallback cue, not as the main first-stage solution.

Questions

  1. Has anyone successfully implemented face/gaze following on StackChan or similar ESP32-S3 + GC0308 + pan/tilt servo hardware?
  2. Is ESP-DL / human_face_detect considered realistic for the full StackChan AI Agent firmware, or should it stay as a standalone/optional experiment?
  3. Are there recommended camera settings, model choices, or lower-cost tracking approaches for this GC0308 camera?
  4. Would maintainers prefer an experimental demo PR, an optional Kconfig-based feature, or no production integration until stronger hardware is available?
  5. Is there a recommended architecture for this feature so that camera sampling, screen refresh, audio AFE, and servo control do not block each other?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions