Background
I originally wanted StackChan to look toward the person it is interacting with during AI Agent conversation. This is related to, but not exactly the same as, traditional sound-source following. The desired behavior is visual gaze / face following during LISTENING and SPEAKING, while keeping the existing official idle behavior otherwise.
Related PR: #72
The PR has now been changed to a standalone experiment only. It no longer changes the production firmware path.
Experiment Process
1. Lightweight visual heuristic in the main firmware
The first attempt added a conversation gaze modifier to the AI Agent path and used lightweight camera heuristics based on sampled skin-color / motion cues.
Result:
- It could compile and flash, but real tracking quality was poor.
- The target estimate drifted easily.
- The head sometimes kept moving toward one side even when the face was centered.
- It did not form a reliable closed-loop tracking behavior.
2. ESP-DL / face detection evaluation
I then evaluated a more standard vision path using Espressif ESP-DL / human_face_detect.
A standalone demo was created under:
experiments/vision_tracking_demo
The demo isolates only the parts needed for the experiment:
- GC0308 DVP camera at 320x240 RGB565
- ESP-DL
human_face_detect
- display debug dot for face position
- yaw / pitch serial servo output
It intentionally does not start the full AI Agent stack, Wi-Fi/MQTT, audio AFE, LVGL app flow, or normal idle animation system.
Result:
- The standalone demo can build and run the camera -> face detection -> smoothing -> servo command loop.
- With smoothing, deadband, short target hold, and servo rate limiting, the isolated demo became usable enough for basic face tracking.
- The screen dot is helpful for debugging whether the detector position and servo command agree.
3. Integration back into the full AI Agent firmware
After the standalone demo became acceptable, I tried moving the idea back into the main AI Agent firmware so that idle motion before conversation would gaze toward the face while keeping other official behavior unchanged.
Result:
- The behavior became much worse than the standalone demo.
- The device sometimes rebooted during early iterations.
- Tracking was sluggish or jittery when AI Agent, audio, network, UI, camera, and servo logic were all active.
- Serial logs showed very low internal SRAM in several runs. In the worst full-firmware tracking runs, free internal SRAM dropped to only a few KB, with observed low-water marks around hundreds of bytes to a few KB.
- I also observed
AFE: Ringbuffer of AFE(FEED) is full warnings, MQTT/Wi-Fi instability, and slow response while experimenting with visual tracking in the full firmware.
Current Conclusion
The current hardware can run a minimal face tracking demo, but the full AI Agent firmware appears to be very close to the ESP32-S3 internal SRAM / real-time scheduling limit.
My current conclusion is:
- Basic visual face tracking is feasible as a standalone demo on this hardware.
- Merging it directly into the production AI Agent path is not ready yet.
- The limiting factor seems less like a single algorithm bug and more like a hardware/resource budget issue: CPU time, internal SRAM, camera quality, audio AFE, network stack, UI, and servo control are all competing at once.
Possible next directions:
- make visual tracking an optional experimental build feature only;
- use a much cheaper detector or ROI-based tracker after initial face detection;
- run vision at a very low fixed rate in a fully asynchronous pipeline;
- increase memory/partition budget if maintainers are comfortable with that;
- offload vision to stronger hardware or a dedicated vision module;
- use sound direction only as a fallback cue, not as the main first-stage solution.
Questions
- Has anyone successfully implemented face/gaze following on StackChan or similar ESP32-S3 + GC0308 + pan/tilt servo hardware?
- Is ESP-DL /
human_face_detect considered realistic for the full StackChan AI Agent firmware, or should it stay as a standalone/optional experiment?
- Are there recommended camera settings, model choices, or lower-cost tracking approaches for this GC0308 camera?
- Would maintainers prefer an experimental demo PR, an optional Kconfig-based feature, or no production integration until stronger hardware is available?
- Is there a recommended architecture for this feature so that camera sampling, screen refresh, audio AFE, and servo control do not block each other?
Background
I originally wanted StackChan to look toward the person it is interacting with during AI Agent conversation. This is related to, but not exactly the same as, traditional sound-source following. The desired behavior is visual gaze / face following during
LISTENINGandSPEAKING, while keeping the existing official idle behavior otherwise.Related PR: #72
The PR has now been changed to a standalone experiment only. It no longer changes the production firmware path.
Experiment Process
1. Lightweight visual heuristic in the main firmware
The first attempt added a conversation gaze modifier to the AI Agent path and used lightweight camera heuristics based on sampled skin-color / motion cues.
Result:
2. ESP-DL / face detection evaluation
I then evaluated a more standard vision path using Espressif ESP-DL /
human_face_detect.A standalone demo was created under:
The demo isolates only the parts needed for the experiment:
human_face_detectIt intentionally does not start the full AI Agent stack, Wi-Fi/MQTT, audio AFE, LVGL app flow, or normal idle animation system.
Result:
3. Integration back into the full AI Agent firmware
After the standalone demo became acceptable, I tried moving the idea back into the main AI Agent firmware so that idle motion before conversation would gaze toward the face while keeping other official behavior unchanged.
Result:
AFE: Ringbuffer of AFE(FEED) is fullwarnings, MQTT/Wi-Fi instability, and slow response while experimenting with visual tracking in the full firmware.Current Conclusion
The current hardware can run a minimal face tracking demo, but the full AI Agent firmware appears to be very close to the ESP32-S3 internal SRAM / real-time scheduling limit.
My current conclusion is:
Possible next directions:
Questions
human_face_detectconsidered realistic for the full StackChan AI Agent firmware, or should it stay as a standalone/optional experiment?