Hi everyone.
I'm working on detecting articulatory tongue movements through a smartphone's front-facing camera. Specifically, I need to reliably classify vertical movement (up / down / neutral position) in real time. The environment is everyday use — regular lighting, no additional hardware or depth sensors. The target audience is children, which adds variability in both anatomy and behavior in front of the camera.
I've experimented with MediaPipe, but it doesn't really handle the tongue itself — there are no landmarks on it, and using it as a standalone detector isn't viable. As an ROI localizer for the mouth area it works fine, but that's about it.
A few questions I'd love to get input on:
Are there approaches that actually work for this kind of task, or is it inherently constrained given the capture conditions (mobile RGB camera, no depth, kids as users)? Where's the realistic ceiling in terms of accuracy?
Can anything be borrowed from adjacent fields — deformable object tracking, medical imaging (ultrasound tongue tracking, EdgeTrak), pose estimation tools like DeepLabCut (originally for animal behavior research), or even industrial object detection — where similar problems seem to be solved reasonably well?
Has anyone here actually built detection or tracking of intra-oral structures (tongue, teeth) on mobile hardware? What ended up working with high enough accuracy in production, and what were the main pitfalls — segmentation quality per frame, temporal consistency, on-device performance, or something else entirely?
Any pointers, papers, repos, or war stories would be hugely appreciated.
Thanks in advance.
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (0)