The Problem: Static Playlists in a Dynamic World
Music is deeply personal, but our music players are surprisingly impersonal. We’ve all been there: You’re having a rough day, but your "Daily Mix" decides it’s the perfect time for high-energy dance pop. Or you're in the zone working, and a jarring ballad breaks your flow.
We curate playlists for specific moods, but scrolling through them takes effort. What if your phone could just look at you, understand how you’re feeling, and play the perfect track automatically?
In this post, we’re diving into the EmotionToMusic-App, an open-source Android project that bridges the gap between computer vision and music recommendation. We’ll explore how to build a pipeline that goes from Face -> Emotion -> Music in real-time, all on-device.
Technical Overview
This application is built natively in Kotlin and follows modern Android development practices. The core philosophy is on-device inference. By running the Machine Learning (ML) models locally, we ensure:
- Privacy: No images of the user are ever sent to a cloud server.
- Latency: Music reaction is near-instantaneous.
- Offline Capability: It works without an internet connection.
The Tech Stack
Language: Kotlin
Camera: CameraX (Jetpack library for easy camera lifecycle management)
ML Engine: TensorFlow Lite (for emotion classification)
Face Detection: Google ML Kit (to locate the face before analysis)
How It Works:
The application functions as a continuous loop. It doesn't just take a single photo; it analyzes the video stream frame-by-frame. Here is the high-level flow:
- Capture: CameraX intercepts a frame from the live preview.
- Detection: ML Kit scans the full frame to find a face.
- Preprocessing: The face is cropped, converted to grayscale, and resized (usually to 48x48 pixels) to match the model's input requirements.
- Inference: The processed image is fed into a TFLite model (often a CNN trained on the FER2013 dataset).
- Output: The model returns a probability array (e.g., [Happy: 0.8, Sad: 0.1, Neutral: 0.1]).
- Action: The app maps the highest probability emotion to a specific genre and triggers the Media Player.
The Implementation
Let's look at the code that powers this "AI DJ."
1. The Analyzer ( The "Eye")
The heart of the app is the ImageAnalysis.Analyzer. This runs on a background thread and processes frames. Note how we use ML Kit first to find the face, ensuring we don't feed background noise to our emotion model.
class EmotionAnalyzer(private val listener: EmotionListener) : ImageAnalysis.Analyzer {
@androidx.annotation.OptIn(androidx.camera.core.ExperimentalGetImage::class)
override fun analyze(imageProxy: ImageProxy) {
val mediaImage = imageProxy.image
if (mediaImage != null) {
val inputImage = InputImage.fromMediaImage(mediaImage, imageProxy.imageInfo.rotationDegrees)
// Step 1: Detect Face
FaceDetection.getClient().process(inputImage)
.addOnSuccessListener { faces ->
if (faces.isNotEmpty()) {
// Step 2: Crop Face & Run Inference
val emotion = recognizeEmotion(faces[0], mediaImage)
listener.onEmotionDetected(emotion)
}
}
.addOnCompleteListener {
imageProxy.close() // Important: Release frame!
}
}
}
}
2. The Model (The "Brain")
Once we have the face, we interpret it. The TensorFlowLite interpreter takes the raw byte buffer of the image and outputs the probabilities.
fun recognizeEmotion(face: Face, image: Image): String {
// 1. Convert YUV image to Bitmap and crop to face bounding box
val faceBitmap = BitmapUtils.cropToFace(image, face.boundingBox)
// 2. Resize to model input size (e.g., 48x48) & Grayscale
val processedBitmap = BitmapUtils.preprocess(faceBitmap)
// 3. Run Inference
val output = Array(1) { FloatArray(7) } // 7 emotions (Happy, Sad, Angry, etc.)
tfliteInterpreter.run(processedBitmap, output)
// 4. Get the index of the highest confidence
val maxIndex = output[0].indices.maxByOrNull { output[0][it] } ?: 0
return emotionLabels[maxIndex] // e.g., "Happy"
}
3. The Mapper (The "DJ")
Finally, a simple controller maps the result to a playlist.
fun playMusicForEmotion(emotion: String) {
val playlist = when(emotion) {
"Happy" -> R.raw.upbeat_pop
"Sad" -> R.raw.melancholy_piano
"Angry" -> R.raw.heavy_rock
else -> R.raw.chill_lofi
}
mediaPlayer.play(playlist)
}
Design Decisions & Challenges
1. Lighting is the Enemy
Computer vision models struggle in low light. During testing, shadows across the face were often misclassified as "Angry" or "Sad."
Solution: We implemented a check for average luminosity. If the frame is too dark, the app pauses detection and prompts the user to move to the light.
2. The Jitter Problem
Real-time inference is fast. Your face might register as "Happy" for 10 frames, "Neutral" for 1 frame, and "Happy" again. If we switched the song every time the emotion flickered, the experience would be terrible.
Solution: Smoothing. We use a buffer that stores the last 10 detected emotions and only changes the music if the dominant emotion changes for a sustained period (e.g., 2 seconds).
Final Thoughts
The EmotionToMusic-App is a great example of how accessible AI has become. You don't need to be a data scientist to build "smart" apps; you just need to know how to wire the components together.
GitHub Repository—https://github.com/sisodiajatin/EmotionToMusic-App
Top comments (0)