This post is more of a thought process while embarking on this journey rather than a line by line code tutorial. I will try to cover as much as I can for people who are learning AI (focused towards image processing/computer vision) or trying to start a project just like when I started this project.
Here is the link to try it out:
My Final Year Project (FYP) for my degree started just when the COVID-19 pandemic hit. While online classes started and people started working from home, I realized my sitting posture declined over time as I continue sitting in front of my PC. We know what bad postures leads to over the years. This made me wonder, what if there was someone who could constantly monitor my posture and let me know once it starts declining (you get the idea). At the same, I was keen to explore and contribute to the healthcare sector.
Before thinking too hard on actually building the system, a few things usually would come to my mind. What values would this system bring? Does it benefit not only me?
Those questions were answered pretty easily for me. Having internship with an airline company made be realized that some flights are actually pretty lengthy! Given you are not seated on a business class, seats could be rather small and I personally tend to slouch after awhile. Imagine if there was a small light bulb that blinks and reminds you that your holiday could be ruined due to back ache if you continued slouching or bending your neck forward reading the magazines.
And not only for flights could this system be implemented on! A reminder system in kindergartens so good postures are cultivated in the early ages or perhaps a "pre-diagnosis" of you in the waiting lounge of a chiropractor.
Now that we identified our problem and potential use case, I think it is safe to proceed with building the system. Of course it is essential to do some literature review and see what the experts in this subject has explored for this problem.
At the time of writing my FYP, most of the systems uses an array of sensors placed on a chair (pressure) or even on the body (gyroscope) (more towards IOT). This data is then fed into a system and a rule base matching is applied. However this method is not scalable, complex in setup and definitely not cheap to implement. So if we need to build this system, it has to overcome that 3 drawbacks from other systems.
The approach that immediately came into my mind was: using a webcam and a browser, we could record the user and infer information from that image. What we can safely say is most laptops definitely have a webcam, and webcam are not that expensive for desktop users. This would overcome the cost and complexity of setting up the system anywhere at anytime. Any users including newbies could just open the link on their browser, place their webcam to face in the correct angle and the system would be setup.
Since we are inferring information from an image, there are a few methods that could be used, however for this project, we would be exploring the use of machine learning and computer vision to solve this particular posture problem.
Now with that draft approach of using machine learning established, data collection is a necessity for training the machine to learn. However before blindly going around digging for data, the question would be: what data are we looking for.
From literature reviews done previously, we know that certain factors contributes to bad postures which includes but not limited to neck bending, slouching and crossed legs. Based on this information, the data we are looking for would be a full body image of a person (sorry shy people) sitting.
Typically most of the tutorials out there would point you to Kaggle, some Google Image Repository for the data source (Don't get me wrong, those are great places to get what you need). Unfortunately, I could not find a single data source resembling a human sitting on a chair, therefore manual work needs to be done. Not going into details on this, but my phone camera worked pretty hard for that few weeks.
Google Form was created and sent to some of my friends with the guideline above. Of course, some credits goes to gaming chairs advertisement on the internet and illustrations of good/bad postures. I ended up with around 90 images, which the pandemic hit and I could not go out for more. 90 images is actually a small number, however certain methods such as image augmentation (resize, crop, rotate etc) can help increase the dataset size and variability. Luckily for me, my dataset was not bias towards any label, this means the labels are distributed equally across the dataset (eg: 40 images with good posture and 50 with bad posture).
This section would be the interesting one for most beginners! So many acronyms, what are they even? CNN? RCNN? SSD?
To answer those acronyms, I think Rohith explained it well here. But I will still briefly talk about it as we go along. OpenCV was used for loading, processing and drawing bounding boxes on the images.
CNN detection approach
Being totally new and clueless, I started with the most common one that you hear about, Convolutional Neural Network (CNN). Opened a Tensorflow Image Classification tutorial here, copy pasted and replaced the dataset with my sitting posture.
I separated my dataset for this approach by categorizing the images with "good" and "bad" posture. Image enhancement (blur, cropping to ROI) to remove background noise was applied on the images. Ran it, randomly changed the epoch, add and removed layers based on some recommendations from Google search. However I then realize, CNN results return as how I trained them, either a "good" or "bad". This is not what we want! We want a system to highlight to us, what exactly is the problem with the posture. So to be more accurate, the problem we are solving is not a classification problem (good or bad).
Multi-label image classification
As the title states, multi-label. First thought would be, cool, now from an image, we have attributes/label such as neck, legs, back with values 0 (good posture) and 1 (bad posture). So I then created a CSV file, with first column being the image file name, and the remaining being the attributes and their respective values. Same thing, copied some code from a Google search on multi-label classification, ran it and got the results. This time, the results came pretty jumbled and fairly inaccurate. The issue being, we may have a label called neck, however we must always remember the machine has no idea what is a neck, and may tag that label with something else in the image that could persist across the entire dataset (perhaps the type of chair, or even the shape of the person). We have to be more specific in this case.
RCNN, FRCNN, SSD then YOLO
As we mature more into understanding how it works now, Recursive-CNN (RCNN), Fast-RCNN (F-RCNN) or Single Shot Detectors (SSD) will appear sooner or later. It became clear that object detection might be an approach for this problem. I have created my own simplified version of manual RCNN for testing and will share in the next few post!
But some might ask, those tutorials usually detect objects (apple, human, face, legs), how can we get it to know if it is the legs are placed in a good posture? The most obvious answer could be: detect the legs, crop that part and send it to a CNN to classify if it is placed in a good posture. However imo, it is computationally heavy and not suitable for a real time system, and furthermore it is a lot of work to train a CNN for each attribute.
Here comes what I actually did out of being lazy. Instead of labeling the parts with neck, back, legs and buttocks. I labelled them as:
neck_good neck_bad backbone_good backbone_bad buttocks_good buttocks_bad legs_crossed legs_straight
You get the idea. Images with people having their necks upright would be labelled as neck_good and vice versa.
Labelimg made by tzutalin was used for labeling the images. You just need to draw a rectangle around the region, and then assign the respective label.
Next is training a model. While they are a few tutorials on building RCNN, I came across YOLO (v3 at time of doing my FYP). Rather than building something from scratch, I could utilize what was already there and apply my use case on it. This method is known as transfer learning. Retraining YOLO model took a really long time on my PC as they have multiple complex layers. Luckily, Google Colab is free and I do recommend people to check it out! (I will share my colab project once I have sorted it out).
After setting it up, I let it run and automatically saved snapshots every 1000 iterations to Google Drive.
YOLO performed excellently. I quickly built a front end web interface using VueJS for posting images to a Python web server. That Python web server will then pass the image to YOLO and return the results in a JSON format back to the front end.
Here is a snapshot of the result:
YOLO results are highly accurate, and bounding boxes are colored red for bad and green for good. In cases where there are no humans in the image, a result returning no human would be returned.
To simulate what was intended, a "timeline" feature was created to capture an image every x interval, then a simple graph would be plotted to indicate posture degradation over the interval.
User privacy is definitely the highest priority here. Remember that we are actually processing images of the user from their webcam. The concept of not storing any data on the system is applied here. Images that are sent for processing are processed and immediately returned back to the user. No information is written on local disk. Building the prediction system in a Docker container and deploying it on Google Cloud Run allows the instance to be destroyed once it is not being used too. Feel free to view the source code above.
Even after the goal of the project is achieved (and my FYP is completed), there is a lot that can be worked on. Bringing the model for offline usage, a better modelling of what a good posture is and many more. However I hope that this idea can be scaled to be used in production, especially for the healthcare sector and I would love to be involved in it! So please reach out to me if there are any intentions to work on this project or anything similar!