DEV Community: Maksym Tatariants

AR & AI Technologies For Virtual Fitting Room Development

Maksym Tatariants — Sun, 21 Mar 2021 16:11:17 +0000

I hate shopping in brick and mortar stores. However, my interest in virtual shopping is not limited to the buyer experience only. With the MobiDev DataScience department, I’ve gained experience in working on AI technologies for virtual fitting. The goal of this article is to describe how these systems work from the inside.

How Virtual Fitting Technology Works

A few years ago, the “Try before you buy” strategy was an efficient customer engagement method in outfit stores. Now, this strategy exists in the form of virtual fitting rooms. Fortune Business Insights predicted that the virtual fitting room market size is expected to reach USD 10.00 billion by 2027.

To better understand the logic of virtual fitting room technology, let’s review the following example. Some time ago, we had a project of Augmented Reality (AR) footwear fitting room development. The fitting room works in the following way:

The input video is split into frames and processed with a deep learning model which estimates the position of a set of specific leg and feet keypoints. Read the related article: 3D Human Pose Estimation in Fitness Coach Apps
A 3D model of footwear is placed according to the detected keypoints to display the orientation to a user naturally.
A 3D footwear model is rendered so that each frame displays realistic textures and lighting.

Utilization of ARKit for 3D human body pose estimation and 3D model rendering

When working with ARKit (Augmented Reality framework for Apple’s devices) we discovered that it has rendering limitations. As you can see from the video above, the tracking accuracy is too low to use it for footwear positioning. The cause of this limitation may be the maintenance of the inference speed while neglecting the tracking accuracy, which might be critical for apps working in real-time.

One more issue was the poor identification of body parts by the ARKit algorithm. Since this algorithm is aimed to identify the whole body, it doesn’t detect any keypoints if the processed image contains only a part of the body. It is exactly the case of a footwear fitting room when the algorithm is supposed to process only a person’s legs.

The conclusion was that virtual fitting room apps might require additional functionality along with the standard AR libraries. Thus, it’s recommended to involve data scientists for developing a custom pose estimation model supposed to detect keypoints on only one or two feet in the frame and operate in real-time.

Virtual Fitting Room Solutions

The virtual fitting room technology market provides offerings for accessories, watches, glasses, hats, clothes, and others. Let’s review how some of these solutions work under the hood.

WATCHES

A good example of virtual watches try-on is the AR-Watches app allowing users to try on various watches. The solution is based on the ARTag technology utilizing specific markers printed on a band, which should be worn on a user’s wrist in place of a watch in order to start a virtual try-on the watch. The computer vision algorithm processes only those markers visible in the frame and identifies the camera’s position in relation to them. After that, to render a 3D object correctly, the virtual camera should be placed at the same location.

Overall, technology has its limits (for instance, not everybody has a printer at hand to print out the ARTag band). But if it matches the business use case, it wouldn’t be that difficult to create a product with a production-ready quality. Probably, the most important part would be to create proper 3D objects to use.

3D model rendering of a watch using the ARTag technology

SHOES

Wanna Kicks and SneakerKit apps are a good demonstration of how AR and deep learning technologies might be applied for footwear.

Virtual shoes try-on, Wanna Kick app

Technically, such a solution utilizes a foot pose estimation model based on deep learning. This technology may be considered for a particular case of widespread full-body 3D pose estimation models that estimate the position of selected keypoints in 3D dimension directly or through the inference of detected 2D keypoints’ positions into 3D coordinates.

3D foot pose estimation (source)

Once positions of 3D keypoints of feet are detected, they can be utilized for creating a parametric 3D model of a human foot, and positioning & scaling of a footwear 3D model according to the geometric properties of the parametric model.

Positioning of a 3D model of footwear on top of a detected parametric foot model (source)

Compared to the full-body/face pose estimation model, foot pose estimation still has certain challenges. The main issue is the lack of 3D annotation data required for model training.

However, the optimal way to avoid this problem is to use the synthetic data which supposes rendering of photorealistic 3D human feet models with keypoints and training a model with that data; or to use photogrammetry which supposes the reconstruction of a 3D scene from multiple 2D views to decrease the number of labeling needs.

This kind of solution is way more complicated. In order to enter the market with a ready-to-use product, it is required to collect a large enough foot keypoint dataset (either using synthetic data, photogrammetry, or a combination of both), train a customized pose estimation model (that would combine both high enough accuracy and inference speed), test its robustness in various conditions and create a foot model. We consider it a medium complexity project in terms of technologies.

GLASSES

FittingBox and Ditto companies considered AR technology for the virtual glasses try-on. The user should choose a glasses model from a virtual catalog and it is put on his/her eyes.

Virtual glasses try-on and lenses simulation

This solution is based on the deep learning-powered pose estimation approach utilized for facial landmarks detection, where the common annotation format includes 68 2D/3D facial landmarks.

Example of facial landmark detection in video. Note that the model in the video detects more than 68 landmarks (source)

Such an annotation format allows the differentiation of face contour, nose, eyes, eyebrows, and lips with a sufficient accuracy level. The data for training the face landmark estimation model might be taken from such open-source libraries as Face Alignment, providing face pose estimation functionality out-of-the-box.

In terms of technologies, this kind of solution is not that complicated, especially if using any pre-trained model as a basis for the face recognition task. But it’s important to consider that low-quality cameras and poor light conditions could be limiting factors.

SURGICAL MASKS

Amidst the COVID-19 pandemic, ZapWorks launched the AR-based educational app aimed to instruct users on how to wear surgical masks properly. Technically, this app is also based on a 3D facial landmark detection method. Like the glasses try-on app, this method allows receiving information about facial features and further mask rendering.

AR for mask wear guide

HATS

Given the fact that facial landmark detection models work well, another frequently simulated AR item is hats. Everything required for correct rendering of a hat on the person’s head is the 3D coordinates of several keypoints indicating temples and the location of a forehead center. The virtual hats try-on apps have already been launched by QUYTECH, Banuba, and Vertebrae.

Baseball cap try-on

CLOTHES

Compared to shoes, masks, glasses, and watches, virtual try-on 3D clothes still remain a challenge. The reason is that clothes are deformed when taking the shape of a person’s body. Thus, for proper AR experience, a deep learning model should identify not only basic keypoints on the human body’s joints but also the body shape in 3D.

Looking at one of the most recent deep learning models DensePose aimed to map pixels of an RGB image of a person to the 3D surface of the human body, we can find out that it’s still not quite suitable for augmented reality. The DensePose’s inference speed is not appropriate for real-time apps, and body mesh detections have insufficient accuracy for the fitting of 3D clothing items. In order to improve results, it’s required to collect more annotated data which is a time and resource-consuming task.

The alternative is to use 2D clothing items and 2D people’s silhouettes. That’s what Zeekit company does, giving the users a possibility to apply a number of clothing types (dresses, pants, shirts, etc.) to their photo.

2D clothing try-on, Zeekit

Strictly speaking, the method of 2D clothes images transferring cannot be considered as Augmented Reality, since the “Reality” aspect implies the real-time operation, however, it still can provide an unusual and immersive user experience. The behind technologies comprise Generative Adversarial Networks, Human Pose Estimation, and Human Parsing models. The 2D clothes transferring algorithm may look as follows:

Identification of areas in the image corresponding to the individual body parts
Detection of the position for identified body parts
Producing of a warped image of a transferred clothing
Application of a warped image to the image of a person with the minimum produced artifacts

OUR EXPERIMENTS WITH 2D CLOTH TRANSFERING

Since there are no ready pre-trained models for the virtual dressing room we researched this field experimenting with the ACGPN model. The idea was to explore outputs of this model in practice for 2D cloth transferring by utilizing various approaches.

The model was applied to people’s images in constrained (samples from the training dataset, VITON) and unconstrained (any environment) conditions. In addition, we tested the limits of the model’s capabilities by not only running it on custom persons’ images but also using custom clothing images that were quite different from the training data.

Here are examples of results we received during the research:

1) Replication of results described in the “Towards Photo-Realistic Virtual Try-On by Adaptively GeneratingPreserving↔Image Content” research paper, with the original data and our preprocessing models:

Successful (A1-A3) and unsuccessful (B1-B3) replacement of clothing

Results:

B1 – poor inpainting
B2 – new clothes overlapping
B3 – edge defects

2) Application of custom clothes to default person images:

Clothing replacement using custom clothes

Results:

Row A – no defects
Row B – some defects to be moderated
Row C – critical defects

3) Application of default clothes to the custom person images:

Outputs of clothing replacement on images with an unconstrained environment

Results:

Row A – edge defects (minor)
Row B – masking errors (moderate)
Row C – inpainting and masking errors (critical)

4) Application of custom clothes to the custom person images:

Clothing replacement with the unconstrained environment and custom clothing images

Results:

Row A – best results obtained from the model
Row B – many defects to be moderated
Row C – most distorted results

When analyzing the outputs, we discovered that virtual clothes try on still has certain limitations. The point is the training data should contain paired images of the target cloth, and people wearing this cloth. If given a real-world business scenario, it may be challenging to accomplish. The other takeaways from the research are:

The ACGPN model outputs rather good results on the images of people from the training dataset. It is also true if custom clothing items are applied.
The model is unstable when it comes to processing the images of people captured in varying lighting, other environmental conditions, and unusual poses.
The technology for creating virtual dressing room systems for transferring 2D clothing images onto the image of the target person in the wild is not yet ready for commercial applications. However, if the conditions are static, the expected results can be much better.
The main limiting factor that holds back the development of better models is the lack of diverse datasets with people captured in outdoor conditions.

In conclusion, I’d say that current virtual fitting rooms work well for items related to separate body parts like head, face, feet, and arms. But talking about items where the human body requires to be fully detected, estimated, and modified, the virtual fitting is still in its infancy. However, the AI evolves in leaps and bounds, and the best strategy is to stay tuned and keep trying.

Written by Maksym Tatariants, Data Science Engineer at MobiDev.

Full article originally published at https://mobidev.biz. It is based on MobiDev technology research and experience providing software development services.

Human Pose Estimation Technology 2021 Guide

Maksym Tatariants — Fri, 12 Mar 2021 12:00:51 +0000

“Is it possible for a technology solution to replace fitness coaches? Well, someone still has to motivate you saying “Come On, even my grandma can do better!” But from a technology point of view, this high-level requirement led us to 3D human pose estimation technology.

In this article, I will describe our own experience of how 3D human pose estimation can be developed and implemented for the AI fitness coach solution.

What is Human Pose Estimation?

Human pose estimation is a computer vision-based technology that detects and analyzes human posture. The main component of human pose estimation is the modeling of the human body. There are three of the most used types of human body models: skeleton-based model, contour-based, and volume-based.

Skeleton-based model consists of a set of joints (keypoints) like ankles, knees, shoulders, elbows, wrists, and limb orientations comprising the skeletal structure of a human body. This model is used both in 2D and 3D human pose estimation techniques because of its flexibility.

Contour-based model consists of the contour and rough width of the body torso and limbs, where body parts are presented with boundaries and rectangles of a person’s silhouette.

Volume-based model consists of 3D human body shapes and poses represented by volume-based models with geometric meshes and shapes, normally captured with 3D scans.

Source

Here, I am talking about skeleton-based models, which may be detected from a 2D or 3D perspective.

2D pose estimation is based on the detection and analysis of X, Y coordinates of human body joints from an RGB image.

3D pose estimation is based on the detection and analysis of X, Y, Z coordinates of human body joints from an RGB image.

When speaking about fitness applications involving human pose estimation, it’s better to use 3D estimation, since it analyzes human poses during physical activities more accurately.

Talking about AI fitness coach apps, the common flow looks as follows:

Capture user’s movements while doing an exercise
Analyze the correctness of an exercise performance
Display mistakes to the user interface

How 3D Human Pose Estimation Works

The visual image of how 3D human pose estimation technology detects keypoints on a human body looks like as follows:

The process usually involves the extraction of joints on a human body, and then analysis of a human pose by deep learning algorithms. If the human pose estimation system uses video records as a data source, keypoints (joints locations) are detected from a sequence of frames, not a single picture. It allows us to achieve more accuracy as the system analyzes an actual movement of a person, not a steady position.

There are several ways to develop the 3D human pose estimation system for fitness. The most optimal ways are training of a deep learning model to extract 3D or 2D key points from the given images/frames

For sure, using video streams from several cameras with different views on the same person doing exercises – it will grant us better accuracy. But multi-cameras are often not available. Also, analyzing video from several video streams will take more computer power to process.

For our research, we used a single video source for the analysis. And applied convolutional neural networks (CNNs) with dilated temporal convolutions (see the video below).

Source

We made the analysis of existing models and figured out that VideoPose3D is the most optimal choice for fitness app purposes. In the input, it should have a set of 2D keypoints detected, where the COCO 2017 dataset is applied as a pre-trained 2D detector. For the accurate prediction of a current joint’s position, it processes visual data from several frames captured at various periods of time.

How to Use Human Pose Estimation in AI Fitness Coach App

Digitalization has not spared the fitness industry. According to the Research and Markets report, the digital fitness market size is expected to reach $27.4 billion by 2022.

The 3D human pose estimation is a relatively new but rapidly evolving technology in digital fitness. After analysis and practical experience of working with 3D human pose estimation systems, we have come to our own vision of how it can be implemented. Let’s review the flow of how this system may be built so that it can analyze movements in an automatic manner by utilizing videos of users performing physical exercises.

Assuming that the goal of the given system is to inspect the input video for common exercise mistakes and compare it with the reference video, where the professional athlete is performing the same exercise, the flow will look like as follows:

1) Cutting of the input video depending on the exercise start & end

For the start and the end points indication, we can automatically detect the position of body control points by using arbitrary thresholds. For example, when squatting, it is possible to detect the angle of arms and position of hands by height, and then, by using arbitrary thresholds, we can detect the start and the end points of an exercise.

Video source

One more way is to ask the user to indicate the start and the end of the exercise performance manually.

2) Detecting 2D and 3D keypoints on the user’s body

3) Decomposing of the exercise phases

When having the positions of keypoints (joints) extracted, they should be compared with the reference video’s positions. However, we cannot make a direct comparison because the exercise performance speed and the total number of repetitions on the input and reference videos may differ.

These discrepancies can be resolved by decomposition of an exercise into phases. We can see how it is illustrated in the image below, where the squatting exercise is decomposed into two primary phases: squatting down and squatting up.

Photo source: stronglifts.com

The decomposition can be done through the analysis of keypoints detected from the input video frame by frame, and then comparing them by certain criteria with the keypoints from the reference video.

4) Searching for common mistakes

When 3D keypoints and certain phases of an exercise are detected, it’s time to detect common mistakes in an exercise technique in the input video. For example, in squatting, we can detect moments when the legs are bent (not straight) and the knees are closer to the center torso than feet.

Video source

5) Comparing the input video frames with the reference ones

Here we should take a reference video, where the exercise is performed correctly, split it into phases, and detect keypoints in each frame. When the keypoints are detected and exercise phases defined in both input and reference videos, we can compare each phase of an exercise performed by a user and professional athlete.

The step-by-step flow looks as follows:

a. Slow down/accelerate the reference video in order to match the speed of the input one.

b. Align both skeleton models of the user and a professional athlete so that their rotation angle and origins match.

c. Normalize the size of both skeletons since reference and input videos can be captured from different distances.

d. Compare keypoints frame by frame and detect motion inconsistencies.

e. Repeat the flow separately for different groups of joints (e. g. feet position, knee position, hands and elbows position, etc.).

6) Display results and generate recommendations for a user

When the whole analysis cycle is completed, the user will get results displayed in different formats. For example, the output may include interactive 3D reconstructions with mistake hints, so that the user can zoom in/out, go back, forward, or pause at a specific moment. It is also possible to collect and display movement statistics such as the number of repetitions, average speed and duration of one repetition, and others.

Visually the 3D human pose estimation system based on videos looks like as follows:

Photo sources: stronglifts.com, Men’s Health channel

In this article, I described how a 3D human pose estimation system works from the perspective of AI fitness coach app development because it illustrates well how it might work by example. But please note that the flow might be changed depending on business requirements or other factors.

Highlights:

3D human pose estimation can be used to detect movement errors in fitness exercises.
The selection of a proper 2D keypoint detector is critical in getting high-quality 3D keypoints.
Occluded or fast-moving joints can be challenging to detect for 2D keypoint models and lead to incorrect/random predictions.
When using pre-trained models, it is important to keep in mind that they will most likely not work well for unusual moves and body positions. You will probably need to fine-tune or re-train at least refine a model on domain-specific or purposefully augmented data.

Written by Maksym Tatariants, Data Science Engineer at MobiDev.

Full article originally published at https://mobidev.biz. It is based on MobiDev technology research and experience providing software development services.