<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Maksym Tatariants</title>
    <description>The latest articles on DEV Community by Maksym Tatariants (@mtatariants).</description>
    <link>https://dev.to/mtatariants</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F593065%2F8315d1c8-dff5-4a4a-a171-9abccc644eae.jpg</url>
      <title>DEV Community: Maksym Tatariants</title>
      <link>https://dev.to/mtatariants</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mtatariants"/>
    <language>en</language>
    <item>
      <title>AR &amp; AI Technologies For Virtual Fitting Room Development</title>
      <dc:creator>Maksym Tatariants</dc:creator>
      <pubDate>Sun, 21 Mar 2021 16:11:17 +0000</pubDate>
      <link>https://dev.to/mobidev/ar-ai-technologies-for-virtual-fitting-room-development-2gbf</link>
      <guid>https://dev.to/mobidev/ar-ai-technologies-for-virtual-fitting-room-development-2gbf</guid>
      <description>&lt;p&gt;I hate shopping in brick and mortar stores. However, my interest in virtual shopping is not limited to the buyer experience only. With the MobiDev DataScience department, I’ve gained experience in working on AI technologies for virtual fitting. The goal of this article is to describe how these systems work from the inside.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Virtual Fitting Technology Works
&lt;/h2&gt;

&lt;p&gt;A few years ago, the “Try before you buy” strategy was an efficient customer engagement method in outfit stores. Now, this strategy exists in the form of virtual fitting rooms. Fortune Business Insights &lt;a href="https://www.fortunebusinessinsights.com/industry-reports/virtual-fitting-room-vfr-market-100322" rel="noopener noreferrer"&gt;predicted&lt;/a&gt; that the virtual fitting room market size is expected to reach USD 10.00 billion by 2027.&lt;/p&gt;

&lt;p&gt;To better understand the logic of virtual fitting room technology, let’s review the following example. Some time ago, we had a project of Augmented Reality (AR) footwear fitting room development. The fitting room works in the following way:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The input video is split into frames and processed with a deep learning model which estimates the position of a set of specific leg and feet keypoints.
Read the related article: &lt;a href="https://mobidev.biz/blog/human-pose-estimation-ai-personal-fitness-coach" rel="noopener noreferrer"&gt;3D Human Pose Estimation in Fitness Coach Apps&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;A 3D model of footwear is placed according to the detected keypoints to display the orientation to a user naturally.&lt;/li&gt;
&lt;li&gt;A 3D footwear model is rendered so that each frame displays realistic textures and lighting.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://mobidev.biz/wp-content/uploads/2020/09/ar-based-virtual-try-on-technology.gif" rel="noopener noreferrer"&gt;Utilization of ARKit for 3D human body pose estimation and 3D model rendering&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When working with &lt;a href="https://mobidev.biz/blog/arkit-guide-augmented-reality-app-development-ios" rel="noopener noreferrer"&gt;ARKit&lt;/a&gt; (Augmented Reality framework for Apple’s devices) we discovered that it has rendering limitations. As you can see from the video above, the tracking accuracy is too low to use it for footwear positioning. The cause of this limitation may be the maintenance of the inference speed while neglecting the tracking accuracy, which might be critical for apps working in real-time.&lt;/p&gt;

&lt;p&gt;One more issue was the poor identification of body parts by the ARKit algorithm. Since this algorithm is aimed to identify the whole body, it doesn’t detect any keypoints if the processed image contains only a part of the body. It is exactly the case of a footwear fitting room when the algorithm is supposed to process only a person’s legs. &lt;/p&gt;

&lt;p&gt;The conclusion was that virtual fitting room apps might require additional functionality along with the standard AR libraries. Thus, it’s recommended to involve data scientists for developing a custom pose estimation model supposed to detect keypoints on only one or two feet in the frame and operate in real-time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Virtual Fitting Room Solutions
&lt;/h2&gt;

&lt;p&gt;The virtual fitting room technology market provides offerings for accessories, watches, glasses, hats, clothes, and others. Let’s review how some of these solutions work under the hood.&lt;/p&gt;

&lt;h3&gt;
  
  
  WATCHES
&lt;/h3&gt;

&lt;p&gt;A good example of virtual watches try-on is the &lt;a href="https://apps.apple.com/us/app/ar-watches-augmented-reality/id1435312889" rel="noopener noreferrer"&gt;AR-Watches app&lt;/a&gt; allowing users to try on various watches. The solution is based on the &lt;a href="https://en.wikipedia.org/wiki/ARTag" rel="noopener noreferrer"&gt;ARTag technology&lt;/a&gt; utilizing specific markers printed on a band, which should be worn on a user’s wrist in place of a watch in order to start a virtual try-on the watch. The computer vision algorithm processes only those markers visible in the frame and identifies the camera’s position in relation to them. After that, to render a 3D object correctly, the virtual camera should be placed at the same location.&lt;/p&gt;

&lt;p&gt;Overall, technology has its limits (for instance, not everybody has a printer at hand to print out the ARTag band). But if it matches the business use case, it wouldn’t be that difficult to create a product with a production-ready quality. Probably, the most important part would be to create proper 3D objects to use.&lt;/p&gt;

&lt;p&gt;3D model rendering of a watch using the ARTag technology&lt;br&gt;
&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/yLnGjabCDD0"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  SHOES
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://play.google.com/store/apps/details?id=by.wanna.apps.wsneakers&amp;amp;hl=en" rel="noopener noreferrer"&gt;Wanna Kicks&lt;/a&gt; and &lt;a href="https://apps.apple.com/us/app/sneakerkit/id1463772901" rel="noopener noreferrer"&gt;SneakerKit&lt;/a&gt; apps are a good demonstration of how AR and deep learning technologies might be applied for footwear.&lt;/p&gt;

&lt;p&gt;Virtual shoes try-on, Wanna Kick app&lt;br&gt;
&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/02e20PkYeXQ"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Technically, such a solution utilizes a foot pose estimation model based on deep learning. This technology may be considered for a particular case of widespread full-body &lt;a href="https://mobidev.biz/blog/human-pose-estimation-ai-personal-fitness-coach" rel="noopener noreferrer"&gt;3D pose estimation&lt;/a&gt; models that estimate the position of selected keypoints in 3D dimension directly or through the inference of detected 2D keypoints’ positions into 3D coordinates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flwm0pb236ggud860wwlw.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flwm0pb236ggud860wwlw.gif" alt="3d-foot-pose-estimation-virtual-try-on"&gt;&lt;/a&gt;&lt;br&gt;
3D foot pose estimation &lt;a href="https://labs.laan.com/blog/leveraging-photogrammetry-to-increase-data-annotation-efficiency-in-ML.html" rel="noopener noreferrer"&gt;(source)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once positions of 3D keypoints of feet are detected, they can be utilized for creating a parametric 3D model of a human foot, and positioning &amp;amp; scaling of a footwear 3D model according to the geometric properties of the parametric model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtr3er6si4dt5emnaohl.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtr3er6si4dt5emnaohl.gif" alt="3d-model-human-foot-virtual-try-on"&gt;&lt;/a&gt;&lt;br&gt;
Positioning of a 3D model of footwear on top of a detected parametric foot model &lt;a href="https://www.vyking.io/video/Vyking_SneakerStudio.mp4" rel="noopener noreferrer"&gt;(source)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Compared to the full-body/face pose estimation model, foot pose estimation still has certain challenges. The main issue is the lack of 3D annotation data required for model training. &lt;/p&gt;

&lt;p&gt;However, the optimal way to avoid this problem is to use the &lt;a href="https://www.di.ens.fr/willow/research/surreal/" rel="noopener noreferrer"&gt;synthetic data&lt;/a&gt; which supposes rendering of photorealistic 3D human feet models with keypoints and training a model with that data; or to use photogrammetry which supposes the reconstruction of a 3D scene from multiple 2D views to &lt;a href="https://labs.laan.com/blog/leveraging-photogrammetry-to-increase-data-annotation-efficiency-in-ML.html" rel="noopener noreferrer"&gt;decrease the number of labeling needs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This kind of solution is way more complicated. In order to enter the market with a ready-to-use product, it is required to collect a large enough foot keypoint dataset (either using synthetic data, photogrammetry, or a combination of both), train a customized pose estimation model (that would combine both high enough accuracy and inference speed), test its robustness in various conditions and create a foot model. We consider it a medium complexity project in terms of technologies.&lt;/p&gt;
&lt;h3&gt;
  
  
  GLASSES
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.fittingbox.com/en/" rel="noopener noreferrer"&gt;FittingBox&lt;/a&gt; and &lt;a href="https://ditto.com/" rel="noopener noreferrer"&gt;Ditto&lt;/a&gt; companies considered AR technology for the virtual glasses try-on. The user should choose a glasses model from a virtual catalog and it is put on his/her eyes.&lt;/p&gt;

&lt;p&gt;Virtual glasses try-on and lenses simulation&lt;br&gt;
&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/p0dGmaiQKAg"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;This solution is based on the deep learning-powered pose estimation approach utilized for facial landmarks detection, where the common annotation format includes 68 2D/3D facial landmarks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhkfbig13z603oi6fpwzh.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhkfbig13z603oi6fpwzh.gif" alt="face-pose-estimation"&gt;&lt;/a&gt;&lt;br&gt;
Example of facial landmark detection in video. Note that the model in the video detects more than 68 landmarks &lt;a href="https://firebase.googleblog.com/2018/11/ml-kit-adds-face-contours-to-create.html" rel="noopener noreferrer"&gt;(source)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Such an annotation format allows the differentiation of face contour, nose, eyes, eyebrows, and lips with a sufficient accuracy level. The data for training the face landmark estimation model might be taken from such open-source libraries as &lt;a href="https://github.com/1adrianb/face-alignment" rel="noopener noreferrer"&gt;Face Alignment&lt;/a&gt;, providing face pose estimation functionality out-of-the-box.&lt;/p&gt;

&lt;p&gt;In terms of technologies, this kind of solution is not that complicated, especially if using any pre-trained model as a basis for the &lt;a href="https://mobidev.biz/blog/custom-face-detection-recognition-software-development" rel="noopener noreferrer"&gt;face recognition task&lt;/a&gt;. But it’s important to consider that low-quality cameras and poor light conditions could be limiting factors.&lt;/p&gt;
&lt;h3&gt;
  
  
  SURGICAL MASKS
&lt;/h3&gt;

&lt;p&gt;Amidst the COVID-19 pandemic, &lt;a href="https://zap.works/" rel="noopener noreferrer"&gt;ZapWorks&lt;/a&gt; launched the AR-based educational &lt;a href="https://viewtoo.arweb.app/?zid=z/bEPn1c&amp;amp;toolbar=0" rel="noopener noreferrer"&gt;app&lt;/a&gt; aimed to instruct users on how to wear surgical masks properly. Technically, this app is also based on a 3D facial landmark detection method. Like the glasses try-on app, this method allows receiving information about facial features and further mask rendering.&lt;/p&gt;

&lt;p&gt;AR for mask wear guide&lt;br&gt;
&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/HvTYcEQdrcc"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  HATS
&lt;/h3&gt;

&lt;p&gt;Given the fact that facial landmark detection models work well, another frequently simulated AR item is hats. Everything required for correct rendering of a hat on the person’s head is the 3D coordinates of several keypoints indicating temples and the location of a forehead center. The virtual hats try-on apps have already been launched by &lt;a href="https://www.quytech.com/" rel="noopener noreferrer"&gt;QUYTECH&lt;/a&gt;, &lt;a href="https://www.banuba.com/" rel="noopener noreferrer"&gt;Banuba&lt;/a&gt;, and &lt;a href="https://www.vertebrae.com/" rel="noopener noreferrer"&gt;Vertebrae&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Baseball cap try-on&lt;br&gt;
&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/RAIm7blzkD0"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  CLOTHES
&lt;/h3&gt;

&lt;p&gt;Compared to shoes, masks, glasses, and watches, virtual try-on 3D clothes still remain a challenge. The reason is that clothes are deformed when taking the shape of a person’s body. Thus, for proper AR experience, a deep learning model should identify not only basic keypoints on the human body’s joints but also the body shape in 3D. &lt;/p&gt;

&lt;p&gt;Looking at one of the most recent deep learning models &lt;a href="https://github.com/facebookresearch/Densepose" rel="noopener noreferrer"&gt;DensePose&lt;/a&gt; aimed to map pixels of an RGB image of a person to the 3D surface of the human body, we can find out that it’s still not quite suitable for augmented reality. The DensePose’s inference speed is not appropriate for real-time apps, and body mesh detections have insufficient accuracy for the fitting of 3D clothing items. In order to improve results, it’s required to collect more annotated data which is a time and resource-consuming task. &lt;/p&gt;

&lt;p&gt;The alternative is to use 2D clothing items and 2D people’s silhouettes. That’s what &lt;a href="https://zeekit.me/" rel="noopener noreferrer"&gt;Zeekit&lt;/a&gt; company does, giving the users a possibility to apply a number of clothing types (dresses, pants, shirts, etc.) to their photo.&lt;/p&gt;

&lt;p&gt;2D clothing try-on, Zeekit&lt;br&gt;
&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/IXIbeBQwgDA"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Strictly speaking, the method of 2D clothes images transferring cannot be considered as Augmented Reality, since the “Reality” aspect implies the real-time operation, however, it still can provide an unusual and immersive user experience. The behind technologies comprise &lt;a href="https://towardsdatascience.com/generative-adversarial-networks-explained-34472718707a" rel="noopener noreferrer"&gt;Generative Adversarial Networks&lt;/a&gt;, &lt;a href="https://www.kdnuggets.com/2020/08/3d-human-pose-estimation-experiments-analysis.html" rel="noopener noreferrer"&gt;Human Pose Estimation&lt;/a&gt;, and &lt;a href="http://sysu-hcp.net/lip/index.php" rel="noopener noreferrer"&gt;Human Parsing&lt;/a&gt; models. The 2D clothes transferring algorithm may look as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Identification of areas in the image corresponding to the individual body parts&lt;/li&gt;
&lt;li&gt;Detection of the position for identified body parts&lt;/li&gt;
&lt;li&gt;Producing of a warped image of a transferred clothing &lt;/li&gt;
&lt;li&gt;Application of a warped image to the image of a person with the minimum produced artifacts&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  OUR EXPERIMENTS WITH 2D CLOTH TRANSFERING
&lt;/h3&gt;

&lt;p&gt;Since there are no ready pre-trained models for the virtual dressing room we researched this field experimenting with the &lt;a href="https://arxiv.org/abs/2003.05863" rel="noopener noreferrer"&gt;ACGPN model&lt;/a&gt;. The idea was to explore outputs of this model in practice for 2D cloth transferring by utilizing various approaches.&lt;/p&gt;

&lt;p&gt;The model was applied to people’s images in constrained (samples from the training dataset, VITON) and unconstrained (any environment) conditions. In addition, we tested the limits of the model’s capabilities by not only running it on custom persons’ images but also using custom clothing images that were quite different from the training data.&lt;/p&gt;

&lt;p&gt;Here are examples of results we received during the research:&lt;/p&gt;

&lt;p&gt;1) Replication of results described in the “Towards Photo-Realistic Virtual Try-On by Adaptively GeneratingPreserving↔Image Content” research paper, with the original data and our preprocessing models:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4pqm5ew6xtvvz5bvwaae.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4pqm5ew6xtvvz5bvwaae.png" alt="image"&gt;&lt;/a&gt;&lt;br&gt;
Successful (A1-A3) and unsuccessful (B1-B3) replacement of clothing &lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;B1 – poor inpainting&lt;/li&gt;
&lt;li&gt;B2 – new clothes overlapping&lt;/li&gt;
&lt;li&gt;B3 – edge defects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2) Application of custom clothes to default person images:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xkk9i3wty1zy8zy7ye6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xkk9i3wty1zy8zy7ye6.png" alt="image"&gt;&lt;/a&gt;&lt;br&gt;
Clothing replacement using custom clothes&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Row A – no defects &lt;/li&gt;
&lt;li&gt;Row B – some defects to be moderated &lt;/li&gt;
&lt;li&gt;Row C – critical defects
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3) Application of default clothes to the custom person images:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fro5lt3hplgztlq5g2knb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fro5lt3hplgztlq5g2knb.png" alt="image"&gt;&lt;/a&gt;&lt;br&gt;
Outputs of clothing replacement on images with an unconstrained environment&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Row A – edge defects (minor)&lt;/li&gt;
&lt;li&gt;Row B – masking errors (moderate)&lt;/li&gt;
&lt;li&gt;Row C – inpainting and masking errors (critical) &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;4) Application of custom clothes to the custom person images:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkh8v4y4129g9eeanu65x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkh8v4y4129g9eeanu65x.png" alt="image"&gt;&lt;/a&gt;&lt;br&gt;
Clothing replacement with the unconstrained environment and custom clothing images&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Row A – best results obtained from the model&lt;/li&gt;
&lt;li&gt;Row B – many defects to be moderated&lt;/li&gt;
&lt;li&gt;Row C – most distorted results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When analyzing the outputs, we discovered that virtual clothes try on still has certain limitations. The point is the training data should contain paired images of the target cloth, and people wearing this cloth. If given a real-world business scenario, it may be challenging to accomplish. The other takeaways from the research are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The ACGPN model outputs rather good results on the images of people from the training dataset. It is also true if custom clothing items are applied.&lt;/li&gt;
&lt;li&gt;The model is unstable when it comes to processing the images of people captured in varying lighting, other environmental conditions, and unusual poses.&lt;/li&gt;
&lt;li&gt;The technology for creating virtual dressing room systems for transferring 2D clothing images onto the image of the target person in the wild is not yet ready for commercial applications. However, if the conditions are static, the expected results can be much better.&lt;/li&gt;
&lt;li&gt;The main limiting factor that holds back the development of better models is the lack of diverse datasets with people captured in outdoor conditions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In conclusion, I’d say that current virtual fitting rooms work well for items related to separate body parts like head, face, feet, and arms. But talking about items where the human body requires to be fully detected, estimated, and modified, the virtual fitting is still in its infancy. However, the &lt;a href="https://mobidev.biz/blog/future-ai-machine-learning-trends-to-impact-business" rel="noopener noreferrer"&gt;AI evolves&lt;/a&gt; in leaps and bounds, and the best strategy is to stay tuned and keep trying.&lt;/p&gt;

&lt;p&gt;Written by Maksym Tatariants, Data Science Engineer at MobiDev.&lt;/p&gt;

&lt;p&gt;Full article originally published at &lt;a href="https://mobidev.biz/blog/ar-ai-technologies-virtual-fitting-room-development" rel="noopener noreferrer"&gt;https://mobidev.biz&lt;/a&gt;. It is based on MobiDev technology research and experience providing software development services.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Human Pose Estimation Technology 2021 Guide</title>
      <dc:creator>Maksym Tatariants</dc:creator>
      <pubDate>Fri, 12 Mar 2021 12:00:51 +0000</pubDate>
      <link>https://dev.to/mobidev/human-pose-estimation-technology-2021-guide-5ejd</link>
      <guid>https://dev.to/mobidev/human-pose-estimation-technology-2021-guide-5ejd</guid>
      <description>&lt;p&gt;“Is it possible for a technology solution to replace fitness coaches? Well, someone still has to motivate you saying “Come On, even my grandma can do better!” But from a technology point of view, this high-level requirement led us to 3D human pose estimation technology. &lt;/p&gt;

&lt;p&gt;In this article, I will describe our own experience of how 3D human pose estimation can be developed and implemented for the AI fitness coach solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Human Pose Estimation?
&lt;/h2&gt;

&lt;p&gt;Human pose estimation is a computer vision-based technology that detects and analyzes human posture. The main component of human pose estimation is the modeling of the human body. There are three of the most used types of human body models: skeleton-based model, contour-based, and volume-based.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skeleton-based model&lt;/strong&gt; consists of a set of joints (keypoints) like ankles, knees, shoulders, elbows, wrists, and limb orientations comprising the skeletal structure of a human body. This model is used both in 2D and 3D human pose estimation techniques because of its flexibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contour-based model&lt;/strong&gt; consists of the contour and rough width of the body torso and limbs, where body parts are presented with boundaries and rectangles of a person’s silhouette. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Volume-based model&lt;/strong&gt; consists of 3D human body shapes and poses represented by volume-based models with geometric meshes and shapes, normally captured with 3D scans.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--u9b18-KK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kv1wxxhvmp7f44d7b3o7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--u9b18-KK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kv1wxxhvmp7f44d7b3o7.png" alt="image" width="880" height="495"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://arxiv.org/pdf/2006.01423.pdf"&gt;Source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, I am talking about &lt;strong&gt;skeleton-based models&lt;/strong&gt;, which may be detected from a 2D or 3D perspective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2D pose estimation&lt;/strong&gt; is based on the detection and analysis of X, Y coordinates of human body joints from an RGB image.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3D pose estimation&lt;/strong&gt; is based on the detection and analysis of X, Y, Z coordinates of human body joints from an RGB image. &lt;/p&gt;

&lt;p&gt;When speaking about fitness applications involving human pose estimation, it’s better to use 3D estimation, since it analyzes human poses during physical activities more accurately.&lt;/p&gt;

&lt;p&gt;Talking about AI fitness coach apps, the common flow looks as follows: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Capture user’s movements while doing an exercise&lt;/li&gt;
&lt;li&gt;Analyze the correctness of an exercise performance &lt;/li&gt;
&lt;li&gt;Display mistakes to the user interface&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How 3D Human Pose Estimation Works
&lt;/h2&gt;

&lt;p&gt;The visual image of how 3D human pose estimation technology detects keypoints on a human body looks like as follows:&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HhSAHE7g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6mm6q9dbb82abg8i5sau.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HhSAHE7g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6mm6q9dbb82abg8i5sau.png" alt="image" width="880" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The process usually involves the extraction of joints on a human body, and then analysis of a human pose by deep learning algorithms. If the human pose estimation system uses video records as a data source, keypoints (joints locations) are detected from a sequence of frames, not a single picture. It allows us to achieve more accuracy as the system analyzes an actual movement of a person, not a steady position. &lt;/p&gt;

&lt;p&gt;There are several ways to develop the 3D human pose estimation system for fitness. The most optimal ways are training of a deep learning model to extract 3D or 2D key points from the given images/frames&lt;/p&gt;

&lt;p&gt;For sure, using video streams from several cameras with different views on the same person doing exercises – it will grant us better accuracy. But multi-cameras are often not available. Also, analyzing video from several video streams will take more computer power to process.    &lt;/p&gt;

&lt;p&gt;For our research, we used a single video source for the analysis. And applied convolutional neural networks (CNNs) with dilated temporal convolutions (see the video below).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ORiMdhZH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/cst6o7n08ten5f2upckj.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ORiMdhZH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/cst6o7n08ten5f2upckj.gif" alt="Alt Text" width="880" height="332"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/facebookresearch/VideoPose3D/blob/master/images/convolutions_anim.gif"&gt;Source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We made the analysis of existing models and figured out that &lt;a href="https://github.com/facebookresearch/VideoPose3D"&gt;VideoPose3D&lt;/a&gt; is the most optimal choice for fitness app purposes.  In the input, it should have a set of 2D keypoints detected, where the COCO 2017 dataset is applied as a pre-trained 2D detector. For the accurate prediction of a current joint’s position, it processes visual data from several frames captured at various periods of time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Use Human Pose Estimation in AI Fitness Coach App
&lt;/h2&gt;

&lt;p&gt;Digitalization has not spared the fitness industry. According to the Research and Markets &lt;a href="https://www.businesswire.com/news/home/20170724006151/en/27.4-Billion-Growth-Opportunities-Global-Digital-Fitness?utm_campaign=embodied-ai&amp;amp;utm_medium=email&amp;amp;utm_source=Revue%20newsletter"&gt;report&lt;/a&gt;, the digital fitness market size is expected to reach $27.4 billion by 2022.&lt;/p&gt;

&lt;p&gt;The 3D human pose estimation is a relatively new but rapidly evolving technology in digital fitness. After analysis and practical experience of working with 3D human pose estimation systems, we have come to our own vision of how it can be implemented. Let’s review the flow of how this system may be built so that it can analyze movements in an automatic manner by utilizing videos of users performing physical exercises. &lt;/p&gt;

&lt;p&gt;Assuming that the goal of the given system is to inspect the input video for common exercise mistakes and compare it with the reference video, where the professional athlete is performing the same exercise, the flow will look like as follows:&lt;/p&gt;

&lt;p&gt;1) Cutting of the input video depending on the exercise start &amp;amp; end&lt;/p&gt;

&lt;p&gt;For the start and the end points indication, we can automatically detect the position of body control points by using arbitrary thresholds. For example, when squatting, it is possible to detect the angle of arms and position of hands by height, and then, by using arbitrary thresholds, we can detect the start and the end points of an exercise.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lONeyb5C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/krxd8fn04uqv1tyw7dhv.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lONeyb5C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/krxd8fn04uqv1tyw7dhv.gif" alt="Alt Text" width="880" height="440"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.youtube.com/watch?v=M-qAx0yGK9w"&gt;Video source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One more way is to ask the user to indicate the start and the end of the exercise performance manually.&lt;/p&gt;

&lt;p&gt;2) Detecting 2D and 3D keypoints on the user’s body&lt;/p&gt;

&lt;p&gt;3) Decomposing of the exercise phases&lt;/p&gt;

&lt;p&gt;When having the positions of keypoints (joints) extracted, they should be compared with the reference video’s positions. However, we cannot make a direct comparison because the exercise performance speed and the total number of repetitions on the input and reference videos may differ.&lt;/p&gt;

&lt;p&gt;These discrepancies can be resolved by decomposition of an exercise into phases. We can see how it is illustrated in the image below, where the squatting exercise is decomposed into two primary phases: squatting down and squatting up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ZAaIXOUW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sn91lesn5eqw3tl7d756.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ZAaIXOUW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sn91lesn5eqw3tl7d756.png" alt="image" width="880" height="551"&gt;&lt;/a&gt;&lt;br&gt;
Photo source: stronglifts.com&lt;/p&gt;

&lt;p&gt;The decomposition can be done through the analysis of keypoints detected from the input video frame by frame, and then comparing them by certain criteria with the keypoints from the reference video.&lt;/p&gt;

&lt;p&gt;4) Searching for common mistakes&lt;/p&gt;

&lt;p&gt;When 3D keypoints and certain phases of an exercise are detected, it’s time to detect common mistakes in an exercise technique in the input video. For example,  in squatting, we can detect moments when the legs are bent (not straight) and the knees are closer to the center torso than feet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LmxQ61CL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/v58e5uvo77f3czggr4yx.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LmxQ61CL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/v58e5uvo77f3czggr4yx.gif" alt="Alt Text" width="880" height="293"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://youtu.be/W73Mc0Gil9A"&gt;Video source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;5) Comparing the input video frames with the reference ones&lt;/p&gt;

&lt;p&gt;Here we should take a reference video, where the exercise is performed correctly, split it into phases, and detect keypoints in each frame. When the keypoints are detected and exercise phases defined in both input and reference videos, we can compare each phase of an exercise performed by a user and professional athlete.&lt;/p&gt;

&lt;p&gt;The step-by-step flow looks as follows:&lt;/p&gt;

&lt;p&gt;a. Slow down/accelerate the reference video in order to match the speed of the input one.&lt;/p&gt;

&lt;p&gt;b. Align both skeleton models of the user and a professional athlete so that their rotation angle and origins match.&lt;/p&gt;

&lt;p&gt;c. Normalize the size of both skeletons since reference and input videos can be captured from different distances.&lt;/p&gt;

&lt;p&gt;d. Compare keypoints frame by frame and detect motion inconsistencies.&lt;/p&gt;

&lt;p&gt;e. Repeat the flow separately for different groups of joints (e. g. feet position, knee position, hands and elbows position, etc.).&lt;/p&gt;

&lt;p&gt;6) Display results and generate recommendations for a user&lt;/p&gt;

&lt;p&gt;When the whole analysis cycle is completed, the user will get results displayed in different formats. For example, the output may include interactive 3D reconstructions with mistake hints, so that the user can zoom in/out, go back, forward, or pause at a specific moment. It is also possible to collect and display movement statistics such as the number of repetitions, average speed and duration of one repetition, and others.&lt;/p&gt;

&lt;p&gt;Visually the 3D human pose estimation system based on videos looks like as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4XTVk3dU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kfyp5j2gdbme10utvhmo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4XTVk3dU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kfyp5j2gdbme10utvhmo.png" alt="image" width="880" height="1370"&gt;&lt;/a&gt;&lt;br&gt;
Photo sources: stronglifts.com,  Men’s Health channel &lt;/p&gt;

&lt;p&gt;In this article, I described how a 3D human pose estimation system works from the perspective of AI fitness coach app development because it illustrates well how it might work by example. But please note that the flow might be changed depending on business requirements or other factors.&lt;/p&gt;

&lt;p&gt;Highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3D human pose estimation can be used to detect movement errors in fitness exercises.&lt;/li&gt;
&lt;li&gt;The selection of a proper 2D keypoint detector is critical in getting high-quality 3D keypoints.&lt;/li&gt;
&lt;li&gt;Occluded or fast-moving joints can be challenging to detect for 2D keypoint models and lead to incorrect/random predictions.&lt;/li&gt;
&lt;li&gt;When using pre-trained models, it is important to keep in mind that they will most likely not work well for unusual moves and body positions. You will probably need to fine-tune or re-train at least refine a model on domain-specific or purposefully augmented data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Written by Maksym Tatariants, Data Science Engineer at MobiDev.&lt;/p&gt;

&lt;p&gt;Full article originally published at &lt;a href="https://mobidev.biz/blog/human-pose-estimation-ai-personal-fitness-coach"&gt;https://mobidev.biz&lt;/a&gt;. It is based on MobiDev technology research and experience providing software development services.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
