DEV Community: MobiDev

12 Augmented Reality Trends of 2023: New Milestones in Immersive Technology

Andrew Makarov — Thu, 22 Sep 2022 12:50:50 +0000

Innovative technologies transform science fiction into reality, and AR is undoubtedly one of them. Holograms, like in the Star Wars and the Marvel movies, now surround us in the real world, bringing a new immersive experience, and it’s more than just entertainment. Today, augmented reality is an effective business tool.

Across a number of different industries like retail, business, gaming, healthcare, and even the military, augmented reality is used for solving various business challenges. It’s important to keep an eye on these technologies to know where the industry is heading. As we discuss these 12 augmented reality trends making moves in 2023, think about how these solutions may benefit your own business.

Trend #1: Leap into the Metaverse

It’s likely no surprise to you that augmented reality is being used alongside other metaverse technologies. The metaverse has barraged news media over this past year since Facebook’s ‘Meta’ rebranding. However, it’s not just marketing hogwash. One of the goals of metaverse technologies is to strike down the barriers between the digital and physical worlds. Since augmented reality can display virtual objects embedded in our real world, various opportunities emerge for businesses and consumers alike.

AVATARS

If we’re going to bring digital experiences into the real world, AR is a great start. Using body and face tracking, as well as advanced scene depth sensing, companies are already working on camera filters that accomplish this. Geenee AR and Ready Player Me partnered up to make this experience a reality. By inserting your avatar into Geenee’s WebAR Builder software, you can effectively ‘wear’ your avatar on camera. The software also takes into account cosmetic items on your Ready Player Me character, including accessories in the form of NFTs.

This technology isn’t new. It’s been seen in use with apps like Snapchat and Instagram for a long time. However, the innovative element is how the app allows users to drop their avatar that they use on other platforms into the app and use it in AR. In the future, this technology could be used to better hybridize virtual meetings. If one person on your team is using a VR headset to attend a meeting and you’re attending without a VR headset, an AR avatar of the person could represent them at your meeting.

Horizon Workrooms, the “metaverted” meeting rooms, presented by Meta group on VivaTech 2022

Making more cost-effective and powerful AR headsets is the number one barrier to entry here. Time will tell how the technology evolves.

SPATIAL AUDIO

Although it may not seem like an augmented reality technology on the surface, spatial audio is very important for enhancing the immersion of AR experiences. Metaverse technologists are obsessed with including all of our five senses in the process, and our hearing is no exception. To make VR and AR experiences more immersive, 3D audio is needed. Users should be able to tell where a sound is coming from in 3D space based on their own position.

Meta recently added an advanced engine to its AR Spark Studio to create sound effects by mixing multiple sounds. This allows creators to create multi-sensory effects that allow people to use both sight and sound to feel more immersed in the augmented reality experience. In this way, we can make sounds play in response to human interaction with our AR effect.

TAKING DIGITAL ITEMS INTO THE REAL WORLD

Metaverse fans love taking things from the digital world into the real world and vice versa with AR. This technology has actually been around since before the metaverse craze. For example, Meta is working on displaying digital collectibles in AR. All creators will need to do is import their NFTs as 2D virtual objects into Instagram Stories and combine them with “See in AR” functionality. This will open up new opportunities for collectors and creators to access and share their NFTs beyond their wallets and it will certainly quickly become one of the key augmented reality market trends in 2023.

There has also been some buzz about virtual art, or art from the real world being offered as AR experiences. For example, Sotheby’s, the fourth oldest auction house in the world, has begun offering AR experiences to bidders through an Instagram filter that allows them to see art up for auction up close and personal. Sotheby’s used this technology to sell a painting for $121.2 million.

Trend #2: Augmented Reality Meets Artificial Intelligence

There are two ways that artificial intelligence plays nicely with augmented reality:

Artificial intelligence powers facial and spatial recognition software needed for AR to function.
AR and AI solutions can work together to provide innovative solutions.

These roles aren’t necessarily exclusive. They tend to blend together quite a bit.

AI MAKES AR WORK BETTER

Augmented reality and artificial intelligence are separate technologies. However, it’s no surprise that AI and AR work well together due to AR’s needs. Complicated algorithms must be used to make sense of sensor data of the environment. AI can simplify that process and make it more accurate than a model made exclusively by a human.

An example of this in practice is the app ClipDrop. The app allows users to quickly digitize an item in the real world into a 3D object for use in programs like PowerPoint, Photoshop, Google Docs, and more. 3D scanning can be used to import real-world objects into metaverse environments as well. 3D scanning may be a great way for businesses to speed up the pipeline of offering items for virtual trial experiences as well.

Automatic design is another use case of combining AR and AI. An app called SketchAR is an example of this technology in action. Users can freely draw in AR using this app. However, they can also use an AI to draw for them. The AI can create structures quickly. This shows that it’s possible for AI programs to design objects in 3D space using the real world as the source environment. In the future, this may mean that AI will be able to design and create structures for use in the real world.

Trend #3: Mobile Augmented Reality is Evolving

One of the main vehicles for delivering augmented reality experiences has been mobile devices. Most consumers have some kind of mobile device, and AR headsets haven’t gone mainstream for consumer use just yet. Because of that, businesses have found a number of opportunities to leverage mobile devices for AR. The technology has improved significantly as well over the years.

GEOSPATIAL API FOR ARCORE

In 2022, Google introduced a new API for geospatial experiences. This allows developers to create experiences that are tied to specific locations in space. In the past, AR experiences have been purely relative to the user or in arbitrary locations set by the user.

Geospatial API allows developers to set latitude and longitude coordinates for AR content. Scanning the physical space isn’t necessary either. It works very similarly to Apple ARKit Location Anchors, comparing images of the surrounding area to Google Street View images to determine a specific location nearly instantaneously.

ARKIT 6 ENHANCEMENTS

Several new features were introduced by Apple for their ARKit 6 upgrade in 2022 at WWDC. One of them is a 4K video recording while ARKit content is in use. The depth API has also received an upgrade to make scene occlusion and other experiences much more realistic. Apple’s LiDAR scanner allows AR experiences to be prepared so quickly that they call the technology ‘instant AR’.

Apple also improves its motion capture feature. When the camera is focused on another person, motion capture data can be taken from their movements and applied to a 3D model. One of the other latest upgrades is people occlusion, which allows virtual objects to pass in front of and behind people in the scene.

One major example of these ARKit 6 enhancements in use is through RoomPlan, a solution that leverages LiDAR scanning to quickly create floor plans of a house or other structure.

ARCORE VS ARKIT

The competition between Apple and Google in the augmented reality arena has been more or less the same over the past several years. As usual, both technologies are on par with one another in terms of software.

However, hardware is where things get more interesting. Apple’s LiDAR scanner and similar technologies found on higher-end Samsung devices can leverage the highest quality AR experiences available. However, there’s a wide variety of hardware differences between Android devices. Many Android devices simply aren’t powerful enough to handle higher-end AR.

Because of this, businesses need to be strategic about the kinds of AR experiences they want to offer. If they want to offer high-quality experiences to a smaller, wealthier audience with the devices that can handle it, they can let their imaginations run wild. However, if a business is looking for an accessible experience for more devices, they will need to tone things down a bit with a simpler app.

Trend #4: WebAR: Better Accessibility with Compromise

Another important trend in augmented reality is WebAR. Powered by web browsers, WebAR doesn’t require users to download additional software. This is the best-case scenario for accessibility. However, it comes at a cost — WebAR offers the most basic AR experiences and lacks many of the features that native AR can offer on mobile devices.

However, in some cases, WebAR can be very useful for simple experiences. Like adding filters to faces, changing the color of hair or objects, background replacement, and simple 3D objects. Simpler virtual try-on experiences are possible with WebAR. These are used by a number of businesses like L’Oréal and Maybelline for their cosmetic products.

Tom Emrich from 8th Wall, the world’s leading WebAR development platform, notes that WebAR is the key to bridging the gap between the virtual and physical worlds. Although WebAR isn’t very powerful at the moment, the evolution of WebAR may be one of the most important ways to engage with the Internet in the future. 8th Wall is continuing to improve WebAR technologies to fulfill this vision.

Trend #5: Cross-Platform AR Gains Prominence

One major challenge in developing AR is making apps cross-platform. There’s also the unfortunate truth that cross-platform applications will most likely not be quite as good as the full potential of native ones. However, cross-platform apps can be very high quality if the right steps are taken. Cross-platform AR is easier to code and can result in a faster time to market. However, performance and presentation can suffer.

Generally, it’s better to keep an app native if the app is very complex and needs to use the full potential of native features. However, if the app is simpler and doesn’t need extremely high performance, cross-platform will do just fine.

For example, if you are creating an online store where 90% of the functionality is not platform-dependent, does not require maximum performance and has a simple product preview module in AR, then you can opt for cross-platform AR. But if we deal with an application whose functionality requires maximum performance or is platform-dependent, then the native option is better. This applies to projects such as 3D scanning or AR navigation.

Working with an augmented reality development company is a great way to build cross-platform applications with the highest quality possible. This allows you not only to improve the quality of your product but also helps you focus on other aspects of your business.

Trend #6: AR Glasses, Future or Fiction?

It seems like with every year that passes, comfortable and consumer-friendly AR glasses are just around the corner. One of the latest devices up in the air is Meta’s planned mixed reality headset currently called Cambria. This is a new product line separate from their successful Meta Quest 2.

However, the Cambria headset seems to be geared more toward wealthier audiences looking to get an early experience with the future of AR. Because of this, it seems that Cambria isn’t the magic bullet everyone was hoping for. However, it may be a step in the right direction.

Another important thing to watch out for is the evolution of Apple’s LiDAR scanner. Apple is one of the top companies predicted to introduce a consumer-focused AR headset or glasses in the future. In 2020, their advanced depth sensor was equipped on the iPad Pro, and later was equipped on the iPhone 12 Pro. The more that this technology and processing can be miniaturized, the more likely that we may see comfortable to wear ‘Apple Glasses’ in the future.

In addition to AR glasses, there are even more innovative devices that promise to take a prominent place among the augmented reality future trends. In June 2022, Mojo Vision Labs in Saratoga, California hosted the first demonstration of augmented reality smart contact lenses. Relying on eye tracking, communications and software, AR lenses integrate with user interface to enable an augmented reality experience. Mojo Lens has a custom-tuned accelerometer, gyroscope, and magnetometer that continuously track eye movements to ensure that AR images remain still as the eyes move.

Trend #7: AR in Marketing

There are a number of different applications for Augmented Reality in the marketing industry. For example, business cards are a popular and simple choice that can work with simple AR solutions. By adding interactivity to marketing material like a business card, you stand out from the competition and offer potential customers a whole new and exciting experience to get to know your company.

The MobiDev demo below shows how this business idea can be implemented using ARKit.

AR manuals are also a popular choice among businesses looking to provide their customers with more detailed and feature-rich instructions and documentation. AR not only allows for delivering information in an engaging way, but also significantly improves the user experience without forcing the buyer to spend a lot of effort to master one or another mechanism.

The MobiDev demo below presents a virtual user instruction for a coffee machine to show AR virtual manuals in action.

AR also has many opportunities for use in advertising. Web banner ads have decreased in popularity with users considerably over the years. Click through rate of banner ads have dropped from 0.72% in 2016 to 0.35% in 2019. One reason why this may be the case is those banner ads are disruptive to the content the user is trying to access. However, AR ads may provide more seamless access to content, obstructing content less. For example, with Facebook’s new augmented reality ads users can access AR experiences from their timeline with special ads with various capabilities. Some of these features include virtual try-on, placing virtual objects in their homes, and more.

Trend #8: Powering Indoor and Outdoor Navigation

In 2022, AR navigation has become more fluid and achievable than ever before. Most importantly, the rise of technologies like Bluetooth Low Energy (BLE) antennas, Wi-Fi RTT and ultra wideband (UWB) make indoor navigation much more viable than in previous years. One of the most useful applications of this technology is for displaying AR directions in large indoor locations like distribution centers, shopping malls, and airports.

Watch the demo below to find out how MobiDev implemented mobile AR for navigation of the corporate campus.

Something that shouldn’t be overlooked is this technology’s potential to be used by both consumer and business users. Just as a guest in a store may use AR indoor navigation to find the product they’re looking for, a distribution center worker may use it to find a particular item in their warehouse. Although comfortable and affordable glasses with AR capability aren’t quite here yet, the capacity for the business applications of AR in distribution centers, stores, and other sectors is there.

With indoor navigation, buy online pick up in store (BOPIS) services can be made much more efficient. Team members tasked to ‘pick’ the items in the store for order fulfillment can use AR directions to directly navigate to find the item as opposed to following coordinate directions to find the item. This eliminates time spent looking through many similar items and finding the correct aisle and section of the store. All the team member has to do is hold up their device and see the directions on the screen.

However, there are some limitations that need to be taken into account, such as items that have been misplaced around the store. If they have been moved by guests or incorrectly logged into the system, the team member might use AR navigation on their device to arrive at an empty spot on a shelf.

Trend #9: Healthcare and Augmented Reality

According to Deloitte Research, augmented reality and AI will transform the traditional healthcare business model by offering AR/MR-enabled hands-free solutions and IA-based diagnostic tools. For example, Microsoft Hololens 2 can provide information to the surgeon while allowing them to use both of their hands during the procedure.

With the continued restrictions associated with Covid-19, the use of augmented reality solutions is becoming increasingly important to address issues such as the complexity of remote patient support and the increased burden on hospitals. This includes both telesurgery solutions and mental health apps that are helping people to maintain psychological balance during these difficult times. For example, features such as drawing and annotating on the 3D screen can make communication between doctors and patients much easier and clearer. Remote assistance tools can also help clinicians support their patients while reducing downtime.

Combining with machine learning algorithms, AR technology can become an efficient option for disease detection. Back in 2020, Google announced the development of an AR-based microscope for the Department of Defense (DoD) to improve the accuracy of cancer diagnosis and treatment. Such a hybrid device uses a camera to capture images in real-time which are then processed using computer diagnostics to immediately display results and diagnose diseases at an early stage.

Trend #10: Augmented Reality Shopping Experiences

The onset of the COVID-19 pandemic called for numerous innovations that could help extend experiences to online shoppers. Augmented reality was one of the technologies that benefitted the most from this disruption. It resulted in an explosion of virtual try-on solutions.

Brands are actively adopting AR technology to improve the user experience when shopping online. For example, Dior has repeatedly launched AR shoes experiences allowing customers to virtually try on shoes before buying. Back in 2020, Dior teamed up with Snapchat to create such an initiative for the first time.

FRED Jewelry uses AR to let customers customize bracelets on the company website with a 3D configurator and try them on virtually.

FRED Jewelry Virtual Try-On Presented on Viva Tech 2022

SMART MIRRORS

As quarantine lockdowns have come to an end and brick-and-mortar stores have seen customers return, there is still an opportunity for AR to help with in-store experiences too. Smart mirrors are a great way to enrich the in-store experience and reduce the load on fitting rooms. Customers can walk up to smart mirrors and try on clothes in-store with advanced AR technologies not available on their smartphones.

Smart mirrors are also helpful in situations where certain sizes of clothes aren’t available in store and need to be shipped to customers. Smart mirrors and virtual fitting room technologies from home can help with these needs.

Trend #11: Augmented Reality in Manufacturing

Many AR applications are consumer-focused. However, AR has a lot of potential for use in industries like manufacturing. For example, worker training can be enhanced with AR experiences powered by CAD data. AR can also assist technicians through routine maintenance processes. AR applications can highlight elements of devices being worked on to guide technicians through the process at hand. This is generally more accessible through head-mounted solutions than through mobile applications.

In more simple applications, AR can help give workers more contextual information about objects in a factory when set up appropriately. By highlighting an object with a mobile device, a worker can learn more about it and if any action, such as maintenance, needs to be taken.

AR also has a promise for remote troubleshooting. Remote support agents can place virtual markers on the screen for workers to follow on the other end of the call. This can allow for more rich and valuable remote support in factory locations.

Trend #12: Augmented Reality in Automotive Industries

Augmented reality has a number of different applications that can be useful for the automotive industry. One of the more futuristic and interesting technologies emerging in this space is AR highlighting on-road objects through the use of a heads-up display (HUD). This can make drivers aware of hazards and GPS directions without requiring them to take their eyes off the road. AR is also in use for entertainment and information, such as 3D car manuals and other applications.

5G AND PARKING

One interesting application of AR in the automotive industry is for parking assistance. With the help of 5G connectivity, empty parking spaces can be highlighted on a driver’s heads-up display. This can also provide a great deal of data that can be useful for optimizing the layouts and operations of parking facilities like parking lots and garages.

WAKEUP APP: DRIVER AWARENESS ASSISTANCE

The WakeUp app developed by MobiDev also can be a great example of using augmented reality in the automotive industry. The objective of WakeUp is to help keep drivers awake by using ARKit facial recognition technology to detect when a driver’s eyes are closed or their head is tilted. If the eyes remain closed or head is tilted for too long, the device plays an alarm to help wake the driver up.

There is room for this technology to grow. For example, TrueDepth camera with its infrared sensing can help to perform head and eye tracking in complete darkness. Also, artificial intelligence could detect behaviors from a driver that indicate that they might become drowsy and alert the driver before it’s too late. These are the directions in which we are going to develop these products in the future.

The Future of Augmented Reality

The augmented reality market will continue to grow as the years go by, especially as technology becomes more and more accessible to consumers. With there being a significant growth in the focus on metaverse technologies, AR is the next step for many businesses. Those who are playing the long game may want to jump into this sector a bit early.

However, those looking to respond to more immediate growth and change may find better success in retail and mobile applications. AR-capable smartphones and tablets are everywhere and are great opportunities to advertise and extend conversion-driving experiences to users.

With the market expected to reach $97.76 billion in 2028, it’s clear that augmented reality is the future for many industries. That future will be determined by businesses that adapt to today’s challenges in new and innovative ways. Companies that offer rich AR experiences to their customers will be much better equipped to stand up alongside their competition.

Applying AI for Early Dementia Diagnosis and Prediction

Andrew Makarov — Mon, 18 Jul 2022 10:22:48 +0000

MobiDev would like to acknowledge and give its warmest thanks to the DementiaBank which made this work possible by providing the data set.

Mental illnesses and diseases that cause mental symptoms are somewhat difficult to diagnose due to the uneven nature of such symptoms. One such condition is dementia. While it’s impossible to cure dementia caused by degenerative diseases, early diagnostics help reduce symptom severity with the proper treatment, or slow down illness progression. Moreover, about 23% of dementia causes are believed to be reversible when diagnosed early.

Communicative and reasoning problems are some of the earliest indicators used to identify patients at risk of developing dementia. Applying AI for audio and speech processing significantly improves diagnostic opportunity for dementia and helps to spot early signs years before significant symptoms develop.

In this study, we’ll describe our experience creating a speech processing model that predicts dementia risk, including the pitfalls and challenges in speech classification tasks.

AI Speech Processing Techniques

Artificial intelligence offers a range of techniques to classify raw audio information, which often passes through pre-processing and annotation. In audio classification tasks we generally strive to improve the sound quality and clean up any present anomalies before training the model.

If we speak about classification tasks involving human speech, generally, there are two major types of audio processing techniques used for extracting meaningful information:

Automatic speech recognition or ASR is used to recognize or transcribe spoken words into a written form for further processing, feature extraction, and analysis.

Natural language processing or NLP, is a technique for understanding human speech in context by a computer. NLP models generally apply complex linguistic rules to derive meaningful information from sentences, determining syntactic and grammatical relations between words.

Pauses in speech can also be meaningful to the results of a task, and audio processing models can also distinguish between different sound classes like:

human voices
animal sounds
machine noises
ambient sounds

All of the different sounds above may be removed from the target audio files because they can worsen overall audio quality or impact model prediction.

HOW DOES AI SPEECH PROCESSING APPLY TO DEMENTIA DIAGNOSIS?

People with Alzheimer’s disease and dementia specifically have a certain number of communication conditions such as reasoning struggles, focusing problems, and memory loss. Impairment in cognition can be spotted during the neuropsychological testing performed on individuals.

If recorded on audio, these defects can be used as features for training a classification model that will find a difference between a healthy person, and an ill one. Since an AI model can process enormous amounts of data and maintain accuracy of its classification, the integration of this method into dementia screening can improve overall diagnostic accuracy.

Dementia-detection systems based on neural networks have two potential applications in healthcare:

Early dementia diagnostics. Using recordings of neuropsychological tests, patients can learn about the early signs of dementia long before brain cell damage occurs. Applying even phone recordings with test results appears to be an accessible and fast way to screen population compared to conventional appointments.

Tracking dementia progression. Dementia is a progressive condition, which means its symptoms tend to progress and manifest differently over time. Classification models for dementia detection can also be used to track changes in a patient’s mental condition and learn how the symptoms develop, or how treatment affects manifestation.

So now, let’s discuss how we can train the actual model, and what approaches appear most effective in classifying dementia.

How do you train AI to analyze dementia patterns?

The goal of this experiment was to detect as many sick people as possible out of the available data. For this, we needed a classification model that was able to extract features and find the differences between healthy and ill people.

The method used for dementia detection applies neural networks both for feature extraction and classification. Since audio data has a complex and continuous nature with multiple sonic layers, neural networks appear superior to traditional machine learning for feature extraction. In this research 2 types of models were used:

Speech-representation neural network which accounts for extracting speech features (embeddings), and
Classification model which learns patterns from feature-extractor output

In terms of data, recordings of Cookie Theft neuropsychological examination are used to train the model.

Image source: researchgate.net

In a nutshell, Cookie Theft is a graphic task that requires patients to describe the events happening in the picture. Since people suffering from early symptoms of dementia experience cognitive problems, they often fail to explain the scene in words, repeat thoughts, or lose the narrative chain. All of the mentioned symptoms can be spotted in recorded audio, and used as features for training classification models.

ANALYZING DATA

For the model training and evaluation we used a DementiaBank dataset consisting of 552 Cookie Theft recordings. The data represents people of different ages split into two groups: healthy, and those diagnosed with Alzheimer diseases — the most common cause of dementia. The DementiaBank dataset shows a balanced distribution of healthy and ill people, which means neural networks will consider both classes during the training procedure, without skewing to only one class.

The dataset contains samples with different length, loudness and noise level. The total length of the whole dataset equals 10 hours 42 min with an average audio length of 70 seconds. In the preparation phase, it was noted that the duration of the recordings of healthy people is overall shorter, which is logicall, since ill people struggle with completing the task.

However, relying just on the speech length doesn’t guarantee meaningful classification results. Since there can be people suffering from mild symptoms, or we can become biased for quick descriptors.

DATA PREPROCESSING

Before actual training, the obtained data has to go through a number of preparation procedures. Audio processing models are sensitive to the quality of recording, as well as omission of words in sentences. Poor quality data may worsen the prediction result, since a model may struggle to find a relationship between the information where a part of recording is corrupted.

Preprocessing sound entails cleaning any unnecessary noises, improving general audio quality, and annotating the required parts of an audio recording. The Dementia dataset initially has approximately 60% poor quality data included in it. We have tested both AI and non-AI approaches to normalize loudness level and reduce noises in recordings.

Huggingface MetricGan model was used to automatically improve audio quality, although the majority of the samples weren’t improved enough. Additionally, Python audio processing libraries and Audacity were used to further improve data quality.

For very poor quality audio, additional cycles of preprocessing may be required using different Python libraries, or audio mastering tools like Izotope RX. But, in our case, the aforementioned preprocessing steps dramatically increased data quality. During the preprocessing, samples with the poorest quality were deleted, accounting for 29 samples (29 min 50 sec length) which is only 4% of total dataset length.

APPROACHES TO SPEECH CLASSIFICATION

As you might remember, neural network models are used in conjunction to extract features and classify recordings. In speech classification tasks, there are generally two approaches:

Converting speech to text, and using text as an input for the classification model training.
Extracting high-level speech representations to conduct classification on them. This approach is an end-to-end solution, since audio data doesn’t require conversion into other formats.

In our research, we use both approaches to see how they differ in terms of classification accuracy.

Another important point is that all feature extractors were trained in two steps. On the first iteration, the model is pre-trained in a self-supervised way on pretext tasks such as language modeling (auxiliary task). In the second step, the model is fine-tuned on downstream tasks in a standard supervised way using human-labeled data.

The pretext task should force the model to encode the data to a meaningful representation that can be reused for fine-tuning later. For example, a speech model trained in a self-supervised way needs to learn about sound structure and characteristics to effectively predict the next audio unit. This speech knowledge can be re-used in a downstream task like converting speech into text.

Modeling

To evaluate the results of model classification, we’ll use a set of metrics that will help us determine the accuracy of the model output.

Recall evaluates the fraction of correctly classified audio records of all audio records in the dataset. In other words, recall shows the number of records our model classified as dementia.
Precision metric indicates how many of those records classified with dementia are actually true.

F1 Score was used as a metric to calculate harmonic mean out of recall and precision. The formula of metric calculation looks like this: F1 = 2*Recall*Precision / (Recall + Precision).

Additionally, as in the first approach when we converted audio to text, Word Error Rate is also used to calculate the number of substitutions, deletions, and insertions between the extracted text, and the target one.

APPROACH 1: TEXT-TO-SPEECH IN DEMENTIA CLASSIFICATION

For the first approach, two models were used as feature extractors: wav2vec 2.0 base and NEMO QuartzNet. While these models convert speech into text, and extract features from it, the HuggingFace BERT model performs the role of a classifier.

Extracted by wav2vec text appeared to be more accurate compared to QuartzNet output. But on the flipside, it took significantly longer for wav2vec 2.0 to process audio, which makes it less preferable for real-time tasks. In contrast, QuartzNet shows faster performance due to a lower number of parameters.

The next step was feeding the extracted text of both models into the BERT classifier for training. Eventually, the training logs showed that BERT wasn’t trained at all. This could possibly happen due to the following factors:

Converting audio speech into text basically means losing information about the pitch, pauses, and loudness. Once we extract the text, there is no way feature extractors can convey this information, while it’s meaningful to consider pauses during the dementia classification.
The second reason is that the BERT model uses predefined vocabulary to convert word sequences into tokens. Depending on the quality of recording, the model can lose the information it’s unable to recognize. This leads to omission of, for example, incorrect words that still make sense to the prediction results.

As long as this approach doesn’t seem to bring meaningful results, let’s proceed to the end-to-end processing approach and discuss the training results.

APPROACH 2: END-TO-END PROCESSING

Neural networks represent a stack of layers, where each of the layers is responsible for catching some information. In the early layers, models learn the information about raw sound units also called low-level audio features. These have no human-interpretable meaning. Deep layers represent more human-understandable features like words and phonemes.

End-to-end approach entails the use of speech features from intermediate layers. In this case, speech representation models (ALBERT or HuBERT) were used as feature extractors. Both feature extractors were used as a Transfer learning while classification models were fine-tuned. For these classification tasks we used two custom s3prl downstream models: an attention-based classifier that was trained on SNIPS dataset and a linear classifier that is trained on Fluent commands dataset, but eventually both models were fine-tuned using Dementia dataset.

Looking at inference results of the end-to-end solution, it’s claimed that using speech features, instead of text, with fine-tuned downsample models led to more meaningful results. Namely, the combination of HuBERT and an attention-based model shows the most concise result among all approaches. In this case, classifiers learned to catch relevant information that could help differentiate between healthy people and those with Dementia.

For the explicit description of what models and methods for fine-tuning were used, you can download the PDF of this article.

How to improve the results?

Given the two different approaches to dementia classification with AI, we can derive a couple of recommendations to improve the model output:

Use more data. Dementia can have different manifestations depending on the cause and patient age, as symptoms will basically vary from person to person. Obtaining more data samples with dementia speech representations allows us to train models on more diverse data, which can possibly result in more accurate classifications.

Improve preprocessing procedure. Besides the number of samples, data quality also matters. While we can’t correct the initial defects in speech or actual recording, using preprocessing can significantly improve audio quality. This will result in less meaningful information lost during the feature extraction and have a positive impact on the training.

Alter models. As an example of end-to-end processing, different upstream and downstream models show different accuracy. Trying different models in speech classification may result in improvement of classification accuracy.

As the test results show, applying neural networks to analyzing dementia audio recordings can generate accurate suggestions. Training neural networks for speech classification tasks is a complex exercise that requires data science expertise as well as audio processing knowledge.

AR & AI Technologies For Virtual Fitting Room Development

Maksym Tatariants — Sun, 21 Mar 2021 16:11:17 +0000

I hate shopping in brick and mortar stores. However, my interest in virtual shopping is not limited to the buyer experience only. With the MobiDev DataScience department, I’ve gained experience in working on AI technologies for virtual fitting. The goal of this article is to describe how these systems work from the inside.

How Virtual Fitting Technology Works

A few years ago, the “Try before you buy” strategy was an efficient customer engagement method in outfit stores. Now, this strategy exists in the form of virtual fitting rooms. Fortune Business Insights predicted that the virtual fitting room market size is expected to reach USD 10.00 billion by 2027.

To better understand the logic of virtual fitting room technology, let’s review the following example. Some time ago, we had a project of Augmented Reality (AR) footwear fitting room development. The fitting room works in the following way:

The input video is split into frames and processed with a deep learning model which estimates the position of a set of specific leg and feet keypoints. Read the related article: 3D Human Pose Estimation in Fitness Coach Apps
A 3D model of footwear is placed according to the detected keypoints to display the orientation to a user naturally.
A 3D footwear model is rendered so that each frame displays realistic textures and lighting.

Utilization of ARKit for 3D human body pose estimation and 3D model rendering

When working with ARKit (Augmented Reality framework for Apple’s devices) we discovered that it has rendering limitations. As you can see from the video above, the tracking accuracy is too low to use it for footwear positioning. The cause of this limitation may be the maintenance of the inference speed while neglecting the tracking accuracy, which might be critical for apps working in real-time.

One more issue was the poor identification of body parts by the ARKit algorithm. Since this algorithm is aimed to identify the whole body, it doesn’t detect any keypoints if the processed image contains only a part of the body. It is exactly the case of a footwear fitting room when the algorithm is supposed to process only a person’s legs.

The conclusion was that virtual fitting room apps might require additional functionality along with the standard AR libraries. Thus, it’s recommended to involve data scientists for developing a custom pose estimation model supposed to detect keypoints on only one or two feet in the frame and operate in real-time.

Virtual Fitting Room Solutions

The virtual fitting room technology market provides offerings for accessories, watches, glasses, hats, clothes, and others. Let’s review how some of these solutions work under the hood.

WATCHES

A good example of virtual watches try-on is the AR-Watches app allowing users to try on various watches. The solution is based on the ARTag technology utilizing specific markers printed on a band, which should be worn on a user’s wrist in place of a watch in order to start a virtual try-on the watch. The computer vision algorithm processes only those markers visible in the frame and identifies the camera’s position in relation to them. After that, to render a 3D object correctly, the virtual camera should be placed at the same location.

Overall, technology has its limits (for instance, not everybody has a printer at hand to print out the ARTag band). But if it matches the business use case, it wouldn’t be that difficult to create a product with a production-ready quality. Probably, the most important part would be to create proper 3D objects to use.

3D model rendering of a watch using the ARTag technology

SHOES

Wanna Kicks and SneakerKit apps are a good demonstration of how AR and deep learning technologies might be applied for footwear.

Virtual shoes try-on, Wanna Kick app

Technically, such a solution utilizes a foot pose estimation model based on deep learning. This technology may be considered for a particular case of widespread full-body 3D pose estimation models that estimate the position of selected keypoints in 3D dimension directly or through the inference of detected 2D keypoints’ positions into 3D coordinates.

3D foot pose estimation (source)

Once positions of 3D keypoints of feet are detected, they can be utilized for creating a parametric 3D model of a human foot, and positioning & scaling of a footwear 3D model according to the geometric properties of the parametric model.

Positioning of a 3D model of footwear on top of a detected parametric foot model (source)

Compared to the full-body/face pose estimation model, foot pose estimation still has certain challenges. The main issue is the lack of 3D annotation data required for model training.

However, the optimal way to avoid this problem is to use the synthetic data which supposes rendering of photorealistic 3D human feet models with keypoints and training a model with that data; or to use photogrammetry which supposes the reconstruction of a 3D scene from multiple 2D views to decrease the number of labeling needs.

This kind of solution is way more complicated. In order to enter the market with a ready-to-use product, it is required to collect a large enough foot keypoint dataset (either using synthetic data, photogrammetry, or a combination of both), train a customized pose estimation model (that would combine both high enough accuracy and inference speed), test its robustness in various conditions and create a foot model. We consider it a medium complexity project in terms of technologies.

GLASSES

FittingBox and Ditto companies considered AR technology for the virtual glasses try-on. The user should choose a glasses model from a virtual catalog and it is put on his/her eyes.

Virtual glasses try-on and lenses simulation

This solution is based on the deep learning-powered pose estimation approach utilized for facial landmarks detection, where the common annotation format includes 68 2D/3D facial landmarks.

Example of facial landmark detection in video. Note that the model in the video detects more than 68 landmarks (source)

Such an annotation format allows the differentiation of face contour, nose, eyes, eyebrows, and lips with a sufficient accuracy level. The data for training the face landmark estimation model might be taken from such open-source libraries as Face Alignment, providing face pose estimation functionality out-of-the-box.

In terms of technologies, this kind of solution is not that complicated, especially if using any pre-trained model as a basis for the face recognition task. But it’s important to consider that low-quality cameras and poor light conditions could be limiting factors.

SURGICAL MASKS

Amidst the COVID-19 pandemic, ZapWorks launched the AR-based educational app aimed to instruct users on how to wear surgical masks properly. Technically, this app is also based on a 3D facial landmark detection method. Like the glasses try-on app, this method allows receiving information about facial features and further mask rendering.

AR for mask wear guide

HATS

Given the fact that facial landmark detection models work well, another frequently simulated AR item is hats. Everything required for correct rendering of a hat on the person’s head is the 3D coordinates of several keypoints indicating temples and the location of a forehead center. The virtual hats try-on apps have already been launched by QUYTECH, Banuba, and Vertebrae.

Baseball cap try-on

CLOTHES

Compared to shoes, masks, glasses, and watches, virtual try-on 3D clothes still remain a challenge. The reason is that clothes are deformed when taking the shape of a person’s body. Thus, for proper AR experience, a deep learning model should identify not only basic keypoints on the human body’s joints but also the body shape in 3D.

Looking at one of the most recent deep learning models DensePose aimed to map pixels of an RGB image of a person to the 3D surface of the human body, we can find out that it’s still not quite suitable for augmented reality. The DensePose’s inference speed is not appropriate for real-time apps, and body mesh detections have insufficient accuracy for the fitting of 3D clothing items. In order to improve results, it’s required to collect more annotated data which is a time and resource-consuming task.

The alternative is to use 2D clothing items and 2D people’s silhouettes. That’s what Zeekit company does, giving the users a possibility to apply a number of clothing types (dresses, pants, shirts, etc.) to their photo.

2D clothing try-on, Zeekit

Strictly speaking, the method of 2D clothes images transferring cannot be considered as Augmented Reality, since the “Reality” aspect implies the real-time operation, however, it still can provide an unusual and immersive user experience. The behind technologies comprise Generative Adversarial Networks, Human Pose Estimation, and Human Parsing models. The 2D clothes transferring algorithm may look as follows:

Identification of areas in the image corresponding to the individual body parts
Detection of the position for identified body parts
Producing of a warped image of a transferred clothing
Application of a warped image to the image of a person with the minimum produced artifacts

OUR EXPERIMENTS WITH 2D CLOTH TRANSFERING

Since there are no ready pre-trained models for the virtual dressing room we researched this field experimenting with the ACGPN model. The idea was to explore outputs of this model in practice for 2D cloth transferring by utilizing various approaches.

The model was applied to people’s images in constrained (samples from the training dataset, VITON) and unconstrained (any environment) conditions. In addition, we tested the limits of the model’s capabilities by not only running it on custom persons’ images but also using custom clothing images that were quite different from the training data.

Here are examples of results we received during the research:

1) Replication of results described in the “Towards Photo-Realistic Virtual Try-On by Adaptively GeneratingPreserving↔Image Content” research paper, with the original data and our preprocessing models:

Successful (A1-A3) and unsuccessful (B1-B3) replacement of clothing

Results:

B1 – poor inpainting
B2 – new clothes overlapping
B3 – edge defects

2) Application of custom clothes to default person images:

Clothing replacement using custom clothes

Results:

Row A – no defects
Row B – some defects to be moderated
Row C – critical defects

3) Application of default clothes to the custom person images:

Outputs of clothing replacement on images with an unconstrained environment

Results:

Row A – edge defects (minor)
Row B – masking errors (moderate)
Row C – inpainting and masking errors (critical)

4) Application of custom clothes to the custom person images:

Clothing replacement with the unconstrained environment and custom clothing images

Results:

Row A – best results obtained from the model
Row B – many defects to be moderated
Row C – most distorted results

When analyzing the outputs, we discovered that virtual clothes try on still has certain limitations. The point is the training data should contain paired images of the target cloth, and people wearing this cloth. If given a real-world business scenario, it may be challenging to accomplish. The other takeaways from the research are:

The ACGPN model outputs rather good results on the images of people from the training dataset. It is also true if custom clothing items are applied.
The model is unstable when it comes to processing the images of people captured in varying lighting, other environmental conditions, and unusual poses.
The technology for creating virtual dressing room systems for transferring 2D clothing images onto the image of the target person in the wild is not yet ready for commercial applications. However, if the conditions are static, the expected results can be much better.
The main limiting factor that holds back the development of better models is the lack of diverse datasets with people captured in outdoor conditions.

In conclusion, I’d say that current virtual fitting rooms work well for items related to separate body parts like head, face, feet, and arms. But talking about items where the human body requires to be fully detected, estimated, and modified, the virtual fitting is still in its infancy. However, the AI evolves in leaps and bounds, and the best strategy is to stay tuned and keep trying.

Written by Maksym Tatariants, Data Science Engineer at MobiDev.

Full article originally published at https://mobidev.biz. It is based on MobiDev technology research and experience providing software development services.

Human Pose Estimation Technology 2021 Guide

Maksym Tatariants — Fri, 12 Mar 2021 12:00:51 +0000

“Is it possible for a technology solution to replace fitness coaches? Well, someone still has to motivate you saying “Come On, even my grandma can do better!” But from a technology point of view, this high-level requirement led us to 3D human pose estimation technology.

In this article, I will describe our own experience of how 3D human pose estimation can be developed and implemented for the AI fitness coach solution.

What is Human Pose Estimation?

Human pose estimation is a computer vision-based technology that detects and analyzes human posture. The main component of human pose estimation is the modeling of the human body. There are three of the most used types of human body models: skeleton-based model, contour-based, and volume-based.

Skeleton-based model consists of a set of joints (keypoints) like ankles, knees, shoulders, elbows, wrists, and limb orientations comprising the skeletal structure of a human body. This model is used both in 2D and 3D human pose estimation techniques because of its flexibility.

Contour-based model consists of the contour and rough width of the body torso and limbs, where body parts are presented with boundaries and rectangles of a person’s silhouette.

Volume-based model consists of 3D human body shapes and poses represented by volume-based models with geometric meshes and shapes, normally captured with 3D scans.

Source

Here, I am talking about skeleton-based models, which may be detected from a 2D or 3D perspective.

2D pose estimation is based on the detection and analysis of X, Y coordinates of human body joints from an RGB image.

3D pose estimation is based on the detection and analysis of X, Y, Z coordinates of human body joints from an RGB image.

When speaking about fitness applications involving human pose estimation, it’s better to use 3D estimation, since it analyzes human poses during physical activities more accurately.

Talking about AI fitness coach apps, the common flow looks as follows:

Capture user’s movements while doing an exercise
Analyze the correctness of an exercise performance
Display mistakes to the user interface

How 3D Human Pose Estimation Works

The visual image of how 3D human pose estimation technology detects keypoints on a human body looks like as follows:

The process usually involves the extraction of joints on a human body, and then analysis of a human pose by deep learning algorithms. If the human pose estimation system uses video records as a data source, keypoints (joints locations) are detected from a sequence of frames, not a single picture. It allows us to achieve more accuracy as the system analyzes an actual movement of a person, not a steady position.

There are several ways to develop the 3D human pose estimation system for fitness. The most optimal ways are training of a deep learning model to extract 3D or 2D key points from the given images/frames

For sure, using video streams from several cameras with different views on the same person doing exercises – it will grant us better accuracy. But multi-cameras are often not available. Also, analyzing video from several video streams will take more computer power to process.

For our research, we used a single video source for the analysis. And applied convolutional neural networks (CNNs) with dilated temporal convolutions (see the video below).

Source

We made the analysis of existing models and figured out that VideoPose3D is the most optimal choice for fitness app purposes. In the input, it should have a set of 2D keypoints detected, where the COCO 2017 dataset is applied as a pre-trained 2D detector. For the accurate prediction of a current joint’s position, it processes visual data from several frames captured at various periods of time.

How to Use Human Pose Estimation in AI Fitness Coach App

Digitalization has not spared the fitness industry. According to the Research and Markets report, the digital fitness market size is expected to reach $27.4 billion by 2022.

The 3D human pose estimation is a relatively new but rapidly evolving technology in digital fitness. After analysis and practical experience of working with 3D human pose estimation systems, we have come to our own vision of how it can be implemented. Let’s review the flow of how this system may be built so that it can analyze movements in an automatic manner by utilizing videos of users performing physical exercises.

Assuming that the goal of the given system is to inspect the input video for common exercise mistakes and compare it with the reference video, where the professional athlete is performing the same exercise, the flow will look like as follows:

1) Cutting of the input video depending on the exercise start & end

For the start and the end points indication, we can automatically detect the position of body control points by using arbitrary thresholds. For example, when squatting, it is possible to detect the angle of arms and position of hands by height, and then, by using arbitrary thresholds, we can detect the start and the end points of an exercise.

Video source

One more way is to ask the user to indicate the start and the end of the exercise performance manually.

2) Detecting 2D and 3D keypoints on the user’s body

3) Decomposing of the exercise phases

When having the positions of keypoints (joints) extracted, they should be compared with the reference video’s positions. However, we cannot make a direct comparison because the exercise performance speed and the total number of repetitions on the input and reference videos may differ.

These discrepancies can be resolved by decomposition of an exercise into phases. We can see how it is illustrated in the image below, where the squatting exercise is decomposed into two primary phases: squatting down and squatting up.

Photo source: stronglifts.com

The decomposition can be done through the analysis of keypoints detected from the input video frame by frame, and then comparing them by certain criteria with the keypoints from the reference video.

4) Searching for common mistakes

When 3D keypoints and certain phases of an exercise are detected, it’s time to detect common mistakes in an exercise technique in the input video. For example, in squatting, we can detect moments when the legs are bent (not straight) and the knees are closer to the center torso than feet.

Video source

5) Comparing the input video frames with the reference ones

Here we should take a reference video, where the exercise is performed correctly, split it into phases, and detect keypoints in each frame. When the keypoints are detected and exercise phases defined in both input and reference videos, we can compare each phase of an exercise performed by a user and professional athlete.

The step-by-step flow looks as follows:

a. Slow down/accelerate the reference video in order to match the speed of the input one.

b. Align both skeleton models of the user and a professional athlete so that their rotation angle and origins match.

c. Normalize the size of both skeletons since reference and input videos can be captured from different distances.

d. Compare keypoints frame by frame and detect motion inconsistencies.

e. Repeat the flow separately for different groups of joints (e. g. feet position, knee position, hands and elbows position, etc.).

6) Display results and generate recommendations for a user

When the whole analysis cycle is completed, the user will get results displayed in different formats. For example, the output may include interactive 3D reconstructions with mistake hints, so that the user can zoom in/out, go back, forward, or pause at a specific moment. It is also possible to collect and display movement statistics such as the number of repetitions, average speed and duration of one repetition, and others.

Visually the 3D human pose estimation system based on videos looks like as follows:

Photo sources: stronglifts.com, Men’s Health channel

In this article, I described how a 3D human pose estimation system works from the perspective of AI fitness coach app development because it illustrates well how it might work by example. But please note that the flow might be changed depending on business requirements or other factors.

Highlights:

3D human pose estimation can be used to detect movement errors in fitness exercises.
The selection of a proper 2D keypoint detector is critical in getting high-quality 3D keypoints.
Occluded or fast-moving joints can be challenging to detect for 2D keypoint models and lead to incorrect/random predictions.
When using pre-trained models, it is important to keep in mind that they will most likely not work well for unusual moves and body positions. You will probably need to fine-tune or re-train at least refine a model on domain-specific or purposefully augmented data.

Written by Maksym Tatariants, Data Science Engineer at MobiDev.

Full article originally published at https://mobidev.biz. It is based on MobiDev technology research and experience providing software development services.

AI Assisted Real-Time Video Processing

Serhii Maksymenko — Thu, 11 Mar 2021 13:36:32 +0000

This article was written based on our research and expertise of building real-time video processing products, together with creating pipelines for applying Machine Learning and Deep Learning models.

When it comes to real time video processing, the data pipeline becomes more complex to handle. And we are striving to minimize latency in streaming video. On the other hand, we must also ensure sufficient accuracy of the implemented models.

Overall, the livestreaming industry has increased up to 99% in hours watched since last year according to dailyesports.gg statistics. So, it will totally change fan experience, gaming, telemedicine, etc. Moreover, Grand View Research reports that the Video Stream Market will be worth USD 184.27 billion by 2027.

AI-driven Live Video Processing Use Cases

Trained models, able to detect certain objects, are not an uncomplicated thing to create. However, when it comes to children in kindergarten – security is a top priority. The models may help to prevent, for example, a kid from running away or slipping out. Or, as another example, also about runaways, would be animals leaving the borders of a farm, zoo or reserve.

Organizations that store and process facial images for identification and authentication, sometimes need to implement security solutions to ensure privacy and meet GDPR data protection requirements. Some examples would be blurring faces while streaming conferences, meetings, etc. via YouTube, CCTV, private channels, or security cameras in a manufacturing building, and in shopping malls.

Another area of AI-based Visual Inspection for Defect Detection has been implemented at manufacturing facilities that are on the way to becoming fully robotic. So, computer vision makes it easier to distinguish the manufacturer’s flaws. With visual inspection technology, integration of deep learning methods allows differentiating parts, anomalies, and characters, which imitate a human visual inspection while running a computerized system.

How to Speed up Real Time Video Processing?

A Technical problem, which we are solving, is to blur faces of video subjects quickly and accurately while live streaming, and without quality loss through the use of Artificial Intelligence.

In short, video processing may be sketched as a series of consequent processes: decoding, computation and encoding. Although the criteria for this serial process, like speed, accuracy and flexibility may complicate the easiness of the first blush skim. So, the final resolution is supposed to be flexible in terms of input, output and configuration.

To make processing faster, keeping the accuracy at the reasonable level is possible in several ways: 1) to do something parallely; 2) to speed up the algorithms.

Basically, there are two approaches for ways to parallel the processes: file splitting and pipeline architecture.

The first one, file splitting, is to make the algorithms run in parallel so it might be possible to keep using slower, yet accurate models. It is implemented when video is split into parts and processed in parallels. In such a manner, splitting is a kind of virtual file generation, not a real sub-file generation. However, this process is not very suitable for real time processing, because it may be difficult to pause, resume or even move the processing at a different position in timespin.

The second one, pipeline architecture, is to make a certain effort to accelerate the algorithms themselves, or their parts with no significant loss of the accuracy. Instead of splitting the video, the pipeline approach is aimed to split and parallelize the operations, which are performed during the processing. Because of this process, the pipeline approach is more flexible.

Why is the pipeline approach more flexible? One of the benefits of the pipeline is the ease of manipulation of the components due to requirements. Decoding can work using a video file to encode frames into another file.

Alternatively, input can be an RTSP stream from an IP camera. Output can be a WebRTC connection in the browser or mobile application. There is a unified architecture, which is based on a video stream for all combinations of input and output formats. The computation process is not necessarily a monolith operation.

How to Implement a Pipeline Approach

As part of one of the projects, we had to process video in real time using AI algorithms.

The pipeline was composed of decoding, face detection, face blurring and encoding stages. The flexibility of the system was essential in this case because it was essential to process not only video files, but also different formats of video live-stream. It showed a good FPS in range 30-60 depending on the configuration.

INTERPOLATION WITH TRACKING

We used the tracking algorithm based on centroids, because it is easier to apply. However, when there is a need – other algorithms like Deep SORT can be used. But they really impact the speed if there are too many faces on the video. That’s why interpolation should be used additionally.

What is the quality of interpolated frames? Since we need to skip some frames, we’d like to know the quality of the interpolated frames. Therefore, the F1 metric was calculated and ensured that there aren’t too many false positives and false negatives because of interpolation. F1 value was around 0.95 for most of the video examples.

SHARING MEMORY

The next stage of this process is sharing memory. Usually, it is quite slow to send data through the queue, so to do it between the processes in Python is really the best way to do this process

The PyTorch version of multiprocessing has the ability to pass a tensor handle through the queue so that another process can just get a pointer to the existing GPU memory. So, another approach was used: a system level inter-process communication (IPC) mechanism for shared memory based on POSIX API. The speed of interprocess communication was extremely improved with the help of Python libraries, providing an interface for using this memory.

MULTIPLE WORKERS OR MULTIPROCESSING

Finally, there is a need to add several workers for a pipeline component to reduce the time that is needed for processing. This was applied in the face detection stage, and may also be done for every heavy operation which doesn’t need an ordered input. The thing is that it really depends on the operations which are done inside the pipeline. In case, there is comparatively fast face detection, and FPS may be lowered after adding more detection workers.

The time which is required for managing one more process can be bigger than time we gained from adding it. Neural networks, which are used in the multiple workers, will calculate tensors in a serial CUDA stream unless a separate stream is created for each network, which may be tricky to implement.

Multiple workers, due to their concurrent nature, can’t guarantee that the order will be the same with input sequence. Therefore, it requires additional effort to fix the order in the pipeline stages while encoding, for example. Still, the skipping frames may cause the same problem.

Thus, detection time was reduced almost twice, provided that we have 2 workers running a model with MobileNetv2 backbone.

How to Develop an AI-Based Live Video Processing System

How complex is it to apply AI to live video streams? As for the basic scenario, the process of implementation consists of several stages:

Adjusting a pre-trained neural network (or traine) to be able to perform the tasks needed
Setting a cloud infrastructure to enable processing of the video and to be scalable to a certain point
Building a software layer to pack the process and implement user scenarios (mobile applications, web and admin panels, etc.)

To create a product like this, using a pre-trained NN and some simple application layers, takes 3-4 months for building an MVP. However, the details are crucial and each product is unique in terms of the scope and timeline.

We strongly suggest our client start with the Proof of Concept to explore the main and/or the most complicated flow. Spending a few weeks on exploring the best approaches and gaining results, often secures further development flow and brings confidence to both the client and the engineering team.

Written by Serhii Maksymenko, Data Science Engineer at MobiDev.

Full article originally published at https://mobidev.biz. It is based on MobiDev technology research and experience providing software development services.

Image Credit

ARKit vs ARCore: Comparison of Image Tracking Feature

Andrew Makarov — Wed, 27 May 2020 18:33:21 +0000

In my previous article, I covered using ARKit to develop indoor navigation applications. By using visual markers or the ARReferenceImage function within ARKit and ARCore’s Augmented Images, it’s possible to create powerful and flexible indoor navigation apps.

Now, I’ll go over each of these AR SDKs - Comparing the two, discussing their relative accuracies of the image tracking functionality, and describing how to use them in contexts other than just indoor navigation.

Both ARCore's Augmented Images and ARKit’s ARReferenceImage can identify 2-dimensional images in reality and superimpose a virtual image over those real-world images. They’re also both capable of real-time tracking of the movement of images.

As a developer, you can attach virtual images or other content to real-world surfaces, this opens a variety of AR use cases in marketing from virtual branding materials and business cards to advertising displays and posters outdoors.

What is better – ARKit or ARCore?

Over the course of 2019, ARKit was considerably more popular than ARCore, with ARKit being used on around 600 million devices versus 400 million for ARCore.

However, it’s worth noting that ARCore’s base of compatible Android devices grew by around 150 million devices from December of 2018 to May of 2019.

Github’s repository tells a similar tale, with more than 4,000 ARKit results versus more than 1,500 ARCore results as of May 2020.

Each of these platforms offers roughly comparable tools for tapping into motion sensors, monitoring lighting changes, and understanding environments, and both are Unity framework compatible.

ARCore has an advantage in the field of mapping. It can gather, parse, and store information about a 3-dimensional environment in a manner that allows for easy and simple re-access.

With ARKit, a relatively small quantity of similar information is retained, and a ‘sliding window’ of recent experience data is all that’s available to access.

ARCore creates a bigger mapping dataset, allowing for the possibility of increased stability and speed.

The face detection/tracking feature for iOS devices is quicker and more accurate than the comparable facial detection feature for Android devices due to TrueDepth Camera in iOS devices.

When it comes to recognition and augmentation of images, ARKit is superior to a significant degree. The below video draws a comparison between how the two SDKs function.

The user examines the renowned Mona Lisa painting, and the app imposes a virtual picture over the real picture. Note that the Mona Lisa can blink her eyes as the user taps the virtual image via the app.

Here, ARKit can surpass ARCore when it comes to delivering an immersive experience to app users. ARKit delivers a higher-quality image and can maintain image stability far better as the user moves their device around, which allows for using ARKit in non-obvious applications.

At this point, it seems far too early to choose a winner or loser between ARKit and ARCore. While it will be fascinating to see how the two platforms will develop and work with their respective strengths and weaknesses, right now it’s too close to call.

In all likelihood, businesses will have to develop solutions able to be used by devices running both platforms.

How to use ARKit for indoor positioning app development

Andrew Makarov — Thu, 21 May 2020 20:52:42 +0000

The process of developing an app for indoor spaces navigation has three stages:

1) Finding user’s position
2) Calculating the route
3) Rendering the route

Finding User’s Position

The ARKit 2.0 contains an ARReferenceImage function that can identify a 2-dimensional image within the real world and then use that image as a reference point for AR content.

Having scanned the 2-dimensional visual marker placed on a floor surface or a wall with the help of ARKit, then the app matches it with data on remote cloud to find the exact coordinates of it in the real world.

The ARReferenceImage object is made up of three data properties: an image, a name identifier, and the size of the image. However the name field can be used as unique identifier, which can then be linked up with a cloud-based coordinate set.

After the visual marker has been scanned, the resulting position of the user can be translated to 3-dimensional coordinates to represent our starting point.

Calculating the route

Provided that we can’t always reliably get a map of a given building with adequate scalability and picture quality, we must create a custom map using the Cartesian coordinate system and then align it with azimuth and geo coordinates using Google Maps or a similar solution.

An important notice: AR Ruler is a tool with bias issues, meaning traditional measuring tools are preferable.

Vector images allow for top-quality zooming with minimum transmitted data, resulting in excellent performance.

We’re then able to generate a graph by incorporating rooms and corridors with the placed visual markers locations.

Rendering the Route

Finally, it’s necessary to render the route itself. To begin, the image layer generated by the camera is overlaid with a 3-dimensional layer. As the route progresses around corners, we’d expect a wall to block the route, but we view the entire route. The resulting output is confusing and doesn’t look natural or aesthetically pleasing.

We have three potential solutions to this issue.

1) The first and simplest solution is to represent the route with an arrow, similar to the look of a compass. This works in certain contexts, but isn’t optimal for many use cases, including navigation apps.

2) Another solution is to only output the route within a fixed proximity from the user. This can be implemented fairly quickly and easily and does represent a solution to the issue.

3) The final, most progressive solution is to generate a low-poly building model with multiple 2-dimensional maps. This results in effectively clipping any part of the route that shouldn’t be visible. When it disappears around a corner, the route is clipped at that point. In a long stretch of straight hall, we’ll see the route until it vanishes around a corner. This type of route is easy for users to understand, and looks quite natural.

That's how the process looks. The video shows how ARKit-based app works for navigation inside the office building.

As a general rule, the two biggest factors in developing an AR indoor navigation app are the mapping and its overall complexity level.

It’s worth noting that this method has a technical limitation. It needs an uninterrupted session to function. To maintain proper accuracy, the user has to maintain an active camera even after scanning the starting marker, all the way to the final destination. It’s possible to mitigate this limitation by working with technologies like Wi-Fi RTT to leverage new methods of indoor positioning.

It's possible to do the same in ARCore with Augmented Images feature. Read my article about ARKit vs ARCore comparison for image detection and tracking.

How to build face recognition app

Serhii Maksymenko — Sat, 21 Mar 2020 21:33:31 +0000

Machine Learning (ML) is playing a key role in a wide range of critical applications.

Recently, I was tasked to design a biometric identification system that integrated with real-time camera streaming. The task had several constraints that required innovative solutions. For instance, the system workflow had to not only detect faces but also had to recognize them in a near-instant in order to expedite further action.

A camera application triggers the detection and recognition workflow. The application (local console app for Ubuntu and Raspbian) written in Golang is installed on a device that is connected to the camera.

We used a JSON config file with Local Camera ID and Camera Reader type for the first app launch configuration.

For face detection we experimented with several processes and discovered that Caffe Face tracking and TensorFlow object detection models provided the best detection outcomes. In addition, both programs are available through OpenCV library.

For face detection we don’t use Dlib. We have a separate API to calculate vectors of features with dlib behind the scenes and compare with references and we call it not for every video frame: we assume that if the bounding box didn’t move too fast it’s the same person.

When a face is captured, the image is cropped and transmitted via HTTP form data request to the backend. The image is saved to a local file system using a backend API. A record is created and saved to Detection Log along with a personID.

A background worker at the app’s backend finds any record with a “classified=false” tag. Using Dlib, the worker calculates a 128-dimensional descriptor vector of face features. Each feature vector is run against multiple reference images within an existing database. The application finds a match by comparing the Euclidean distance between feature vectors of the live-stream image with the Euclidean distance between feature vectors of existing records and entries of each person in the database.

The code uses properly defined point indices in Dlib and represents each point index with a facial feature.

To get started, you can use the dlib wrapper: https://github.com/ageitgey/face_recognition
There is an example how to compare faces: https://github.com/ageitgey/face_recognition/blob/master/examples/web_service_example.py
A worker knows if an image is of a known or an unknown individual based on the Euclidean distance of facial vectors. If a detected person's features rank less than 0.6, then the background worker sets a personID that is marked as classified and enters it into the Detection Log. If a feature is greater than 0.6, then a new record is created. This record is set as unknown and a new personID is entered into the Detection Log.

Unidentified person images are sent as notifications to the corresponding manager. We chose to implement chatbot messenger notifications and found that simple alert chatbot could be implemented within 2-5 days.

We created two chatbots, one with Microsoft Bot Framework and the other with Python-based Errbot. Once chatbots are in place, it is possible for security personnel or others to manually grant remote access to unknown individuals on a case-by-case basis.

Captured images and their corresponding records are managed using an Admin Panel that acts as a portal to a database that contains stored photos and IDs. Both the Admin Panel and database are prepared and entered into the biometric identification system before its implementation. Still, unidentified images and IDs can be added to the existing database using the Admin Portal.

NOTES:

It is necessary to note that the app’s backend requires Golang and MongoDB Collections to store employee data. Yet, API requests are based on RESTful API. Users can test the system on regular workstations prior to implementation.

As unidentified images and IDs are added, databases will grow. Our use case employed a 200-entry database. Since the app works in real-time and recognition is near-instant, the need to scale becomes evident. If organizations need to add cameras or create databases with entries of 10,000 or more, then there could be a lag in real-time analysis and recognition speed. To solve this issue we used parallelization. Using a load balancer and several web workers for simultaneous tasks, the system can chunk an entire database which allows for quick match searches and provides swift results.
Anti-spoofing measures must be highly adaptable to bad actors that might gain entry using false facial images. Our team has put in place enhanced security measures and anti-spoofing features to counteract fraudulent attempts at access.

While this case study is focused on facial recognition, the underlying technology can be used for a range of objects. Object recognition models can be trained to identify any other object once a dataset has been created.