DEV Community: Anna Kovalenko

What is Product Localization and How to Make it Work

Anna Kovalenko — Thu, 08 Dec 2022 15:22:48 +0000

You may have found yourself in a situation where you wanted to order a product from abroad, and the website for ordering was in a foreign language you don’t understand. Most likely, after that you purchased a product from another website or decided not to buy anything at all. This situation occurred because the international retailer hasn’t localized their product.

So what is Product Localization?
Product localization is the process of adapting and modifying a product to satisfy the language and cultural needs of customers in the target markets. To localize their products companies can translate texts, alter their packaging or modify the content on their websites so that it is appropriate and readable for international customers.

Maybe if you are a native English speaker, you have never encountered translation difficulties and it may seem to you that the product localization is not necessary, since all of the websites and social media you use are already in English. However, it is important to remember that only 1.35 billion people worldwide actually speak English — it is less than 20% of the world's population. So if you want to sell your product on the global market, you should be prepared for the fact that most of the potential customers will be unavailable and inaccessible until you localize your product.

Product localization helps to attract these 80% of people who do not speak English — the survey found that 76% of people prefer purchasing products with information in their own language.

But it is important to properly localize the products. There are many examples when the localization of the product did not go according to plan.

For example, in 2005 the well-known Swedish company IKEA released a bench named “Fartfull”. The original name came from the Swedish word “Fährt” in Swedish, meaning “traveling in a vehicle with wheels”. Of course, for the Swedes “Fartfull” made perfect sense but in English the localization has a “stinky” connotation. The product was eventually taken off the market, but not before people all around the world had a great laugh.

So how to make Product Localization work?
There are a couple of steps you need to take to properly localize your product:

Content analysis Everyone knows that planning is important for any process. Product localization requires some planning as well. Localizers need to address a huge amount of details so a product content analysis is a crucial first step for the localization process.

During this step, the product localizers need to map out the necessary details that need to be localized. Usually these details are the product information, imagery, colors, currencies, measurements, help guides and other readable descriptive content.

Internalization Internationalization means building or designing the product in the way that it can support different languages from that start. Internationalization is particularly important if the product is a mobile app or a piece of software. If the localizers skip this step, they will spend much more time and effort localizing the product in the future.

Internationalization involves separating the source code from all the product elements that need to be localized. To internationalize the product localizers use text encoding — usually UTF-8 — add support for right-to-left languages and choose a framework for exporting hard-coded string identifiers to allow for easy translation in the future.

Adaptation of software, websites, marketing collateral, and documentation

Software localization Software localization is considered the most important step because if the localizer doesn’t localize the actual software product well, the other elements of the localization process won’t matter at all.

During the software the localizer needs to adapt UI elements, splash screens, in-app tooltips, and other in-product copy in a way that makes sense for the target market. Also, during software localization, measure, date formats, and currencies need to be changed to match the local standards.

Website localization
The website is the first thing the customer sees, so it is important to properly localize all the web pages. As well as software localization website localization includes translating all text content into the target language, as well as adapting images, videos, and other visual elements.
Marketing localization
Marketing localization is a bit more difficult than software and website ones. The localizer cannot just translate the marketing materials into the other language — they need to adapt marketing messages to resonate with the target market on a cultural level.

Marketing localization involves the development of new slogans and taglines and creation of new ad campaigns that take characteristics of the new audience into account.

Documentation localization Documentation localization concerns things like user manuals, how-to videos, FAQs, and so on. Those documents are important because with them customers can figure out how to solve most of the potential problems on their own. Localization of these documents will make it less likely that a foreign user will seek help from the customer support service, where they will not always be able to help a user because of the language barrier.

Feedback When localizers complete localization, it is important that their work is reviewed by people who are familiar with the language and culture of the target market the companies localize the product for. Main goal of this step is to be sure that the message of the original content has been translated or adapted correctly and will fit the new target market.

Is there any technology that can help with Product localization?
Doing the localization on your own can be tiresome and, quite frankly, unproductive. It will be easier to use various tools to speed up and improve the localization process.

According to the report by Nimdzi Insights found that the current language technology landscape includes more than 700 solutions. There are a couple of them you can use:

Translation memory
It’s a database of previously translated content that allows translators to reuse old translations. Translation memory databases will save you time and money and also make your translations more consistent and coherent.
Term base
It’s bases that keep all of the terminology used in a project in one place so that it can be reused in future projects and also allow you to search for brand-specific or project-specific terminology. Term bases can help with tracking changes in different terminology and ensure that the translation of the terms is right and consistent.
Content integration
It’s a feature that can adapt to agile processes. For localization to be agile, content must flow freely between different systems. In other words, your localization tool should be able to integrate with the rest of your tech stack, including your content management system (CMS), customer relationship management (CRM) system, ecommerce platform and so on. With content integration content can be translated and published without any manual intervention, quickly and without errors.
Machine translation
It is obvious that Machine translation speeds up the translation process by automatically translating content. For product localization it is important to choose a tool that offers best-of-breed, AI-driven Machine translation capabilities because you will be translating large amounts of content.

Now you’re ready to implement Product Localization strategy in your business. Good luck!

CLIP: The link between Text and an Image

Anna Kovalenko — Wed, 07 Dec 2022 23:38:19 +0000

Maybe you’ve heard about OpenAI’s CLIP model that was released last year, on January 5, 2021. Maybe you even tried to read an article on the OpenAI’s website or the paper where the development of the CLIP model is described. Maybe you remember that the model is somehow related to deep learning, text-to-image generation and other cutting-edge technologies from the machine learning world. And maybe after all of this you want to understand CLIP on a deeper level. So let’s unpack the CLIP model.

What is CLIP?
Put simply, CLIP or Contrastive Language-Image Pre-training is a multimodal model that combines knowledge of image description and concepts written in English with semantic knowledge of images.

The CLIP model is trained on 400 000 000 image and text pairs. Those pairs were a picture and its description written in English, so the CLIP model was trained using the data of 400 000 000 matched up pictures and their descriptions — it seems like a lot, but it is clear that it’s still not all the images that exist on the Internet and therefore pictures and descriptions of not all things existing in the world. Nevertheless, this training gave the CLIP model the ability to describe any picture possible. So if you input an image into the CLIP model, it will return the most suitable description, caption or summary of that image.

It is called “zero-shot learning”. Usually most of the machine learning models learn one specific task. For example, they are trained to classify houses and cars. So it is expected that those models will do well on the task that is given to them: classifying houses and cars. Generally no one expects a machine learning model trained on houses and cars to be good at detecting, for example, planes or trees. However, the CLIP model tends to perform well on tasks it isn’t directly trained to do.

CLIP can easily distinguish between images of simple objects like a house or a car but also between something more complex like a bumblebee that loves capitalism or a lightbulb that is trying its best, although CLIP has never seen such things in its training data: it just has a generalized knowledge of what those English words and phrases mean and what those pixels represent.

Summarizing all of the things said, the CLIP model is a vision and text model that was trained on hundreds of millions of images and their descriptions, can return the likeliest caption to a given image, has “zero-shot” capabilities using which ut can accurately predict entire classes it wasn’t trained on.

How does CLIP work?
To connect images and text to one another, you need them to be embedded. Imagine you have one house and two cars. You could represent them as a dot on a graph:

Source: WolframAlpha

And, basically, that is the process of embedding. We just embedded the information on the XY grid. You can think of embedding as a way to transport information into mathematical space, just like we did: we took information about a house and two cars and transported it into mathematical space. The same thing can be done with text and images.

The CLIP model consists of two sub-models or two encoders: image encoder and text encoder

During the training, the images and their text descriptions in a mini-batch are transformed to vectors of the same length by their corresponding encoders, respectively. After normalization, the image vectors are pulled closer to their matching text vectors, and pushed apart from the other text vectors. The same happens with the text vectors and their corresponding image vectors.

Contrastive pre-training

Source: OpenAI

Input images and texts are encoded, and their vector representations are used to build a similarity matrix (I*T is an inner product). During training CLIP learns that the values on the diagonal represent correct classifications, so their similarity must be higher than those in the same row and column. This approach contrasts what the CLIP model knows goes together (diagonal values) to what doesn’t go together (non-diagonal values). Each row is a classification task: given an input image (I1), the CLIP model predicts the text. Similarly, each column is a classification task: given an input text (T1), the CLIP model predicts the image. During training of CLIP, OpenAI used a very large size of mini-batches 32768 (N on the figure above).

During inference one takes a set of labels, creates texts based on those labels and runs these texts through the text encoder. Text embeddings are later matched to visual representation.

Create dataset classifier from label text &
Use for zero-shot prediction Source: OpenAI

Classic classification training cares only about the predefined labels. If it is successful in finding houses and cars, then it doesn’t care if it is a photo or a sketch of a house or a car or a specific number of storeys in a house or a brand of a car. Whereas CLIP learns various aspects of images and points attention to details due to its training coupled with a large dataset.

The CLIP model is sensitive to words used for image descriptions. Texts “a photo of a bird”, “a photo of a bird sitting near bird feeder”, or “an image of a bird” all produce different probability paired with the same image:

Source: OpenAI

Why does CLIP matter?
The CLIP model has a lot of advantages and applications due to its progressive usage of computer vision and language processing. For example, CLIP is used in DALL·E 2, an AI system that can create realistic images and art from a text prompt in a second. If you want to learn more about this AI system and the CLIP’s part in it, read my article about DALL·E 2.

Machine Learning Filters: How social media incorporates Machine Learning

Anna Kovalenko — Wed, 07 Dec 2022 23:28:03 +0000

Sometimes, in the special moments of boredom, you may play around with Instagram, Snapchat or Tiktok filters just for the fun of it — putting on digital-makeup, fake freckles, sticking out your tongue to become a cute dog or swapping your face with a celebrity or your friend. And maybe sometimes, in those special moments, you may wonder how all of this works on a technical level. How all kinds of social media manage to match your face with the filters.

Brief History
The history of Machine Learning Filters has begun with different face changing startups developing the face modifying technology. Basically, those startups were simultaneously working on an application that allowed users to modify their facial features on photos and videos and during video calls. Around 2015 this application was adapted by various social media apps.

Those filters that can change facial features are part of the huge and cutting edge field of computer vision. Computer vision creates a 3D space from image data with Image processing technology. With Image processing, mathematical operations are performed on each individual pixel on the provided image to transform a picture. Computer vision technology is used not only by Snapchat, but also by other social media apps like TikTok, Instagram, Asian face modifier Snow and others.

To create new filters and digital face modifiers, Meta Studio and Lens Studio technologies were developed. Those studios are a tool for creating different lenses and filters. It allows you to create both Face Lenses for the front camera experiences and World Lenses for the rear camera experiences. Any person: from a 3D professional to a Photoshop amatuer — can use this tool to create their own customized filter.

From an outside perspective, it sounds almost futuristic. But, surprisingly, it can be even more advanced. Last June the newer Machine Learning-based technology was launched. It's an update to Meta and Lens Studios that allows the developers to use Machine Learning algorithms to create more detailed, realistic and vivid filters.

What Machine Learning Filters do
Broadly speaking, Machine Learning filters can:

create and train a neural network in an external tool (these are Tensorflow or Pytorch)
insert the model created in Meta and Lens Studio projects
test the filter with any smartphone
distribute the created Lens to social media users

How Machine Learning Filters work
Engineers, development teams, Photoshop hobbyists and other casual creators can upload custom Machine Learning models directly into social media apps. Surely, these models must be compatible with the ONNX model format or any other format suitable for specific apps but it’s a rather simple and convenient way to drop your customized Lens.

Machine Learning filters manage not only Computer Vision algorithms but also others, for example Style-Transfer, and other implementable algorithms Filters Development studios came up with. They can be used as templates for the potential filters.

There are 6 different templates that are available (via Jupyter Notebook) and can be used for a number of cases:

Classification
That is an algorithm that is able to recognize different particular situations (for example, a person wearing glasses or a person having a textured cultural hairstyle) and consequently carry out some action.
Object Detection
This algorithm can recognize the presence of an object the camera is shooting. The algorithm has a car and food detection model and can visually call it out.
Style Transfer
This algorithm allows you to transform images by applying a particular graphic or art style (for example, transform the camera feed into a Van Gogh painting or a comic book style)
Custom & Ground Segmentation
This algorithm is used for identifying areas or objects and replacing them with custom textures and objects (for example, the algorithm comes with a pizza segmentation texture and uses Material Editor to make it look sizzling).
Keyword Detection
This algorithm is used for basic audio related Machine Learning models. Given a spectrogram analysis of the audio, it can return the probability of a spoken word.
Multi Object Detection
This algorithm can detect 7 different classes of objects: cat, dog, potted plant, TV, car, bottle and cup.

Those algorithms can be used by anyone as a starting point to create your own Lens. Those algorithms are flexible and can be modified so you can use them as a basis for creating your own template.

What Machine Learning Filters contribute to Machine Learning and what they mean for ML engineers and developers
Machine Learning Lenses are a promising distribution channel with great functionality. Above that, Machine Learning Filters have other advantages and props:

Machine Learning filters reduce barriers to entry
As was previously stated, with Machine Learning filters, anyone can drop custom neural networks into Filters Development studios and then distribute it to millions of social media users and it's a promising and exciting possibility. With this technology it's no longer required for Machine Learning Development teams and engineers to create an entire mobile application from scratch to develop an algorithm.
Machine Learning filters allow easier experimentation
Without the need for teams and engineers to work through a full app release cycle to see their on-device models in action, it will be much easier to experiment with and test brand- and product-based Lenses powered by immersive Machine Learning features.
Machine Learning filters democratize on-device Machine Learning
For an overall development of Machine Learning technology it is important to democratize said technology. One of early developers and creators of Machine Learning Lenses technology, Hart Woolery noted in one of his interviews that the potential of Machine Learning filters reminds him, in some ways, of how YouTube helped democratize video creation.

Next Step
With all of it being said, Machine Learning Filters are a great leap forward for Mobile Machine Learning technology and AI technology overall. So from now on, when you're playing around with TikTok or Snapchat filters, remember that it's a cutting edge technology and, in a way, you're incorporating Machine Learning in your everyday life just by pressing a couple of buttons and having fun.

Drawing the Undrawable: DALL·E 2

Anna Kovalenko — Wed, 07 Dec 2022 23:18:56 +0000

Imagine that you can draw anything you want: from a bowl of soup that is actually a planet in the universe drawn in a classical digital art style to the portrait of a lady with a ruby necklace drawn in a Renaissance painting style. Believe it or not, you can generate those images — all imaginable and unimaginable things — using DALL·E 2.

So what exactly is DALL·E 2?
DALL·E 2 is an AI system that can create realistic images and art from a prompt in natural language in a couple of seconds. A prompt is just a sentence that can’t be longer than 400 characters and is describing the image you want to create. A prompt can be a detailed descriptive sentence or just an emoji — AI will create an image on its basis anyway.

Here are some random examples of prompts:

DALL·E 2 has not been deliberately “taught” different art styles or different artists’ techniques as well as it has not been “taught” what a bowl of soup is or how the figurines made of plasticine look like. DALL·E studied 650 million images and their descriptions and drew its own conclusions. DALL·E’s skills and abilities are surprising even for its creators because even they, the developers of the AI system, do not know what DALL·E has learned and what it didn’t learn and how it interprets the prompts.

But how does DALL·E 2 actually work?
To be honest, at the highest level, DALL·E 2's works pretty simply:

Firstly, your text prompt is input into a text encoder that is trained to map the prompt to a representation space.
Secondly, a model called the prior maps the text encoding to a corresponding image encoding that captures the semantic information of the prompt contained in the text encoding.
Finally, an image decoder stochastically generates an image which is a visual manifestation of this semantic information you put in your prompt.

But on the more detailed level, there are some steps that DALL·E takes to link related textual and visual abstractions.

Step 1. How DALL·E 2 links Textual and Visual Semantics
If you input a prompt “a bowl of soup that is a portal to another dimension in the style of Basquiat”, DALL·E 2 will output something like this:

But how does the AI system know how a textual concept like “a bowl of soup” is manifested in the visual space? The link between textual semantics and their visual representations in DALL·E 2 is learned by another OpenAI model called CLIP (Contrastive Language-Image Pre-training).

CLIP is trained on millions of images and their captions, learning how much a given text prompts relates to an image. CLIP does not try to predict a caption given to a picture or an artwork but learns how related the caption to an image is. This approach helps CLIP to learn the link between textual and visual representations of the same abstract concept or object. To obtain this “knowledge”, DALL·E 2 model uses CLIP's ability to learn semantics from natural language.

The principles of CLIP training are not really difficult:

Firstly, all images and their associated captions are passed through their respective encoders, mapping all objects into an m-dimensional space.
Next, the cosine similarity of each (image and text) pair is computed.
The training objective is to simultaneously maximize the cosine similarity between N correct encoded image and caption pairs and minimize the cosine similarity between N2 — N incorrect encoded image and caption pairs.

After the training, the CLIP model gets “frozen” and DALL·E 2 moves onto its next step.

Step 2. How DALL·E 2 generates Images from Visual Semantics
During this step DALL·E 2 learns how to reverse the image encoding mapping that the CLIP model just learned. CLIP learns a representation space in which it is easy to determine how textual and visual encodings relate to each other, but the image generation requires the AI system to learn how to exploit the representation space to create an image.

DALL·E 2 uses the GLIDE model to perform the image generation. On its part, GLIDE uses the Diffusion Model to create an image. In brief, Diffusion Models learn to generate data by reversing a gradual noising process and help with the text-to-image generation process. If you want to learn more about Diffusion Models and Stable Diffusion, you can read my other article.

Step 3. How DALL·E 2 maps from Textual Semantics to Corresponding Visual Semantics
After GLIDE successfully generates an image that reflects the text prompt by image encoding, DALL·E 2 needs to actually find these encoded representations. Put simply, DALL·E needs to inject the text conditioning information from the text into the text-to-image generation process.

Keep in mind that the CLIP model also learns a text encoder in addition to the image encoder. To map from the text encodings of image captions to the image encodings of their corresponding images, DALL·E 2 uses another model called the Prior. The Prior also uses the Diffusion Model in its encoding.

The Prior consists of a decoder-only Transformer. It operates on an ordered sequence of:
the tokenized text and caption
the CLIP model’s text encodings of these tokens
an encoding for the diffusion timestep
the noised image passed through the CLIP model’s image encoder
final encoding which output from the Transformer is used to predict the unnoised CLIP image encoding

Step 3. How the developers of DALL·E 2 put all of it together
There were all the components needed for DALL·E 2 and the next step is to chain all of it together for text-to-image generation:

First of all, the CLIP model text encoder maps the image description into the representation space.
Second of all, the diffusion prior maps from the CLIP text encoding to a corresponding CLIP image encoding.
Finally, the GLIDE generation model maps from the representation space into the image space with the help of reverse-Diffusion, generating one of many possible images that conveys the semantic information within the text prompt.

And it’s basically it.

With this technology DALL·E 2 can create realistic, detailed, brand new images and expand existing images beyond their original canvas, edit them and do other fun stuff. If you want to learn more, you can check out DALL·E 2 website and try out the AI system yourself.

StyleGAN-NADA: Blind Training and Other Wonders

Anna Kovalenko — Wed, 07 Dec 2022 23:09:44 +0000

Continuing the series of articles about the AI system DALL·E 2 and the models used in it, this time I will talk about the StyleGAN-NADA model, CLIP-Guided Domain Adaptation of Image Generators. If you want to learn more about the CLIP model, you can check out my other article!

Introduction & Basics
Imagine how cool it would be if you could describe a GAN with a text prompt (for example, Dog → The Joker) and get a complete generator that synthesizes images corresponding to the provided text query in any domain. Imagine how cool it would be if a generative model could be trained to produce those images without seeing any image beforehand at all.

It is actually possible with the StyleGAN-NADA model. And it is really cool.

Dog → The Joker

Leveraging the semantic power of large scale CLIP (Contrastive-Language-Image-Pre-training) models, Rinon Gal and his colleagues present a text-driven method that allows shifting a generative model to new domains and does not have to collect even a single image from those domains. In other words, the StyleGAN-NADA model is trained blindly. All it takes is only a natural language text prompt and a few minutes of training, and by that the method can adapt a generator across a great number of domains characterized by diverse styles and shapes.

The domains that the StyleGAN-NADA covers are very specific and fun — or maybe a little bit creepy:

Human → Mark Zuckerberg

Church → New York City

Human → Zombie

Why StyleGAN-NADA matters
GAN training requires obtaining a multitude of images from a specific domain and usually it’s a pretty difficult task. Of course, you can leverage the information learned by Vision-Language models such as the CLIP model, yet applying these models to manipulate pretrained generators to synthesize out-of-domain images is not that easy. That’s why the authors of the StyleGAN-NADA model propose to use dual generators and an adaptive layer selection procedure to increase training stability. Unlike other models and methods, StyleGAN-NADA works in a zero-shot manner and automatically selects a subset of layers to update at each iteration.

Pre-training Setup
It all starts with a pre-trained generator and two text prompts describing a direction of change (for example, “Dog” to “The Joker”). Instead of editing a single image, the authors of StyleGAN-NADA use the signal from the CLIP model in order to train the generator itself. So there is actually no need for training data, and the process is really fast. The training takes minutes or even less.

If you’re interested in the more detailed overview of the training setup, here it is:

The authors of the StyleGAN-NADA model initialize two intertwined generators — G-frozen and G-train using the weights of a generator pre-trained on images from a source domain. The weights of G-frozen remain fixed throughout the whole process, while the weights of G-train are modified through optimization and an iterative layer-freezing scheme. The process shifts the domain of G-train according to a user-provided textual direction while maintaining a shared latent space.

How StyleGAN-NADA works
The main goal of the method is to shift a pre-trained generator from a given source domain to a new target domain only with the textual prompts, without using images of the target domain. Here’s the training scheme that helps to achieve that goal:

Network Architecture
The model consists of two pretrained StyleGAN2 generators with a shared mapping network and the same latent space. The goal is to change the domain of one of the paired generators with a CLIP-based loss and keep the other fixed as a reference with a layer-freezing scheme that can adapt and select which layers to update at each iteration.

CLIP-based Guidance
There are 3 different types of losses that are used:

Global target loss
The global loss is the most intuitive CLIP loss. It minimizes the CLIP-space cosine distance between the generated images and the given target text prompt and either collapses to a single image or fools CLIP by adding per-pixel noise to the images.

Directional loss
It’s a more advanced type of loss that seeks to align the direction of CLIP embeddings between images from two domains to the CLIP direction of the corresponding text queries.

Embedding-norm loss
Embedding-norm loss uses a regularized version of StyleCLIP’s latent mapper that is used to reduce the number of semantic artifacts on synthesized images.

Layer-Freezing
It happens that some layers of the generator are more important for specific domains than others, hence at each iteration a set of W+ vectors is generated — a separate style vector for each layer in the generator. A number of StyleCLIP global optimization steps are performed to measure which layers changed the most. Only those most changed layers are updated, while all other layers are frozen for that iteration.

Latent-Mapper
During the last step, it is noted that the generator does not undergo a complete transformation for some shape changes. For some domain (for example “Dog” to “The Joker”) the resulting generator can output both dogs, and the jokers and everything that lies in-between. Therefore a StyleCLIP latent mapper can be trained to map all latent codes to the dog region of the latent space.

Conclusion
So this is how StyleGAN-NADA, a CLIP-guided zero-shot method for Non-Adversarial Domain Adaptation of image generators, works. Although the StyleGAN-NADA is focused on StyleGAN, it can be applied to other generative architectures such as OASIS and many others.

The ability to blindly train intertwined generators leads to new cool possibilities. For example, with the StyleGAN-NADA model you can edit images in ways that are constrained almost only by your own creativity or synthesize paired cross-domain data and labeled images for downstream applications such as image-to-image translation. And it's only the beginning! The method surely will be developed in the future. Maybe this article inspired you to explore the world of textually-guided generation and abilities of the CLIP model yourself.

How Generative AI changes the Process of Creating Content

Anna Kovalenko — Wed, 07 Dec 2022 22:58:15 +0000

There is a lot of controversy surrounding art and — in general — content created by Artificial Intelligence. Can the content be considered art if it is made by a computer? Is the AI generated content better than a human-made one? There are a lot of questions and disagreements but there are definitely some types of content — for example, journalistic news articles or illustrations and diagrams — that can be created automatically without employing human content creators. That’s when AI can help.

What is Generative AI?
Generative AI is a technology that can create new content by utilizing existing text prompts, images, videos, audio files and so on. With the Generative AI technology, computers can detect the underlying pattern related to the input and produce similar content. Generative AI already does a lot. It produces text and images, blog posts, program codes, poetry, artwork and even wins fine-arts competitions.

There are different Generative AI techniques. For example:

Generative adversarial networks (GANs):
GANs are two neural networks: a generator and a discriminator that pit against each other to find equilibrium between the two networks:

The generator network generates new data or content that resembles the source data
The discriminator network differentiates between the source and the generated data in order to recognize what is more similar to the original data

Transformers:
For example, GPT-3, LaMDA, and Wu-Dao. Those transformers imitate cognitive attention and differentially measure the significance of the input data parts. They are trained to understand the language and images, learn classification tasks and generate texts and images from massive datasets

Variational auto-encoders:
Auto-encoders encode the input into compressed code while the decoder reproduces the initial information from this code. If chosen and trained correctly, this compressed representation stores the input data distribution in a much smaller dimensional representation.

How can Generative AI be used?
The Generative AI technology can be used for:

Generating photographs of human faces, objects and scenes
Generative AI can produce real looking photographs using, for example, the GAN models.

If you want to learn more about the GAN models — especially the StyleGAN-NADA model — you can read my article about it!

Image-to-image conversion
Generative AI can translate one image to another (for example, black and white photos to color photos; day photos to night photos; real photographs to artistic paintings in the famous artists’ styles and so on).

Text-to-image Generation
It produces realistic photographs from text prompts of simple objects like birds and flowers and other basic descriptions using, for example, the DALL·E 2 or the StackGAN systems.

If you want to learn more about the DALL·E 2 model, you can check out my article about it!

Film Restoration
It restores and improves old images and old movies by upscaling them to 4K and beyond. It can generate 60 frames per second instead of 23 or less, eliminate the noise and add color, for example, with the help of Topaz Gigapixel AI.

You can watch an example of the AI Film Restoration here!

Semantic Segmentation
Semantic-image-to-photo translation converts input that are semantic images or sketches to photo realistic images or photographs.

Face Frontal View Generation
It generates front-on photos from photos taken in different angles for a face verification or face identification system.

Photos to Emojis Transformation
It changes real photos to emojis or cartoonish faces. This technology is used in Machine Learning filters on social media like Snapchat, Instagram or TikTok.

If you want to learn more about Machine Learning filters, you can read my article about it!

Face Aging
It generates older versions of faces from photos. This technology is also used in Face Modification entertaining apps.

Media and entertainment
There are different ways to apply Deep fake and other generative AI technologies. For example, deep fake technology can help with localization of content (e.g. dubbing of a movie). By using face synthesis and voice cloning such as the CRISPR technology, the actors’ original voices can be matched with a lip-sync.

Benefits of Generative AI
Enhanced Identity Protection
Generative AI that can create avatars that can conceal the real appearance of people who do not want or are not comfortable disclosing their identities for any reason while being interviewed or working online.

Improved Quality of Output
Generative AI systems can help to get high-quality images, video, audio and other content even if the original input content is distorted and far from perfect.

Decreased Financial and Reputational Risks
Generative AI tools can quickly detect malicious or at least suspicious activities and prevent all kinds of damage to a business or a creator.

Challenges of Generative AI
Security
Generative AI can be used for scamming people.

Overestimation of Capabilities
Enormous amounts of training data is needed for the Generative AI algorithms to perform tasks. But even with that, GANs cannot create entirely new images or texts but only combine the visual and semantic information they already learned.

Unexpected Outcomes
It’s not easy to control the behavior of some Generative AI, for example, like GANs. The Generative AI models can perform unstably and generate something unexpected.

Now, when you know basic information about the Generative AI technology, you can use it to benefit your business and content creation in general. Good luck!

The Science behind Generative AI Models

Anna Kovalenko — Wed, 07 Dec 2022 22:28:02 +0000

I covered the topic of Generative AI and types of AI generated content in one of my previous articles. In this one, I will explain the science behind it and answer the important question: how do Generative AI Models work?

If you have read my article “How Generative AI changes the Process of Creating Content”, you know what magical things Generative AI Models can do — from generating images from text prompts to restoring the old movies — and here is how:

Discriminative modeling & Generative modeling
There’s two different types of modeling and to understand the science behind Generative AI it is important to learn the difference between these two.

Discriminative modeling helps to classify existing data points (for example, images of elephants and mice into respective categories). It is mostly used in supervised machine learning tasks.

Generative modeling helps to understand the structure of the dataset and generate similar datasets (for example, creating a photorealistic image of an elephant or a mouse). It is mostly used in unsupervised and semi-supervised machine learning tasks.

Source: Altexsoft

Discriminative modeling
Like most machine learning models, discriminative algorithms classify input data with some set of features and predict a category, a label or a class to which an input data example belongs.

The training data for machine learning models usually contains images — they are called samples — each of which has input features (X) and output class labels (Y). The training data for machine learning models can be, for example, photos of elephants and mice. And there is a neural net that looks at one of those photos, pays attention to the features that distinguish elephants and mice and tells whether that one specific photo is an elephant or a mouse.

p(ylx) = probability of y given x
Source: Altexsoft

The difference between elephants and mice can be limited to just two features x (for example, “the size of the body” and “the presence of the trunk”). In this case, each feature is a dimension so it is easy to present them in a two-dimensional data space. In the illustration above, the blue dots represent elephants and the red dots represent mice. The line represents the decision boundary and shows that the discriminative model learned to separate elephants from mice based on the x features.

When the discriminative model is fully trained and can tell the difference between elephants and mice, it can “recall” what an elephant or a mouse looks like from what it has already seen during training.

If the discriminative model comes across an image from a different class, for example, a house, it can tell that it’s an elephant with some level of probability. In this case, the predicted output (ŷ) is compared to the expected output (y) from the training dataset. Based on the comparison, it is easy to understand how and what in a model needs to be updated to create more accurate outputs for given classes.

So, to sum up, the discriminative model compresses information about the differences between elephants and mice but doesn’t try to understand what exactly elephants and mice are.

Generative modeling
Well, generative algorithms do the complete opposite. They don’t predict a label given to some features, they try to predict features given a certain label. For discriminative algorithms the relations between x and y are more important than the process of how to get x.

p(x,y) = probability of x and y occurring together
Source: Altexsoft

Mathematically speaking, generative modeling can capture the probability of x and y occurring together. Generative algorithms learn the distribution of individual classes and features but not the boundary.

So generative models help to answer the question of what is an elephant or a mouse itself. The illustration above shows that a generative model can predict both “the size of the body” and “the presence of the trunk” features of the animals and also other features from a class. Basically, it means that the generative model learns features and their relations between them to get an idea of what those animals look like in general.

If the generative model knows what kinds of elephants and mice there are in general, then their differences are also known. Those generative algorithms can learn to recreate images of elephants and mice, although that was not in their training dataset.

Generative modeling can help to solve a lot of problems using cutting-edge technologies like GANs and transformer-based algorithms.

GANs (Generative Adversarial Networks)
A generative adversarial network is a machine learning algorithm that puts the two neural networks that are called the generator and the discriminator against each other. The contest between two neural networks takes the form of a zero-sum game, where one agent’s gain is another agent’s loss.

GANs were invented in 2014 by Jan Goodfellow and his colleagues from the University of Montreal. If you want to look at their description of the GAN architecture you can read their paper titled “Generative Adversarial Networks”.

Source: Altexsoft

The GAN architecture has two sub-models in it:

Generator
Generator is a neural net that can create fake input or fake samples from a random input vector (a list of mathematical variables each of whose value is unknown)

Discriminator
It’s a neural net that can take a given sample and decide if it’s a fake sample from a generator or a real sample from the domain. So, basically, the discriminator is a binary classifier that returns probabilities — a number between 0 and 1. If the result is closer to 0, the output is more likely to be fake. And if the number is closer to 1 is more likely to be real.

A generator and a discriminator are often implemented as Convolutional Neural Networks (CNNs), when working with visual content.

Transformer-based models
Transformer-based models are deep neural networks that learn context and therefore meaning by tracking relationships in sequential data, for example, like the words in the sentence. That’s why transformers are often used in Natural Language Processing tasks.

Transformer-based models were firstly described in a 2017 paper “Attention Is All You Need” by Ashish Vaswani and his colleagues from Google. The examples of transformers are GPT-3 and LaMDA.

The transformer-based models can transform one sequence into another. Those models go through semi-supervised learning which means that they are pre-trained in an unsupervised manner using a large unlabeled dataset and then fine-tuned through supervised training to perform better.

Source: Altexsoft

A typical transformer has two parts in it:

The encoder
The encoder works on the input sequence. It extracts all features from a sequence, converts them into vectors (e.g. vectors representing the semantics and position of a word in a sentence) and then passes them to the decoder.

The decoder
The decoder works on the target output sequence. Each decoder receives the encoder layer outputs, derives context from them and generates the output sequence.

Transformers work through sequence-to-sequence learning, so, basically, the transformer takes a sequence of tokens (again, for example, words in a sentence) and predicts the next word in the output sequence. It does this through iterating encoder layers.

Now when you know the science behind generative AI models you can read my other article about generative content in which I describe how generative AI models can be used and what are their advantages and disadvantages.

What is Stable Diffusion and why it matters

Anna Kovalenko — Wed, 07 Dec 2022 22:21:06 +0000

What is Stable Diffusion and why it matters

What is Stable Diffusion?
Stable diffusion is a text-to-image generation machine learning model released by Stability.Ai on August 22, 2022. Basically, this model is capable of generating highly detailed photo-realistic images from the text descriptions.

The model was released publicly and is open source so you can play around with it if you want.

What can Stable Diffusion do?
For now the Stable Diffusion model allows users to:
Convert text into brand new realistic images at 512x512 pixels in a few seconds
Use image modification, via image-to-image translation guided by a text prompt and upscaling, to transform an existing image into a new image
Use GFP-GAN modeling for the inpaiting, the process that helps to restore and upscale an existing image

How does Stable Diffusion work?
Stable diffusion is a form of a diffusion model (DM). Diffusion models were introduced back in 2015 and are trained with the objective of removing successive applications of Gaussian noise to training images, and can be thought of as a sequence of denoising autoencoders. There are different variants of the DMs but Stable Diffusion is powered by a variant known as Latent Diffusion or Latent Diffusion Model (LDM).

Latent Diffusion is a breakthrough text-to-image synthesis technique. It was described by AI researchers at the Ludwig Maximilian University of Munich in a paper called “High-Resolution Image Synthesis with Latent Diffusion Models”. In short, Latent Diffusion is trained to transform images into a lower-dimensional latent space instead of learning to denoise image data in “pixel space”. It makes the training cost lower and the inference speed faster.

How was Stable Diffusion trained?
Stable Diffusion was trained on massive datasets collected by LAION, a non-profit AI open network who received funding from Stability.Ai, the owner of the Stable Diffusion model. Those datasets contained 120 million image-text pairs from the complete set which contains nearly 6 billion image-text pairs.

Overall they sampled 12 million images. 47% of the total sample size came from 100 domains. Pinterest took up 8.5% of the entire dataset, followed by other sources such as WordPress.com, Blogspot, Flickr, DeviantArt, and Wikimedia Commons.

Is Stable Diffusion being regulated?
There are plenty of ethical, moral and legal issues concerning the misuse of AI devices. Compared to previous comparable AI models, Stable Diffusion permits a wider range of images to be generated. For example, users can generate pictures of real people or existing brand logos. The other expressed concern associated with Stable Diffusion is that widespread usage of image synthesis softwares may eventually lead to human artists, along with photographers, models and everyone involved into creation of the visual art, to gradually lose commercial viability against AI-based competitors.

In response to these concerns, Stability.ai emphasized the importance of “ethical and legal” use of the model in its public release announcement. There are some important points:
Users of the Stable Diffusion model are granted a “perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare, publicly display, publicly perform, sublicense, and distribute” the Model, and any of its Complementary Material, such as its source code, and any Derivatives of the Model
Users are also granted a “perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable” patent license to make, use, sell or offer to sell, import or otherwise transfer, the Model and any of its Complementary Material
Users agree to use the Model or its Derivatives “in an ethical, moral and legal manner and contribute both to the community and discourse around it” and to not use the Model in a way which would cause harm to minors, defame anyone, discriminate against an individual or group or exploit the vulnerabilities of a specific group
The Licensor does not assert any rights in the output users generate using the model

Basically, the license of Stable Diffusion usage relies on users self-regulating their own actions and their ability to “do the right thing”. However, the license does not imply any punishment for non-compliance with the social agreement between Stable Diffusion and its users.

Commenting on that, the CEO of Stability.Ai Emad Mostaque explained that it is “peoples' responsibility as to whether they are ethical, moral, and legal in how they operate this technology” and the freedom given to users can provide an overall net benefit, even in spite of the potential negative consequences. Also Mostaque adds that image generating AI systems before Stable Diffusion were developed closed and controlled by large corporations, whilst the availability of Stable Diffusion ends corporate control and dominance over such technologies.

But is it considered art, though?
The US Copyright Office says that it is not. In February 2022, the Office’s Review Board rejected an application for the artwork called “A Recent Entrance to Paradise” that was generated by an AI algorithm that repurposed picture to create an image seen by a synthetic dying brain.

The Review Board stated that “human authorship is a prerequisite to copyright protection <...> but the Work [“A Recent Entrance to Paradise”] was autonomously created by artificial intelligence without any creative contribution from a human actor”.

Yet AI art is relatively popular. For example, in 2018 an auction house Christie sold an IA generated picture of a blurred face titled “Portrait of Edmond Belamy” for $432,500. But many claim that the only appeal AI generated art has is a quirkiness of being made by a non-human.

Proving that AI generated art is not art, the Rutgers AI lab concluded:

“Still, there’s something missing in [AI’s] artistic process: The algorithm might create appealing images, but it lives in an isolated creative space that lacks social context. Human artists, on the other hand, are inspired by people, places, and politics. They create art to tell stories and make sense of the world”.