DEV Community: Aditi Baheti

Transforming Fashion with AI: Building a GenZ Trend Generator using Stable Diffusion 3 and DreamBooth LoRA

Aditi Baheti — Mon, 26 Aug 2024 13:26:57 +0000

Introduction

In a world where fashion trends change rapidly, staying ahead of the curve is both a challenge and an opportunity. The ability to predict and generate trendy designs can empower brands to cater to their audience’s evolving tastes more effectively. During a recent hackathon, our team, SheCodes from IIT Jodhpur, developed an AI-powered solution aimed at revolutionizing the "Forward" section of Myntra by generating trendy fashion designs targeted at GenZ. This blog will take you through the journey of creating this innovative project using state-of-the-art AI techniques, including Stable Diffusion 3, DreamBooth, and LoRA (Low-Rank Adaptation).

Project Overview

Objective

The project’s primary goal was to enhance Myntra's "Forward" section by automating the generation of fashionable dress designs tailored for GenZ users. We achieved this by fine-tuning a Stable Diffusion 3 model with DreamBooth LoRA, allowing the AI to learn and generate designs based on specific text prompts.

Team Composition

The project was executed by SheCodes, a team of three dedicated members:

Aditi Baheti
Aayushi Bhimani
Ritu Singh

Implementation Stages

Our project was divided into three major stages: Dataset Preparation, Model Fine-Tuning, and Inference & Deployment.

1. Dataset Preparation

Collecting the Data

The foundation of any AI model lies in the quality of its dataset. We began by collecting a diverse set of images and captions from Myntra's "Forward" section. Each image was paired with a detailed text description, capturing essential attributes such as color, style, length, and pattern. This ensured that the model could learn the intricate details of fashion trends that resonate with GenZ.

Secure Image Identification with SHA-256 Hashing

To manage and maintain the integrity of our dataset, we employed SHA-256 hashing. This cryptographic technique provided a unique identifier for each image, enabling us to handle large datasets efficiently and avoid duplicate entries. By ensuring the uniqueness of each image, we could maintain a high standard of data quality throughout the project.

Computing High-Dimensional Embeddings

The next step involved computing high-dimensional embeddings for the image-text pairs in our dataset. These embeddings serve as a condensed representation of the data, capturing the most important features that the model would later use to generate new designs. This was achieved using a pre-trained text encoder and image processing pipeline from the Stable Diffusion model.

2. Model Fine-Tuning

Loading Stable Diffusion 3 and DreamBooth LoRA

Stable Diffusion 3, a state-of-the-art text-to-image generative model, formed the backbone of our project. We leveraged DreamBooth, a fine-tuning technique that allows the model to learn specific tasks, and LoRA, which enables fine-tuning with fewer parameters by focusing on low-rank adaptations. This combination allowed us to tailor the model specifically for generating fashion designs based on our curated dataset.

Configuring LoRA and Training the Model

The model was fine-tuned by adjusting the LoRA parameters, such as the rank and alpha values, to optimize learning. This involved iteratively training the model on our dataset while monitoring key metrics like loss and gradient accumulation. By the end of the training phase, the model was adept at generating high-quality fashion designs that reflected current trends and resonated with GenZ preferences.

3. Inference and Deployment

Generating Fashion Designs

With the model fine-tuned, we moved on to the inference phase, where the model was tasked with generating new fashion designs based on user-provided prompts. The ability of the model to interpret and creatively respond to these prompts was key to demonstrating the potential of AI in fashion design.

Deployment with Gradio and Hugging Face

For deployment, we chose Gradio, an open-source tool that makes it easy to create web-based interfaces for machine learning models. Integrated with Hugging Face, this setup allowed us to create a real-time, interactive experience where users could input their fashion preferences and receive AI-generated designs instantly. This deployment showcased how the model could be integrated into Myntra's platform to enhance user engagement.

Technical Breakdown

Understanding the Execution Flow

The overall execution of the project can be visualized in the flowchart provided. The process begins with collecting images from Myntra and applying SHA-256 hashing for secure identification. The images and captions are then transformed into embeddings using the DreamBooth fine-tuning method. These embeddings serve as the foundation for training the model, which is then fine-tuned using LoRA with Stable Diffusion 3. Finally, the model is deployed on Gradio, allowing users to generate fashion designs based on their prompts.

Example Outputs

To better understand the capabilities of our model, consider the following examples:

Input Prompt: "Blue, white floral print tiered fit & flare dress, above the knee length, Square neck, Short, puffed sleeves."
- Output: The model generates a dress design that closely matches the description, capturing the floral pattern, fit, and style as described.

Input Prompt: "Beige-colored & black regular wrap top, Animal printed, V-neck, three-quarter, regular sleeves."
- Output: The AI outputs a design that mirrors the input description, including the wrap style, animal print, and sleeve length.

These examples highlight the model's ability to interpret complex fashion descriptions and translate them into visually appealing designs that align with current trends.

Potential Impact

While this project was developed within the scope of a hackathon, its implications are far-reaching. The ability to automate fashion design using AI could significantly enhance the creative process for designers, reduce the time and effort required to produce new collections, and offer personalized shopping experiences for users. By integrating this technology into a platform like Myntra, brands can stay ahead of trends and cater more effectively to their audience, particularly the GenZ demographic.

Benefits to Myntra

For Designers: Quick generation of diverse design options, reducing the time and effort required for the creative process.
For Users: Personalized, trendy fashion suggestions that enhance the shopping experience.
For Myntra: Increased user engagement and potentially higher conversion rates, contributing to overall business growth.

Conclusion

Our project demonstrates the potential of AI to revolutionize the fashion industry. By fine-tuning a state-of-the-art diffusion model with DreamBooth and LoRA, we were able to create a system capable of generating high-quality, trend-aligned fashion designs tailored to the preferences of GenZ. While the project was developed for a hackathon, the techniques and models we explored have real-world applications that could transform how fashion is designed and consumed.

We invite you to explore our GitHub repository for more details and to see how these techniques can be applied to other creative fields.

An In-Depth Look at Audio Classification Using CNNs and Transformers

Aditi Baheti — Wed, 03 Jul 2024 15:55:05 +0000

Introduction

Audio classification is a fascinating area of machine learning that involves categorizing audio signals into predefined classes. In this blog, we will delve into the specifics of an audio classification project, exploring the architectures, methodologies, and results obtained from experimenting with Convolutional Neural Networks (CNNs) and Transformers.

Dataset

The project utilized the ESC-50 dataset, a compilation of environmental audio clips categorized into 50 different classes. Specifically, the ESC-10 subset was used, narrowing the dataset to 10 categories for more focused experimentation.

Architecture 1: Convolutional Neural Networks (CNNs)

Initial Setup

The initial model setup for audio classification relied heavily on CNNs. These networks use convolutional layers to extract features from the audio signals progressively, increasing the output channel size from 16 to 64. Each convolutional layer is followed by a max-pooling layer to reduce the spatial dimensions and highlight the most critical features.

Original Model

The original model focused solely on feature extraction without incorporating dropout, early stopping, or other regularization techniques. This led to a basic yet effective structure for understanding audio data's complex patterns.

Enhanced Model

To combat overfitting and improve generalization, several enhancements were made:

Dropout: Introduced to randomly deactivate neurons during training, thereby preventing over-reliance on specific paths.
Early Stopping: Implemented to halt training when validation performance plateaued, ensuring the model does not overfit to the training data.
Regularization: Additional techniques were employed to further stabilize the training process and enhance generalization.

Results

The use of k-fold cross-validation, with fold 1 reserved for validation, provided a comprehensive evaluation of the model's performance. Key observations from hyperparameter tuning include:

Reduced Overfitting: The enhanced model exhibited lower test losses and higher test accuracies, F1 scores, and ROC AUC values across all folds compared to the original model.

The following table summarizes the performance across different folds:

Metric	Fold 2 (Original)	Fold 2 (Enhanced)	Fold 3 (Original)	Fold 3 (Enhanced)	Fold 4 (Original)	Fold 4 (Enhanced)	Fold 5 (Original)	Fold 5 (Enhanced)
Avg. Training Accuracy	63.49%	51.15%	68.77%	43.67%	68.64%	55.49%	67.55%	49.84%
Avg. Validation Accuracy	34.25%	38.42%	39.17%	35.00%	38.54%	40.64%	38.44%	43.97%
Test Loss	7.7658	1.5196	4.4111	1.4217	4.1973	1.5789	4.4777	1.5499
Test Accuracy	30.42%	48.47%	42.08%	45.97%	40.56%	43.47%	45.69%	42.92%
F1 Score	0.26	0.47	0.40	0.45	0.41	0.42	0.44	0.39
ROC AUC	0.72	0.88	0.81	0.88	0.78	0.87	0.80	0.86

Confusion Matrix and ROC Curve

The confusion matrix and ROC curve for the best performing fold (Fold 2) highlight the classifier's ability to distinguish between most classes effectively. However, there are instances of misclassification, suggesting the need for further refinement in the model.

Architecture 2: Transformers

Transformers, known for their success in natural language processing, were adapted for audio classification in this project. The core of this architecture involves:

Convolutional Layers: Used initially to extract basic audio features such as tones and rhythms.
Transformer Blocks: Employed to process these features using attention mechanisms, enabling the model to focus on different parts of the audio sequence dynamically.
Multi-Head Attention: Utilized to attend to various representation subspaces simultaneously, enhancing the model's interpretive capabilities.
Positional Encodings: Incorporated to retain the sequential order of audio data, allowing the model to adapt positional information effectively.

Performance Metrics

The transformer model was evaluated with different numbers of attention heads (1, 2, and 4). Key observations include:

Two Heads Model: This configuration outperformed others in terms of test accuracy and F1 score, suggesting an optimal balance between feature learning and generalization.
Four Heads Model: Despite higher train accuracy, this model exhibited signs of overfitting, with less effective feature integration for classification.

The table below outlines the performance metrics for different configurations:

Number of Heads	Train Accuracy	Valid Accuracy	Test Accuracy	Train Loss	Valid Loss	Test Loss	F1 Score	ROC AUC
1 Head	80.74%	46.39%	43.47%	0.5412	2.5903	2.9106	0.41	0.82
2 Heads	79.91%	49.86%	49.86%	0.5778	2.4115	2.4757	0.47	0.86
4 Heads	81.71%	44.86%	42.78%	0.5759	2.6297	2.4895	0.40	0.84

Enhanced Model with Transformers

The enhanced model employed additional techniques such as gradient clipping and the AdamW optimizer, coupled with a learning rate scheduler. This configuration significantly improved the model's stability and generalization capabilities.

Gradient Clipping: Applied to prevent exploding gradients, ensuring stable training.
AdamW Optimizer: Recognized for its weight decay regularization, enhancing the model's performance on validation data.

The enhanced model demonstrated superior performance across several metrics:

Metric	Enhanced Model
Train Accuracy	79.81%
Validation Accuracy	55.00%
Test Accuracy	58.19%
Train Loss	0.6030
Validation Loss	1.5191
Test Loss	1.1435
F1 Score	0.56
ROC AUC	0.93

Trainable Parameters

SoundClassifier: Approximately 16.4 million trainable parameters.
AudioClassifierWithTransformer: About 8.9 million trainable parameters.

Conclusion

This project illustrates the potential of both CNNs and Transformers in audio classification tasks. While CNNs provide a solid foundation for feature extraction, Transformers offer advanced capabilities through attention mechanisms, enhancing the model's ability to interpret complex audio signals. By incorporating regularization techniques and advanced optimizers, the enhanced models achieved significant improvements in generalization and stability, highlighting the importance of these strategies in machine learning.

The results underscore the effectiveness of using a combination of traditional convolutional methods and modern transformer architectures to tackle the challenges of audio classification, paving the way for further innovations in this exciting field.

From Day to Night: Building a CycleGAN for Image Translation

Aditi Baheti — Wed, 03 Jul 2024 10:26:49 +0000

Introduction

Welcome to the exciting world of image translation! Have you ever wondered how a scene would look at night if you only have its day image? Using CycleGANs, we can transform images from one domain to another, like day to night and vice versa, without the need for paired examples. Let's dive into this fascinating journey and see how we can achieve this using CycleGANs.

Background

Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) are a type of neural network where two models compete against each other. The Generator creates images, trying to fool the Discriminator, which attempts to distinguish between real and fake images. This adversarial process helps the Generator produce highly realistic images.

CycleGANs

CycleGANs take GANs a step further by introducing cycle consistency. Instead of just one generator-discriminator pair, CycleGANs have two pairs, each learning to translate images from one domain to another. The cycle consistency ensures that if you translate an image from domain A to domain B and back to domain A, you should end up with the original image. This makes CycleGANs powerful for unpaired image-to-image translation tasks.

Dataset Preparation

Our dataset consists of day and night images. We split these images into training and testing sets to evaluate our model's performance. Specifically, we used 80% of the images for training and 20% for testing. The dataset comprises:

Training Day Images: 417
Testing Day Images: 105
Training Night Images: 181
Testing Night Images: 46

Splitting the dataset ensures that our model can generalize well to new, unseen data. The preparation of the dataset is crucial as it directly impacts the model's performance.

Hyperparameters

Setting the right hyperparameters is key to training a successful model. For our CycleGAN, we carefully chose parameters such as the number of epochs, learning rate, batch size, and image size. These parameters control the training process and significantly influence the model's performance. Here are some of the essential hyperparameters we used:

Epoch: Starting epoch for training.
n_epochs: Total number of epochs for training, set to 200.
Batch Size: Number of images fed into the model at once, set to 4.
Learning Rate: Set to 0.0002, controls how much to change the model in response to the estimated error each time the model weights are updated.
Decay Start Epoch: Epoch at which learning rate decay starts, set to 100.
Image Size: Dimensions to which images are resized before feeding into the model, set to 128x128 pixels.
Channels: Number of color channels in the images, set to 3 (RGB).
Lambda_cyc: Weight for the cycle consistency loss, set to 10.0.
Lambda_id: Weight for the identity loss, set to 5.0.
Beta1 and Beta2: Coefficients used for computing running averages of gradient and its square, set to 0.5 and 0.999, respectively.

Data Augmentation

To make our model robust, we applied data augmentation techniques such as resizing, normalizing, and random flipping. These augmentations help the model learn to generalize better by seeing various transformations of the input images. In our implementation, we used:

Resizing: Images are resized to 128x128 pixels using bicubic interpolation.
Normalization: Pixel values are normalized to the range [-1, 1].
Random Flipping: Although our current implementation does not include random flipping, this is a common technique used in data augmentation to make the model more robust.

Custom Dataset Class

We created a custom dataset class to handle loading and transforming images from the day and night domains. This class reads images, applies the necessary transformations, and prepares the data for the model. It also supports unaligned image pairs, making it versatile for different datasets.

Unaligned Image Pairs

In traditional supervised learning tasks, datasets consist of paired examples, where each input corresponds to a specific output. However, in many real-world scenarios, such paired datasets are not available. This is where unaligned image pairs come into play. Our dataset class supports unaligned image pairs, which means it can handle cases where the day and night images are not perfectly matched pairs. This flexibility is crucial for training on unpaired datasets, as it allows the model to learn from a broader range of examples, making it more generalizable.

Replay Buffer

A Replay Buffer is used to store previously generated images, which are then reused during training. This technique helps stabilize the training process by providing the Discriminator with a mix of recent and older generated images, preventing it from overfitting to the most recent ones. Our buffer stores up to 50 previously generated images.

Importance and Advantages of Replay Buffer

Stabilizes Training: By providing a mix of recent and older generated images, it prevents the Discriminator from becoming too adapted to the most recent outputs of the Generator.
Improves Generalization: By reusing images, it helps the Generator learn to produce more varied and realistic images over time.
Efficient Use of Data: Ensures that generated images are not wasted and are used effectively to improve the model.

Implementation

In our implementation, the Replay Buffer stores up to 50 previously generated images. When new images are generated, there is a 50% chance that an image from the buffer will be used instead. This randomness helps in keeping the training process dynamic and effective.

LambdaLR

LambdaLR is a learning rate scheduler that helps in decaying the learning rate after a certain number of epochs. This is crucial for ensuring that the model converges smoothly without abrupt changes in learning rates, leading to better and more stable training. The scheduler adjusts the learning rate linearly starting from the decay start epoch.

Initialization of Convolutional Weights

Initializing the weights of the convolutional layers correctly is vital for stable training. We used normal initialization, setting the mean to 0 and the standard deviation to 0.02, which is a standard practice for GANs. This helps in speeding up the convergence and achieving better results.

Model Architecture

Generator

Our Generator uses a ResNet architecture consisting of several convolutional layers, normalization layers, and residual blocks. Residual blocks are essential as they help in retaining the image features across layers, crucial for generating high-quality images. Here's a detailed breakdown:

Initial Convolution Block: Pads and convolves the input image to start the feature extraction.
Downsampling Layers: Reduce the spatial dimensions, increasing the feature depth.
Residual Blocks: We used 19 residual blocks that maintain the image's features while allowing deeper layers to learn more abstract representations.
Upsampling Layers: Increase the spatial dimensions back to the original size.
Output Layer: Produces the final translated image using a Tanh activation function.

PatchGAN Discriminator

Our Discriminator uses a PatchGAN architecture, focusing on classifying patches of the image as real or fake. This approach allows the model to capture fine details, making the generated images more realistic.

What is PatchGAN?

PatchGAN is a type of GAN architecture that classifies each patch of the image as real or fake, rather than the entire image. This technique helps in capturing high-frequency details and textures, leading to more realistic outputs.

Advantages of PatchGAN

Detail Preservation: By focusing on small patches, it helps in preserving fine details and textures.
Computational Efficiency: It is more computationally efficient than processing the entire image, making it faster and less resource-intensive.
Improved Realism: Helps in generating images that are more visually appealing and realistic by focusing on local features.

Discriminator Architecture

Convolutional Blocks: Layers with convolution, normalization, and activation functions to extract features.
PatchGAN Output: Outputs a matrix representing the probability of each patch being real.

Loss Functions

We employed three types of loss functions to train our CycleGAN:

Adversarial Loss: Ensures that the generated images look realistic by fooling the Discriminator, implemented using Mean Squared Error (MSE) loss.
Cycle Consistency Loss: Ensures that translating an image to the other domain and back results in the original image, implemented using L1 loss.
Identity Loss: Ensures that images already in the target domain are preserved during translation, also implemented using L1 loss.

Optimizers and Gradient Clipping

We used Adam optimizers to update the weights of our models, with separate optimizers for the Generators and Discriminators. Gradient clipping was applied to prevent the gradients from exploding, which helps in stabilizing the training process.

Training Procedure

The training process involves the following steps:

Forward Pass: Generate fake images using the Generators.
Compute Losses: Calculate adversarial, cycle consistency, and identity losses.
Backward Pass: Compute gradients and update model weights using the optimizers.
Gradient Clipping: Clip gradients to a maximum value to prevent exploding gradients.
Learning Rate Scheduling: Adjust the learning rate during training to ensure smooth convergence.

We also used a Replay Buffer to store previously generated images and a LambdaLR scheduler to adjust the learning rate during training.

Evaluation

During evaluation, we generated images from the validation set and compared them with the real images. This helps us understand how well the model has learned the mappings between the domains. We saved model checkpoints periodically to monitor progress.

Visualization and Results

After training our CycleGAN model, it is crucial to visualize the results to assess the quality of the image translations. Below are the visualizations of the real and generated images along with the training loss curves.

Image Translations

The first image grid showcases the real images from the day and night domains and their corresponding generated counterparts. Each row contains:

First Column: Real day images.
Second Column: Generated night images from the corresponding real day images.
Third Column: Real night images.
Fourth Column: Generated day images from the corresponding real night images.

Analysis of Image Translations

Visual Quality: The generated night images capture the dark tones and lighting typical of nighttime scenes. Similarly, the generated day images retain the brightness and color characteristic of daytime.
Detail Preservation: The model manages to preserve significant details from the original images, such as buildings, streets, and landscapes, while translating the overall ambiance from day to night and vice versa.
Consistency: There is a consistent style in the generated images, indicating that the model has learned the translation mapping effectively.

Training Loss Curves

The second figure illustrates the training loss curves for both the Generator (G) and the Discriminator (D) over the training epochs.

Analysis of Training Loss Curves

Generator Loss (G): The generator loss shows a decreasing trend, which suggests that the Generator is improving its ability to produce realistic images that can fool the Discriminator over time. There are fluctuations, which are typical in GAN training due to the adversarial nature.
Discriminator Loss (D): The discriminator loss remains relatively low and stable throughout the training process, indicating that the Discriminator effectively distinguishes between real and fake images. The stability of the discriminator loss is a good sign, suggesting that the training process is balanced.

Key Observations

Training Stability: The loss curves indicate that the training process was stable, with the Generator and Discriminator learning effectively from each other.
Improvement Over Time: The gradual decrease in the Generator loss highlights that the model becomes better at generating realistic images as training progresses.
Balanced Adversarial Training: The consistent discriminator loss shows that the Discriminator is performing its role effectively without overwhelming the Generator, ensuring a balanced adversarial process.

These visualizations and analysis of the training loss curves demonstrate the effectiveness of our CycleGAN model in translating day images to night images and vice versa. The results indicate that the model has successfully learned the mappings between the two domains, producing realistic and visually appealing image translations.

Conclusion

CycleGANs are a powerful tool for image translation tasks without requiring paired datasets. By using adversarial, cycle consistency, and identity losses, CycleGANs can generate realistic translations between two domains. This implementation demonstrates the potential of CycleGANs in tasks such as day-to-night image translation, offering valuable insights into their workings and applications.

Generalization

The model we built for day-to-night image translation is generalizable to other cyclic GAN datasets as well. For instance, it can be used for tasks like translating horses to zebras, summer to winter landscapes, or even artistic style transfer. The same principles and architectures apply, making CycleGANs a versatile solution for many image-to-image translation problems.

Detailed Steps

Dataset Preparation: Collected images from day and night domains, split them into training and testing sets, and applied data augmentations.
Hyperparameters: Defined key parameters such as learning rate, batch size, and the number of epochs.
Custom Dataset Class: Created a class to load and transform images, handling both aligned and unaligned image pairs.
Replay Buffer: Implemented a buffer to store and reuse previously generated images to stabilize training.
LambdaLR: Used a learning rate scheduler to adjust the learning rate during training.
Initialization of Convolutional Weights: Applied normal initialization to the convolutional layers for stable training.
Model Architecture: Implemented Generators and Discriminators using ResNet and PatchGAN architectures, respectively.
Loss Functions: Used adversarial, cycle consistency, and identity losses to train the models.
Optimizers and Gradient Clipping: Used Adam optimizers and applied gradient clipping to prevent exploding gradients.
Training Loop: Performed forward and backward passes, computed losses, updated model weights, and applied gradient clipping.
Evaluation: Generated images from the validation set and saved model checkpoints periodically.
Visualization: Displayed real and generated images side by side, labeled for clarity.

By following these detailed steps, we implemented a CycleGAN model capable of translating images between day and night domains, demonstrating the versatility and power of GAN-based image translation.

Feel free to reach out if you have any questions or need further clarification on any part of the implementation. Happy coding!

Bridging Linguistic Diversity: Evaluating and Advancing AI for Indian Languages

Aditi Baheti — Tue, 11 Jun 2024 17:07:33 +0000

Bridging Linguistic Diversity: Evaluating and Advancing AI for Indian Languages

Introduction to Language Models and Their Benchmarks

Language models (LLMs) are at the heart of modern AI, enabling machines to understand and generate human language. The effectiveness of these models is gauged through benchmarks, which are standardized tests designed to evaluate their performance across various tasks. Benchmarks play a crucial role in identifying strengths, pinpointing weaknesses, and guiding improvements in LLMs.

Key Aspects of Language Models:

Scale: The ability to process vast amounts of data efficiently.
Adaptability: Flexibility to perform a range of tasks from translation to summarization.
Contextual Understanding: Comprehension of context and subtleties in language.

Benchmarks: What, Why, and How

What Are Benchmarks?

Benchmarks are standardized datasets and tasks used to assess the performance of language models. They provide a common ground for comparing different models.

Why Are They Important?

Benchmarks help in understanding how well models perform across different tasks, identifying areas for improvement, and driving the development of more capable AI systems.

How Are They Conducted?

Models are evaluated on predefined tasks using metrics such as accuracy, precision, and recall. These tasks range from sentiment analysis to natural language inference.

Key Benchmarks

GLUE (General Language Understanding Evaluation):

Purpose: Evaluates general language understanding tasks.
Tasks: Includes sentiment analysis, sentence similarity, and natural language inference.
Advantages: Comprehensive evaluation of model capabilities.
Limitations: Primarily focused on English, which limits its applicability to other languages.

SUPERGLUE:

Purpose: Designed to challenge more advanced models beyond what GLUE offers.
Tasks: Includes Boolean QA, commonsense reasoning, and coreference resolution.
Advantages: Introduces more complex tasks requiring deeper understanding.
Limitations: Resource-intensive and still centered on English.

Hellaswag:

Purpose: Tests commonsense reasoning by predicting plausible continuations of events.
Data Source: Derived from ActivityNet Captions and WikiHow.
Advantages: Focuses on practical scenarios and everyday reasoning.
Limitations: Primarily in English, specific to certain types of reasoning.

MMLU (Massive Multitask Language Understanding):

Purpose: Evaluates the performance of models across a broad spectrum of subjects.
Tasks: Includes questions from standardized tests and professional exams.
Advantages: Broad coverage of subjects and real-world relevance.
Limitations: Performance can vary significantly with small changes in test conditions, such as the order of answers or symbols.

Developing and Evaluating LLMs for Indian Languages

The Journey of IndicLLMs:

The journey of IndicLLMs began with IndicBERT in 2020, focusing on Natural Language Understanding (NLU). IndicBERT has over 400K downloads on Hugging Face, highlighting its widespread use. IndicBART followed in 2021, targeting Natural Language Generation (NLG). These models were developed with support from EkStep Foundation and Nilekani Philanthropies, despite the limited data and model scale available.

With the introduction of large open models like Llama-2 and Mistral, the focus shifted towards adapting these models for Indic languages. Initiatives like OpenHathi (Base) and Airavata (Chat) have emerged, developing models tailored to different languages. These adaptations involve extending the tokenizer and embedding layer, followed by continual pretraining using data from existing multilingual corpora like mc4, OSCAR, and Roots.

Challenges in Indic Language Models:

Data Scarcity: Limited high-quality datasets for many Indian languages.
Dialectal Variations: Managing diverse dialects and regional nuances.
Technological Gaps: Need for more computational resources and standardized tools for development and evaluation.

Why Indic-Only Models Are Necessary:

Despite the capabilities of models like GPT-3.5 and GPT-4, there are specific reasons why Indic-only models are essential:

Tokenization Efficiency: Indic languages are not efficiently represented in English tokenizers, leading to inefficiencies.
Performance on Low-Resource Languages: English models perform well with high-resource languages but struggle with low-to-medium resource languages like Oriya, Kashmiri, and Dogri.
Accuracy and Hallucinations: Issues like hallucinations are more pronounced in Indic languages, significantly decreasing the accuracy of responses.

Samanantar Dataset

Overview:

Samanantar is a large-scale parallel corpus collection designed to support machine translation and other NLP tasks. It contains 49.7 million sentence pairs between English and 11 Indic languages, representing two language families.

Data Collection:

The data for Samanantar was collected from various sources, including news articles, religious texts, and government documents. The process involved identifying parallel sentences, scoring their similarity, and post-processing to ensure quality.

Creation Process:

Parallel Sentences: Identifying sentences that are translations of each other.
Scoring Function: Using LaBSE embeddings to determine the likelihood of sentences being translation pairs.
Post-Processing: Removing duplicates and ensuring high-quality sentence pairs.

Challenges in Data Collection:

The inherent noisiness of web-sourced data is a significant challenge. The quality of content varies, often containing unwanted content like poorly translated text. Ensuring high-quality, relevant content is crucial, which is why human verification plays a vital role in the data collection pipeline.

Sangraha Corpus: The Foundation for Indian LLMs

Components:

Sangraha Verified: Contains 48B tokens of high-quality, human-verified web crawled content in all 22 scheduled Indian languages.
Sangraha Synthetic: Includes 90B tokens from translations of English Wikimedia into 14 Indian languages and 72B tokens from transliterations into Roman script.
Sangraha Unverified: Adds 24B tokens of high-quality, unverified data from existing collections like CulturaX and MADLAD-400.

IndicGLUE

Overview:

IndicGLUE focuses on core NLU tasks like sentiment analysis, NER, and QA. It covers 11 Indian languages and primarily uses machine translations for some datasets. However, it is not explicitly designed for zero-shot evaluation, which limits its applicability.

Key Tasks:

News Category Classification: Classifying news articles into predefined categories.
Named Entity Recognition (NER): Identifying and classifying proper nouns and entities.
Headline Prediction: Generating headlines for given texts.
Question Answering (QA): Answering questions based on given text passages.

IndicXTREME

Overview:

IndicXTREME is a human-supervised benchmark designed to evaluate models on nine diverse NLU tasks across 20 Indian languages. It includes 105 evaluation sets, with 52 newly created for this benchmark, ensuring high quality and relevance.

Key Features:

Largest Monolingual Corpora: IndicCorp with 20.9B tokens across 24 languages.
Human-Supervised Benchmark: Emphasizes human-created or human-translated datasets.
Tasks: Covers 9 diverse NLU tasks, including classification, structure prediction, QA, and text retrieval.
Zero-Shot Testing: Designed to test the zero-shot multilingual capabilities of pretrained language models.

Advantages Over IndicGLUE:

Broader Coverage: Evaluates more languages and tasks.
Higher Quality: Human supervision ensures better accuracy.
Zero-Shot Capabilities: Tests generalization without specific training data.

OpenHathi and Airavata LLM Models

OpenHathi:

Developed By: Sarvam AI and AI4Bharat.
Base Model: Extended from Llama 2.
Focus: Foundational model for Hindi.
Key Features: Trained on diverse Hindi datasets, open source for community use.

Airavata:

Developed By: AI4Bharat and Sarvam AI.
Base Model: Fine-tuned from OpenHathi.
Focus: Instruction-tuned model for assistive tasks in Hindi.
Key Features: Uses IndicInstruct dataset, with 7B parameters, optimized for generating Hindi instructions.

Issues with Machine Translations for Indian Languages

Machine translations play a crucial role in building datasets for Indic language models, but they come with significant challenges and limitations:

Context Loss:

Issue: Machine translations often lose the nuanced meanings and context of the original text.
Example: Idiomatic expressions or cultural references can be inaccurately translated, leading to a loss of intended meaning.
Impact: This affects the comprehension and relevance of the translated text, which can mislead the language model during training.

Partial Sentences:

Issue: Translating partial sentences or phrases can lead to ambiguities and incorrect interpretations.
Example: A phrase in English might not have a direct counterpart in an Indic language, leading to incomplete or inaccurate translations.
Impact: This can result in fragmented or nonsensical data that negatively impacts the model's learning process.

Order and Format Changes:

Issue: Changes in the order of words or the format of sentences during translation can significantly alter the meaning.
Example: The structure of questions and answers can be altered, leading to inconsistencies in the data.
Impact: This inconsistency can cause models to perform poorly, as they struggle to interpret the translated text correctly.

Bias Introduction:

Issue: Automated translation processes can introduce or amplify biases present in the source or target languages.
Example: Gender biases or cultural biases might be exaggerated or incorrectly represented in translations.
Impact: These biases can skew the training data, leading

to biased language models that do not perform equitably across different user groups.

Cultural Nuances:

Issue: Capturing the cultural context and nuances specific to Indic languages is challenging for machine translations.
Example: Cultural references, local customs, and regional dialects might not be accurately translated.
Impact: This can lead to misunderstandings and misinterpretations, reducing the effectiveness and relevance of the language model.

Resource Intensity:

Issue: Ensuring high-quality translations requires significant computational and human resources.
Example: Manual verification and correction of machine-translated data can be resource-intensive.
Impact: The high cost and effort involved can limit the scalability and feasibility of creating large, high-quality datasets.

Addressing These Challenges

To overcome these challenges, several strategies can be employed:

Collaborative Translation Approach:

Combining machine translation with human validation to ensure accuracy and cultural relevance.
Involving native speakers and linguists in the translation process to maintain context and nuance.

Standardized Guidelines:

Developing clear guidelines for translators to maintain consistency and quality across translations.
Training translators to understand the specific requirements and nuances of NLP tasks.

Contextual Embedding Techniques:

Using advanced embedding techniques to preserve the context and meaning of sentences during translation.
Implementing thresholding and post-processing steps to filter out low-quality translations.

Multilingual Prompting Strategies:

Designing prompts that are suitable for multiple languages and contexts to improve model performance.
Utilizing few-shot learning techniques to provide models with contextually relevant examples.

Bias Mitigation:

Conducting regular bias audits on translated datasets to identify and address potential biases.
Ensuring datasets include diverse sources and contexts to reduce the impact of any single bias.

Resource Optimization:

Using efficient translation tools and APIs to handle large-scale translations without compromising quality.
Optimizing computational resources to manage the high demands of translation processes.

By implementing these strategies, we can create more accurate, culturally relevant, and effective language models for Indian languages, ensuring they are robust and equitable for all users.

Pariksha Benchmark

Challenges with Existing Multilingual Benchmarks:

Cross-Lingual Contamination:
- Even if the multilingual version of a benchmark is not contaminated, the original English version might be. Models can use knowledge of the English benchmark through cross-lingual transfer, making the multilingual benchmark indirectly contaminated.
Loss of Cultural and Linguistic Nuances:
- Direct translations of benchmarks created in English and in a Western context often lose crucial cultural and linguistic nuances. Specialized models need to be evaluated on these dimensions to ensure relevance and accuracy.
Unsuitability of Standard Metrics:
- Standard metrics used in many benchmarks, such as exact match and word overlap, are not suitable for Indian languages due to non-standard spellings. This can unfairly penalize a model for using slightly different spellings than those in the benchmark reference data.

Methodology:

Step-by-Step Process:

Curating Evaluation Prompts:
- A diverse set of evaluation prompts is curated with the help of native speakers to ensure cultural and linguistic relevance.
Generating Model Responses:
- Responses to the curated prompts are generated from the models under consideration, capturing a wide range of linguistic behaviors and outputs.
Evaluation Settings:
- The generated responses are evaluated in two settings:
  - Individual Evaluation: Each response is evaluated on its own.
  - Pairwise Comparison: Responses are compared against each other to determine which one is better.
Constructing Leaderboards:
- Scores from the evaluations are used to construct leaderboards, providing a clear ranking of model performance.

Introduction to ELO Rating System:

The ELO rating system, widely used in competitive games like chess, measures the relative skill levels of players. In the Pariksha Benchmark, we adapt the ELO rating system to evaluate and compare the performance of AI models based on their responses to evaluation prompts. This system allows us to convert human preferences into ELO ratings, predicting the win rates between different models.

Formulas and Explanation:

1. Expected Score (EA):

Explanation: This formula calculates the expected score for model A when compared to model B. (R_A) and (R_B) are the current ratings of models A and B, respectively. The expected score represents the probability of model A winning against model B.

2. Rating Update Formula:

Explanation: This formula updates the rating of model A after a comparison. (R_A) is the current rating, (R_A') is the new rating, (K) is a factor that determines the sensitivity of the rating system, (S_A) is the actual score (1 for a win, 0.5 for a draw, 0 for a loss), and (E_A) is the expected score calculated using the first formula. The rating is adjusted based on the difference between the expected outcome and the actual outcome, scaled by (K).

3. Bradley-Terry Model:

Explanation: In the context of the Bradley-Terry model, which is used to estimate the log-likelihood of the underlying ELO, (p_i) and (p_j) are the performance parameters of models (i) and (j), respectively. This model assumes a fixed but unknown pairwise win-rate and estimates the probability that model (i) will outperform model (j).

ELO Calculation Process:

Step-by-Step Process:

Pairwise Comparison:
- For each prompt, responses from two models are compared.
- Human evaluators or an LLM decide which response is better.
Expected Score Calculation:
- The expected score (E_A) is calculated for model A against model B using the first formula.
- This gives the probability of model A winning against model B.
Rating Update:
- After the comparison, the actual score (S_A) is determined (1 for a win, 0.5 for a draw, 0 for a loss).
- The new rating (R_A') is calculated using the second formula, updating model A’s rating based on its performance relative to the expectation.
Bradley-Terry Model Application:
- The Bradley-Terry model is used to estimate the fixed pairwise win-rate, ensuring that the order of comparisons does not affect the ratings.
- The probability of one model outperforming another is calculated to provide a robust comparison framework.

Individual Metrics:

Linguistic Acceptability:
- Measures if the text is in the correct language and grammatically correct. It is rated on a scale of 0-2.
Task Quality:
- Assesses if the answer is of high quality and provides useful information. It is also rated on a scale of 0-2.
Hallucination:
- Checks if the answer contains untrue or made-up facts. It is rated on a binary scale of 0-1.

Inter-Rater Reliability Metrics:

Percentage Agreement (PA):
- Calculates the percentage of items on which annotators agree, ranging from 0 (no agreement) to 1 (perfect agreement).
Fleiss Kappa (κ):
- Measures inter-annotator agreement, accounting for the agreement occurring by chance.
Kendall’s Tau:
- A correlation coefficient that measures the relationship between two columns of ranked data, used to compare leaderboards obtained through various evaluation techniques.

Higher agreement scores among human annotators compared to human-LLM pairs suggest that while GPT-4 performs well, human evaluators still provide more reliable and consistent evaluations. The variation across languages could point to specific challenges in those languages, such as syntax complexity or cultural nuances that GPT-4 might not fully grasp.

Way Forward: Developing Truly "Indian" Language Models

Vision:

The goal is to develop models that go beyond multilingual capabilities to truly understand and generate culturally and contextually relevant content for all Indian users. This involves creating models that act as digital knowledge companions, comprehending cultural idioms, historical references, regional specifics, and diverse interaction styles.

Key Strategies:

High-Quality Data Curation: Ensuring datasets are comprehensive, diverse, and of high quality.
Human Supervision: Leveraging language experts for data annotation and translation.
Broad Evaluation: Developing benchmarks like IndicXTREME to evaluate a wide range of tasks across multiple languages.
Continual Adaptation: Updating and refining models to keep pace with linguistic and cultural changes.

Conclusion

The development and evaluation of Indic language models are crucial for advancing AI in India. By focusing on comprehensive data curation, human supervision, and robust evaluation frameworks, we can create models that are not only multilingual but truly multicultural. Initiatives like IndicXTREME, OpenHathi, Airavata, and IndicMT Eval are paving the way for a future where AI can seamlessly interact with and understand the diverse linguistic landscape of India. As we continue to innovate and refine these models, we move closer to achieving truly inclusive and effective AI solutions for all Indian languages.