DEV Community: Nikhil Reddy

Cognitive RAG-Chain of Thought Approach in Retrieval Augmented Generation on Graph Data

Nikhil Reddy — Sun, 02 Feb 2025 05:31:07 +0000

Introduction

In recent years, Retrieval-Augmented Generation (RAG) has transformed how large language models (LLMs) interact with structured and unstructured data. One of the most promising advancements in this field is integrating Chain-of-Thought (CoT) reasoning with RAG, especially when working with graph-based knowledge representations. This approach enhances reasoning, improves retrieval efficiency, and generates more accurate responses. In this blog, we explore how CoT-enhanced RAG can improve performance when dealing with graph data.

Understanding RAG and Graph Data

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is an AI paradigm that enhances the capabilities of generative models by incorporating an external retrieval mechanism. Instead of relying solely on parametric memory, RAG fetches relevant information from external knowledge sources. This improves factual accuracy, reduces hallucinations, and provides contextually grounded responses.

Why Use Graph Data?

Graphs are widely used to represent relationships between entities in various domains, such as knowledge graphs, social networks, and biomedical databases. Unlike textual data, graphs encode rich relational structures, making them highly expressive. However, they also pose challenges for traditional retrieval models, which often struggle to leverage these interconnected relationships effectively.

Motivation for Graph-Augmented LLMs

Large Language Models (LLMs) sometimes generate content that appears plausible but lacks a factual foundation, a problem known as hallucination. Existing RAG models mitigate this issue by incorporating external text corpora as knowledge sources, treating each document as an independent knowledge unit.

However, real-world information is rarely isolated; instead, it is interconnected, forming a graph of relationships. Knowledge graphs store information not just in textual form but also through structured connections between entities. These graphs often contain different types of edges, representing various relationships such as "is a part of," "is similar to," "was influenced by," and "has property." This diversity in relationships allows for richer retrieval and reasoning processes.

Example of Knowledge Graphs:

Traditional RAG with Graph Databases vs. Graph-CoT

Traditional RAG with Graph Databases (e.g., Neo4j)

Uses predefined query languages (like Cypher) to fetch relevant nodes and edges.
Relies on direct lookup of structured graph data rather than reasoning iteratively.
Typically retrieves subgraphs and provides them as static context to LLMs.
Struggles with complex multi-hop reasoning beyond predefined paths.
Does not dynamically adjust retrieval strategies based on the query.

Graph-CoT (Our Proposed System)

Iteratively reasons through the graph, identifying the necessary steps dynamically.
Uses an LLM-driven reasoning mechanism to decide on retrieval steps rather than predefined queries.
Expands queries dynamically based on previous retrievals, allowing multi-hop and inductive reasoning.
Provides structured evidence aggregation, leading to more coherent and contextually aware responses.
Adapts retrieval dynamically, selecting only the most relevant information instead of entire subgraphs.

By enabling step-wise reasoning, Graph-CoT allows LLMs to retrieve and process information more effectively than traditional RAG approaches that rely solely on direct lookups in graph databases.

Proposed Method: Graph Chain-of-Thought (Graph-CoT)

We introduce Graph Chain-of-Thought (Graph-CoT), a step-by-step framework that enables LLMs to traverse knowledge graphs iteratively, identifying key information rather than processing entire subgraphs at once.

Graph-CoT Framework

Each iteration of Graph-CoT consists of three key sub-steps:

Reasoning: The LLM analyzes the current information and determines what additional data is needed from the graph.
Interaction: The LLM formulates actions to retrieve relevant nodes, check connections, or refine the search.
Execution: The system executes the requested retrieval, fetching the necessary data from the graph.

This iterative process continues until the LLM gathers sufficient information to generate an answer.

Implementation

We implemented Graph-CoT using the Claude Sonnet API for the LLM component. The retrieval and reasoning pipeline follows these steps:

Initial Question Processing: The LLM first determines the subject category of the query.
Self-Reasoning: If the LLM can confidently answer using its internal knowledge, it does so. Otherwise, it proceeds to retrieval.
Graph Retrieval with Embeddings: Using Sentence Transformer embeddings, the system identifies the most relevant node.
Information Sufficiency Check: The LLM evaluates whether the retrieved node contains enough information. If not, it reasons again and explores further connections in the knowledge graph.
Graph Traversal: The process repeats, leveraging named edges and multi-hop connections until the LLM determines that it has gathered sufficient context.
Final Answer Generation: Once enough relevant data is retrieved, the LLM synthesizes an accurate and coherent response.

Comparison with Hop-Based Retrieval

We also compared Graph-CoT with hop-based retrieval strategies:

Hop-0: Retrieves only the most relevant node found using sentence transformer embeddings.
Hop-1: Retrieves the most relevant node along with its directly connected neighbors.
Hop-2: Retrieves the most relevant node, its neighbors, and their respective neighbors, expanding the context further.

While increasing the hop level provides more contextual information, our proposed Graph-CoT system consistently outperforms all hop-based approaches by dynamically reasoning and retrieving only the most relevant information at each step.

Conclusion

Integrating Chain-of-Thought reasoning with Retrieval-Augmented Generation offers a powerful approach to handling graph-based knowledge representations. By enabling step-wise reasoning, contextual query expansion, and structured retrieval, Graph-CoT significantly improves the efficiency and accuracy of knowledge graph augmentation for LLMs. Future research will focus on optimizing retrieval efficiency, exploring more advanced graph traversal strategies, and evaluating performance across diverse real-world applications.

By structuring the thought process and leveraging graph-based retrieval, Graph-CoT paves the way for more reliable and interpretable AI systems capable of handling complex, interconnected knowledge sources.

GitHub Repository: Cognitive-RAG

DeepFake Detection Using Convolutions with Attention (CoAtNet)

Nikhil Reddy — Sat, 01 Feb 2025 23:45:24 +0000

Introduction

With the proliferation of deepfake technology, the ability to discern real from manipulated content has become a critical concern. Deepfake videos, often indistinguishable from authentic footage, pose serious threats to privacy, security, and the integrity of information.

In response to this challenge, cutting-edge research in computer vision and machine learning has yielded innovative solutions for detecting deepfakes. Among these, the fusion of convolutional neural networks (ConvNets) and attention-based models has emerged as a promising approach.

In this project, we present an in-depth exploration of deepfake detection using ConvNets with attention, specifically focusing on the CoAtNet architecture. CoAtNet, a novel family of image recognition models, seamlessly integrates ConvNets and attention mechanisms, offering a powerful tool for analyzing facial images extracted from videos.

Data processing

We only used 3 chunks of 50 chunks of data from the DFDC dataset for our project which resulted in the size of (~30gb).

Overview of preprocessing steps:

Face Extraction Using BlazeFace:
BlazeFace is a lightweight and efficient face-detection model. It is utilized to extract faces from each frame of the videos in the DFDC dataset. This step ensures that only the relevant facial regions are considered for analysis. Here are some examples of face extractions from a video sample.

Real video:

DeepFake video:

Normalization of Pixel Values:

After face extraction, pixel values of the extracted facial images are normalized. Normalization standardizes the pixel values to a common scale, typically between 0 and 1, to improve the convergence and stability of the training process.

mean = [0.485, 0.456, 0.406]

std = [0.229, 0.224, 0.225]

These mean and std values are given by Kaggle during the competition for the DFDC dataset. using these we are normalizing the pixel values.

The above figure is an example of normalized pixel values. We first normalized its pixel values and then to visualize it, we did invert_normalization.

Augmentation Techniques like Albumenation:

Augmentation techniques, such as the Albumenation library, are applied to increase the diversity and robustness of the training dataset. Albumenation introduces variations in the training data by applying transformations such as rotation, flipping, scaling, and color adjustments to the facial images.

#Basic Geometric Transformations

RandomRotate90: Rotates the image by 90, 180, or 270 degrees (controlled by p=0.2).

Transpose: Flips rows and columns (potentially useful for text or certain object orientations, with p=0.2).

Horizontal Flip: Mirrors the image horizontally (p=0.5).

Vertical Flip: Mirrors the image vertically (p=0.5).

#Random Effects:

OneOf ([GaussNoise()], p=0.2): Adds random noise to the image with a 0.2 probability within this group (other options are not applied if noise is chosen).

#Combined Transformations:

Shift Scale Rotate: This applies a combination of random shift, scale, and rotation in a single step (p=0.2).

#Pixel-Level Adjustments:

One Of ([CLAHE(clip_limit=2), Sharpen(), Emboss(), Random Brightness Contrast()], p=0.2): Within this group, one of these transformations is applied with a 0.2 probability:

CLAHE: Contrast Limited Adaptive Histogram Equalization (improves local contrast).

Sharpen: Enhances image edges.

Emboss: Creates a raised or sunken effect.

Random Brightness Contrast: Randomly adjusts brightness and contrast.

#Color Adjustments:

Hue Saturation Value: Randomly modifies the image’s hue (color), saturation (intensity), and value (brightness) with a 0.2 probability.

Temporal Consistency vs. Face Extraction and Classification

Temporal Consistency refers to maintaining coherence across sequential frames in video analysis, often achieved through models integrating time-based architectures like LSTM or GRU to capture temporal dependencies. However, recent advancements demonstrate that face extraction and classification alone can yield effective results without explicitly modeling temporal relationships.
By focusing solely on face extraction and classification, without considering temporal consistency, the model can efficiently detect deepfake content while simplifying the architecture and reducing computational complexity.
We solely focused on detecting Image or video manipulations Ignoring Audio whereas many of the current best models detect audio manipulations too.
The current models leverage efficient vision transformers, such as the Cross Efficient Vision Transformer (CEViT), which combines the efficiency of vision transformers with cross-modal fusion for improved performance across various tasks in computer vision.

CoAtNet Architecture

CoAtNet is a new family of image recognition models that combines the strengths of convolutional neural networks(ConvNets) and attention-based models (like Transformers). The CoatNet model is specifically designed for efficient image classification tasks, making it well-suited for processing large volumes of facial images extracted from videos.

The CoAtNet architecture comprises five stages (S0, S1, S2, S3, S4), each tailored to specific characteristics of the data and task at hand. Beginning with a simple 2-layer convolutional stem in S0, the subsequent stages employ a combination of MBConv blocks with squeeze-excitation (SE) and Transformer blocks.
To optimize model performance, we adopt a strategic approach to stage selection. Convolution stages precede Transformer stages, leveraging the former’s proficiency in processing local patterns common in early stages. This leads to four variants: C-C-C-C, C-C-C-T, C-C-T-T, and C-T-T-T, with varying degrees of convolution and Transformer stages. Through rigorous experimentation, we determine that the C-C-T-T configuration yields the best balance between generalization ability and model capacity.

[2106.04803] CoAtNet: Marrying Convolution and Attention for All Data Sizes

Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets(pronounced "coat" nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.

arxiv.org

Project Architecture

Our approach to this project is to use CoATnet-0. The CoAtNet authors proposed 5 best architectures for best performance(CoAtNet-0 to CoAtNet-4). CoAtNet-0 is the smaller model. we used it for our detection to make a small and compact model. Here is a brief explanation of our model layers:

model = Coatnet(image_size=(224, 224), in_channels=3, num_blocks=[2, 2, 3, 5, 2], channels=[64, 96, 192, 384, 768], num_classes=2)

model.summary()

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 64, 112, 112]           1,728 
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              GELU-3         [-1, 64, 112, 112]               0
            Conv2d-4         [-1, 64, 112, 112]          36,864
       BatchNorm2d-5         [-1, 64, 112, 112]             128
              GELU-6         [-1, 64, 112, 112]               0
         MaxPool2d-7           [-1, 64, 56, 56]               0
            Conv2d-8           [-1, 96, 56, 56]           6,144
       BatchNorm2d-9         [-1, 64, 112, 112]             128
           Conv2d-10          [-1, 256, 56, 56]          16,384
      BatchNorm2d-11          [-1, 256, 56, 56]             512
             GELU-12          [-1, 256, 56, 56]               0
           Conv2d-13          [-1, 256, 56, 56]           2,304
      BatchNorm2d-14          [-1, 256, 56, 56]             512
             GELU-15          [-1, 256, 56, 56]               0
AdaptiveAvgPool2d-16            [-1, 256, 1, 1]               0
           Linear-17                   [-1, 16]           4,096
             GELU-18                   [-1, 16]               0
           Linear-19                  [-1, 256]           4,096
          Sigmoid-20                  [-1, 256]               0
               SE-21          [-1, 256, 56, 56]               0
           Conv2d-22           [-1, 96, 56, 56]          24,576
      BatchNorm2d-23           [-1, 96, 56, 56]             192
          PreNorm-24           [-1, 96, 56, 56]               0
           MBConv-25           [-1, 96, 56, 56]               0
      BatchNorm2d-26           [-1, 96, 56, 56]             192
           Conv2d-27          [-1, 384, 56, 56]          36,864
      BatchNorm2d-28          [-1, 384, 56, 56]             768
             GELU-29          [-1, 384, 56, 56]               0
           Conv2d-30          [-1, 384, 56, 56]           3,456
      BatchNorm2d-31          [-1, 384, 56, 56]             768
             GELU-32          [-1, 384, 56, 56]               0
AdaptiveAvgPool2d-33            [-1, 384, 1, 1]               0
           Linear-34                   [-1, 24]           9,216
             GELU-35                   [-1, 24]               0
           Linear-36                  [-1, 384]           9,216
          Sigmoid-37                  [-1, 384]               0
               SE-38          [-1, 384, 56, 56]               0
           Conv2d-39           [-1, 96, 56, 56]          36,864
      BatchNorm2d-40           [-1, 96, 56, 56]             192
          PreNorm-41           [-1, 96, 56, 56]               0
           MBConv-42           [-1, 96, 56, 56]               0
        MaxPool2d-43           [-1, 96, 28, 28]               0
           Conv2d-44          [-1, 192, 28, 28]          18,432
      BatchNorm2d-45           [-1, 96, 56, 56]             192
           Conv2d-46          [-1, 384, 28, 28]          36,864
      BatchNorm2d-47          [-1, 384, 28, 28]             768
             GELU-48          [-1, 384, 28, 28]               0
           Conv2d-49          [-1, 384, 28, 28]           3,456
      BatchNorm2d-50          [-1, 384, 28, 28]             768
             GELU-51          [-1, 384, 28, 28]               0
AdaptiveAvgPool2d-52            [-1, 384, 1, 1]               0
           Linear-53                   [-1, 24]           9,216
             GELU-54                   [-1, 24]               0
           Linear-55                  [-1, 384]           9,216
          Sigmoid-56                  [-1, 384]               0
               SE-57          [-1, 384, 28, 28]               0
           Conv2d-58          [-1, 192, 28, 28]          73,728
      BatchNorm2d-59          [-1, 192, 28, 28]             384
          PreNorm-60          [-1, 192, 28, 28]               0
           MBConv-61          [-1, 192, 28, 28]               0
      BatchNorm2d-62          [-1, 192, 28, 28]             384
           Conv2d-63          [-1, 768, 28, 28]         147,456
      BatchNorm2d-64          [-1, 768, 28, 28]           1,536
             GELU-65          [-1, 768, 28, 28]               0
           Conv2d-66          [-1, 768, 28, 28]           6,912
      BatchNorm2d-67          [-1, 768, 28, 28]           1,536
             GELU-68          [-1, 768, 28, 28]               0
AdaptiveAvgPool2d-69            [-1, 768, 1, 1]               0
           Linear-70                   [-1, 48]          36,864
             GELU-71                   [-1, 48]               0
           Linear-72                  [-1, 768]          36,864
          Sigmoid-73                  [-1, 768]               0
               SE-74          [-1, 768, 28, 28]               0
           Conv2d-75          [-1, 192, 28, 28]         147,456
      BatchNorm2d-76          [-1, 192, 28, 28]             384
          PreNorm-77          [-1, 192, 28, 28]               0
           MBConv-78          [-1, 192, 28, 28]               0
      BatchNorm2d-79          [-1, 192, 28, 28]             384
           Conv2d-80          [-1, 768, 28, 28]         147,456
      BatchNorm2d-81          [-1, 768, 28, 28]           1,536
             GELU-82          [-1, 768, 28, 28]               0
           Conv2d-83          [-1, 768, 28, 28]           6,912
      BatchNorm2d-84          [-1, 768, 28, 28]           1,536
             GELU-85          [-1, 768, 28, 28]               0
AdaptiveAvgPool2d-86            [-1, 768, 1, 1]               0
           Linear-87                   [-1, 48]          36,864
             GELU-88                   [-1, 48]               0
           Linear-89                  [-1, 768]          36,864
          Sigmoid-90                  [-1, 768]               0
               SE-91          [-1, 768, 28, 28]               0
           Conv2d-92          [-1, 192, 28, 28]         147,456
      BatchNorm2d-93          [-1, 192, 28, 28]             384
          PreNorm-94          [-1, 192, 28, 28]               0
           MBConv-95          [-1, 192, 28, 28]               0
        MaxPool2d-96          [-1, 192, 14, 14]               0
           Conv2d-97          [-1, 384, 14, 14]          73,728
        MaxPool2d-98          [-1, 192, 14, 14]               0
        Rearrange-99             [-1, 196, 192]               0
       LayerNorm-100             [-1, 196, 192]             384
          Linear-101             [-1, 196, 768]         147,456
         Softmax-102          [-1, 8, 196, 196]               0
          Linear-103             [-1, 196, 384]          98,688
         Dropout-104             [-1, 196, 384]               0
       Attention-105             [-1, 196, 384]               0
         PreNorm-106             [-1, 196, 384]               0
       Rearrange-107          [-1, 384, 14, 14]               0
       Rearrange-108             [-1, 196, 384]               0
       LayerNorm-109             [-1, 196, 384]             768
          Linear-110             [-1, 196, 768]         295,680
            GELU-111             [-1, 196, 768]               0
         Dropout-112             [-1, 196, 768]               0
          Linear-113             [-1, 196, 384]         295,296
         Dropout-114             [-1, 196, 384]               0
     FeedForward-115             [-1, 196, 384]               0
         PreNorm-116             [-1, 196, 384]               0
       Rearrange-117          [-1, 384, 14, 14]               0
     Transformer-118          [-1, 384, 14, 14]               0
       Rearrange-119             [-1, 196, 384]               0
       LayerNorm-120             [-1, 196, 384]             768
          Linear-121             [-1, 196, 768]         294,912
         Softmax-122          [-1, 8, 196, 196]               0
          Linear-123             [-1, 196, 384]          98,688
         Dropout-124             [-1, 196, 384]               0
       Attention-125             [-1, 196, 384]               0
         PreNorm-126             [-1, 196, 384]               0
       Rearrange-127          [-1, 384, 14, 14]               0
       Rearrange-128             [-1, 196, 384]               0
       LayerNorm-129             [-1, 196, 384]             768
          Linear-130            [-1, 196, 1536]         591,360
            GELU-131            [-1, 196, 1536]               0
         Dropout-132            [-1, 196, 1536]               0
          Linear-133             [-1, 196, 384]         590,208
         Dropout-134             [-1, 196, 384]               0
     FeedForward-135             [-1, 196, 384]               0
         PreNorm-136             [-1, 196, 384]               0
       Rearrange-137          [-1, 384, 14, 14]               0
     Transformer-138          [-1, 384, 14, 14]               0
       Rearrange-139             [-1, 196, 384]               0
       LayerNorm-140             [-1, 196, 384]             768
          Linear-141             [-1, 196, 768]         294,912
         Softmax-142          [-1, 8, 196, 196]               0
          Linear-143             [-1, 196, 384]          98,688
         Dropout-144             [-1, 196, 384]               0
       Attention-145             [-1, 196, 384]               0
         PreNorm-146             [-1, 196, 384]               0
       Rearrange-147          [-1, 384, 14, 14]               0
       Rearrange-148             [-1, 196, 384]               0
       LayerNorm-149             [-1, 196, 384]             768
          Linear-150            [-1, 196, 1536]         591,360
            GELU-151            [-1, 196, 1536]               0
         Dropout-152            [-1, 196, 1536]               0
          Linear-153             [-1, 196, 384]         590,208
         Dropout-154             [-1, 196, 384]               0
     FeedForward-155             [-1, 196, 384]               0
         PreNorm-156             [-1, 196, 384]               0
       Rearrange-157          [-1, 384, 14, 14]               0
     Transformer-158          [-1, 384, 14, 14]               0
       Rearrange-159             [-1, 196, 384]               0
       LayerNorm-160             [-1, 196, 384]             768
          Linear-161             [-1, 196, 768]         294,912
         Softmax-162          [-1, 8, 196, 196]               0
          Linear-163             [-1, 196, 384]          98,688
         Dropout-164             [-1, 196, 384]               0
       Attention-165             [-1, 196, 384]               0
         PreNorm-166             [-1, 196, 384]               0
       Rearrange-167          [-1, 384, 14, 14]               0
       Rearrange-168             [-1, 196, 384]               0
       LayerNorm-169             [-1, 196, 384]             768
          Linear-170            [-1, 196, 1536]         591,360
            GELU-171            [-1, 196, 1536]               0
         Dropout-172            [-1, 196, 1536]               0
          Linear-173             [-1, 196, 384]         590,208
         Dropout-174             [-1, 196, 384]               0
     FeedForward-175             [-1, 196, 384]               0
         PreNorm-176             [-1, 196, 384]               0
       Rearrange-177          [-1, 384, 14, 14]               0
     Transformer-178          [-1, 384, 14, 14]               0
       Rearrange-179             [-1, 196, 384]               0
       LayerNorm-180             [-1, 196, 384]             768
          Linear-181             [-1, 196, 768]         294,912
         Softmax-182          [-1, 8, 196, 196]               0
          Linear-183             [-1, 196, 384]          98,688
         Dropout-184             [-1, 196, 384]               0
       Attention-185             [-1, 196, 384]               0
         PreNorm-186             [-1, 196, 384]               0
       Rearrange-187          [-1, 384, 14, 14]               0
       Rearrange-188             [-1, 196, 384]               0
       LayerNorm-189             [-1, 196, 384]             768
          Linear-190            [-1, 196, 1536]         591,360
            GELU-191            [-1, 196, 1536]               0
         Dropout-192            [-1, 196, 1536]               0
          Linear-193             [-1, 196, 384]         590,208
         Dropout-194             [-1, 196, 384]               0
     FeedForward-195             [-1, 196, 384]               0
         PreNorm-196             [-1, 196, 384]               0
       Rearrange-197          [-1, 384, 14, 14]               0
     Transformer-198          [-1, 384, 14, 14]               0
       MaxPool2d-199            [-1, 384, 7, 7]               0
          Conv2d-200            [-1, 768, 7, 7]         294,912
       MaxPool2d-201            [-1, 384, 7, 7]               0
       Rearrange-202              [-1, 49, 384]               0
       LayerNorm-203              [-1, 49, 384]             768
          Linear-204              [-1, 49, 768]         294,912
         Softmax-205            [-1, 8, 49, 49]               0
          Linear-206              [-1, 49, 768]         197,376
         Dropout-207              [-1, 49, 768]               0
       Attention-208              [-1, 49, 768]               0
         PreNorm-209              [-1, 49, 768]               0
       Rearrange-210            [-1, 768, 7, 7]               0
       Rearrange-211              [-1, 49, 768]               0
       LayerNorm-212              [-1, 49, 768]           1,536
          Linear-213             [-1, 49, 1536]       1,181,184
            GELU-214             [-1, 49, 1536]               0
         Dropout-215             [-1, 49, 1536]               0
          Linear-216              [-1, 49, 768]       1,180,416
         Dropout-217              [-1, 49, 768]               0
     FeedForward-218              [-1, 49, 768]               0
         PreNorm-219              [-1, 49, 768]               0
       Rearrange-220            [-1, 768, 7, 7]               0
     Transformer-221            [-1, 768, 7, 7]               0
       Rearrange-222              [-1, 49, 768]               0
       LayerNorm-223              [-1, 49, 768]           1,536
          Linear-224              [-1, 49, 768]         589,824
         Softmax-225            [-1, 8, 49, 49]               0
          Linear-226              [-1, 49, 768]         197,376
         Dropout-227              [-1, 49, 768]               0
       Attention-228              [-1, 49, 768]               0
         PreNorm-229              [-1, 49, 768]               0
       Rearrange-230            [-1, 768, 7, 7]               0
       Rearrange-231              [-1, 49, 768]               0
       LayerNorm-232              [-1, 49, 768]           1,536
          Linear-233             [-1, 49, 3072]       2,362,368
            GELU-234             [-1, 49, 3072]               0
         Dropout-235             [-1, 49, 3072]               0
          Linear-236              [-1, 49, 768]       2,360,064
         Dropout-237              [-1, 49, 768]               0
     FeedForward-238              [-1, 49, 768]               0
         PreNorm-239              [-1, 49, 768]               0
       Rearrange-240            [-1, 768, 7, 7]               0
     Transformer-241            [-1, 768, 7, 7]               0
       AvgPool2d-242            [-1, 768, 1, 1]               0
          Linear-243                 [-1, 1000]         768,000
================================================================
Total params: 17,757,760
Trainable params: 17,757,760
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 382.18
Params size (MB): 67.74
Estimated Total Size (MB): 450.49

Layer (1–7): Convolutional and Pooling Layers

Layer(8–21): Mobile Inverted Bottleneck Convolution (MBConv)

Layer(22–57): Additional MBConv Blocks and Downsampling

Layer(58–74): Repeat MBConv Blocks and Attention Mechanisms

Layer(75–98): Additional MBConv Blocks and Attention Mechanisms

Layer(99–138): Transformer Blocks

Layer(139–243): Classification Head

Stem Layer (S0):

Conv2d-1 & BatchNorm2d-2: Initial convolutional layer followed by batch normalization for basic feature extraction.
GELU-3: Applies the Gaussian Error Linear Unit (GELU) activation function to introduce non-linearity.
Conv2d-4 & BatchNorm2d-5: Additional convolutional layer with batch normalization for feature enhancement.
MaxPool2d-7: Max-pooling operation to reduce spatial dimensions and aggregate features
.
Convolution Blocks (S1-S2):
Conv2d-22: Convolutional layer with batch normalization.
SE-21: Squeeze-Excitation module for channel-wise feature recalibration.
MBConv-25: Mobile inverted bottleneck convolutional block for efficient feature extraction.
MaxPool2d-43: Max-pooling to downsample feature maps.

Transformer Blocks (S3-S4):

Attention Modules (Attention-105, Attention-125): Self-attention mechanism for capturing long-range dependencies.
FeedForward Modules (FeedForward-115, FeedForward-135): Fully connected layers with activation functions for feature processing.
Layer Normalization (LayerNorm-100, LayerNorm-109, LayerNorm-119, LayerNorm-129, LayerNorm-139): Normalizes activations across channels.

Output Layer:

AvgPool2d-242: Average pooling operation to reduce spatial dimensions.
Linear-243: Fully connected layer mapping feature representation to output space, typically representing class probabilities for deepfake detection.

CoAtNet-2:

It is very similar to the CoAtNet-0 but has very few modifications with the blocks and channels which makes it a slightly larger model(~600mb) than CoAtNet-0(~200mb). We followed the same C-C-T-T structure for the stages of the CoAtNet-2 model.

model = CoAtNet(image_size=(224, 224), in_channels=3, num_blocks=[2, 2, 6, 14, 2], channels = [128, 128, 256, 512, 1026], num_classes=2)

Training

As we trained our model on only 25 epochs considering computation and time factors, we were unable to get a good understanding of inferences from the learning curves.

Learning curves of CoAtNet-0 model:

Convergence: It’s difficult to determine convergence definitively from the plot. As we can observe both training and validation loss curves flatten out towards the end, which suggests some degree of convergence. We need to analyze a few more epochs to confirm if the flattening continues or if there are minor fluctuations.
Gap between losses: A persistent gap between training and validation loss, with validation loss being higher, is still a good sign. This indicates the model is generalizing reasonably well. A smaller gap suggests the model might be underfitting the training data, while a much larger gap could indicate overfitting.
Decreasing trend: The overall decreasing trend in both loss curves is very positive. It confirms the model is actively learning and improving its performance throughout training. The rate of decrease can also be informative. A sharp initial decrease followed by a plateau suggests the model captured the key patterns quickly.

Learning curves of CoAtNet-2 model:

Convergence: Both the training and validation loss curves converge towards the end, indicating the model has learned and is not overfitting or underfitting significantly.
Gap between losses: There is a noticeable gap between the training and validation loss, with validation loss being higher. This is expected and suggests some generalization error, as the model performs slightly better on training data.
Decreasing trend: The overall decreasing trend of both loss curves over epochs is a positive sign, indicating the model is learning and improving its performance as training progresses.

Results

Results of CoAtNet-0 model:

We achieved 79% accuracy with our CoAtNet 0 model, which is good but not satisfactory. Here is a plot of confidence scores of predictions on the test set of size=0.1

With the CoAtNet-0, we were, able to produce correct predictions for 8 out of 10 videos from the test set. An example prediction is shown below with a confidence score of around 0.77

Test-Video:

result.mp4 - Google Drive

drive.google.com

Our Prediction score: 0.7742524743080139

The prediction score is on a scale of 0–1, where a prediction score >0.5 is fake and a prediction score <0.5 is real.

Some Fun Test:

Recently, Reid Hoffman, the Co-founder of LinkedIn, shared a video showcasing his AI twin, which appears incredibly realistic and capable of fooling any person. However, our small model indicates that the AI twin is real, highlighting its limitations.

Results of CoAtNet 2 model:

We achieved 89% accuracy with our CoAtNet 2 model which is a little larger than CoAtNet 0 while extending its layers and transforming it as a larger model.

We considered 0.1 of our training as the test set. We have not balanced the test set, which contains 30 real videos and 70 fake videos.

Here, we can observe clearly that the CoAtNet-2 model is performing very well on real videos with 0 errors and maintaining good confidence scores making most of them scatter below 0.2 for real and above 0.7 for fake.

Some Fun Test:

We got the correct prediction for Reid Hoffman’s AI twin as a Fake video, which depicts our model performing well.

Note: We used the face_recognition library for extracting faces while performing predictions and testing the model because of its efficiency of face extractions. but as we mentioned we used BlazeFace for the training stage because of its lightweight architecture which saves a lot of time and computation.

Challenges Faced

Initially, we attempted to train the model on individual data chunks, saving and loading the model between chunks. However, this approach proved suboptimal and failed to yield satisfactory results.

Subsequently, we experimented with training the model on the entire dataset using various batch sizes and epochs. Through rigorous testing, we discovered that the model performed best when the data was skewed to contain more real data than fake, with an optimal batch size of 32 and diminishing returns observed after 25 epochs.

To address these challenges, we considered scaling up to larger models like CoatNet-2 or CoatNet-3, which could offer improved performance but come with significantly larger file sizes (over 900 MB after training).

In our reference project by “The CVIT,” over 90% accuracy was achieved, but with a model size of around 1 GB. In contrast, our CoAtNet-0 model, at approximately 200 MB size, detects around 8 out of 10 videos correctly in the DFDC dataset and shows promising performance on real deepfake videos encountered.

Despite our efforts, we reached a deadlock in further improving the small model’s performance. Currently, we are training CoatNet-2 on three data chunks for 25 epochs with a batch size of 12, which is taking approximately 30 hours on a limited GPU.

Checkout the Project Code:
https://github.com/Nikhilreddy024/Deepfake_detection-using-Coatnet

References:

CoAtNet- Marrying Convolution and Attention for All Data Sizes: https://arxiv.org/pdf/2106.04803
https://github.com/chinhsuanwu/coatnet-pytorch
CViT: https://github.com/erprogs/CViT
https://link.springer.com/article/10.1007/s42979-023-02294-y
https://arxiv.org/abs/2102.11126
https://www.youtube.com/@LazyProgrammerOfficial

A Robust DeepFake Detection system

Nikhil Reddy — Sat, 01 Feb 2025 23:16:50 +0000

Deepfake videos are a growing concern in today’s digital landscape due to their potential to spread misinformation and manipulate public opinion. In this project, we aim to develop a system that detects deepfake videos using the DeepFake Detection Challenge (DFDC) dataset from Kaggle. Our ultimate goal is to create a reliable tool for real-time detection of deepfakes, helping to preserve the authenticity of media content.

Problem Domain and Brief Project Description

The advancement of artificial intelligence technologies has made it possible to create highly realistic yet fake videos. Deepfake videos use AI to superimpose faces or manipulate expressions, making it increasingly challenging to distinguish between real and fake content. This poses a threat to the credibility of online media and can have serious implications for society.

Our project seeks to address this issue by creating a system capable of detecting deepfake videos. The system will take video content as input and provide a prediction on whether it is real or fake. The DFDC dataset, which contains a variety of real and fake videos, serves as the foundation for our training and evaluation.

Exploring Deep Fake

The DeepFakes has spurred a lot of research and development in the field of deepfake detection, leading to the creation and evaluation of numerous models and architectures. Some general models and architectures that have shown promising progress in the DeepFake include:

EfficientNet: EfficientNet is a family of models known for their efficient use of parameters and computational resources while maintaining high performance. Variants such as EfficientNet-B4 and EfficientNet-B4-ST have been employed in deepfake detection with success.
XceptionNet: XceptionNet is an extension of InceptionNet and is known for its use of depthwise separable convolutions. It has been widely used in image classification tasks and is effective in detecting deepfake videos due to its strong feature extraction capabilities. XceptionNet for FaceForensics++: https://paperswithcode.com/paper/faceforensics-learning-to-detect-manipulated
Attention Mechanisms: Attention-based models, such as self-attention and multi-head attention, have been used in conjunction with CNN architectures to capture long-range dependencies and improve deepfake detection accuracy.
Ensemble Models: Ensemble methods combine multiple models to improve overall performance and robustness. Ensembles of CNNs with self-attention mechanisms and other deep learning architectures have shown effectiveness in deepfake detection.

These models and architectures have contributed significantly to the progress of deepfake detection.

Experimenting Model Architectures

While many deepfake detection approaches leverage temporal information from video data, our project explores model architectures that focus on spatial modeling without incorporating explicit temporal components.

The rationale behind this approach is that even in the absence of temporal modeling, there may be inherent visual cues and artifacts within individual video frames that can reliably distinguish real from deepfake content. Facial textures, lighting inconsistencies, and other localized anomalies may be present in deepfake images that can be captured by models that prioritize spatial feature extraction and analysis.

Our experiments involve the following types of model architectures:

CoAtNet: This model represents a hybrid approach, combining the strengths of CNNs and transformers. It incorporates depthwise convolutions from CNNs for efficient feature extraction and relative attention from transformers to capture long-range dependencies within image data.

paper

DenseNet with Vision Transformers: We are experimenting with integrating DenseNet architecture with vision transformers to leverage the rich feature extraction capabilities of DenseNet and the long-range context-handling ability of transformers.
Efficient Vision Transformer: Similar to a CNN-ViT integration, this approach has proven successful in detecting deepfakes using the datasets we used. It offers an efficient architecture that balances feature extraction with transformer-based context analysis.

paper

Data Preprocessing

Proper data preprocessing is essential for training accurate and robust models. In our project, we focus on:

Face Extraction: We employ various methods such as MTCNN (Multi-task Cascaded Convolutional Networks) for effective face extraction from videos.
Data Augmentation: We utilize techniques such as ensembling and CutMix augmentations to improve model performance. These methods combine data samples in novel ways to expose models to a wider variety of data during training.

Model Evaluation

After training the model, we will evaluate its performance using a test set from the DFDC dataset. This evaluation will help us understand how well the model is performing in detecting deepfakes and where improvements might be needed.

Current best models

1. Cross Efficient Vision Transformer

AUC: 0.951
Model: Cross Efficient Vision Transformer
Tags: CNN+Transformer, Vision Transformer, EfficientNet
Paper: Combining EfficientNet and Vision Transformers for Video Deepfake Detection
Year: 2021

The Cross-Efficient Vision Transformer model represents an innovative approach by combining EfficientNet and Vision Transformers for deepfake detection. This hybrid architecture leverages the strengths of EfficientNet’s feature extraction capabilities and Vision Transformer’s long-range dependency analysis. The model’s AUC of 0.951 signifies its high effectiveness in detecting deepfake videos.

2. Efficient Vision Transformer

AUC: 0.919
Model: Efficient Vision Transformer
Tags: CNN+Transformer, Vision Transformer, EfficientNet
Paper: Combining EfficientNet and Vision Transformers for Video Deepfake Detection
Year: 2021

The Efficient Vision Transformer is another powerful model that integrates EfficientNet and Vision Transformers. With an AUC of 0.919, this architecture has demonstrated strong performance in detecting deepfake videos, showcasing the potential of combining CNNs and transformers for this task.

3. EfficientNetB4 + EfficientNetB4ST + B4Att

LogLoss: 0.4640
Model: EfficientNetB4 + EfficientNetB4ST + B4Att
Paper: https://paperswithcode.com/paper/video-face-manipulation-detection-through
Year: 2020

This ensemble model combines variants of EfficientNet (EfficientNetB4 and EfficientNetB4ST) with B4Att to enhance deepfake detection. The approach uses multiple models in an ensemble to improve overall performance and robustness. The reported LogLoss of 0.4640 indicates a solid performance in detecting manipulated videos.

Challenges and Future Work

While our model is still under development, we anticipate several challenges and areas for future work:

Distinguishing Real and Fake: The sophistication of deepfake technology presents a challenge in accurately distinguishing between real and fake content. Continued improvement and experimentation with model architectures will be key.
Real-Time Detection: Achieving real-time deepfake detection is a long-term goal. This requires optimizing the model for speed and efficiency while maintaining accuracy.
Scalability and Adaptability: Our system needs to be scalable and adaptable to different types of video content and evolving deepfake techniques.

Tools and Resources

For this project, we leveraged several tools and resources:

DFDC Dataset from Kaggle: This comprehensive dataset provided real and fake videos for training and evaluation.
BlazeFace, MTCNN and face_recognition
Code and Resources: We are building upon existing code and resources for deep learning and computer vision, which provides a strong foundation for our project.

Note: Due to our computation power we are considering only some parts of the DFDC dataset for training our model.

Conclusion

While our deepfake detection model is still under development, our project holds great potential for addressing the pressing issue of deep-fake videos. By leveraging advanced techniques such as vision transformers, we aim to create an accurate system for detecting deepfakes in real time.

As we continue our work, we are excited to explore new approaches and make improvements to our model. This project will ultimately contribute to a safer and more trustworthy digital environment by combating the spread of misinformation and ensuring the authenticity of online media. We look forward to sharing our progress and results as our project evolves.

Depth-Driven Vision: With Object Detection and Depth Estimation

Nikhil Reddy — Sat, 01 Feb 2025 23:02:19 +0000

Introduction

In the world of computer vision, combining object detection and depth estimation has brought a big improvement in analyzing videos accurately. This blog talks about a project that uses two powerful tools: YOLO for detecting objects and Depth_anything for depth estimation. By blending these advanced models, the project not only finds objects well but also digs deep into understanding distances in video frames from the camera’s reference point.

Imagine a scenario where a continuous stream of video frames undergoes processing, with YOLO diligently detecting specific objects and Depth_anything overlaying crucial depth information. This mix of technologies doesn’t just do object identification; it also shows us how far things are from the camera, giving us valuable insights into the spatial layout of each frame.

YOLO: You Only Look Once

YOLO, short for “You Only Look Once,” is a pioneering object detection algorithm known for its speed and accuracy in real-time applications. It revolutionized the field by introducing a single neural network that predicts bounding boxes and class probabilities directly from full images in one evaluation, making it incredibly efficient. The evolution of the YOLO series has seen significant advancements from YOLOv1 to the latest YOLOv8. Each iteration has brought improvements in speed, accuracy, and model architecture. Notable versions include:

YOLOv4: Introduced bag-of-freebies and bag-of-specials techniques, achieving impressive results with PyTorch implementation.
YOLOv5: Developed using PyTorch, it focused on user-friendliness for training and deployment, achieving high accuracy on the MS COCO dataset.
Scaled-YOLOv4: Introduced scaling-up and scaling-down techniques for improved accuracy and speed.
YOLOv8: The latest version released in January 2023 by Ultralytics, featuring an anchor-free approach, faster Non-maximum Suppression (NMS), and various enhancements for improved performance.

YOLO-World

The YOLO-World model, based on YOLOv8, is a real-time open-vocabulary object detection system that excels in detecting any object within an image based on descriptive texts. It offers a versatile tool for vision-based applications with lower computational demands while maintaining competitive performance. The model is user-friendly, easy to integrate into Python applications, and provides pre-trained weights for efficient deployment.

Key Features:

Real-time object detection with just a prompt.
Accessibility on the YOLO-World Github.
Utilizes the CNN-based YOLO architecture for unmatched speed and accuracy.

Architecture:

YOLO-World’s architecture efficiently fuses image features with text embeddings using components like YOLO Detector, Text Encoder, and RepVL-PAN for multi-level cross-modality fusion.

AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection

Tianheng Cheng^2,3,* Lin Song^1,📧,* Yixiao Ge^1,🌟,2 Wenyu Liu³, Xinggang Wang^3,📧, Ying Shan^1,2

* Equal contribution 🌟 Project lead 📧 Corresponding author

¹ Tencent AI Lab, ² ARC Lab, Tencent PCG ³ Huazhong University of Science and Technology

Notice

YOLO-World is still under active development!

We recommend that everyone use English to communicate on issues, as this helps developers from around the world discuss, share experiences, and answer questions together.

For business licensing and other related inquiries, don't hesitate to contact yixiaoge@tencent.com.

🔥 Updates

[2024-11-5]: We update the YOLO-World-Image and you can try it at HuggingFace YOLO-World-Image (Preview Version). It's a preview version and we are still improving it! Detailed documents about training and few-shot inference are coming soon.
[2024-7-8]: YOLO-World now has been integrated into ComfyUI! Come and try adding YOLO-World to your…

View on GitHub

Depth Anything: Revolutionizing Depth Estimation in Computer Vision

Depth Anything, a model represents a significant advancement in monocular depth estimation. Trained on a vast dataset comprising 1.5 million labeled images and over 62 million unlabeled images, Depth Anything stands out as a robust foundation model for Monocular Depth Estimation (MDE), offering unparalleled features and capabilities.

Key Features of Depth Anything:

Zero-Shot Relative Depth Estimation: Outperforming MiDaS v3.1 (BEiTL-512), Depth Anything excels in zero-shot relative depth estimation.
Zero-Shot Metric Depth Estimation: Surpassing ZoeDepth, this model showcases superior zero-shot metric depth estimation capabilities.
Optimal In-Domain Fine-Tuning: Through fine-tuning and evaluation on datasets like NYUv2 and KITTI, Depth Anything demonstrates exceptional performance.
Enhanced ControlNet: The model enhances a depth-conditioned ControlNet based on its architecture, surpassing previous versions based on MiDaS.

depth-anything.github.io

Project Architecture

This project showcases a system that smoothly integrates cutting-edge models like YOLO-World and Depth_Anything to enhance video analysis capabilities significantly. By combining state-of-the-art technologies in object detection, depth estimation, and distance calculation, this system provides a holistic approach to video analysis, yielding impactful outcomes and meaningful visual insights.

Components:

Input Processing:

The video input undergoes frame-by-frame processing using OpenCV (cv2) to extract visual data for analysis.

Object Detection with YOLO-World:

Utilizing the YOLO-World model, specific classes are identified within each frame.
Detected objects are visually highlighted with bounding boxes and class labels for easy identification.
The results we get are the coordinates of the bounding box, Confidence score, and Class ID.

from collections import defaultdict
import cv2
from ultralytics import YOLO

# Initialize YOLO model
model = YOLO('yolov8s-world.pt')
model.set_classes(["car"])  # Define custom classes

# Open video stream
cap = cv2.VideoCapture('road_test.mp4') 
# Store the track history
track_history = defaultdict(lambda: [])

# Loop through the video frames
while cap.isOpened():
    # Read a frame from the video
    success, frame = cap.read()

    if success:
        # Run YOLOv8 tracking on the frame, persisting tracks between frames
        results = model.track(frame, persist=True)
        print(results[0].boxes)
        # Visualize the results on the frame
        annotated_frame = results[0].plot()
        # Check if there is any detection and extract coordinates and confidence score
        if results[0].boxes.data.numel() > 0:
            detections = results[0].boxes.data[0].tolist()
            xmin, ymin, xmax, ymax, conf, class_id = detections
            print("xmin=", round(xmin))
            print("ymin=", round(ymin))
            print("xmax=", round(xmax))
            print("ymax=", round(ymax))
            print("conf=", conf)
        # Display the annotated frame
        cv2.imshow("Tracking", annotated_frame)

        # Break the loop if 'q' is pressed
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break
    else:
        # Break the loop if the end of the video is reached
        break

# Release the video capture object and close the display window
cap.release()
cv2.destroyAllWindows()

Depth Estimation with Depth_Anything:

The Depth_Anything pre-trained model comes into play, generating detailed depth maps based on the processed frames.
These depth maps provide crucial spatial information essential for accurate distance calculation.
Using the Coordinates of the bounding box, we get the pixel values of the target object from the depth maps.

#Using the frames collected, applying the depth_anything model to mask with depth maps. 
frame = cv2.cvtColor(raw_frame, cv2.COLOR_BGR2RGB) / 255.0
frame = transform({'image': frame})['image']
frame = torch.from_numpy(frame).unsqueeze(0).to(DEVICE)
with torch.no_grad():
    depth = depth_anything(frame)
depth = F.interpolate(depth[None], (frame_height, frame_width), mode='bilinear', align_corners=False)[0, 0]
depth = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0
print(depth)
depth = depth.cpu().numpy().astype(np.uint8)
depth_color = cv2.applyColorMap(depth, cv2.COLORMAP_INFERNO)

Distance Calculation:

Utilizing pixel values of the bounding box, the system calculates the relative distance of detected objects from the camera reference point.
Here, the distance is calculated from the camera reference point in only unidirectional, and also it is calculated to the center point of the object.
Changes in distance for each frame are recorded, offering valuable insights into object movement and spatial dynamics.

# Distance is inversely proportional to the pixel value of the center of the bounding box.
dist = depth[round((round(ymin)+round(ymax))/2)][round((round(xmin)+round(xmax))/2)]
dist = 1/(dist)
if dist > 0:
    dist_arr.append(dist)
#Here the distance we got doesn't have any units. It is later converted to meters using the relative distance technique.
#Note- You can use any other technique to calculate the relative distance.

Visualization and Output:

Visual representations include masked color maps depicting depth information and annotated frames showcasing detected objects along with their respective distances.

#Frames in original colour with bounding box detecting the target object.
out_frame = cv2.rectangle(frame, (int(xmin), int(ymin)), (int(xmax), int(ymax)), (0, 255, 0), 2)
out_frame = cv2.putText(frame, f'{class_name} {conf:.2f} Dist: {dist:.3f}', (int(xmin), int(ymin) - 10),
                                      cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
#Masked Frames with depth maps with bounding boxes detecting the target object.
masked_frame = cv2.rectangle(depth_color, (int(xmin), int(ymin)), (int(xmax), int(ymax)), (0, 255, 0), 2)
masked_frame = cv2.putText(depth_color, f'{class_name} Distance: {dist:.3f}', (int(xmin), int(ymin) - 10),
                                      cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

Results & Analysis

In this project, we investigated the variation in distance(depth) of a car(object) over time(no. of frames). We achieved this by employing a combination of YOLO for object detection and a depth estimation model to track the car’s distance from the camera across individual video frames.

Variation in distance(depth) of a car(object) over time(no. of frames)

The graph depicts a clear trend: the car’s distance steadily increased with each frame, signifying a movement away from the camera. This implies the car traversed the field of view throughout the video. The distance fluctuated slightly at times, but the overall trajectory indicated a receding motion.

The minimum distance observed positioned the car at the camera’s immediate vicinity in the very first frame. This could suggest the car was either stationary or approaching the camera initially. Conversely, the maximum distance captured the car at its furthest point from the camera within the video sequence.

It’s important to acknowledge that the analysis captured the car’s distance in a single dimension. The car’s movement in other directions, such as laterally towards or away from the camera, wouldn’t be reflected in this specific data.

This project demonstrates the effectiveness of YOLO and depth estimation models in analyzing car movement within a video. The generated distance data offers valuable insights for tasks like motion tracking and speed estimation.

Monitoring the Yezin Dam: A Journey Through Time with Computer Vision

Nikhil Reddy — Sat, 01 Feb 2025 22:23:21 +0000

Introduction

Nestled near Yezin village, east of the Yangon-Mandalay Highway, lies the Yezin Dam — an essential infrastructure contributing to the agricultural prosperity of Zayarthiri Township in Nay Pyi Taw City, Myanmar. Constructed in 1975, the dam’s primary purpose is to facilitate irrigation for the surrounding agricultural areas and mitigate flooding from Sinthay and Yezin Streams.

Motivation for Monitoring: The Yezin Dam stands as a testament to human-engineered solutions addressing crucial water resource challenges. The motivation behind monitoring this vital structure over time is deeply rooted in understanding its dynamic interactions with the environment. By tracking changes in water levels, we aim to unravel insights into climate influences, dam management practices, and the broader implications for the region’s water resources.

Google Timelapse: A Window to the Past

To embark on this temporal exploration, we turn to Google Timelapse — a powerful tool harnessing the extensive satellite imagery archive from 1984 to 2022. This remarkable resource allows us to observe the evolution of landscapes, including the Yezin Dam, with unprecedented clarity. The time-lapse imagery offers a unique perspective, enabling the observation of long-term trends and environmental transformations.

By leveraging Google Timelapse, we gain access to a visual chronicle of the Yezin Dam’s journey through decades. This comprehensive dataset serves as the foundation for our endeavor to employ advanced computer vision techniques, providing a nuanced understanding of how the dam and its surroundings have changed over time.

Timelapse – Google Earth Engine

Explore the dynamics of our changing planet over the past three and a half decades.

earthengine.google.com

Data Collection Insights

Google Timelapse Unveiled

Our exploration of the Yezin Dam’s temporal evolution commenced with the bountiful imagery from Google Timelapse, spanning 1984 to 2022. This visual journey provided a captivating narrative of the dam’s transformation over nearly four decades.

Here are some sample satellite images from Google Timelapse of Yezin Dam for certain years(1996–2002):

Crafting Uniformity Amidst Diversity

Navigating the terrain of data collection, we confronted a formidable challenge — divergent resolutions and sizes spanning different years. Achieving analytical consistency necessitated a dedicated effort to standardize these images. The intricate yet pivotal task of harmonizing disparate shapes and sizes became our focus, demanding precision to ensure uniformity across the dataset.

Amidst this endeavor, we encountered additional challenges. The unyielding nature of the images restricted any alteration to their resolution or dimensions. Attempts to upscale the images proved counterproductive, yielding unexpected and random results. Moreover, experimenting with upscaling and subsequent model application resulted in outcomes contrary to our expectations, adding another layer of complexity to the data processing journey. Despite these hurdles, we persisted with resilience, employing online tools to meticulously transform the varied images into a cohesive and standardized collection.
Then we used some online tools to get the image set with equal dimensions and sizes.

Model Selection: UNET Water Body Segmentation

gauthamk02 / pytorch-waterbody-segmentation

PyTorch implementation of image segmentation for identifying water bodies from satellite images

title	emoji	colorFrom	colorTo	sdk	sdk_version	app_file	pinned
Water Body Segmentation	🤗	blue	gray	gradio	3.10.0	app.py	false

UNET Water Body Segmentation - PyTorch

This project contains the code for training and deploying a UNET model for water body segmentation from satellite images. The model is trained on the Satellite Images of Water Bodies from Kaggle. The model is trained using PyTorch and deployed using Gradio on Hugging Face Spaces.

🚀 Getting Started

All the code for training the model and exporting to ONNX format is present in the notebook folder or you can use this Kaggle Notebook for training the model. The app.py file contains the code for deploying the model using Gradio.

🤗 Demo

You can try out the model on Hugging Face Spaces

🖥️ Sample Inference

View on GitHub

Our model of choice, UNET Water Body Segmentation, stands as a testament to precision in the realm of semantic segmentation. Trained on Kaggle’s Satellite Images of Water Bodies, it emerged from Hugging Face’s repository, showcasing a prowess that aligns seamlessly with our project’s core objectives.

Why UNET?

In our pursuit of understanding water body dynamics over time, UNET’s reputation for accurate semantic segmentation made it the perfect ally. Its knack for unraveling intricate details in images aligns precisely with our mission to decode the evolving story of Yezin Dam.

Kaggle’s Tapestry, ONNX’s Canvas

Transitioning from Kaggle’s rich dataset to a model in the ONNX format, we ensured a smooth integration of the UNET model into our computational journey. This model, with its Kaggle-honed accuracy, becomes not just a tool but a guiding compass in our exploration of Yezin Dam’s aquatic chronicle.

Implementation

import cv2
import numpy as np
import onnxruntime as rt
import matplotlib.pyplot as plt
import os

# Assuming your images are in a folder named 'image-set'
images_folder = '../image-set'
onnx_path = '../weights/model.onnx'

# Get a list of all image files in the folder
image_files = [f for f in os.listdir(images_folder) if f.endswith('.png')]

# Load the ONNX model
session = rt.InferenceSession(onnx_path)
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

# Lists to store data for plotting
years = []
waterbody_areas = []

for img_file in image_files:
    # Extract the year from the image file name (assuming the file name has the format 'YYYY.png')
    year = int(img_file[:4])

    # Construct the full path to the image
    img_path = os.path.join(images_folder, img_file)

    # Read and preprocess the image
    orig_image = cv2.imread(img_path)
    orig_dim = orig_image.shape[:2]
    orig_image = cv2.cvtColor(orig_image, cv2.COLOR_BGR2RGB)

    image = np.array(orig_image, dtype=np.float32) / 255.0
    image = cv2.resize(image, (256, 256))
    image = np.transpose(image, (2, 0, 1))
    image = np.expand_dims(image, axis=0)

    # Perform inference
    pred_onx = session.run([output_name], {input_name: image.astype(np.float32)})[0]
    pred_onx = pred_onx > 0.5
    pred_onx = cv2.resize(pred_onx[0, 0].astype(np.uint8), (orig_dim[1], orig_dim[0]))

    # Calculate and store waterbody area for each image
    waterbody_area = np.count_nonzero(pred_onx)
    years.append(year)
    waterbody_areas.append(waterbody_area)

    # Display the results (you can modify this part as needed)
    plt.figure(figsize=(10, 10))
    plt.subplot(1, 2, 1)
    plt.imshow(orig_image)
    plt.title('Original Image')
    plt.subplot(1, 2, 2)
    plt.imshow(pred_onx, cmap='gray')
    plt.title('Predicted Mask')
    plt.show()

# Plotting the waterbody areas over time
plt.figure(figsize=(10, 6))
plt.plot(years, waterbody_areas, marker='o')
plt.title('Waterbody Areas Over Time')
plt.xlabel('Year')
plt.ylabel('Waterbody Area')
plt.grid(True)
plt.show()

Results of Code Implementation and Analysis

Here are some images showing the original satellite image imported from Google Timelapse of Yezin Dam and the predicted masked image of the corresponding year:

Waterbody area in 1984

Waterbody area in 1989

Waterbody area in 1993

Waterbody area in 1999

Waterbody area in 2006

Waterbody area in 2011

Waterbody area in 2015

Waterbody area in 2019

Waterbody area in 2022

Here is the Plot of Yezin Dam water body areas Over the time from 1984 to 2022:

Early 1990s Surge (1993):

Observation: Substantial increase in waterbody area, reaching a peak in 1993 (57,633 square units). Possible Reasons:
Increased rainfall during this period could have contributed to higher water levels.
Changes in dam management practices or water release policies may have played a role.

1999 Drop:

Observation: Significant drop in waterbody area in 1999 (19,064 square units). Possible Reasons:
Reduced rainfall or changes in weather patterns might have led to a decrease in water levels.
Increased water usage, possibly for irrigation or other purposes.

2006 Surge:

Observation: Remarkable surge in waterbody area in 2006 (62,339 square units). Possible Reasons:
Heavy rainfall during this period could have caused an increase in water levels.
Dam-related activities, such as reservoir management or construction, might have influenced the waterbody area.

2011 and 2019 Peaks:

Observation: Elevated waterbody areas in 2011 and 2019. Possible Reasons:
Climate events such as increased rainfall or variations in precipitation patterns.
Changes in dam operation or water management practices during these years.

Recent Stability (2015–2022):

Observation: More stable waterbody areas in recent years (2015–2022). Possible Reasons:
Improved water management practices or reservoir control measures.
Climate conditions leading to consistent water levels.

Insights from Research Paper

We got to find a research paper on Water level changes in Yezin dam that was published in Scientific Research Journal (SCIRJ), Volume VII, Issue VIII, August 2019.

drive.google.com

As the Water level of the dam is proportional to its area, we are considering this paper to review and get some insights.

Climate Conditions:

The study area experiences a Tropical Savanna Climate (AW) according to the Köppen Climate Classification. The average temperature increased slightly from 26.97°C in 2006 to 28.14°C in 2016. However, this change alone may not be a dominant factor in water surface area variation.
Rainfall plays a crucial role. The total average rainfall for the 10 years was 1279.93 cm, with fluctuations in individual years (139.47 cm in 2006, 152.09 cm in 2011, and 137.39 cm in 2016). Changes in rainfall patterns could significantly impact water levels.

Inflow and Outflow Data:

The study includes data on total inflow and outflow from the dam. In 2006, inflow was 48,716 m³, and outflow was 46,391 m³. In 2011, inflow increased to 58,473 m³, and outflow decreased to 36,683 m³. By 2016, both inflow and outflow significantly reduced to 19,938 m³ and 4,778 m³, respectively.
The decrease in outflow may indicate increased water retention or reduced water release, affecting the overall water surface area.

Normalized Difference Water Index (NDWI):

The research employed NDWI using Landsat satellite images to assess water surface changes. NDWI is calculated based on the normalized relationship between the green and near-infrared portions of the spectrum. A decrease in NDWI values could indicate a reduction in water surface area.

Accuracy Assessment:

The study assessed the accuracy of its methods, reporting an accuracy of 97% in 2006, 79.15% in 2011, and 92.2% in 2016. Discrepancies in accuracy may be attributed to factors like spatial resolution, ground data quality, or image processing techniques.

Technical Limitations:

The research acknowledges limitations such as the failure of the Scan Line Corrector (SLC) in Landsat 7, leading to data gaps. Techniques like gap-filling were employed to address this issue.

Water Management Practices:

Human interventions such as the construction of drainage channels, water supply for rural and urban areas, and irrigation for agriculture could contribute to changes in water surface area.

Land Use Dynamics:

The combined use of Remote Sensing (RS) and Geographic Information System (GIS) helps detect land use dynamics. Changes in the surrounding land cover might impact water flow and storage.

Conclusion

Embarking on the implementation journey, the essence of my vision algorithm’s success lies in meticulous data preprocessing and model selection. Rigorous steps were taken to curate a robust dataset, with careful consideration given to image quality and relevance. The adoption of the UNET Water Body Segmentation model from Hugging Face Space aligns seamlessly with the project’s objectives, leveraging its proficiency in analyzing satellite images of water bodies. The utilization of the ONNX format facilitates seamless integration into the workflow, ensuring efficient model deployment. The code implementation involves a systematic approach, from loading the model to processing images, conducting inference, and generating masked images highlighting water body areas. The emphasis on visualizing results at each step not only aids in debugging but also enhances interpretability. The subsequent analysis of water body areas over time provides valuable insights into the dynamic changes around the Yezin Dam, aligning with the overarching goal of understanding and monitoring the evolution of water bodies.

Unraveling Image Classification: Understanding Classification Model Behavior in Different Approaches

Nikhil Reddy — Sat, 01 Feb 2025 21:50:04 +0000

Introduction

In our pursuit of understanding image classification models, our goal was to determine the tipping point at which a classification model begins to struggle when faced with increasingly distorted images. We embarked on this journey by exploring two different approaches to image classification and observing how they react as images become progressively more challenging to classify due to noise or other disturbances.

Through our exploration, we sought to uncover insights into the behavior of classification models under adverse conditions. By pinpointing the moment at which a model’s accuracy starts to decline, we aimed to gain a deeper understanding of the factors influencing its performance. This understanding could provide valuable insights for improving the robustness and reliability of image classification models in real-world scenarios, where images may often be corrupted or distorted.

Our journey aimed to shed light on the nuanced interplay between image quality and classification accuracy, ultimately contributing to advancements in the development of more resilient and effective image classification systems.

We started by diving into Teachable Machine and ResNet50, two popular methods for image classification. We wanted to see how they would react when faced with images that were becoming harder and harder to classify.

Approach 1: Teachable Machine

Teachable Machine’s image classification model utilizes a CNN architecture to effectively learn and recognize patterns within images, enabling it to classify them accurately based on the features it has learned during training.
Teachable Machine offers an intuitive platform for training image classification models, making it accessible to users with varying levels of expertise. In our endeavor, we meticulously curated a dataset by recording 500 unique image samples of both bananas and apples. Here are a sample of recorded input data.

Banana Group

Apple Group

Armed with this dataset, we embarked on a binary classification task, aiming to distinguish between these two fruits. The model showcased commendable performance in Classifying bananas across various levels of image corruption, encompassing scenarios like blurring, alterations mimicking apple markings, and changes in color and background. Here are some test data samples.

Sample Test Data

This model performed very well for these test samples, as the disturbances are not very high, even though it gave 100% accuracy in classifying bananas for the above samples.

Results of Sample Test Data

However, as our exploration progressed, we noticed a decline in accuracy when the model encountered unconventional inputs, such as images depicting partially eaten bananas. This decline hinted at the model’s limitations in generalizing beyond the confines of its training data.

Starting from 100%, it declined to 53% and then to 29% and then to 0%.

Unconventional Sample Data

Analyzing Teachable Machine’s Behavior

we discerned several contributing factors to this decline in accuracy. Primarily, the lack of diversity in the model’s training data may have hindered its ability to adequately represent the myriad of potential inputs encountered in real-world scenarios. Consequently, the model struggled to generalize to novel or unconventional images, leading to inaccuracies in classification. Additionally, Teachable Machine’s feature extraction capabilities may not have been robust enough to handle variations in image characteristics that deviated significantly from the norm, further exacerbating its performance limitations.

Approach 2: Resnet-50 from Hugging Face

ResNet-50, a deep convolutional neural network that has been pre-trained on the extensive ImageNet dataset, stands out for its exceptional performance in handling image classification tasks.

For our investigation, we utilized an online Hugging face Microsoft/resnet-50 Inference API subjected to a series of progressively corrupted banana images. These corruptions included various levels of blurring and the overlaying of apple markings at different opacities.

Overlaying Apple image on Banana

we subjected the classification model to a unique challenge involving overlaying apple images onto banana images at varying opacities (25%, 40%, 50%, 60%, 75%). The results were intriguing: for the 25% opacity, ResNet-50 confidently classified the image as a banana. However, as the opacity of the overlaid apple increased, ResNet-50’s classification shifted towards identifying the image predominantly as an apple, with probabilities reaching as high as 90% for 50%, 60%, and 75% opacities. Notably, at 75% opacity, an unexpected classification of 10% as pomegranate emerged, adding a surprising twist to the analysis.

In addition to the overlaying experiment, we also explored the impact of blurring on ResNet50’s classification accuracy. For images subjected to 50% blur, ResNet50 displayed uncertainty, assigning only a small probability (0.228) to the banana class, with minor probabilities distributed across various other classes. However, the most astonishing result came with 100% blur, where ResNet50 failed to classify the image as a banana at all. Instead, it produced perplexing classifications such as match sticks, sunscreen, and stethoscope, highlighting the model’s inability to recognize the blurred banana image.

Analyzing ResNet-50 model’s Behavior

These findings illuminate the intricacies of ResNet-50’s behavior when confronted with challenging image distortions. While the model demonstrates impressive resilience to certain types of corruption, such as overlaying images, its performance may falter in scenarios involving extreme blurring, underscoring the importance of understanding the limitations of image classification models in real-world applications.

Summary

In our exploration of image classification using Teachable Machine and ResNet-50, we uncovered fascinating insights into the behavior of these models when faced with progressively challenging scenarios. Teachable Machine, with its intuitive platform, showed commendable performance in classifying bananas across various levels of image corruption. However, it struggled with unconventional inputs, indicating limitations in its ability to generalize beyond the scope of its training data. Conversely, ResNet-50, a deep convolutional neural network pre-trained on the ImageNet dataset, demonstrated remarkable resilience to image corruption. It accurately classified images even when overlaid with apple markings at varying opacities, showcasing its robust feature extraction capabilities. However, ResNet-50’s performance faltered in extreme blurring scenarios, highlighting the importance of understanding model limitations. Overall, our exploration sheds light on the complexities of image classification and underscores the need for further research to develop more robust and reliable models for real-world applications.

Exploring the Capabilities and Limitations of Image Analysis: A Journey with Gemini

Nikhil Reddy — Sat, 01 Feb 2025 21:34:37 +0000

Introduction

In the era of advanced technology, the capabilities of online image analysis tools have become a topic of interest. This blog post documents my exploration using Gemini, a versatile online tool, to understand how well it can detect and analyze images, specifically focusing on two categories: Object Identification in Image variation and Image OC

Object Identification with Image Variation

Lion in a Plain Field:

Gemini: The image you sent me shows a male African lion (Panthera Leo) walking across a dry grass field. The lion in the image is likely patrolling his territory. The dry grass field is typical of the African savanna. The lion in the image is an adult male. Adult male lions are easily recognizable by their large mane, which is a thick growth of hair around their neck and head. The lion is walking with a slow, deliberate stride. This is typical of lions when they are patrolling their territory.

Man with Bear Statue:

Gemini: This picture shows a group of people gathered around a large bronze horse and rider statue. Depicts a horse rearing up on its hind legs with a rider seated on its back. The rider is dressed in military attire and appears to be holding a sword in the air. The statue is surrounded by a crowd of people, including men, women, and children, who are all looking up at it with awe and curiosity.

Dogs Sleeping:

Gemini: Unable to provide any information, indicating a lack of understanding.

Lemur Hanging Inverted:

Gemini: Unable to process the unconventional behavior captured in the image.

When presented with standard wildlife photography, Gemini exhibits remarkable accuracy in object identification and segmentation. The tool effortlessly makes sense of images that showcase straightforward and familiar situations, unveiling specifics about the species, their behavior, and their environment.

However, challenges arise when faced with unconventional imagery, indicating potential limitations in its ability to comprehend and analyze non-traditional subjects or behaviors. Here, Gemini faced challenges in interpreting unconventional animal behaviors.

Image OCR Analysis

Clear Text data — “W93BX”:

Gemini: The text in the image you sent is “W93BX.” I can read this because it is a relatively simple black-and-white image with no significant distortions or background noise. However, more complex images, such as those with color variations, background images, or overlapping characters, can be more difficult for me to read accurately.

Text with Lines — “R84CH”:

Gemini: The black-and-white photo you sent shows a musical note in the center of the image. Some lines and other symbols surround the note. The text is faint and difficult to read, but I can make out some of the letters.

Text with Background — “TSMS9”:

Gemini: The text in the image captcha you sent is “TSMS”. I can read this because the image is a relatively simple green and black image with few distractions.

Complex — “y7hbw” with Colored Lines:

Gemini: I’m afraid I’m still unable to extract the text from the image you sent, even though it’s slightly clearer than the previous ones. The combination of blurriness, distortion, and overlapping characters makes it too difficult for me to accurately identify the letters and numbers.

Gemini’s prowess in image decoding shines when confronted with simple designs. It adeptly deciphers clear text in uncluttered backgrounds. Nevertheless, as the complexity of images increases with elements like faint text, distortions, and overlapping characters, Gemini’s accuracy diminishes. The tool struggles when confronted with multifaceted challenges, emphasizing the significance of refining its capabilities for handling intricate visual data.

Summary

Gemini, an online image analysis tool, demonstrates adeptness in standard object identification and segmentation, particularly excelling in decoding clear and common wildlife photography scenarios. It efficiently provides detailed insights into species, behavior, and surroundings. However, its limitations become apparent when faced with unconventional imagery, suggesting potential challenges in comprehending non-traditional subjects or behaviors. In the realm of image decoding, Gemini showcases proficiency with simple designs, accurately deciphering clear text against uncluttered backgrounds. Yet, its accuracy diminishes when confronted with complex images featuring faint text, distortions, and overlapping characters. This highlights the tool’s struggle in handling intricate visual data, emphasizing the need for continuous improvement to enhance its capabilities in decoding diverse and challenging images.