DEV Community: Amir Mullagaliev

SPO600: Project Stage III - Enhancing the Clone-Pruning Analysis Pass

Amir Mullagaliev — Sat, 19 Apr 2025 17:49:10 +0000

Introduction
Stage III Requirements
Implementation Approach
Implementation Details
- Data Structure Changes
- Function Tracking Logic
- Analysis Algorithm
- Complete Implementation
Testing and Results
- Test Case Design
- x86_64 Results
- aarch64 Results
- Comparison Between Architectures
Capabilities and Limitations
How to Reproduce My Results
Final Reflections
Conclusion

Introduction

Welcome to the third and final stage of my GCC Clone-Pruning Analysis Pass project! In this post, I'll describe how I extended my Stage II implementation to handle multiple cloned functions in a single program and tested it across both x86_64 and aarch64 architectures.

If you didn't read my previous post about Stage II, I'd recommend checking it out first, since this builds directly on that work. In Stage II, I built a basic GCC pass that could analyze a single cloned function and determine whether its variants were substantially similar enough to be pruned.

Stage III Requirements

For Stage III, we needed to:

Extend our code to handle multiple cloned functions in a single program
Create test cases with at least two cloned functions per program
Verify functionality on both x86_64 and aarch64 architectures
Test scenarios with mixed PRUNE and NOPRUNE recommendations
Clean up any remaining issues from Stage II

The biggest challenge was removing the assumption from Stage II that "there is only one cloned function in a program" and ensuring our implementation works correctly across architectures.

Implementation Approach

After reviewing my Stage II code, I identified several limitations that needed to be addressed:

The static variables used to track function information limited us to a single function with two variants
The simple comparison based on block and statement counts could be improved
The state management needed enhancement to handle multiple functions

My approach for Stage III focused on:

Replacing the static variables with a more sophisticated data structure to track multiple functions
Enhancing the comparison algorithm
Making sure the implementation works consistently across architectures
Creating comprehensive test cases for both architectures

Implementation Details

Data Structure Changes

The core of my improvement was replacing the simple static variables from Stage II:

static std::string previous_function_name = "";
static size_t previous_block_total = 0;
static size_t previous_statement_total = 0;

With a map to store information about all encountered functions:

struct FunctionVariant {
    std::string full_name;    // Complete function name with variant
    size_t block_count;       // Number of basic blocks
    size_t statement_count;   // Number of GIMPLE statements
};

static std::map<std::string, std::vector<FunctionVariant> > function_variants;

This data structure allows us to track multiple variants of multiple functions simultaneously, and keep all their relevant information organized by base function name.

Function Tracking Logic

First of all, I needed to improve my base function name extraction to handle different variant suffixes:

std::string get_base_function_name(function *fun) {
    struct cgraph_node *node = cgraph_node::get(fun->decl);
    std::string fname = (node != nullptr) ? std::string(node->name())
                                          : std::string(function_name(fun));

    // Handle resolver functions
    size_t pos = fname.find(".resolver");
    if (pos != std::string::npos) {
        return fname.substr(0, pos);
    }

    // Handle regular variants
    pos = fname.find('.');
    if (pos != std::string::npos) {
        return fname.substr(0, pos);
    }

    return fname;
}

I also added a function to find the default variant among all variants of a function:

FunctionVariant find_default_variant(const std::vector<FunctionVariant> &variants) {
    for (size_t i = 0; i < variants.size(); i++) {
        if (variants[i].full_name.find(".default") != std::string::npos) {
            return variants[i];
        }
    }

    for (size_t i = 0; i < variants.size(); i++) {
        if (variants[i].full_name.find('.') == std::string::npos) {
            return variants[i];
        }
    }

    return variants[0];
}

Analysis Algorithm

Here's the core logic for processing a function in the execute method:

unsigned int execute(function *fun) {
    FILE *out = (dump_file != nullptr) ? dump_file : stderr;

    struct cgraph_node *node = cgraph_node::get(fun->decl);
    std::string full_fname = (node != nullptr)
                           ? std::string(node->name())
                           : std::string(function_name(fun));

    print_frame_header(out, full_fname);

    if (is_resolver_function(full_fname)) {
        print_frame_footer(out, "ANALYSIS FINISHED (resolver function)");
        return 0;
    }

    size_t bb_count = 0;
    size_t gimple_count = 0;
    basic_block bb;
    FOR_EACH_BB_FN(bb, fun) {
        bb_count++;
        for (gimple_stmt_iterator gsi = gsi_start_bb(bb);
             !gsi_end_p(gsi);
             gsi_next(&gsi))
        {
            gimple_count++;
        }
    }

    std::string base_name = get_base_function_name(fun);
    FunctionVariant current_variant = {full_fname, bb_count, gimple_count};
    function_variants[base_name].push_back(current_variant);

    if (function_variants[base_name].size() > 1) {
        analyze_function_variants(out, base_name, function_variants[base_name]);
    } else {
        print_frame_footer(out, "First variant of this function - storing for comparison");
    }

    return 0;
}

And here's the function that analyzes variants to determine if they should be pruned:

void analyze_function_variants(FILE *out, const std::string &base_name,
                              const std::vector<FunctionVariant> &variants) {
    if (variants.size() <= 1) {
        return;
    }

    FunctionVariant default_variant = find_default_variant(variants);

    for (size_t i = 0; i < variants.size(); i++) {
        const FunctionVariant &variant = variants[i];

        if (variant.full_name == default_variant.full_name) {
            continue;
        }

        bool should_prune = (variant.block_count == default_variant.block_count &&
                            variant.statement_count == default_variant.statement_count);

        if (should_prune) {
            fprintf(out, "PRUNE: %s\n", base_name.c_str());
        } else {
            fprintf(out, "NOPRUNE: %s\n", base_name.c_str());
        }
        fprintf(out, "CLONE FOUND: %s\n", base_name.c_str());
        fprintf(out, "CURRENT: %s\n", variant.full_name.c_str());

        std::string border(60, '*');
        fprintf(out, "%s\n", border.c_str());
        fprintf(out, "*  End of Diagnostic for Clone Pair\n");
        fprintf(out, "%s\n", border.c_str());
    }
}

The key improvements are:

We track all variants of all functions
We process functions as they're encountered
When we have multiple variants of a function, we analyze them immediately
We identify the default variant to use as a baseline for comparison

Complete Implementation

For the complete implementation, check my GitHub repository:
tree-amullagaliev.cc

Testing and Results

Test Case Design

To thoroughly test the implementation, I created a complex test file with six different functions, each with different optimization characteristics:

simple_calculation: Basic scalar operations
vector_multiply: Vector operations
matrix_transpose: Matrix operations
count_bits: Bit counting
count_char: String processing
classify_number: A branch-heavy function

For x86_64, I used these target attributes:

__attribute__((target_clones("default", "arch=x86-64-v3")))
int simple_calculation(int a, int b, int c) {
    // function body
}

__attribute__((target_clones("default", "avx2")))
void vector_multiply(float *result, const float *a, const float *b, int size) {
    // function body
}

__attribute__((target_clones("default", "avx2")))
void matrix_transpose(float *dst, const float *src, int rows, int cols) {
    // function body
}

__attribute__((target_clones("default", "popcnt")))
int count_bits(uint64_t value) {
    // function body
}

__attribute__((target_clones("default", "sse4.1")))
int count_char(const char *str, char c) {
    // function body
}

__attribute__((target_clones("default", "arch=x86-64-v3")))
int classify_number(int x) {
    // function body
}

For aarch64, I used different attributes appropriate for that architecture:

__attribute__((target_clones("default", "simd")))
int simple_calculation(int a, int b, int c) {
    // same function body
}

__attribute__((target_clones("default", "sve")))
void vector_multiply(float *result, const float *a, const float *b, int size) {
    // same function body
}

// etc.

The full test files can be found here:

x86_64 Results

On x86_64, the results were:

====== Summary of PRUNE/NOPRUNE decisions ======
NOPRUNE: vector_multiply
PRUNE: classify_number
PRUNE: count_bits
PRUNE: count_char
PRUNE: matrix_transpose
PRUNE: simple_calculation

Most functions were identified for pruning, with only vector_multiply showing structural differences with AVX2 optimization. I was surprised to see that matrix_transpose was marked for pruning despite also using AVX2! This suggests that not all vector operations benefit equally from AVX2 instructions.

aarch64 Results

On aarch64, the results were different:

====== Summary of PRUNE/NOPRUNE decisions ======
NOPRUNE: matrix_transpose
NOPRUNE: vector_multiply
PRUNE: classify_number
PRUNE: count_bits
PRUNE: count_char
PRUNE: simple_calculation

Here, both vector_multiply AND matrix_transpose were marked as NOPRUNE. This was a fascinating finding!

Comparison Between Architectures

The most interesting observation was how the same function (matrix_transpose) was treated differently on each architecture:

On x86_64 with AVX2: PRUNE recommendation
On aarch64 with SVE: NOPRUNE recommendation

This shows that the SVE instructions on aarch64 changed the function structure more significantly than AVX2 did on x86_64, resulting in different optimization outcomes. I found this really intriguing since both are vector instruction sets, but they impact code structure differently. This highlights the importance of architecture-specific optimization.

Capabilities and Limitations

Capabilities

Multiple Function Support: Successfully handles any number of cloned functions in a program
Cross-Architecture Compatibility: Works on both x86_64 and aarch64
Mixed Decision Support: Can recommend PRUNE for some functions and NOPRUNE for others
Detailed Output: Provides clear diagnostic information for each clone pair

Limitations

Simple Comparison Metric: Still relies on basic block and statement counts, which may not capture all structural differences
Architecture-Specific Test Cases: Requires different test files for each architecture due to different valid target attributes

How to Reproduce My Results

To replicate my work:

First, create or modify the pass file:

cd ~/git/gcc/gcc
vi tree-amullagaliev.cc  # Copy the implementation code from GitHub

Rebuild GCC:

cd ~/gcc-build-001/
make -j$(nproc)

Create test files:

mkdir ~/stage3
cd ~/stage3
# Copy complex-clones-test.c and Makefile from GitHub

Test the implementation:

make
make run-test
make show-results

Here's the Makefile:

CC = ~/gcc-build-001/gcc/xgcc
CFLAGS = -B ~/gcc-build-001/gcc/ -g -O3 -ftree-vectorize -fdump-tree-amullagaliev

all: complex-test

complex-test: complex-clones-test.c
    $(CC) $(CFLAGS) complex-clones-test.c -o complex-test

run-test: complex-test
    ./complex-test
    @echo "Test completed. Check the dump files for analysis results."

show-results:
    @echo "====== Analysis Results ======"
    @grep -A 3 -B 1 "PRUNE\\|NOPRUNE" complex-test-complex-clones-test.c.*.amullagaliev || echo "No results found"
    @echo ""
    @echo "====== Summary of PRUNE/NOPRUNE decisions ======"
    @grep "PRUNE:" complex-test-complex-clones-test.c.*.amullagaliev | sort | uniq
    @grep "NOPRUNE:" complex-test-complex-clones-test.c.*.amullagaliev | sort | uniq

clean:
    rm -f complex-test *.o *.amullagaliev*

.PHONY: all run-test show-results clean

The full source code and test files are available in my GitHub repository for easy access and replication:
SPO600 Project Stage III

Final Reflections

Wow, working on this project has been an incredible learning experience! I've gained a much deeper understanding of:

GCC internals and how passes are implemented
Function Multi-Versioning and how it works across architectures
Cross-architecture development challenges
GIMPLE representation and analysis techniques

The most challenging part of Stage III was understanding how to properly track and analyze multiple function variants. I initially tried to use a finalize method to process all variants at the end, but this approach had issues. The solution of processing variants as they're encountered worked much better.

I was particularly surprised by the difference in optimization behavior between x86_64 and aarch64, especially for matrix operations. This highlighted how architecture-specific optimizations can lead to very different code structures, even for the same source code.

The most frustrating part was dealing with architecture-specific target attributes. I spent a lot of time figuring out which attributes were valid on aarch64 vs. x86_64. It was a bit of trial and error until I found combinations that worked on both platforms.

Conclusion

I've successfully extended my GCC clone-pruning analysis pass to handle multiple functions and test it across architectures. The implementation now meets all the requirements for Stage III.

This whole project, from Stage I through III, has been a fascinating journey into compiler development. I've gained skills that I never expected to acquire and deepened my understanding of how compilers work at a fundamental level.

I want to express my sincere thanks to Professor Chris Tyler for his incredible guidance throughout this project. His clear explanations of GCC internals and compiler theory made the complex world of compilers much more approachable. Without his lectures and support, navigating the intricate structures of GCC would have been much more challenging. The skills I've gained in this course will be valuable throughout my career in software development.

This project marks the end of my SPO600 journey, but the knowledge and experience I've gained will stay with me for years to come.

SPO600 Lab 5: Adventures in Assembly Language

Amir Mullagaliev — Fri, 18 Apr 2025 23:52:40 +0000

Introduction
Lab Requirements
Implementing the Loop in AArch64
Implementing the Loop in x86_64
Comparing Assembly Languages
Debugging Headaches
Code Breakdown
Lessons Learned
Full Source Code
Conclusion

Introduction

I've completed Lab 5 for the SPO600 course, and let me tell you - working with assembly language is like trying to communicate with aliens using only hand gestures.

This lab focused on experimenting with assembler on both x86_64 and AArch64 platforms. I had to write programs that looped through numbers, converted them to characters, and printed them to the screen. Sounds simple, right? WRONG. Nothing is simple in assembly!

Lab Requirements

The lab required me to implement the following in both AArch64 and x86_64 assembly:

A basic loop that prints Loop 6 times
Modify it to print Loop: # where # is the loop index (0-5)
Extend it to print 2-digit numbers (00-32)
Suppress leading zeros
Change to hexadecimal output (0-20)

Implementing the Loop in AArch64

My very first roadblock with AArch64 was figuring out how to actually modify a buffer in memory. With higher-level languages, you'd just do something like message[6] = digit + '0'; but in assembly... nope! You need to load addresses, use registers, and do all kinds of register juggling.

For example, to print Loop: # with the index, I had to:

mov     x20, x19    
add     x20, x20, 48

ldr     x1, =message

strb    w20, [x1, digit_pos]

The hardest part was definitely the 2-digit conversion. I spent way too long figuring out how to divide numbers in AArch64. Turns out you need udiv for division and msub to calculate the remainder:

mov     x20, x19         
mov     x21, 10          
udiv    x22, x20, x21    

msub    x23, x22, x21, x20  # x23 = x20 - (x22 * x21) = remainder

Implementing the Loop in x86_64

Working with x86_64 after AArch64 was like switching from "Japanese" to "German" - still foreign, but somehow differently confusing!

The x86_64 division was a total pain. You have to clear specific registers, put values in specific places, and the division gives both quotient AND remainder:

mov     %r15,%rax  
mov     $0,%rdx       
mov     $10,%rcx      
div     %rcx

And don't even get me started on the syntax differences! In AArch64, destination register comes first:

mov x0, 1

But in x86_64, it's the other way around:

mov $1,%rax

I kept mixing them up, and my programs wouldn't assemble.

Comparing Assembly Languages

Now that I've worked with three assembly languages (6502, x86_64, and AArch64), here's my totally subjective ranking:

AArch64: Cleanest syntax and most consistent. The register naming makes sense (x0, x1, etc.), and the instruction names are mostly intuitive. The best part is having separate instructions for quotient and remainder.
6502: Simple and limited, which is actually nice for beginners.
x86_64: Most powerful but also most confusing. The register naming is historical (%rax, %rbx, %r15) with no obvious pattern. Instructions are cryptic (%al vs %ax vs %eax vs %rax). Division is a nightmare requiring specific register setup.

Debugging Headaches

Here's what my debugging process looked like:

Write code
Compile
Get cryptic error message
Stare at code for 10 minutes
Realize I used // for comments instead of # in GNU assembler
Fix and repeat

The worst part was when the program assembled but didn't work right. With no debugger (or at least none that I knew how to use properly), I was basically adding write statements to see what was happening inside - like printf debugging.

Code Breakdown

Let's look at a small piece of the hexadecimal conversion in AArch64:

cmp     x22, 10          
b.ge    high_alpha       

add     x22, x22, 48     
b       high_done

high_alpha:
add     x22, x22, 55     

high_done:

This code checks if a hex digit is 0-9 or A-F and converts it. For 0-9, we add 48 (ASCII for '0'). For 10-15, we add 55 to get 'A'-'F'.

Lessons Learned

Assembly is PRECISE: A single wrong register or memory address and everything breaks.
Different architectures = different paradigms: x86_64 and AArch64 handle things like division completely differently.
Comments are ESSENTIAL: Without comments, I'd have no idea what my own code was doing 5 minutes after writing it.
Register allocation matters: In higher level languages, variables just exist. In assembly, you need to carefully plan which registers to use for what.

Full Source Code

Here are the links to the full source code:

I'll just paste the AArch64 loop5.s code here as an example (I'm probably proudest of this one since it handles hex conversion :D):

.data
message:
    .ascii "Loop: ##\n"
message_len = . - message 
hex1_pos = 6              
hex2_pos = 7              
space = 32                

.text
.globl _start
min = 0                   
max = 33                  
_start:
    mov     x19, min      

loop:

    mov     x20, x19         
    mov     x21, 16          
    udiv    x22, x20, x21    

    msub    x23, x22, x21, x20  # x23 = x20 - (x22 * x21) = remainder

    cmp     x22, 10          
    b.ge    high_alpha       

    add     x22, x22, 48     
    b       high_done

high_alpha:
    add     x22, x22, 55     

high_done:
    # Convert low nibble to ASCII
    cmp     x23, 10          
    b.ge    low_alpha        

    add     x23, x23, 48     
    b       low_done

low_alpha:
    add     x23, x23, 55     

low_done:
    ldr     x1, =message

    cmp     x22, 48          
    b.ne    print_both       

    mov     x24, space       
    strb    w24, [x1, hex1_pos]  
    b       print_low        

print_both:
    strb    w22, [x1, hex1_pos]

print_low:
    strb    w23, [x1, hex2_pos]

    mov     x0, 1            # 1 is stdout
    mov     x2, message_len  # message length
    mov     x8, 64           # 64 is write
    svc     0                

    add     x19, x19, 1      
    cmp     x19, max         
    b.ne    loop             

    mov     x0, 0            # set exit status to 0
    mov     x8, 93           # exit is syscall #93
    svc     0

Conclusion

In conclusion, would I write assembly code in my free time? Probably not. But I have a much better understanding of what's happening under the hood of my programs now.

SPO600 - Lab 3: Building a Number Guessing Game in 6502 Assembly

Amir Mullagaliev — Fri, 18 Apr 2025 02:48:01 +0000

Introduction
Game Overview
Code Breakdown
- Initialization and Random Number Generation
- Text Output: Game Prompts
- Keyboard Input Handling
- Graphics Feedback
- Screenshots
- Attempt Tracking and Display
- Restart Mechanism
Reflection
- Challenges
- Limitations
Full Code

Introduction

This blog post is about my journey through Lab 3 of the SPO600 course, where I developed a Number Guessing Game using 6502 Assembly. The goal was to create a program that meets specific criteria:

Outputting to both text and graphics screens
Accepting keyboard input
Using arithmetic operations

Let's dive into how I tackled this challenge!

Game Overview

The game is simple but engaging:

The program generates a random number between 1 and 99.
The player guesses the number via keyboard input.
After each guess, the game outputs ("Too High" or "Too Low") and changes the colour of the graphics screen (red for high, blue for low, green for a win).
The player's number of attempts is tracked and displayed as A.

Code Breakdown

Initialization and Random Number Generation

The game starts by initializing the text screen (SCINIT) and testing the graphics screen with a flash effect (TEST_GRAPHICS). A random number is generated using the pseudo-random byte at memory location $FE:

LDA $FE      ; Load random byte
AND #$7F     ; Ensure positivity (bitwise AND)
CMP #100     ; Check if >=100
BCC SAVE_TARGET ; If <100, use it
LSR          ; Divide by 2 if out of range
SAVE_TARGET:
STA TARGET   ; Store as target number

This ensures the number stays within 1–99 using bitwise operations and comparisons.

Text Output: Game Prompts

The text screen displays instructions and feedback using the CHROUT ROM routine. For example, printing "GUESS" and handling newlines:

LDA #$47     ; 'G'
JSR CHROUT
LDA #$55     ; 'U'
JSR CHROUT
; ... repeats for remaining letters

Feedback like "HI" (too high) or "LO" (too low) is printed after each guess.

Keyboard Input Handling

The INPUT_GUESS subroutine reads digits from the keyboard using CHRIN and converts ASCII characters to numeric values. Two-digit inputs are handled by shifting the first digit into the tens place and adding the second:

; Convert first digit to tens place
LDA GUESS
ASL          ; ×2
STA TEMP2
ASL          ; ×4
ASL          ; ×8
CLC
ADC TEMP2    ; ×10 (8+2)
STA GUESS

; Add second digit
PLA
SEC
SBC #$30     ; ASCII to numeric
ADC GUESS
STA GUESS

This uses shifts (ASL) and arithmetic (ADC) to efficiently calculate the final guess.

Graphics Feedback

The graphics screen (memory starting at $0200) is filled with a color based on the guess using FILL_SCREEN:

FILL_SCREEN:
  LDX #0
  LDY #0
FILL_LOOP:
  STA $0200,X ; Fill pages $0200–$05FF
  STA $0300,X
  ; ... continues for all pages
  INX
  BNE FILL_LOOP
  RTS

Colors are set using values like $05 (green) or $02 (red), updating the entire screen instantly.

Screenshots

User guesses low number

User guesses high number

User guesses correct number (WINS) and asked if he wants to play another game

Attempt Tracking and Display

The ATTEMPTS(A) counter is incremented after each guess. For values ≥10, the number is split into tens and ones digits using a division loop:

DIVIDE_LOOP:
  CMP #10
  BCC DIGIT_READY ; Exit if <10
  SBC #10         ; Subtract 10
  INX             ; Count tens
  JMP DIVIDE_LOOP

This avoids complex division by repeatedly subtracting 10, demonstrating efficient arithmetic in assembly.

Restart Mechanism

After winning, the player can press 'Y' to restart. This resets the screen and jumps back to the start:

RESTART:
  JSR SCINIT      ; Reinitialize screen
  JMP $0600       ; Restart program

Reflection

Writing this game in 6502 Assembly was both frustrating and rewarding. Here are the major challenges I faced and the limitations the program has:

Challenges

Random Number Range: Ensuring the random number stayed within 1–99 required masking (AND #$7F) and conditional checks.

Two-Digit Input: Converting ASCII input to a numeric value involved bit shifts and arithmetic, avoiding slow multiplication.

Limitations

The random number generator isn't perfectly uniform due to reliance on $FE.
Input requires pressing Enter after each digit, which might feel unintuitive.

Full Code

Available here (paste into the 6502 Emulator to run).

OSD700 - RAG Integration: Stage 3

Amir Mullagaliev — Thu, 17 Apr 2025 02:04:14 +0000

Introduction
Tensorflow.js
Settings UI
Conclusion

Introduction

After we have successfully landed the stage 1, 2 to the Chatcraft.org, it's time to work on the stage 3.
Today, I am going to describe the embeddings' generation implementation process that I am currently working on.

Firstly, we gotta stick to the proposed plan, you may find it here:

RAG on DuckDB Implementation Based on Prototype #868

mulla028 posted on Mar 29, 2025

Description

Recently, we have implemented a prototype of RAG on DuckDB, and it proves that implementation is doable for the ChatCraft it's time to start working on it!

The implementation will take several steps, lets call them stages. Since we already have the set up of DuckDB using duckdb-wasm, the file loader, and format to text extractors, we are skipping some of the steps(stages). Therefore here are the steps we need to take in order successfully implement it:

Proposed Implementation Stages

Stage 1: Create Two New Tables in IndexedDB
- Embeddings Table, with foreign key to a file
- Chunks Table, with foreign key to a file
Stage 2: Implement Chunking Logic
- Proper Chunking with overlap (cf. https://platform.openai.com/docs/assistants/tools/file-search#customizing-file-search-settings)
- Proper Chunking Storage in IndexedDB
Stage 3: Implement Embeddings Generation
- Allow using a cloud-based model or local (transformers.js or tensorflow.js)
Stage 4: Vector Search
- Use DuckDB's extension Called VSS
- Load Embeddings, Chunks, etc. into DuckDB
- Apply HNSW Indexing to Increase Speed of the Search ( HNSW Indexing Provided by VSS extension)
Stage 5: LLM Integration
- Modify Prompt Construction to Include Retrieved Context
- Implement Source Attribution in Responses
- Adjust Token Management to Account For Context
Stage 6: Query Processing
- Implement Embedding Generation for User Queries
- Use the Same Embedding Model as Documents for Consistency(text-embedding-3-small)

@humphd, @tarasglek please take a look at the proposed implementation stages, and approve them. Let me know if I am missing something :)

View on GitHub

Stage 3 has just one point: "Allow using a cloud-based model or local (transformers.js or tensorflow.js)"

First, I had to figure out what's tensorflow.js.

Tensorflow.js

TensorFlow is a software library for machine learning and artificial intelligence.

Alright, it sounds cool, but how should I use it? I made a research and found out that tensorflow.js has a model called Universal Sentence Encoder that has embed method which generates the embeddings for the text passed as parameter.

Here's the link of Universal Sentence Encoder's source code and the npm usage documentation. These resources helped me to implement it.

The advantage of tensorflow.js over the openai model that I have also implemented is that we are running it offline, and it doesn't require the API key and the internet connection, which makes it extremely reliable for the chatcraft users.

I would love to share the tensorflow.js implementation:

import { EmbeddingsProvider } from "./EmbeddingProvider";

/**
 * TensorFlow.js-based embedding provider
 * Uses Universal Sentence Encoder for local embedding generation
 */
export class TensorflowEmbeddingsProvider implements EmbeddingsProvider {
  readonly id = "tensorflow-use";
  readonly name = "TensorFlow Universal Sentence Encoder";
  readonly description = "Local embedding model using TensorFlow.js (512 dimensions)";
  readonly dimensions = 512;
  readonly maxBatchSize = 256;
  readonly defaultBatchSize = 128;
  readonly minBatchSize = 16;

  static readonly CONFIG = {
    dimensions: 512,
    maxBatchSize: 256,
    defaultBatchSize: 128,
    minBatchSize: 16,
  };

  private model: any = null;
  private isLoading: boolean = false;
  private loadPromise: Promise<void> | null = null;

  constructor() {}
  get CONFIG(): void {
    throw new Error("Method not implemented.");
  }
  /**
   * Load the model if it hasn't been loaded yet
   */
  private async loadModelIfNeeded(): Promise<void> {
    if (this.model) {
      return;
    }

    if (!this.loadPromise) {
      this.isLoading = true;

      this.loadPromise = (async () => {
        try {
          console.log("Loading Universal Sentence Encoder model...");

          await import("@tensorflow/tfjs");

          const use = await import("@tensorflow-models/universal-sentence-encoder");

          this.model = await use.load();
          console.log("Universal Sentence Encoder loaded successfully");
        } catch (err) {
          console.error("Failed to load Universal Sentence Encoder:", err);
          this.loadPromise = null;
          throw err;
        } finally {
          this.isLoading = false;
        }
      })();
    }

    return this.loadPromise;
  }

  /**
   * Generate an embedding vector for a single text
   */
  async generateEmbeddings(text: string): Promise<number[]> {
    const result = await this.generateBatchEmbeddings([text]);
    return result[0];
  }

  /**
   * Generate embedding vectors for multiple texts in batch
   */
  async generateBatchEmbeddings(texts: string[]): Promise<number[][]> {
    let embeddings;
    try {
      await this.loadModelIfNeeded();

      embeddings = await this.model.embed(texts);

      const arrays = await embeddings.array();

      return arrays;
    } catch (error: any) {
      console.error("Error generating TensorFlow embeddings:", error);
      throw new Error(`TensorFlow embedding error: ${error.message || "Unknown error"}`);
    } finally {
      if (embeddings) {
        embeddings.dispose();
      }
    }
  }
}

Here's the open PR:

[RAG] Stage - 3: Embeddings Generation #873

mulla028 posted on Apr 09, 2025

Description

This is stage 3 of #868. We are adding the capability of generation and storage vector embeddings for document chunks. It introduces modular embedding provider archtecture, and supports:

OpenAI's text-embedding-3-small API
Local tensorflow.js alternative

ChatCraftFile has been extended with methods to:

generate embeddings
store embeddings
manage embeddings

Integrated embedding generation with the use-file-import to automatically create embeddings after chunking.

New settings added to control:

the embedding provider
batch size
automatic generation preference

[!IMPORTANT] UI for the embedding preferences required! Therefore, we must land this PR with the updated Settings UI...

Test It

[!TIP] You may want to test it. In order to do so follow the steps below!

Open CloudFlare deployment below
Upload File >=300KB
Go to DevTools
Application → Storage → IndexedDB → ChatCraftDatabase → files
Wuolah! You can see the generated embeddings! (I HOPE :D)

Screenshot

Question

Maybe we could reduce the size of the minimum chunking size from 300KB → 100 KB. Therefore, the minimum character per chunk is from 1000 → 300.

View on GitHub

Settings UI

Since this is an experimental feature that might be unstable or break, we need to make sure that regular users aren't distracted and by default it is turned off. Therefore, we need a switch that turns on the feature without adjusting the code every time. Here's the result:

Conclusion

This PR is still in progress, but embeddings generation works well. Since it is the last week of the term, everyone busy and don't have time to review my code, which is understandable. This is the end of the semester, but I will continue working on the RAG feature and running this blog, I really enjoy doing it!

OSD700 - RAG Integration: Stage 1 & 2

Amir Mullagaliev — Thu, 17 Apr 2025 00:56:53 +0000

Preface
Introduction
Stage 1
PR Expansion

Preface

In the previous post, I shared with you that I successfully implemented RAG prototype locally and described the steps I've taken. Moreover, we decided to try to make this feature work in the ChatCraft.org which means that the implementation is going to differ a little bit, since unlike my prototype we will be working in a browser-like environment. Therefore, it requires a clear plan that will help land the feature step-by-step.

Introduction

After the prototype presentation, I filed an issue with the proposal plan. Within the next couple of hours, we had a discussion regarding the plan, and eventually professor has adjusted it and approved.

That's how the final proposal issue looks like:

RAG on DuckDB Implementation Based on Prototype #868

mulla028 posted on Mar 29, 2025

Description

Recently, we have implemented a prototype of RAG on DuckDB, and it proves that implementation is doable for the ChatCraft it's time to start working on it!

Proposed Implementation Stages

Stage 1: Create Two New Tables in IndexedDB
- Embeddings Table, with foreign key to a file
- Chunks Table, with foreign key to a file
Stage 2: Implement Chunking Logic
- Proper Chunking with overlap (cf. https://platform.openai.com/docs/assistants/tools/file-search#customizing-file-search-settings)
- Proper Chunking Storage in IndexedDB
Stage 3: Implement Embeddings Generation
- Allow using a cloud-based model or local (transformers.js or tensorflow.js)
Stage 4: Vector Search
- Use DuckDB's extension Called VSS
- Load Embeddings, Chunks, etc. into DuckDB
- Apply HNSW Indexing to Increase Speed of the Search ( HNSW Indexing Provided by VSS extension)
Stage 5: LLM Integration
- Modify Prompt Construction to Include Retrieved Context
- Implement Source Attribution in Responses
- Adjust Token Management to Account For Context
Stage 6: Query Processing
- Implement Embedding Generation for User Queries
- Use the Same Embedding Model as Documents for Consistency(text-embedding-3-small)

@humphd, @tarasglek please take a look at the proposed implementation stages, and approve them. Let me know if I am missing something :)

View on GitHub

As you can see, it takes 6 stages. Could be more, but we already have these features:

Automatic Text Extraction of Uploaded File to IndexedDB
Chunking Logic (Thanks to one of the contributors)

Stage 1

In the first stage, I had to add the chunking and embeddings tables to the file table in IndexedDB. During the implementation, we have decided to have the embeddings inside the chunks, therefore each chunk has its vector embeddings. It didn't require much time, just a couple of lines of code...

PR Expansion

I realized that the PR is too small to be landed, and I have to expand it a little more and implement the chunking logic for each file. Which means that I am implementing Stages 1 and 2 in a single PR.

After a couple of hours, I pushed a bunch of commits:

Created FileChunk[] type
Implemented chunking logic
Added condition that files with the size of >3MB are getting automatically chunked during the import

The rest was a cleanup. However, one of the contributors pointed at the function that he has already implemented for the chunking. It helped me a lot, since I had to remake my chunking logic and it had some problems...

Eventually, the PR was approved and merged, you may take a look right here:

[RAG] Stages 1 & 2: New Columns and Chunking #870

mulla028 posted on Apr 01, 2025

Stage 1 for #868

Description

This is the stage 1 of RAG implementation. Since we've decided to use vector search ChatCraft requires two new tables, as it is stated in the Proposed Implementation. However, @humphd suggested to add two new columns to the ChatCraftFileTable - chunks and embeddings. These are optional columns, chunks that will contain chunked text ( Planned to be implemented during the stage 2 - next.) Therefore, embeddings - will contain generated by model and based on chunks vector embeddings ( The implementation is planned to be done at stage - 3 .)

Small Concern

I totally understand that these columns are optional, but do we need to add them to data schema as the new fields in the indexedDB like this? I don't see that we have any other optional column, so I decided not to include them to PR.

this.version(13).stores({
    files: "id, name, type, size, text, created, chunks, embeddings",
});

UPD: Decided to implement chunking here as well

View on GitHub

Now it means that stages 1,2 are done, and we have to move forward to the stage 3 - embeddings generation, it will be interesting :)

OSD700 - RAG on DuckDB

Amir Mullagaliev — Wed, 16 Apr 2025 23:15:13 +0000

Preface
Introduction
How it works?
Final Decision

Preface

In the last post dedicated to OSD700, I wrote that I've chosen a vector of the development for the rest of the term. If you don't remember, read it. If you are busy, I decided to implement the local prototype of RAG and based on the results decide whether to integrate it into ChatCraft.org.

A month later I am coming back with a group of posts regarding the results, and as a spoiler, I can tell, they are promising!

Introduction

Going back to the previous post, I am going to attach the prototype issue here:

Prototype RAG on DuckDB and File Attachments #803

humphd posted on Jan 27, 2025

ChatCraft has been expanded to include File Attachments and DuckDB, which supports querying files. The two features have been connected, so you can attach files, run SQL queries on them, get back results, download them, etc.

Now that we have this foundation, I think we have most of what we need for building a RAG solution, when file attachments are too large to put into the chat context.

I think the process would work like this:

user attaches some files with text we can extract (PDF, source code, Word Doc, etc)
somehow (UI? automatically based on file size) we decide when use these file attachments for RAG vs. embedding directly in the chat messages
we take the set of RAG-attachment-files and "index" them in DuckDB. Maybe we use full-text search or maybe we use vector search (see part 1, part 2)
when the user asks a question, we use their prompt to create a query, get back results from the indexed docs, and include relevant text context along with the original prompt

The initial version of this can be crude, without proper UI, optimal indexing, etc. We need to play a bit to get this right.

Likely, the best way to begin this work is to prototype it outside of ChatCraft using DuckDB and text files locally.

View on GitHub

Hopefully, it helped refresh your memory.

Research took about 2 weeks, and it really helped to understand what's happening and how, at least locally.

After those two weeks, I had to try to implement the feature locally and present it in the class. After the first failed attempt, I didn't give up and made it work.

Here's the repository where you may find my find prototype solution:

mulla028 / duckdb-rag-prototype

CLI RAG prototype on DuckDB implemented for ChatCraft

duckdb-rag-prototype

CLI RAG prototype on DuckDB implemented for ChatCraft using vector search

Getting Started

Clone the Repository
Install DuckDB on your local machine (Optional, used for testing duckdb -ui)
Install the dependencies
- npm i
Create .env file and add there your OPENAI_API_KEY

How to use it?

Once you have cloned the repo you will get the populated data inside of the targeted folder called documents. Obviously, you may add any text file to process.

First of all you will need to process all the files using the command:

npm run rag -- process

This command will process all the files, segmenting them into the chunks of the sentences(default) or paragraphss. Eventually, it will generate vector embeddings of 1584 dimensions using text-embedding-3-small model.

To use the paragraphs option:

npm run rag -- process -c "paragraphs"
npm run rag -- process --chunking "paragraphs"

…

View on GitHub

If you are willing to try it, I have written README.md that guides on how to use this prototype.

How it works?

Essentially, the process that sounds really complex consists of multiple simple stages and one complex. Here they are:

Receive text from the file. (We learnt it during the first semester)
Chunk Text and store at DuckDB. (Write a logic that chunks text e.g., by paragraph, by sentence etc.)
Generate the Vector Embeddings and store at DuckDB. (Using text-embedding-3-small openai model)
Vector Search (Hard one)
- import DuckDB VSS extension
- Apply HNSW Indexing to Increase Speed of the Search (HNSW Indexing Provided by VSS extension)
Generate Embeddings for User Query and Generate Answer

Final Decision

Countless hours of RAG research and prototyping eventually paid off, so professor liked the way I implemented/understood this problem, and we made a decision to try the integration to ChatCraft.org.

However, firstly I needed to write the proposal implementation issue, which would clearly identify all the steps it requires in order to successfully implement the feature.

SPO600 Project - Stage 2: Function Clone Detection and Analysis (Part 2)

Amir Mullagaliev — Mon, 07 Apr 2025 12:10:16 +0000

Reproducing My Setup
Detailed Test Results
- x86_64 Results
- aarch64 Challenges
Capabilities and Limitations
- What My Implementation Can Do
- Technical Limitations
Knowledge Gaps and Personal Reflections
Technical Improvements for Stage III
Conclusion

Reproducing My Setup

To replicate my work, follow these steps:

First, create or modify pass file located at gcc/gcc/tree-amullagaliev.cc with implementation shown in Part 1.
Navigate to your GCC build directory:

   cd ~/gcc-build-001/

Rebuild GCC with your modified pass:

   time make -j$(nproc)

Test your pass using provided test cases:

   cd /path/to/test/directory
   tar -xzf /public/spo600-test-clone.tgz
   cd spo600/examples/test-clone
   make

Detailed Test Results

x86_64 Results

On the x86_64 platform, my pass successfully identified both prune and no-prune cases:

For prune test case:

;; Function scale_samples (scale_samples.default, funcdef_no=23, decl_uid=3954, cgraph_uid=24, symbol_order=23)

************************************************************
*  ANALYZATION - Examining Function: scale_samples                *
************************************************************
************************************************************
*  ANALYSIS FINISHED!
************************************************************

;; Function scale_samples.popcnt (scale_samples.popcnt, funcdef_no=25, decl_uid=3985, cgraph_uid=30, symbol_order=28)

************************************************************
*  ANALYZATION - Examining Function: scale_samples.popcnt         *
************************************************************
PRUNE: scale_samples
CLONE FOUND: scale_samples
CURRENT: scale_samples.popcnt
************************************************************
*  End of Diagnostic
************************************************************

For no-prune test case:

;; Function scale_samples.arch_x86_64_v3 (scale_samples.arch_x86_64_v3, funcdef_no=25, decl_uid=3985, cgraph_uid=30, symbol_order=28)

************************************************************
*  ANALYZATION - Examining Function: scale_samples.arch_x86_64_v3 *
************************************************************
NOPRUNE: scale_samples
CLONE FOUND: scale_samples
CURRENT: scale_samples.arch_x86_64_v3
************************************************************
*  End of Diagnostic
************************************************************

aarch64 Challenges

On aarch64 platform, I encountered this error:

gcc -D 'CLONE_ATTRIBUTE=__attribute__((target_clones("default","rng") ))'\
    -march=armv8-a -g -O3 -fno-lto  -ftree-vectorize  -fdump-tree-all -fdump-ipa-all -fdump-rtl-all \
    clone-test-core.c vol_createsample.o -o clone-test-aarch64-prune
clone-test-core.c:28:6: error: pragma or attribute 'target("rng")' is not valid
   28 | void scale_samples(int16_t *in, int16_t *out, int cnt, int volume) {
      |      ^~~~~~~~~~~~~
make: *** [Makefile:35: clone-test-aarch64-prune] Error 1

This error happens because rng is not valid target attribute for aarch64 architecture. Unlike x86_64 which has attributes like "popcnt" and "arch=x86-64-v3", aarch64 has different set of supported CPU features.

This shows important cross-platform thing: Function Multi-Versioning attributes are architecture-specific, and code that works on one architecture may need changes to work on another.

Capabilities and Limitations

What My Implementation Can Do

Base Function Identification: Correctly identifies base name of cloned functions by stripping variant suffixes.
Resolver Function Detection: Recognizes and skips resolver functions, which handle runtime selection between clones.
Basic Structure Comparison: Compares number of basic blocks and GIMPLE statements to determine if functions potentially equivalent.
Output Requirements: Produces required PRUNE or NOPRUNE messages in correct format.
State Management: Manages state between function calls to compare different clones using std::string for storage rather than C-style character arrays, making code more robust.

Technical Limitations

Superficial Comparison: Comparison based only on block and statement counts, not on actual code semantics. Two functions with same number of statements but different logic would incorrectly considered identical.
No SSA Variable Normalization: Implementation doesn't normalize variable names before comparison, which more robust solution would do.
Architecture Dependency: As shown by aarch64 error, current implementation doesn't fully handle cross-architecture differences in FMV attributes.
Single Clone Assumption: Code assumes only one cloned function with two variants, as per project specs, but this isn't scalable to real-world codebases.
Limited Structural Analysis: My pass don't analyze control flow structure within functions, which would be necessary for truly robust clone detection algorithm.

Knowledge Gaps and Personal Reflections

This project revealed several knowledge gaps I need address:

Architecture-Specific Compiler Features: I need deeper understanding of how architecture-specific features implemented across different platforms. My aarch64 issues highlighted this gap.
GCC GIMPLE Internals: While I understand basics of GIMPLE representation, I need more thorough understanding of how to analyze and compare GIMPLE statements.
Cross-Architecture Testing: I need better strategies for developing and testing features that must work across different architectures.

I found most challenging aspect to be understanding how to effectively compare function structures beyond simple metrics. Specifically, I struggled with how to normalize SSA variables and other identifiers to determine when two functions was semantically equivalent despite superficial differences.

The most interesting part was seeing how GCC implements Function Multi-Versioning - resolver functions and naming conventions used for variants gave me insight into how runtime feature detection works. Professor Chris Tyler lectures were incredibly helpful in understanding these concepts and inner workings of GCC. I couldnt figure this out without his explanations!

Technical Improvements for Stage III

For Stage III, I plan address these technical issues:

Deep Structural Comparison: Implement GIMPLE statement-by-statement comparison that normalizes SSA variables, labels, and basic block numbers.
Architecture-Agnostic Approach: Modify implementation to handle architecture-specific differences in more robust way.
Hash-Based Signature: Generate normalized hash or signature for each function structure to make comparisons more efficient and accurate.
Better State Management: Improve current state management to handle more complex scenarios with multiple clones.
Control Flow Analysis: Add analysis of control flow structure, which would capture logical equivalence of functions beyond just statement counts.

Conclusion

Stage 2 has been challenging yet enlightening. Working directly with GCC has given me a deeper understanding of compiler optimization techniques, particularly Function Multi-Versioning.

Cross-architecture challenges I encountered were unexpected but provided valuable learning experiences about how compiler features can differ between platforms.

While my current implementation fulfills the basic requirements of the project, the limitations I identified provide clear direction for improvements in Stage III. I'm particularly interested in developing a more robust comparison algorithm that can accurately determine when two functions are semantically equivalent despite sketchy differences.

Professor Chris Tyler guidance and lectures been instrumental in this journey, providing foundation of knowledge needed to tackle these complex compiler topics. Without his clear explanations, navigating GCC internal structures would be much more difficult.

The journey continues in Stage III, where I'll improve these techniques and address the limitations identified here.

Note: All code for this project is available in my GitHub repository. Feel free to clone it and follow steps above to reproduce my results.

SPO600: Project Stage 2 - Function Clone Detection and Analysis (Part 1)

Amir Mullagaliev — Mon, 07 Apr 2025 10:59:51 +0000

Introduction
Project Requirements
What is Function Multi-Versioning (FMV)?
Understanding the Challenge
My Implementation Approach
Key Functions and Data Structures

Introduction

Welcome back to my SPO600 blog! If you didn't read my previous post, in which I described creating a basic GCC pass that counts basic blocks and GIMPLE statements, you should definitely check it out first to understand the foundation of what we're building on.

In Stage 2, we tackle a more complex challenge: building a Clone-Pruning Analysis Pass for GCC. This pass analyzes functions cloned during compilation and determines if they are substantially similar enough to be pruned. It's a deep dive into GCC optimization processes and how we can extend compiler capabilities!

Learning about GCC was really interesting, especially through Professor Chris Tyler's lectures. His clear explanations helped me a lot to navigate the complexity of compiler development. Without his videos, I probably still trying to understand how GCC works!

Project Requirements

For Stage 2, we need to create a pass that:

Identifies functions that have been cloned (these will have names like function.variant)
Examines clones to determine if they are substantially the same or different
Outputs diagnostic message indicating whether functions should be pruned or not

To simplify the project, we allowed to make these assumptions:

There is only one cloned function in the program
There are only two versions (clones) of that function (ignoring resolver)

What is Function Multi-Versioning (FMV)?

Before diving into implementation, it's worth understanding what function cloning or multi-versioning is in GCC.

Function Multi-Versioning is technique where a compiler creates multiple versions of the same function, each optimized for different processor capabilities. For example, one version might use AVX instructions for newer processors, while another uses more basic instructions for compatibility.

NOTE! When program runs, resolver function chooses appropriate version based on actual CPU capabilities of machine. This allows single binary to efficiently run on different processor generations without needing separate builds.

Understanding the Challenge

The challenge here is to determine when two function variants are substantially the same. According to project specs, functions are substantially the same if they are identical except for identifiers like:

Temporary variable names
Single static assignment (SSA) variable names
Labels
Basic block numbers

If two cloned functions are substantially same, there's no reason to keep both versions in final binary—we can prune the redundant one.

My Implementation Approach

After studying GCC codebase and our previous work from Stage 1, I decided implement relatively simple but effective approach:

Function Recognition: Identify base function name by stripping away variant suffixes
Basic Comparison Metrics: Compare structure of functions using:
- Number of basic blocks
- Number of GIMPLE statements

While this isn't full structural comparison, it provides solid first-pass heuristic. Functions with different block counts or statement counts definitely different, while those with matching counts likely similar (though not guaranteed).

Here's core logic from my implementation:

// Clone comparison logic using previous_function_name as flag.
if (previous_function_name.empty()) {
    // No clone stored; store current information.
    previous_function_name = base_name;
    previous_block_total = bb_count;
    previous_statement_total = gimple_count;
    // For standalone function, print footer.
    print_frame_footer(out, "ANALYSIS FINISHED!");
} else {
    // A clone already been stored; compare stored info with current function.
    if (previous_function_name == base_name &&
        previous_block_total == bb_count &&
        previous_statement_total == gimple_count)
    {
        fprintf(out, "PRUNE: %s\n", previous_function_name.c_str());
        fprintf(out, "CLONE FOUND: %s\n", previous_function_name.c_str());
        fprintf(out, "CURRENT: %s\n", full_fname.c_str());
    } else {
        fprintf(out, "NOPRUNE: %s\n", previous_function_name.c_str());
        fprintf(out, "CLONE FOUND: %s\n", previous_function_name.c_str());
        fprintf(out, "CURRENT: %s\n", full_fname.c_str());
    }
    print_frame_footer(out, "End of Diagnostic");
    // Clear previous_function_name to allow storing next clone.
    previous_function_name = "";
    previous_block_total = 0;
    previous_statement_total = 0;
}

My overall approach was:

Store information about first clone encountered
When encountering second clone with same base name, compare it to stored information
Output PRUNE or NOPRUNE decision based on comparison
Reset stored state for potential future clone pairs

Key Functions and Data Structures

Main components of my implementation include:

Static Storage Variables: To maintain state between function calls

  static std::string previous_function_name = "";
  static size_t previous_block_total = 0;
  static size_t previous_statement_total = 0;

Base Name Extraction: Strips variant suffixes to find base function name

  std::string get_base_function_name(function *fun) {
      struct cgraph_node *node = cgraph_node::get(fun->decl);
      std::string fname = (node != nullptr) ? std::string(node->name())
                                            : std::string(function_name(fun));
      size_t pos = fname.find(".resolver");
      if (pos != std::string::npos)
          return fname.substr(0, pos);
      pos = fname.find('.');
      if (pos != std::string::npos)
          return fname.substr(0, pos);
      return fname;
  }

Resolver Detection: Special handling for resolver functions

  bool is_resolver = (full_fname.find(".resolver") != std::string::npos);
  if (is_resolver) {
      print_frame_footer(out, "ANALYSIS FINISHED (resolver function)");
      return 0;
  }

Block and Statement Counting: Basic metrics for function comparison

  size_t bb_count = 0;
  size_t gimple_count = 0;
  basic_block bb;
  FOR_EACH_BB_FN(bb, fun) {
      bb_count++;
      for (gimple_stmt_iterator gsi = gsi_start_bb(bb);
           !gsi_end_p(gsi);
           gsi_next(&gsi))
      {
          gimple_count++;
      }
  }

(In Part 2, I'll cover testing process, results, limitations, and future improvements for project.)

OSD700: Stage 4

Amir Mullagaliev — Thu, 13 Mar 2025 14:25:25 +0000

Introduction

In last week's lecture, the professor helped me pick the work to do for the rest of this term. My main goal was to minimize front-end development and gain experience in back-end or middle-end. However, it doesn't mean that I won't work on UI/UX, it means that I will work in killer feature development.

Therefore, we came up with an idea to develop a RAG on DuckDB.

What's RAG?

"Retrieval-Augmented Generation (RAG) is a hybrid AI framework that enhances language model outputs by combining the model's inherent knowledge with information retrieved from external sources. When a query is received, RAG first searches through connected databases, documents, or knowledge bases to find relevant information, then feeds this retrieved context alongside the original query into the language model. This approach addresses several limitations of standalone language models by providing access to up-to-date information beyond the model's training cutoff, reducing hallucinations by grounding responses in verified sources, enabling attribution to specific documents, and allowing for domain specialization without extensive model fine-tuning. RAG has become fundamental in enterprise AI applications, search engines, and customer support systems where factual accuracy and current information are essential." - Claude AI

The rag will help ChatCraft search through the text files and give an answer based on the user's prompt. There is a filed issue that describes everything:

Prototype RAG on DuckDB and File Attachments #803

humphd posted on Jan 27, 2025

Now that we have this foundation, I think we have most of what we need for building a RAG solution, when file attachments are too large to put into the chat context.

I think the process would work like this:

user attaches some files with text we can extract (PDF, source code, Word Doc, etc)
somehow (UI? automatically based on file size) we decide when use these file attachments for RAG vs. embedding directly in the chat messages
we take the set of RAG-attachment-files and "index" them in DuckDB. Maybe we use full-text search or maybe we use vector search (see part 1, part 2)
when the user asks a question, we use their prompt to create a query, get back results from the indexed docs, and include relevant text context along with the original prompt

The initial version of this can be crude, without proper UI, optimal indexing, etc. We need to play a bit to get this right.

Likely, the best way to begin this work is to prototype it outside of ChatCraft using DuckDB and text files locally.

View on GitHub

What Have I Done?

I started working toward prototype implementation. It took me a while to research how everything works, and the first small steps were taken, but I consider my local prototype a super raw version.

Using langchain, openai and duckdb, I am working on the local version of this new feature before I start web implementation and eventually, implementing it in ChatCraft! It will take some time, but I am really motivated to finish it, and present.

Conclusion

This is a small blog post since I spent a week on research and a small part of the implementation. However, next week, I will write a huge blog post on how to implement RAG on the Duckdb prototype locally. Will see y'all!

SPO600: Project Stage 1 - Basic GCC Pass

Amir Mullagaliev — Mon, 10 Mar 2025 04:03:04 +0000

Introduction
Steps To Create a Pass
- What is GCC Pass?
- Step 1 - Write a Pass
- Step 2 - Registering the Pass
- Step 3 - Add Object File
- Step 4 - Modify Header File
- Step 5 - Re-create the Makefile
Results
- Dump File Outputs
- Code Limitations
Conclusion

Introduction

This blog post is dedicated to SPO600's Project - Stage 1, where I particularly work with GCC. If you haven't read the previous post where I described the steps to build GCC compiler on Aarch64 and x86_64 servers and eventually compared them, I highly recommend doing so!. Therefore, you'd 100% understand what, why and how I do in the current post!

Stage one helps students prepare their environment and GCC for the second stage, where the major "heavy lifting" will happen. During this stage, I am creating Basic GCC Pass for the current development version of the GCC compiler which:

Iterates through the code being compiled.
Prints the name of every function being compiled.
Prints a count of the number of basic blocks in each function.
Prints a count of the number of gimple statements in each function.

Modern GCC has poor documentation on how to create a pass, so I referred to video-lectures and documentation provided and created by our professor:

Steps To Create a Pass

Honestly, these steps do not require extraordinary knowledge but patience and attention. Following these steps and provided resources, anyone may reproduce whatever has been done in this stage. However, it is logical since we were notified that it is a preparation of the environments.

What is GCC Pass?

"A GCC pass is a modular component within the GNU Compiler Collection that performs a specific transformation or analysis task during the compilation process. Each pass operates on an intermediate representation of the code (such as GIMPLE or RTL), executing in a predetermined order within the compilation pipeline to transform source code into machine code. Passes can analyze code, optimize it, clean up after other passes, or implement target-specific transformations, with the pass manager coordinating their execution. Compiler developers can create custom passes to extend GCC's functionality, as in your project where you're implementing a pass to count basic blocks and GIMPLE statements within functions." - Claude AI (Sonnet 3.7)

Back to square one!

Step 1 - Write a Pass

Once I built a GCC compiler, I had to look over the GCC passes to understand how it looked and pick one of them as a template. To do so, I went to the source of gcc, where I was able to find those passes.

Since I cloned GCC from the git repository, my source code is located at ~/git/gcc:

Move to the gcc sub-directory: cd gcc, you will get to ~/git/gcc/gcc where is located actual compiler implementation.
Look for the files starting with tree-*.cc or tree-*.c for passes that work on the tree/GIMPLE representation:

ll tree*.cc

You will find this kind of list of the passes implemented by GCC developers:

Pick one of these templates as the starting point.

The professor's first example was used and found at gcc/gcc/tree-nrv.cc.

Throughout this stage, I'd been reproducing professors' code to make sure that I was going along.

Test Pass may be found in my git repository, simply copy it if you want to keep things simple, and go along with this tutorial: click here to see the source code

This test pass simply outputs all of the compiled functions in the dump file.

My final source code implementation for the pass may be found here: final pass' source code

This implementation shows the names of each function alongside counting the basic blocks and gimple statements. Eventually, shows the total numbers after every function. NOTE: Outputs will be presented as the final step

Step 2 - Registering the Pass

It is very important to register the pass in passes.def in order for GCC to recognize the custom pass. Otherwise, it won't work.

This file is located at ~/git/gcc/gcc/passes.def. This file processes a lot of passes during the compilation therefore the order is important, I decided to do the same as the professor and put it under:

NEXT_PASS (tree_nrv);

My modified file looks like this:

...
NEXT_PASS (tree_nrv);
NEXT_PASS (tree_amullagaliev);
...

Step 3 - Add Object File

I added the object file for my pass to the file Makefile.in in the OBJS section. My source file is tree-amullagaliev.cc therefore, I added tree-amullagaliev.o to the OBJS list.

It should look something like this:

/* existing content before the modification*/
tree-dfa.o \
tree-amullagaliev.o \
tree-diagnostic.o \
/* existing content continues */

Step 4 - Modify Header File

To make my pass recognizable in earlier modified passes.def, I had to declare it in the tree-pass.h header file. Which can be found at ~/git/gcc/gcc/tree-pass.h.

Add this declaration in order to allow GCC to recognize the function:

extern gimple_opt_pass *make_tree_amullagaliev(gcc::contenxt *ctxt);

Step 5 - Re-create the Makefile

As one of the final steps, I had to re-create the Makefile inside of my build tree, which is ~/gcc-build-001. I had to do it in order for changes in the Makefile.in to be recognized - the build system wouldn't automatically detect the changes only to the Makefile.in.

The easiest way is to delete Makefile inside of the gcc sub-directory inside of the build tree, which can be found at ~/gcc-build-001/gcc/Makefile. NOTE: This method allows me to prevent rebuilding everything!

IMPORTANT! Don't get me wrong, the deletion of Makefile inside of ~/gcc-build-001/gcc is required only when we make changes inside of the single file: ~/git/gcc/gcc/Makefile.in, future modification of the pass doesn't require this step!

Here's how it looks in the bash:

cd ~/gcc-build-001/gcc
rm Makefile
cd ..
time make -j$(nproc) |& tee buid-xxx.log

Results

Once, I have rebuilt the GCC with the brand new pass, I am ready to test it!

First of all, I had to write a test code, which can be found here, I am leaving for you, so you could reuse it, also looks like the professor's code:

#include <stdio.h>

int foo(int p1, int p2) {
    return p1*p2;
}

int main() {
    int a = 12;
    int b = 13;

    int c;

    c = foo(a, b);

    printf("%d\n", c);

    return 0;
}

The next thing that I had to do was to create a Makefile inside the test directory, which was also uploaded to the git repository:

BINARIES=hello
CCFLAGS=-g -O0 -fno-builtin -fdump-tree-amullagaliev

all: ${BINARIES}

hello: hello.c
    gcc ${CCFLAGS} -o hello hello.c

clean:
    rm ${BINARIES} *.o || true

Notice! How using the flags I marked that I want to see the dump file for the pass I have just implemented: -fdump-tree-amullagaliev

Dump File Outputs

Upon the completion of all the steps above, I had to compile the sample code using make command:

make hello

Two files will appear hello and hello.c.265t.amullagaliev.

Here are the results of the dump file using two passes: test-pass and final pass:

Test-Pass Output:

;; Function foo (foo, funcdef_no=0, decl_uid=3929, cgraph_uid=1, symbol_order=0)

=== FUnction 1 Name 'printf' ===
=== FUnction 2 Name 'main' ===
=== FUnction 3 Name 'foo' ===


#### End amullagaliev diagnostics, start regular dump of current gimple ####


int foo (int p1, int p2)
{
  int D.3937;
  int _3;

  <bb 2> :
  _3 = p1_1(D) * p2_2(D);

  <bb 3> :
<L0>:
  return _3;

}



;; Function main (main, funcdef_no=1, decl_uid=3931, cgraph_uid=2, symbol_order=1)

=== FUnction 1 Name 'printf' ===
=== FUnction 2 Name 'main' ===
=== FUnction 3 Name 'foo' ===


#### End amullagaliev diagnostics, start regular dump of current gimple ####


int main ()
{
  int c;
  int b;
  int a;
  int D.3939;
  int _7;

  <bb 2> :
  a_1 = 12;
  b_2 = 13;
  c_5 = foo (a_1, b_2);
  printf ("%d\n", c_5);
  _7 = 0;

  <bb 3> :
<L0>:
  return _7;

}

As you can see, only function names were produced.

Final Pass Output

;; Function foo (foo, funcdef_no=0, decl_uid=2337, cgraph_uid=1, symbol_order=0)

===== Basic block count: 1 =====
----- Statement count: 1 -----
_3 = p1_1(D) * p2_2(D);
===== Basic block count: 2 =====
----- Statement count: 2 -----
<L0>:
----- Statement count: 3 -----
# VUSE <.MEM_4(D)>
return _3;
------------------------------------
Total Basic Blocks: 2
Total Gimple Statements: 3
------------------------------------

int foo (int p1, int p2)
{
  int D.2345;
  int _3;

  <bb 2> :
  _3 = p1_1(D) * p2_2(D);

  <bb 3> :
<L0>:
  return _3;

}



;; Function main (main, funcdef_no=1, decl_uid=2339, cgraph_uid=2, symbol_order=1)

===== Basic block count: 1 =====
----- Statement count: 1 -----
a_1 = 12;
----- Statement count: 2 -----
b_2 = 13;
----- Statement count: 3 -----
# .MEM_4 = VDEF <.MEM_3(D)>
c_5 = foo (a_1, b_2);
----- Statement count: 4 -----
# .MEM_6 = VDEF <.MEM_4>
printf ("%d\n", c_5);
----- Statement count: 5 -----
_7 = 0;
===== Basic block count: 2 =====
----- Statement count: 6 -----
<L0>:
----- Statement count: 7 -----
# VUSE <.MEM_6>
return _7;
------------------------------------
Total Basic Blocks: 2
Total Gimple Statements: 7
------------------------------------

int main ()
{
  int c;
  int b;
  int a;
  int D.2347;
  int _7;

  <bb 2> :
  a_1 = 12;
  b_2 = 13;
  c_5 = foo (a_1, b_2);
  printf ("%d\n", c_5);
  _7 = 0;

  <bb 3> :
<L0>:
  return _7;

}

This one results in following all the requirements set by professor, they are all written at the introduction.

These two dump-files may be found on github as well: click here.

Code Limitations

I could list you another whole blog of limitations of this code, however this pass serves as a counter, and function name printer :)

Regarding the capabilities, everything listed in the introduction.

Conclusion

I hope someone found this post helpful, and was able to reproduce everything written here!

Honestly, I faced some minor challenges. First one was the attention, I had to keep track of many files always made sure that I added the pass recognition inside of the gcc source file. Secondly, time it took to rebuild Makefile. For some reasons, it took 15 minutes on x86, and just 2 minutes on Aarch64. I think it happened due to the high load of the servers by other students, everyone was trying to build. I won't be surprised if someone was building gcc from scratch :D

In my opinion, this is one of the most interesting courses, I am doing something new and mind-blowing. I really appreciate professor's efforts and explanations, even while we don't have that much documentation, he still manages to deliver the content by writing his own tutorials, and explaining clearly everything in his videos.

Will see you in next blogs, I have a lot of things to do: Lab03, Lab05 and the rest of the project stages!

SPO600: Lab 4 - Building GCC

Amir Mullagaliev — Fri, 07 Mar 2025 20:29:27 +0000

Introduction
- What is GCC?
- Where to Get Source Code?
Aarch64 vs x86
- CPU Specifications
- Cache Memory
Build Preparation
- Screen
Time to Build!
- x86
- AArch64
Install the Build
- Result
- Building C Programs
Experiments
- Changed Timestamp Experiment
- Result
- Rebuild Software Without Making Any Changes
- Result
Conclusion

Introduction

For those who have never read my blog, I am writing about the Labs and Project for Seneca Polytechnic's Software Optimization and Portability course, or SPO600.

Today, I am going to talk about lab number four, during this lab, I am going to install and build the GCC compiler using Makefile.

First of all, I would like to share all of the resources I used in order to finish this lab:

GCC Build Guide - Provided by professor
GCC source code - Found on this page
Screen - Provided by professor
Makefile - Provided by professor

What is GCC?

Before I start, you have to understand what it is GCC. The GNU Compiler Collection(GCC) is a collection of compilers from the GNU Project that supports various programming languages, hardware architectures and operating systems.

Where to Get Source Code?

The very first step of building the GCC is to get source code. I used the code provided by GNU here. Therefore, cloned it:

git clone git://gcc.gnu.org/git/gcc.git

Since we are using two servers with two different architectures in this course: aarch64 and x86, I have to execute all the steps explained in this blog for both of them.

Going forward, let's understand the difference in those specifications that matter most in terms of this lab and make a prediction based on it. Mostly, we care about the speed of the build.

Aarch64 vs x86

CPU Specifications

Feature	x86_64	aarch64
CPU Cores	10 cores, 20 threads (2 threads per core)	16 cores, 16 threads (1 thread per core)
Clock Speed	3.70GHz base, up to 4.70GHz	Not specified
Vendor	Intel	ARM

Cache Memory

Cache Level	x86_64	aarch64
L1 Data	320 KiB total (32 KiB per core)	512 KiB total (32 KiB per core)
L1 Instruction	320 KiB total (32 KiB per core)	768 KiB total (48 KiB per core)
L2	10 MiB total (1 MiB per core)	8 MiB total (1 MiB per 2 cores)
L3	19.3 MiB (shared)	8 MiB (shared)

Based on the architectural comparison, some factors will likely influence the build performance.

Thread Advantage: The x86_64 system has 20 threads vs. 16 threads on the ARM system.
Cache Considerations: ARM system has larger L1 caches, however x86_64 system has a significantly larger L3 cache - 19.3MiB vs 8MiB. GCC includes processing large amounts of code since x86_64 has a larger L3 cache, which may provide a meaningful advantage.
As of now, I know that x86_64 has extensive specialized instruction sets that may accelerate certain compilation tasks, while the ARM system uses a more streamlined approach.

My prediction is that the x86_64 system will complete the GCC build faster.

Build Preparation

First of all, I created two new directories:

mkdir -p git gcc-build-001

Then, I went to the git directory and cloned gcc:

cd git
git clone git://gcc.gnu.org/git/gcc.git

As a next step, I had to configure the GCC source code for a custom build, since we already have GCC located at /usr/local, we want to install another version at ~/gcc-test-001:

~/git/gcc/configure --prefix=$HOME/gcc-test-001

Screen

Everyone who wants to build GCC understands that it takes at least 20 minutes in the best case, and it may take up to several hours. Therefore, if something goes wrong and you are connected to the server where the build is happening, it will crash.

To prevent users from it, bash uses a screen tool. It detaches the session and allows the user to do whatever he wants in parallel to anything happening on that detached session. Surprisingly, it is a pretty simple tool to utilize.

Create a detached session:

screen -RaD

Run anything continues you want:

make

Leave the session while it finishes the task using this key combination:

Ctrl+A+D

Reconnect to the session:

screen -RaD

Time to Build!

We are ready to start building the GCC before we have to detach the session:

screen -RaD

Going forward, our Makefile is already located at ~/gcc-build-001, so we have to be inside this directory in order to start the build:

cd gcc-build-001

Once we are inside the appropriate directory, it is time to perform the build by typing make. However, we want to compare the time it takes and save the stdout and stderr inside of the build.log file. Therefore, we need a little more complex command than just make:

time make -j 24 |& tee build.log

I'll be back once it builds on both architectures with the results!

x86

x86_64 won the competition and finished first with a time of 47m37s.647 as I predicted!

AArch64

It took a year to finish. With a time of 124m55s, aarch64 architecture loses...as...expected...

Install the Build

After you've performed a build, it is time to install it by simply typing make install.

IMPORTANT!

To prove that we have successfully installed gcc, I will provide a result that I had a different version of gcc installed earlier by admin or by system.

Now I am ready to install it!

As you remember, during the building preparation stage, we configured the GCC source code for the custom build by typing ~/git/gcc/configure --prefix=$HOME/gcc-test-001. Eventually, after we have performed the installment, we can see that ~/gcc-test-001 appeared!

There is now located the installed GCC! To use it, we have to make a small adjustment to the system :

PATH=$HOME/gcc-test-001/bin:$PATH

... to include the bin directory within the installation directory as the first directory!

Result

NOTE! As you can see, we have installed an experimental or development version of GCC, which differs from the initial one!

Building C Programs

This Development version of GCC is capable of creating and compiling simple C programs:

I could have stopped here. However, we have to run a couple of more experiments!

Experiments

This lab provides us only two experiments that answers two qustions:

How long does it take to rebuild GCC if we change the timestamp of one file?
How long does it take to rebuild GCC without making any changes, just by invoking the make command?

Changed Timestamp Experiment

First of all, we have to figure out how to change a timestamp...

I've done some research, and it turns out that it is as simple as using the touch command... Honestly, I used this command a lot, but to create new files, I have never applied this command to existing ones. It proves that every new day, we learn new things, no matter how much we already know...

Secondly, we have to find the file called passes.cc.

Steps I had taken to complete this experiment:

Find the file called passes.cc:

find ~/git/gcc -name "passes.cc"

result:
/home/amullagaliev/git/gcc/gcc/passes.cc

Update timestamp:

touch /home/amullagaliev/git/gcc/gcc/passes.cc

Get to ~/gcc-build-001

cd ~/gcc-build-001

Rebuild the software by re-issuing the make command.

Result

Fortunately, it took only 59s to rebuild.

Rebuild Software Without Making Any Changes

This is the easiest part of the lab; I just have to re-issue the make command without any prior manipulations! Obviously, to measure time, I used: time make -j 24 |& build.log.

Result

It was the fastest build today that only took 15s. I am so happy to see this number :D

These experiments went as I expected.

Conclusion

This lab took so much time due to the build time of GCC. However, I really enjoyed it. I am satisfied with the results, even though I had to start from scratch and rebuild twice.

Unfortunately, I wasn't able to go through all the steps on aarch64, the only thing that I performed there was the build since the server doesn't have enough space... It has only 12MB which isn't enough for the installment of GCC.

Anyway, I may reach back with the experimental results later once we get some more space at the Aarch64 server. Thanks a lot to those who read this, I put a lot of effort into my blog posts!

OSD700: Sprint 4 - Planning

Amir Mullagaliev — Thu, 06 Mar 2025 13:09:42 +0000

Introduction

This is the beginning of the second half of the term, which means that we have already made a significant amount of contributions. However, it is not the end. At this point, I am preparing a clean plan with a bunch of goals that I am willing to achieve by the end of the term.

During the last lecture, the professor talked about the problems we are experiencing now. One of the problems is that we are working hard on the project, but we jump between the tasks without finishing them 100%. It was a pretty good point since I felt that changing the focus and goals of my work wasn't resulting in high-quality results and PRs.

Therefore, I decided to fix existing bugs caused by my previous PRs. Going forward, my primary purpose for the upcoming lecture is to set meaningful goals that I would be proud of after achieving them during the last half of the term.

Bug that Finally Fixed

In the middle of the previous half of the term, I worked on the Files Attachment UI. Here's my PR:

[UI] File Attachment Added #804

mulla028 posted on Jan 27, 2025

Closes #794

Description (UPDATED)

This PR improves UX in terms of file management. Previously user was only able to see the file in the PromptForm, but now he is able to use paperclip icon to see all files attached in modal window, and manage them:

Attach files... (Same as in OptionsButton)
Delete (using removeFile from src/lib/fs.ts)
Download (using downloadFile from src/lib/fs.ts)

If user never attached the file(s) paperclip will trigger file attachment process without opening modal window.

Preview (UPDATED)

View on GitHub

During the implementation of the new UI feature, I faced a bug that was preventing the user from attaching the new files after he had deleted all of them. I was hitting my head on the wall since I wasn't able to find the solution.

However, after the lecture that I was talking about in the introduction, Aldrin opened two issues addressing this problem.

Issue 1

Issue 2

Solution

To fix it, I had to dive into the myself-written code to understand why is that happening. The cause of these problems was pretty simple; I used two <Input> elements and one of them wasn't always rendering due to the !isAttached condition:

Before:

{!isAttached && (
          <Input
            multiple
            type="file"
            ref={fileInputRef}
            hidden
            onChange={handleFileChange}
            accept={acceptableFileFormats}
          />
        )}

After:

    <Input
     multiple
     type="file"
     ref={fileInputRef}
     hidden
     onChange={handleFileChange}
     accept={acceptableFileFormats}
    />

It allows us to always render the input element that doesn't have any duplication. Moreover, I had to refresh the fileInputRef after every deletion so we could ensure that it is available for future use.

Here's the PR I opened:

[FIX] File Input bug fixed #849

mulla028 posted on Mar 05, 2025

Closes #837 & #838

Description

During the implementation of #804 I couldn't fix the problem after the deletion of all files, I couldn't interact with paperclip button unless refreshed the page. This time I realized that problem was hidden inside of the <Input> elements that were duplicated, and also we didn't need a condition to render it, we had to render it in any case. Good lesson for me tho :)

View on GitHub

Result:

If you noticed, now, if the number of files becomes zero, the modal window is going to get closed, and the user will have to upload a new file to open the file attachments modal window again.

Conclusion

It is so hard to pick a focus of work, and I would like to figure it out during today's lecture. Next week, you will learn about the new focus area I've picked. Thank you for reading, will see you next week!

DEV Community: Amir Mullagaliev

SPO600: Project Stage III - Enhancing the Clone-Pruning Analysis Pass

Table of Contents

Introduction

Stage III Requirements

Implementation Approach

Implementation Details

Data Structure Changes

Function Tracking Logic

Analysis Algorithm

Complete Implementation

Testing and Results

Test Case Design

x86_64 Results

aarch64 Results

Comparison Between Architectures

Capabilities and Limitations

Capabilities

Limitations

How to Reproduce My Results

Final Reflections

Conclusion

SPO600 Lab 5: Adventures in Assembly Language

Table of Contents

Introduction

Lab Requirements

Implementing the Loop in AArch64

Implementing the Loop in x86_64

Comparing Assembly Languages

Debugging Headaches

Code Breakdown

Lessons Learned

Full Source Code

Conclusion

SPO600 - Lab 3: Building a Number Guessing Game in 6502 Assembly

Table of Contents

Introduction

Game Overview

Code Breakdown

Initialization and Random Number Generation

Text Output: Game Prompts

Keyboard Input Handling

Graphics Feedback

Screenshots

Attempt Tracking and Display

Restart Mechanism

Reflection

Challenges

Limitations

Full Code

OSD700 - RAG Integration: Stage 3

Table of Contents

Introduction

RAG on DuckDB Implementation Based on Prototype #868

Description

Proposed Implementation Stages

Tensorflow.js

[RAG] Stage - 3: Embeddings Generation #873

Description

Test It

Screenshot

Question

Settings UI

Conclusion

OSD700 - RAG Integration: Stage 1 & 2

Table of Contents

Preface

Introduction

RAG on DuckDB Implementation Based on Prototype #868

Description

Proposed Implementation Stages

Stage 1

PR Expansion

[RAG] Stages 1 & 2: New Columns and Chunking #870

Description

Small Concern

OSD700 - RAG on DuckDB

Table of Contents

Preface

Introduction