DEV Community: mahesh_attarde

Essense of Machine Learning Hardware

mahesh_attarde — Wed, 05 Apr 2023 06:12:30 +0000

Interesting Notes from my work.

Here are few design considerations for machine learning asic. Design traits will achieve efficiency/effectiveness per watts.

Design Paramter : ML Model

Feed-forward ANN are faster since no data feedback simplfied desgin
LSTM require accumlation of partial results, increases complexity
Type of Neural Network affects simplifications to hardware, Usage of domain knowledge.

Design Parameter : Hardware use

Common Sense Idea : Inference and Training hardware are different (most time)

Hardware for training includes input with large dataset, cost function , optimizations function and model
- Large dataset implies more memory for features, bag of features
- cost and optimzation function use are feedback to improve model with indentifying error. implies "latency of feedback".
- Support for debugging model to look for training errors, generalization errors implies "profiling interfaces"
- Should have support for workflows
Hardware for inference
- It includes input as dataset, not as large as training dataset and model.
- Model can be compressed based on domain of use.

Design Parameter: Input Dataset

Common Sense Idea: Use known information to minimize data set

Use of Encoding.
Use of Precise data storage. Less Register Size implies better energy/storage efficiency E.g. ML Model with human Age as input parameter. Ideally is 32 bit int. but age of human is < 100 implies 7 bit size. After understanding use cases, model only understand age ranges with widht of 10. implies 4 bit data.
Use of Data quantization. e.g. bfloat16 or MSFP (great common sense idea)

Design Paramter : Signal Data Processing

For Vision, audio signals as input, FFT has saved computation significantly!!! ( WOOWWWW!)

Design Parameter : ALU (PE) Computation

List All Unit Operations and minimize to bare minimum required. implies minimum area
For each operation minimize register size -> aguments datatype
For each operation minimize microcode with basics e.g. MAC Design (Multiply and Add) MAC for N bit x 2 regs, with result if 2N muls and add to N again Less Register Size, less microcode implies better energy/storage efficiency
Unit Operations in ALU
- Granularity of Unit Operation designed in hardware should be Maximizing paralization, at cost of space, later in time.
- Examples of Unit Operation
- Convolution aka (dot product) , Matrix Multiply. (Learnings from CPU based MatMul) lead to systolic execution of MatMul (MAC) over General MatMul (GMEM).
- Pooling aka ( Normalization) , thresholding (Learning from Comparator design in HDL) Creating minimized Comparator logic ckt. (Spatial Reduction Case)

Design Parameter : MOV data

More Cycles to get data, implies More stalls in ALU, implies high Energy
More frequent access to to distant memory in Memory heirarchy, Explosively increases energy consumption
Always Pipeline data, Move data in Spatial and Temporal Ways.
- Machine Learning Programming Langauge needs Tensor Loops for Time and Space maximization
Decoding Cost of MOV
- Reg < Cache < Buffer < DRAM
- Large Register size better
- Double Buffering is good
- RAM technology. SRAM, DRAM, HBM
- Buffer Specialization, Read only buffers and write only buffer perform better than R/W
- Quantifiable Design Concept

Co-designing PE-MEM

For Typical Operation, 2x DDR MOV RD to Reg, 1 MAC, 1x DDR MOV WR from Reg
Considering NN Type,
- Calculate Order of Execution and Data Memory Access Patterns
  - idea check : Why Build Simulator that identifies Patterns of Data movements!!!
- For Accumlation PE Pattern, ADDER Tree implementation and SYSTOLIC accumlation for matrices
- For Non-Accumation Pattern, Direct Wiring multicast and SYSTOLIC multicast for matrices
- Keep Stationary Data in closest register,without movements if possible
  - idea check : Register Allocation feedback based based on Spill factor and Movement factor.
  - Move Input into Register, Keep Weights stationary or Keep Partial Sums and Output Stationary.
- Scale MAC and LOCAL Memory up.

Design parameter : ISA Decoder

CPU/GPU Decoders are complex for ML Ops.inefficiency into Pipelining Operations CKT
Short Decoder sequences better, small, efficient for only Unit Ops.
SIMD decoder <> VLIW decoder ckt

Design Parameter : Compiler Support

Input (Tensor) Slicing at programming lang and backend level
Re-organize Loop order, polyhydrals for better memory movements, split loops into more loops.
Optimize for Area x Energy

Design Paramter : Choice of Hardware transistor tech

High Frequency, high throughput results in heating
Low frequency, high thougput, optimal area serves better.
Diaelectric Silicon technology in nm
Photonic Technology
Calculate Compute Density , Compute to data movement ratio to classify IO dominant or Compute Dominant

Notes on Code Review

mahesh_attarde — Fri, 17 Mar 2023 10:51:43 +0000

"Build thing right" is motivation for code reviews. Depending on context, each of following point can be extended to any review.
My base understanding about reviewing someone code comes (Programmer Competency Matrix)
[https://sijinjoseph.com/programmer-competency-matrix/], which serves

Here is checklist I noted down from 2 finest engineers.

Size of Code Review
- Is this Single Objective change?
- Multiple Objective Change
  - is it time consuming and simple enough to review >15 min?
    - Else break it down
    - check anti-pattern, clubbing multiple checking?
Requested Functionality Check
- Unit tests
  - are complete and correct ?
  - Missing tests?
- Algorithm
  - Time Complexity
  - Space Complexity
  - Contextual Appeal of Algorithm
- Data Structure
  - New Data Structure
    - Data Layout ?
    - Interfaces are stable?
    - Is there exposed state?
  - Existing Data structure
    - Is Data Structure Abused in usage?
  - Domain of Application
- Find Assumptions hidden in place-sight?
- Are there unhanded cases for assumptions?
Coding Anti-Patterns
Code Style Consistency
- Style in Sync with existing style?
- tool clang-format
Language features
- Are there any Language Features abused?
- Can there be scope for using language features?
Memory Issue
- Is there chance of memory corruption?
- Are there any data races?
- tool val-grind
Performance Issue
- Deterioration due to integration of Algorithmic Functionality
- Performance Test
Concurrency Issue
- Multi-threaded
- Distributed
Contextual Appeal for Integration
- Environment variable vs command line options
Risk of Code Change
- Change affects area that is only requested , Guarded by Option with Default
- is there a disaster scenario ?
- are there deployment issues?
Architectural Change
- Is there library dependency ? is it explicitly mentioned or implicit?
- Is Architectural change adding value over time/ maintainance?
Review of "Language" (C++ in this Context)
- Check Data types used, CV qualifiers, storage specifiers
- Overflow and underflow conditions?
- New Class
- Checklist of Default constructors, operators
- check behaviors are mocked with coverage
- and follow up on guideline based on use cases.

Although this is checklist learnt from experienced programmer and awesome engineer known to me,
It may be possible to think of general framework after deep thought, which any rookie can use.

HTH!

Code Search and Navigation with livegrep

mahesh_attarde — Fri, 10 Mar 2023 15:48:45 +0000

Using Code Search : Livegrep

After trying out bunch of indexing (tag) tools production/command-line, one that index git repo, uses ngram, has web frontend, includes docker setup. It is code search that just works!

(Original Github) [https://github.com/livegrep/livegrep]

This are instructions to run source indexer

Go to Source Directory
Clone repo and output index file with idx suffx current directory, mouted as data on container

docker run -v ${PWD}:/data ghcr.io/livegrep/livegrep/indexer /livegrep/bin/livegrep-github-reindex -repo doxygen/doxygen -http -dir /data

Create Network

docker network create livegrep

Create backend process load it with idx file and accept RPC call at grpc

docker run -d --rm -v ${PWD}:/data --network livegrep --name livegrep-backend ghcr.io/livegrep/livegrep/base /livegrep/bin/codesearch -load_index /data/livegrep.idx -grpc 0.0.0.0:9999

Connect web app and backend process and publish web application

docker run -d --rm --network livegrep --publish 8910:8910 ghcr.io/livegrep/livegrep/base /livegrep/bin/livegrep -docroot /livegrep/web -listen 0.0.0.0:8910 --connect livegrep-backend:9999

Using It
1. Connect to "0.0.0.0:8910" of web application
2. Search for Code literal
Best Parts
1. LIVE GREPPING! Grep Works as you type. faster than egrep,grep
2. Click on Search, it takes you to github file with anchor! :D

HTH!

[GDB-Quick] Using Eclipse Standalone debugger GUI

mahesh_attarde — Thu, 22 Sep 2022 14:08:09 +0000

GDB is debugger that comes with basic functionality of doing everything command line. GDB-TUI provides basic Text UI for debugging source codes with minimum dependencies. At times, Window-GUI is much help to look at much information at snapshot, edit it, view it without explicitly typing all commands and shortcuts.
Here is how we can use eclipse-cdt as standalone debugger.
No need to create projects!

Download Eclipse-CDT package
Unzip Eclipse-CDT.zip
switch to unzipped folder eclipse\plugins\org.eclipse.cdt.debug.application_*\s
cripts
/bin/sh ./cdtdebug.sh

This will install debugger into your $HOME/cdtdebugger/ area

Now Connect to local process with GUI prompt. No need to list process on terminal then type process id!
$HOME/cdtdebugger/cdtdebug.sh -a

Connect Remote Process
$HOME/cdtdebugger/cdtdebug.sh -r address:port

Debug binary
$HOME/cdtdebugger/cdtdebug.sh -e executable

[GDB-Quick] Prints – No Need to do "std::cout" and compile again

mahesh_attarde — Fri, 21 Jan 2022 16:59:09 +0000

For debugging, most obvious way to start is adding print.
When we have gdb, we dont need that (implied software built is in debug mode).
Lets see alternate ways to do that in gdb.

Printing Variable at some line.

We have breakpoints at hand. when breakpoint hits we can print variable values required or message needed (like which if-else branch taken (-q), ¯_(ツ)/¯ ).

(gdb) break source.cpp:50  <enter>
(gdb) command <enter>
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>print "hello World!" 
>end

When breakpoint hits on source file source.cpp at line 50 It prints "hello World!". No need to edit code or recompile it!

On breakpoint, we specify series of commands that need to process. it can be made simple as printing single variable to printing complete data-structure.

Same can be done with dprintf breakpoint and print, however first one provides more flexible and easy to use formatting. I hate dprintf!

Print but more efficient

Our first command of use is

(gdb) print a
(gdb) p a

This is fine, even if we have array.
At times we are interested in buffer contents irrespective of its data structure, treating it like memory area.

(gdb) print &a@10

Above command accepts address and 10 further locations from it. try a@10 and see how cool that feature becomes!

same with different formatting, printing 20('n'umber) 'f'ormat 'w'ords with address a

(gdb) x/20fw &a

We also want to check same variable at different breakpoints

display <var_name>

is easiest way than doing p everytime.

HTH, Happy Hacking!

[GDB-Quick] BreakPoints - Log/Source

mahesh_attarde — Mon, 17 Jan 2022 16:00:43 +0000

For debugging, most obvious way to start is adding print. Second obvious is putting breakpoints and printing. While as simple as it sounds, it may not useful with large scale software variety of architecture.

While each debugging can take several runs, we need efficient setup, hence following post.

Save Your Breakpoints for future.

Save breakpoints in file with command.

(gdb) save breakpoints bpt_file

Rookie stuff!
Here is more, What is right time to save breakpoint? :P
For System level debug, create local breakpoints files is efficient, at times multiple ones, depending on debugging flow.

This save is similar as gdb commands, so edit bpt_file in text editor is required.

Load Breakpoints from history

bpt_file is gdb commands, source it with command line or in gdb, it works!

(gdb) source bpt_file

For VIM users

You can avoid writing breakpoint descriptions like file:line or class::function with few key strokes. Assign KeyMap from VIM to log current file:line_num in bpt_file

break file:line

Now VimScript to do that is

function! Brkpt(message, file)
  new
  setlocal buftype=nofile bufhidden=hide noswapfile nobuflisted
  put=a:message
  execute 'w ' a:file
  q
endfun

function! BrkLog(fileName,lineNum)
  let msg = 'break' .a:fileName . ':' . a:lineNum
  echo msg
  call Brkpt(msg,'~/gdb.txt')
endfun

copy content into say brk.vim file and source in vim. call this function on keystroke.

:nomap <C-G> :call BrkLog(expand('%:t'),line('.'))

Last While sourcing breakpoints, libraries that are not loaded will default error then just

(gdb) set breakpoint pending on

Happy Hacking!

[DopeTales] HackScripts Memory Allocation Testing

mahesh_attarde — Thu, 23 Sep 2021 13:42:06 +0000

Very often we want to test maximum memory that can be allocated at user space. It has interesting use cases. Mine relates to 4gb memory shadow creation. Basically I implemented sample DDR of size 4gb and took snapshot of simulated memory at intervals. so i need to know before hand if i have that much memory available before running actual application. Following script allows me to do that in simple loops.

#include <stdio.h>
#include <stdlib.h>

#define MINREQ      0xFFF   // arbitrary minimum
int main(void)
{
    unsigned int required = (unsigned int)-1; // adapt to native uint
    void *mem = NULL; 
    while (mem == NULL) {
        printf ("Required %X\n", required);
        mem = malloc (required);
        if ((required >>= 1) < MINREQ) {
            if (mem) free (mem);
            printf ("Cannot allocate enough memory\n");
            return (1);
        }
    }

    free (mem);
    void * array[4096];
    unsigned int count  =0;
    for( ;count < 4096; count++) {
    mem = malloc (required);
    if (mem == NULL) {
        printf ("Cannot enough allocate memory  %d \n",count);
        break;
    }
       array[count] = mem;
    }

    printf ("Memory size allocated = %X\n", required );
    for(int  i = 0  ; i <count ; i++)
    free (array[i]);
    return 0;
}

It tries to allocate minimum size then goes on continuously till largest possible at current moment.

Happy Hacking!

[DopeTales] GDB Hit Counter Script for nth time break on function

mahesh_attarde — Wed, 22 Sep 2021 17:47:45 +0000

Some useful hack around gdb. While debugging we often want to wait until certain function hits nth time. Following script does that, in easy python way.

Code Explanation

We have HitMe class which inherits from gdb.BreakPoint which is classes defined in gdb python extension. It provides two methods init will define breakpoint on function and handler on hitting breakpoint. Return value of Stop function tells gdb halt or not.
hitCount is just counter to keep track of number of function hits.
In code BREAK_FUNCTION hits 40 times before breakpoint halts gdb.

# script.py
import traceback
import gdb
import subprocess
import re
import math
import json

hitCount  = 0
class HitMe(gdb.Breakpoint):
    def __init__(self):
        gdb.Breakpoint.__init__(self, "BREAK_FUNCTION")

    def stop(self):
        global hitCount
        hitCount = hitCount + 1
        print(hitCount)
        if hitCount > 40:
          return True;
        return False

print("Ran Counter For Function!")
hit = HitMe()

Running This script

gdb -xargs binary.o arg1
...
gdb> source script.py 
gdb> run

This will be quickest way to get there! Happy Hacking!!!

[DopeTales] Pairing Functions

mahesh_attarde — Thu, 11 Apr 2019 05:55:41 +0000

Pairing functions are surprisingly useful increasing application performace.
As per Wiki, Pairing functions uniquely encode two natural numbers into a single natural number.
While designing system software this feature can be exploited to reduce runtime data structures data footprint.

Here is Dope-tale about one of that use.
While design of compiler diagnostics or run time err0r checkers, we record error location at line:column
pair from user code, typical implementation for this use-case is

typedef std::uint32_t                             Line
typedef std::uint32_t                             Column
typedef std::pair<Line,Column>                    Location;
typedef std::map<Line,Location>               LocationTable; 
// OR
typedef std::multi_map<Line,Location>         LocationTable1;

This implemenation creates wrong semantics. map implemenation does not cover multiple errors on same location.
While both of them fail in semantics as Line does not uniquely identify error and its location.

So we define problem as

0 <= Line <= 2^32-1
0 <= Column <= 2^32-1

Given < Line,Column > --uniquely--> Error, We would be needing O(1) search
with single key, like...

typedef std::unordered_map \< LocationHash,ErrorDescriptor \> LocationTable;

So we wish to pair two integeres into one. hence pairing functions.
A pairing function is a computable bijection
π : N × N → N .

Line and Column are natural number
Practically 32 bits are enough to accommodate Line and Number
Key made by both can be accommodated in 64 bits

here is sample implemenation

typedef std::uint32_t  UI32;
typedef std::uint64_t  UI64;
typedef std::uint64_t  Key64;
typedef std::uint32_t  LocationKey;
using LocationHash  = Key64;
typedef std::pair<LocationHash,ErrorDescriptor> LocationListItemPair;


struct PairHashFunction : std::binary_function<UI32,UI32,Key64>
{
    Key64 operator()(UI32 first,UI32 second) const{
        Key64 first64 = (Key64) first;
        Key64 second64 = (Key64) second;
        Key64 sum64 = (first64 + second64);
        Key64 hashValue64 = sum64 * (sum64 + 1) >> 1;
        return hashValue64 + second64;
    }
};

template<typename HashKeyType,typename KeyType>
struct UnPairHashFunctionTemplate
{
    void operator() (HashKeyType hashValue,KeyType *first, KeyType * second){
        HashKeyType temp =  (hashValue << 3)  + 1;
        temp  =  std::sqrt(temp) - 1;
        temp  =  std::floor(temp>>1);
        HashKeyType consum =  (temp * temp  + temp) >> 1;
        *second =  hashValue -  consum;
        *first  =  temp - *second;
    }
};

typedef UnPairHashFunctionTemplate<Key64,UI32> UnPairHashFunction;

class HitTable{
    LocationTable table;
public:
    void hit(Location locKey){
        PairHashFunction hfunc;
        Key64 key = hfunc(locKey.first,locKey.second);
        auto it =  table.find(key);
        if(it == table.end()){
            table[key] = 1;
        }
        else{
            KeyCountType count  = table[key];
            table[key] = count+1;
        }
    }

    bool isHit(Location locKey){
        PairHashFunction hfunc;
        auto it =  table.find(hfunc(locKey.first,locKey.second));
        return (it ==  table.end()) ? false :true;
    }
};

HTH!

[DopeTales] String operation slowdown in Source Generator Streams

mahesh_attarde — Mon, 17 Dec 2018 17:36:37 +0000

Compilers tends to manipulate readable strings way to much than any other software. Source generators written decade ago and six months back dont really have than much difference. Here is attempt to improve that.

Source Generators tend to build strings by concatenating strings. Usual go to solution for this problem is Attempt 1

Attempt 1

std::string oBuffer;
oBuffer.append("ContextPrefix");
oBuffer.append("_");
oBuffer.append(pClassName.c_str());
Stream<<oBuffer.c_str();

This solution creates different string in memory, results in memory allocations,resizing etc. moreover tend to slow down whole software stack.
Concatenating string with "+" operator does no good either.

Attempt 2

StringBuffer  sBuffer;
sBuffer.append("ContextPrefix");
sBuffer.append("_");
sBuffer.append(pClassName.c_str());
Stream<<sBuffer.c_str();

This solution was inspired by Java/CSharp followup and sadly help from Stackoverflow. While Cpp does not support this, writing own implementation of StringBuffer give much more flexibility of optimizing whole operation. Still we will end up separate buffer.

Attempt 3

typedef std::pair<const char *,const char *> PrefixString;
inline std::ostream& operator<< (std::ostream& pOut,PrefixString& pPair){
    pOut<<pPair.first<<"_"<<pPair.second;
    return pOut;
}
/*Using PrefixString in Source Generator code */
void SourceWriter::ClassBegin(const char * pQualifiedName){
    PrefixString oStr =  std::make_pair(Constants::ContextPrefix,pQualifiedName);
    mOut << oStr;
}

Third attempt is made considering fact that strings are always present in objects.
We need not have separate buffer to create output string. In example shown above,
PrefixString is pair which is just placeholder for char string pointers. It can directly write Prefix String without any intermediate buffer. moreover operator << is inlined. Most of runtime decisions are moved to compile time. and Hence faster alternative.

Hope this helps.