DEV Community: Manish Aradwad

Learn HPC with me: CPU vs GPU

Manish Aradwad — Sat, 23 Nov 2024 12:03:29 +0000

20 Nov 2024

So I wasn’t able to provide much time for learning HPC in the last few(lots of) days. The reason is mostly my laziness but also I am in the phase of preparing and giving interviews. Anyways, I finally convinced myself to start the learning once again wherein I started reading the 4th chapter from beginning although I had stopped the reading after some pages of 4th chapter.

The first sentence of the chapter:

In Chapter 1, Introduction, we saw that CPUs are designed to minimize the latency of instruction execution and that GPUs are designed to maximize the throughput of executing instructions.

made me think deeper about the difference between CPU and GPU.

It’s a general consensus that GPUs are better than CPUs, although this is true only in the sense of speed of execution of a problem statement whose solution can be coded as a parallel program. But does this mean any parallel program in general can be coded as a sequential program?

When I asked Claude:

Consider a problem statement whose solution can be a implemented as a parallel solution. For ex. conversion of an image from RGB to grayscale can be one such problem. \
My question is “In general, can any parallel solution be written as a sequential solution?” In other words, “Are there problems whose solution is strictly parallel in nature and cannot be solved using sequential instructions?” \
Please let me know the correct answer with the proper logic.

it gave me the ultimate answer as:

Church-Turing Thesis states that any computational problem solvable by a parallel algorithm is also solvable by a sequential algorithm. No computational problem exists whose solution is strictly parallel and cannot be sequentially implemented. The differences lie in efficiency, not fundamental solvability.

which was convincing enough for me. So, after understanding this, coming back to the original argument, “are CPUs worse than GPUs?”. The answer is obviously no because there are lots of things which CPUs can do better than GPUs. So, both CPUs and GPUs are equally important in a good gaming rig.

CPU design is to minimize the latency of single instruction while GPU design is to maximize the throughput of instruction execution. The way both the above goals affect the design philosophy of both types of processors:

Control Unit Design
— CPU: Complex control unit with sophisticated branch prediction and speculation
— GPU: Simple control units replicated many times, focusing on parallel execution
Cache and Memory
— CPU: Large caches to reduce memory latency for individual operations
— GPU: Smaller caches but higher memory bandwidth for parallel data access
Execution Units
— CPU: Few but complex ALUs optimized for diverse operations
— GPU: Many simple ALUs designed for parallel floating-point operations
Pipeline Design
— CPU: Deep pipelines with out-of-order execution to minimize stalls
— GPU: Simpler pipelines with in-order execution, compensating with thread-level parallelism
Thread Management
— CPU: Optimized for few high-performance threads
— GPU: Massive thread parallelism with hardware thread scheduler
Instruction Handling
— CPU: Complex instruction decoder, branch prediction, speculative execution
— GPU: SIMD (Single Instruction Multiple Data) architecture for parallel execution

So, after understanding this, I was thinking if it’s possible to create an entity which improves the performance of this whole big system. This system is a big system of Computation wherein:

We start from the silicon to make the chips which combine with the Von-neumann architecture to create the processors on which the problem statement solution is ran as a program written using various programming paradigms.

I wanted to know how the different chips-for-AI hardware startups as well as companies like Cerebras, grok, apple, intel, graphcore, etc. are making changes at different stages of this system to make things faster. Even programming language like Mojo target another stage of the system.

Hope I find enough time in the future to understand how these things work, but for now, I think this much wandering is enough.

"Learn HPC with me" kickoff

Manish Aradwad — Sat, 23 Nov 2024 11:18:53 +0000

Hey there! Manish here on the other side. I am not really gonna introduce myself. Here's my personal portfolio if you wanna more about me.

In this blog, I'm kicking off my blog series (which I hopefully won't abandon in future) about the journey of learning HPC. I am learning it from Programming Massively Parallel Processors book.

I plan on finishing all the exercises as well along with the concepts. Till now I have finished the 3rd chapter and going through the 4th chapter currently. I create notes while reading the chapter on my iPad and finally I write the blogs on bearblog platform. This blog is available on my personal portfolio.

I am treating these blogs as go to notes to revise any concept in future. Along with this notes blog, I will be writing about any new things discovered while learning HPC in the form of casual blogs here. I will write the first blog after this first kickoff blog.

If you've read till here then you're probably also interested in the blogs I will write. Here's the first blog:

Learn HPC: CPU vs GPU

Will see you later! 👋🏻

Why kNN doesn't scale...

Manish Aradwad — Thu, 12 Oct 2023 17:55:50 +0000

...And how Approximate Nearest Neighbors mitigates that.

k-Nearest Neighbours:

k-Nearest Neighbors (kNN) is a simple and intuitive algorithm used for classification and regression tasks. Given a new data point, the algorithm searches the training dataset for the 'k' training examples that are closest to the point and then returns the output based on these 'k' examples.

kNN is used in some of the modern AI solutions like LLMs. But one of the major issues with kNN is its scalability. There are 3 important points regarding this.

Scalability Issues of kNN:

Brute Force Search: In its basic form, for every new data point, KNN performs a brute force search to calculate distances to all points in the training dataset. This means the time complexity is O(N), where N is the number of data points in the dataset. This is assuming calculation of these distances happens in constant time.
Memory Intensive: Storing the entire dataset is necessary for KNN, which can be memory intensive for large datasets.
High Dimensionality: As the dimensionality of the data increases, the distance between most pairs of points tends to become more similar (known as the curse of dimensionality). This can make the distance metric less meaningful and degrade the performance of KNN.

Approximate Nearest Neighbors (ANN):

Algorithm:

Approximate Nearest Neighbors (ANN) algorithms aim to find the nearest neighbors in sub-linear time by allowing a small probability of error. The key idea is that in many applications, an approximate answer is sufficient. Here I will explain one of the most common ANN algorithm, Locality Sensitive Hashing(LSH).

LSH is a technique which is used to hash similar dataset elements together in the same buckets. For ex. Vector embeddings of the similar words in a vocabulary will lie in the same bucket. But instead of typical buckets, you can think of clustering the points by deciding whether they lie above or below a hyperplane.

These hyperplanes are randomly generated and are used to divide the vector space into small sections. If the vector embeddings of a word are two dimensional then this hyperplane will be a line.

To decide whether a vector embedding lie above or below a line, we take the dot product of the vector embedding with the line(hyperplane in higher dimensions) which is also known as Projecting a Vector onto the Line. The sign of this dot product will help us decide whether embedding lies above or below a line.

Here is a visualisation of a projection in 2D space

Here is how sign of the projection tells us about its position

Now let's say you are trying to optimise below loss function:

|| h - X ||

where h is the test value and X represents value from the dataset

In the normal kNN, you will have to go through each X in the dataset, compare it with h using some metric like Cosine Similarity or Euclidean Distance. Calculating this metric for each X is very cumbersome in case of a large dataset.

So, we do the following steps in the ANN:

Use LSH to find the hash values of all the X's in the dataset (which involves only simple matrix multiplications).
Find the hash value of the test value h.
Compare h with only those X's which got the same hash value using the above metrics.

As you can probably guess, this method doesn't gurantee the answer we might get in case of kNN. So, this is done for multiple iterations by generating random planes to increase the likelihood of getting the right answer.

This is how LSH with Approximate Nearest Neighbors trades accuracy with efficiency.

Apart from LSH, ANN uses some other methods too like:

Random Projection Trees: These are space partitioning data structures that divide the dataset into increasingly smaller subsets using random hyperplanes.
KD-Trees, Ball Trees: These are also space partitioning trees but may not perform as well in very high dimensions.

To conclude, while KNN is straightforward and can be very accurate, it can be computationally expensive and unfeasible for large datasets or high-dimensional data. ANN provides a trade-off between accuracy and speed, allowing for efficient querying on large datasets. Depending on the specific requirements (e.g., acceptable error rate, query speed), one might choose an appropriate ANN method over traditional KNN.

Assorted Neural Style Transfer: An extension of Vanilla NST

Manish Aradwad — Sat, 02 Jul 2022 15:27:58 +0000

Neural Style Transfer explores methods for artistic style transfer based on Convolutional Neural Networks. The core idea proposed by Gatys et al. became very popular, and with further research, Johnson et al. overcame a significant limitation to achieve style transfer in real-time.

This article uses the VGG16 model for manual implementation of Neural Style Transfer. Another implementation is based on an article by TensorFlow, which uses a pre-trained model for NST.

So, what is Assorted Neural Style Transfer? (Yes, I came up with that name myself 😅)

Well, as we know, NST combines the style of Style Image with the content of Content Image as follows:

We propose the Assorted NST, which combines the style of 3 Style Images with the content of Content Image. Below are a few examples:

We not only combine the three styles, but we can also control how much weight to give to which style. The above output was generated with weights [0.3, 0.3, 0.4]. The weights [0.1, 0.1, 0.8] (where style 3 has more weight) will give the following output:

So, how does the Assorted NST work??

It's pretty straightforward. Instead of giving the model a single style image as input, we take the weighted combination of all three style images and feed that to the model. Before taking the weighted combination, we resize the style images to have the exact dimensions.

In this way, the model can extract the style of the corresponding final image, which can be used for final image generation.

The above Assorted NST example is based on the TF Hub's model, while below are some examples of manual Assorted NST implementation:

Few Limitations of this method:

For each set of content and style images, we have to do fine variations in weight values for the output to be better. It is impossible to have a fixed set of weights that work on all images. If the weights are not proper, then the outcome might be invalid like below:

The time taken for output image generation is almost 8 seconds per iteration in manual implementation. We need at least ten iterations to get valid output. This can further be reduced using an end-to-end CNN model explicitly built for NST as introduced in Johnson et al. (which is used in TFHub implementation).

Thanks for reading!

References:

Creating gradient images using only Python NumPy

Manish Aradwad — Sun, 26 Jun 2022 13:24:20 +0000

Here’s the backstory of this article:

I was working on a Computer Vision Project as a part of an internship and I needed a script which can generate custom B/W gradient images. By custom gradient, I mean images like this:

Along with the above structures, I also wanted to vary the spread of the gradient and combine different such structures to get the following samples:

I tried to search online but couldn’t find the solution which I really wanted. To help others like me, I decided to write this article. I use NumPy arrays with loops to generate such gradient images.

Code:

Gradients are essentially uneven arrays of numbers. So first, we have to write a function for uneven array creation. Refer this for explanation:

We can use this function for creating different types of gradients as follows:

Finally, we can write a function which returns a gradient of one of the above types with random parameters:

Finally, when we run the custom_gradient() function, it will return one of the gradient image type. Complete code is available in this notebook.

Thanks for reading :)

Have a nice day!