DEV Community: Sam Der

Getting Started With Local LLMs Using AnythingLLM

Sam Der — Sat, 07 Jun 2025 17:42:27 +0000

In this tutorial, AnythingLLM will be used to load and ask questions to a model. AnythingLLM provides a desktop interface to allow users to send queries to a variety of different models.

Note: At the time of writing, AnythingLLM recommends that user machines have at least 2 GB of RAM, a 2-core CPU, and 5 GB of storage.

Navigate to the homepage, download the desktop application for your device, and follow the installation instructions.

At this point, you can choose from any model that's supported. The first selection, AnythingLLM, contains a collection of open source models, including Llama3.2 and Gemma, that do not require any additional setup and are free to use. The remaining selections allow you to connect AnythingLLM to external, third-party providers such as OpenAI and HuggingFace, through API keys or self-hosted endpoints. For this tutorial, we'll choose AnythingLLM's Gemma 3 1 billion parameter model.

After proceeding, the model will begin downloading in the background. In the meantime, you can play around with your settings and create a workspace. AnythingLLM allocates workspaces where you can create chat threads, leverage different models, and configure custom settings for each one. For example, it is entirely possible (and even encouraged!) to create a new workspace that uses Llama 3.2 with a LLM temperature of 0.1.

Once your model is downloaded, you can begin asking it questions:

You can also attach files to the workspace and prompt the model about them for RAG-based capabilities. For example, I uploaded a Python file containing the code below and asked Gemma what it does:

def hello_world():
    return "Hello world"

if __name__ == "__main__":
    hello_world()

As expected, it is able to parse through the code and correctly outputs what does.

Training a Convolutional Neural Network on the CIFAR-10 Dataset

Sam Der — Sat, 26 Oct 2024 18:43:14 +0000

I've created a Kaggle notebook where I create a convolutional neural network to recognize images from the CIFAR-10 dataset. feel free to take a look!

And with that, that concludes this machine learning series! This was meant to be an introductory series to machine learning, so I'll continue to publish posts later down the line on different topics, including natural language processing. Thanks for reading and I hope you've been able to learn something through my posts so far!

Introduction to Convolutional Neural Networks

Sam Der — Sat, 19 Oct 2024 18:16:49 +0000

Now that we know how to create simple neural networks and how to optimize them, we can transition to more complex variants of neural networks. This week, I'll be covering one that's used very often in image processing and computer vision: convolutional neural networks (CNN).

Why CNNs?

In my first article, I gave a walkthrough of how to leverage scikit-learn to recognize digits by training a simple $k$ -nearest neighbors model. You might think that you can apply the same methodology to recognizing certain objects in images.

Here's why that doesn't work. First off, all the images of digits shared the same black color and white background. In reality, digits are written down on paper or are shown on various displays with differing backgrounds and colors. Our $k$ -nearest neighbors model would be overwhelmed at the staggering amount of colors and differences in the image, implode on itself, and fail to make accurate predictions.

You might have seen the meme where some models failed to tell the difference between a chihuahua and a muffin. They do look pretty similar!

A CNN, on the other hand, is specifically designed to handle different aspects in an image, including colors and backgrounds across multiple environments. There are three concepts that CNNs rely on: convolution, pooling, and the fully-connected layer in that order.

What is Convolution?

In the context of image processing, the convolution operation slides a small matrix, known as kernels or equivalently filters, across the input data of the pixels in the image and calculates their dot product. The dimensions of a kernel and its values depend on the operation being performed. If, for example, you were trying to blur an image, the values would be somewhat small around the edges of the matrix, but larger towards the center to more closely (but not exactly) preserve the color of the pixel that lies in the center kernel. Different operations lead to different dot products, or in other words, different aspects of an image, and a CNN leverages these kernels to deduce the characteristics of an image.

Sometimes, for images with lots of pixels, rather than convoluting at each individual pixel, the process can be sped up by only processing rows at intervals (such as every other row or every third row). In the first case, you would be performing convolution with a stride length of 2 and in the second case, you would be doing it with a stride length of 3. Using a stride length greater than 1 causes the dot product to be less representative of an image as a result of skipping out on pixels in the image. At the same time, however, the chances of overfitting on an image would be greatly reduced. Again, a balance needs to be determined.

The above animation is a bit of a simplification, as pixels consist of RGB values rather than a single scalar, but the concept is still there. As the yellow rectangle moves across the matrix, that submatrix is multiplied by the kernel and then summed to a number that becomes the corresponding element of the convoluted image.

What is Pooling?

You can imagine that the more pixels an image has, the more computationally intensive convolution and interpreting the results gets for the model. Pooling takes the convoluted image and compresses it in a manner similar to the convolution process. Slide a small matrix over the values and then calculate some sort of metric on them that outputs a single value for that element of the smaller matrix. The dimensions of the matrix in question once again depends on the developer's desires. If they are more focused on smoothing out noise, then the maximum of those values should be used. Otherwise, they can use the average of the values.

The output of the pooling process is another smaller matrix, which can be subsequently fed back into multiple other convolution-pooling loop layers.

The Fully-Connected Layer

Once processing in the convolution and pooling layers is complete, the matrix is then flattened into a column vector (like what we've seen with MNIST), and then passed into a simple feed forward neural network with backpropagation. Activation functions such as ReLU and SoftMax can then be used to fit non-linear data and classify the results respectively.

This part of the CNN as a whole is called the fully-connected layer.

That's It For Now!

Take some time to digest what you read. It's complicated! I'll provide a PyTorch demo in next week's article. Thanks for reading and see you in the demo!

Optimizing Your Neural Networks

Sam Der — Sat, 12 Oct 2024 16:37:38 +0000

Last week I posted an article about how to build simple neural networks, specifically multi-layer perceptrons. This article will dive deeper into the specifics of neural networks to discuss how we can maximize the performance of a neural network by tweaking its configurations.

How Long to Train Your Model For

When training a model, you might think that if you train your model enough, the model will become flawless. This may be true, but that only holds for the dataset it was trained on. In fact, if you give it another set of data where the values are different, the model could output completely incorrect predictions.

To understand this further, let's say you were practicing every single day for your driver's exam by driving in a straight line without moving the wheel. (Please don't do this.) While you would probably perform very well on the drag strip, if you were told to make a left turn on the actual exam, you might end up turning into a STOP sign instead.

This phenomenon is called overfitting. Your model can learn all the aspects and patterns of the data it's trained on but if it learns a pattern that adheres to the training dataset too closely, then when given a new dataset, your model will perform poorly. At the same time, if you don't train your model enough, then your model won't be able to recognize patterns in other datasets properly. In this case, you would be underfitting.

An example of overfitting. The validation loss, represented by the orange line is gradually increasing while the training loss, represented by the blue line is decreasing.

In the example above, a great position to stop training your model would be right when the validation loss reaches its minimum. It's possible to do this with early stopping, which stops training once there is no improvement in validation loss after an arbitrary number of training cycles (epochs).

Training your model is all about finding a balance between overfitting and underfitting while utilizing early stopping if necessary. That's why your training dataset should be as representative as possible of your overall population so that your model can more accurately make predictions on data it hasn't seen.

Loss Functions

Perhaps one of the most important training configurations that can be tweaked is the loss function, which is the "inaccuracy" between your model's predictions and their actual values. The "inaccuracy" can be represented mathematically in many different ways, one of the most common being mean squared error (MSE):

MSE=∑i=1n(yiˉ−yi)2n\text{MSE} = \frac{\sum_{i=1}^n (\bar{y_i} - y_i)^2}{n}

where $yiˉ\bar{y_i}$ is the model's prediction and $y_i$ is the true value. There's a similar variant called mean absolute error (MAE)

MAE=∑i=1n∣yiˉ−yi∣n\text{MAE} = \frac{\sum_{i=1}^n |\bar{y_i} - y_i|}{n}

What's the difference between these two and which one is better? The real answer is that it depends on a variety of factors. Let's consider a simple 2-dimensional linear regression example.

In many cases, there can be data points that act outliers, points that are far away from other data points. In terms of linear regression, this means that there are a few points on the $x y$ -plane that are far away from the rest of them. If you remember from your statistics classes, it's points like these that can significantly affect the linear regression line that's calculated.

A simple graph with points on (1, 1), (2, 2), (3, 3), and (4, 4)

If you wanted to think of a line that could cross all four points, then $y = x$ would be a great choice because this line would go through all the points.

A simple graph with points on (1, 1), (2, 2), (3, 3), and (4, 4) and the line

y = x

going through it

However, let's say I decide to add another point at $(5, 1)$ . Now what should the regression line be? Well, it turns out that it's completely different: $y = 0.2 x + 1.6$

A simple graph with points on (1, 1), (2, 2), (3, 3), (4, 4), and (5,1) with a linear regression line going through it.

Given the previous data points, the line would expect that the value of $y$ when $x = 5$ is 5, but because of the outlier and its MSE, the regression line is "pulled downwards" significantly.

This is just a simple example, but this poses a question that you, as a machine learning developer, need to stop and think about: How sensitive should my model be to outliers? If you want your model to be more sensitive to outliers, then you would choose a metric like MSE, because in that case, errors involving outliers are more pronounced due to the squaring and your model will adjust itself to minimize that. Otherwise, you would choose a metric like MAE, which doesn't care as much about outliers.

Optimizers

In my previous post, I also discussed the concept of backpropagation, gradient descent, and how they work to minimize the loss of the model. The gradient is a vector that points towards the direction of greatest change. A gradient descent algorithm will calculate this vector and move in the exact opposite direction so that it eventually reaches a minimum.

Most optimizers have a specific learning rate, commonly denoted as $α\alpha$ that they adhere to. Essentially, this represents how much the algorithm will move towards the minimum each time it calculates the gradient. Be careful of setting your learning rate to be too large! Your algorithm may never reach the minimum due to the large steps it takes that could repeatedly skip over the minimum.

[Tensorflow's neural network playground](https://playground.tensorflow.org) showing what can happen if you set the learning rate to be too large. Notice how the testing and training loss are both `NaN`.

Going back to gradient descent, while it is effective in minimizing loss, this might significantly slow down the training process as the loss function is calculated on the entire dataset. There are several alternatives to gradient descent that are more efficient but have their respective downsides.

Stochastic Gradient Descent

One of the most popular alternatives to standard gradient descent is a variant called stochastic gradient descent (SGD). As with gradient descent, SGD has a fixed learning rate. But rather than running through the entire dataset like gradient descent, SGD takes a small sample is randomly selected and the weights of your neural network are updated based on the sample instead. Eventually, the parameter values converge to a point that approximately (but not exactly) minimizes the loss function. This is one of the downsides of SGD, as it doesn't always reach the exact minimum. Additionally, similar to gradient descent, it remains sensitive to the learning rate that you set.

The Adam Optimizer

The name, Adam, is derived from adaptive moment estimation. It essentially combines two variants of SGD to adjust the learning rate for each input parameter based on how often it gets updated during each training iteration (adaptive learning rate). At the same time, it also keeps track of past gradient calculations as a moving average to smooth out updates (momentum). However, because of its momentum characteristic, it can sometimes take longer to converge than other algorithms.

Putting it All Together

Now for an example!

I've created an example walkthrough on Google Colab that uses PyTorch to create a neural network that learns a simple linear relationship.

If you're a bit new to Python, don't worry! I've included some explanations that discuss what's going on in each section.

Reflection

While this obviously doesn't cover everything about optimizing neural networks, I wanted to at least cover a few of the most important concepts that you can take advantage of while training your own models. Hopefully you've learned something this week and thanks for reading!

An Introduction to Neural Networks

Sam Der — Sat, 05 Oct 2024 16:57:40 +0000

The rise of machine learning has been accompanied by a similar rise of tools and libraries to aid machine learning applications. From basic model training with scikit-learn to deep learning frameworks such as PyTorch and Tensorflow, these tools have come a long way in terms of advancing machine learning development.

But as someone starting out with machine learning, how does one make the jump from scikit-learn to PyTorch or Tensorflow? And how do we use these frameworks to create neural networks, one of the most powerufl models in machine learning today?

What is a Neural Network?

A neural network is a different way of training a model. Rather than using an algorithm such as k-nearest neighbors to make predictions, a neural network takes in input(s), assigns weights to each input parameter, and performs mathematical calculations using these weights (along with several bias constants) to arrive at a prediction. The network will then adjust the weights based on the correctness of its predictions.

Simple neural networks generally consist of three layers, each of which contain nodes, or neurons. Input flows in one direction through the layers to the output. Specifically, these kinds of networks are called multi-layer perceptrons.

Input Layer

As the name suggests, each node in this layer represents an input parameter such as an individual pixel in an image.

Hidden Layers

These layers are where the mathematical calculations are performed. Given the input layer, the nodes in the hidden layer multiply the inputs by their weights and add any biases that provide additional weighting to an input to adjust its predictions. In short, a node essentially performs the following linear calculation:

\begin{gather*} y = wx + b \end{gather*}

$y$ can then get fed to a node in another hidden layer where it will be processed again or directly to the output layer.

It's possible to transform $y$ using what's called an activation function. These functions transform it into a non-linear function, which is extremely helpful for allowing the model to fit data points that are non-linear. One example is the Rectified Linear Unit (ReLU) function, which simply takes the maximum of $y$ and 0 and returns that as the node's output:

\begin{gather*} y = \max(0, wx + b) \end{gather*}

Output Layer

The output layer represents the final output, or the prediction the neural network makes given some input data. This can range from probabilities that an input data point is of a certain class to regression predictions.

How a Neural Network Trains Itself

When creating a neural network, you specify a variety of different parameters, ranging from optimizers to loss functions. I'll be going over these in more detail in my next article, but for now, the most important concept to know of is gradient descent.

A model trains itself by evaluating its performance on a training dataset, calculating a loss function and adjusting the model's weights in response. A developer can choose the exact loss function to use, such as mean absolute error, as well as the way the weights are adjusted. One of the most common ways that weights are adjusted is through backpropagation, which uses gradient descent to fine-tune the model's weights so that it eventually reaches a local minimum of the loss function.

Note: A local minimum is not necessarily equivalent to a global minimum. This means that the best weights for your model might not be found even when using backpropagation!

Understanding What's Available To You

Now that we understand how simple neural networks work, let's view what options are available to build our own!

In training a model, you can split your dataset into training and testing datasets. However, you can further split your dataset into training, validation, and testing datasets, where the validation dataset acts as a sanity check for your model before actually testing itself on the testing dataset.

`scikit-learn`

from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(
    hidden_layer_sizes=(5, 5, 5),
    activation='relu',
    max_iter=100,
    validation_fraction=0.2,
    early_stopping=True
)

mlp.fit(X_train, y_train)
mlp.predict(X_test)

This is pretty simple! In this case, mlp uses a multi-layer perceptron as a classifier. As you can see, you can customize a variety of parameters including the number of nodes in your layers, the activation function, the optimizer, and the maximum number of training iterations by the neural network.

Side Note: sgd stands for Stochastic Gradient Descent, which is more efficient in terms of training than standard gradient descent in that it uses smaller batches in determining how to adjust weights rather than the whole dataset.

PyTorch

import torch
from torch import nn

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
            nn.Softmax(
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to("cuda")
X = some 28 by 28 tensor representing input data

# We still need to transform the model's output into something that makes sense!
# Right now, it's just the output of a bunch of linear calculations and activation functions.
# We can use the softmax activation function to convert our output into a prediction.
# Specifically, this returns the index of the class that has the maximum probability.
logits = model(X)
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(dim=1)

This is slightly more complicated, but much more customizable than scikit-learn. For each layer, you can define the number of inputs and outputs and you can also define activation functions within each one.

The snippet above also introduces something called a tensor. A tensor is essentially an optimized version of a matrix (of any size) for machine learning purposes.

Tensorflow

from tensorflow import keras
from tensorflow.keras import layers

input_shape = [X_train.shape[1]]

model = keras.Sequential([
    layers.Dense(units=512, activation='relu', input_shape=input_shape),
    layers.Dense(units=512, activation='relu'),
    layers.Dense(units=512, activation='relu'),
    layers.Dense(units=1)
])

model.compile(loss="mae")
model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=64,
    epochs=100
)

Finally, this is an example of a neural network in Tensorflow. It's quite similar to the PyTorch, with a few minor differences. Ultimately, the library you choose is up to you!

Optimizing Your Neural Networks

As you can see, there are a lot of options to configure when building your neural networks. I'll be going over these in my next article! See you in the next one!

Moving forward, we'll stick with PyTorch.

Important: Taking Advantage of Your GPUs

Machine learning training obviously requires a lot of intensive computation. With GPUs being much more efficient at mathematical calculations than CPUs, the more advanced machine learning libraries cater to GPUs by allowing you to write code that takes advantage of them during training.

Unfortunately, scikit-learn does not support using GPUs to train models.

import tensorflow as tf
with tf.device('/device:GPU:0'):
    # GPU code here
    ...

import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Getting Started With Machine Learning Using MNIST

Sam Der — Sat, 28 Sep 2024 15:38:53 +0000

What is Machine Learning?

Machine learning is one of the hottest topics in computer science in recent years. With the rise of software like ChatGPT by OpenAI, Copilot by Microsoft, and countless other large language models that are powered by machine learning, it's becoming ever more relevant in our daily lives. From helping us write emails to answering our questions about life and the pursuit of happiness, these chatbots seem almost omnipotent.

But how do we get from machine learning to large language models? How do we get from training a model that can, for example, identify handwritten digits to something as complex as ChatGPT? I'll be publishing posts every week that goes further in depth into answering these questions, but for now, let's start with the basics.

The Basics

Machine learning is exactly what it is: making a machine (or more specifically, a computer) learn. More precisely, it's writing code to train our computers so that it can be used to make predictions later down the line. The result of this training is called a model.

In the handwritten digits example, you might think that we can just use existing images to train a model, but we can't just use any image of number. Since we each have our own way of writing digits, we can't just train on a screenshot of a number and then hope it can extrapolate to our handwriting. The numbers for a specific font are rendered exactly the same way on a computer screen whereas each time we write a number, there are extremely minor differences that could convince a computer that the number we wrote down is a different one. We need to train our model using the same type of data we want it to predict on. In other words, we want to train our model with images of handwritten digits so that it can do the same with images it hasn't seen before.

Google Colab

To get started, we can head over to Google Colab, a notebook-based coding environment. What that means is that you can split your code into individual cells that you can re-run and you can view the output of each one.

A simple example using the Fibonacci sequence. Competitive programmers know what I'm doing here :). If you don't, then hopefully you learned something new!

Notice how you don't need to use the print function to view output in this scenario. The return value from the last line of a cell gets outputted by default, but if you want to see output from other lines, then you'll have to use print.

Training A Model With `scikit-learn` To Recognize Digits

Getting Our Data

There's a popular library to get started with machine learning called scikit-learn that's already installed in Colab. We can use this to train our model using the MNIST dataset of handwritten digits.

# Importing the necessary libraries and functions that will be used later
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

mnist = fetch_openml('mnist_784')
mnist.data

The dataset: Each row represents an image and each column represents the weight of a pixel in that image.

mnist.target

The targets: This contains the digits each image corresponds to. For example, the image in the first row is a 5.

images = mnist.data.to_numpy()
plt.imshow(images[0].reshape(28, 28), cmap='gray')

An example image rendered using numpy and matplotlib.

Now that we have our images, let's proceed with building our model! Since we're classifying images to a certain digit and also just to keep things simple, we should use a classification algorithm. One of the most common algorithms of this type is k-nearest neighbors, which looks at data points that are near in terms of Euclidean distance (or some other distance metric) and outputs the classification that is most common among them. This is appropriate in our scenario because images that have similar pixel weights should represent the same number.

Running Our Classifier

To train our model and test it appropriately, we'll need to split this large dataset into smaller training and testing datasets. Luckily, scikit-learn provides a function for us named train_test_split to do just that! However, it requires that mnist.data and mnist.target be combined into one data structure first. We can join them together using the join method provided by the pandas library.

mnist.data.join(mnist.target)
X_train, X_test, y_train, y_test = train_test_split(images, mnist.target.to_numpy(), test_size=0.2)

The test size can be adjusted arbitrarily but usually 80/20 is a good split.

We can then use the KNeighborsClassifer that we imported at the beginning to train our model using our training dataset and then test the accuracy of our model using the test dataset. This snippet looks at the 5 nearest neighbors but feel free to change the number as you see fit!

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
accuracy_score(y_test, model.predict(X_test))

This output means that our model had an accuracy of 97% with the test dataset. That's a pretty good score! And congratulations, you just built your first machine learning model!

Where To Go From Here?

We used the scikit-learn library in this tutorial, but for more intensive machine learning applications that, for instance, require more customization of your parameters or utilization of compute resources, it would be better to use a library like PyTorch or Tensorflow. I'll be going over both in later posts!

In the meantime, thanks for sticking with me. See you in the next one!

DEV Community: Sam Der

Getting Started With Local LLMs Using AnythingLLM

Training a Convolutional Neural Network on the CIFAR-10 Dataset

Introduction to Convolutional Neural Networks

Why CNNs?

What is Convolution?

What is Pooling?

The Fully-Connected Layer

That's It For Now!

Optimizing Your Neural Networks

How Long to Train Your Model For

Loss Functions

Optimizers

Stochastic Gradient Descent

The Adam Optimizer

Putting it All Together

Reflection

An Introduction to Neural Networks

What is a Neural Network?

Input Layer

Hidden Layers

Output Layer

How a Neural Network Trains Itself

Understanding What's Available To You

scikit-learn

PyTorch

Tensorflow

Optimizing Your Neural Networks

Important: Taking Advantage of Your GPUs

Getting Started With Machine Learning Using MNIST

What is Machine Learning?

The Basics

Google Colab

Training A Model With scikit-learn To Recognize Digits

Getting Our Data

Running Our Classifier

Where To Go From Here?

`scikit-learn`

Training A Model With `scikit-learn` To Recognize Digits