DEV Community: Aslan Vatsaev

Engineer's Guide to Local LLMs with LLaMA.cpp on Linux

Aslan Vatsaev — Mon, 15 Sep 2025 00:27:47 +0000

Introduction

In this write up I will share my local AI setup on Ubuntu that I use for my personal projects as well as professional workflows (local chat, agentic workflows, coding agents, data analysis, synthetic dataset generation, etc).

This setup is particularly useful when I want to generate large amounts of synthetic datasets locally, process large amounts of sensitive data with LLMs in a safe way, use local agents without sending my private data to third party LLM providers, or just use chat/RAGs in complete privacy.

What you'll learn

Compile LlamaCPP on your machine, set it up in your PATH, keep it up to date (compiling from source allows to use the bleeding edge version of llamacpp so you can always get latest features as soon as they are merged into the master branch)
Use llama-server to serve local models with very fast inference speeds
Setup llama-swap to automate model swapping on the fly and use it as your OpenAI compatible API endpoint.
Use systemd to setup llama-swap as a service that boots with your system and automatically restarts when the server config file changes
Integrate local AI in Agent Mode into your terminal with QwenCode/OpenCode

I will also share what models I use for different types of workflows and different advanced configurations for each model (context expansion, parallel batch inference, multi modality, embedding, rereanking, and more.

This will be a technical write up, and I will skip some things like installing and configuring basic build tools, CUDA toolkit installation, git, etc, if I do miss some steps that where not obvious to setup, or something doesn't work from your end, please let me know in the comments, I will gladly help you out, and progressively update the article with new information and more details as more people complain about specific aspects of the setup process.

Hardware

RTX3090 Founders Edition 24GB VRAM

The more VRAM you have the larger models you can load, but if you don't have the same GPU as long at it's an NVIDIA GPU it's fine, you can still load smaller models, just don't expect good agentic and tool usage results from smaller LLMs.

RTX3090 can load a Q5 quantized 30B Qwen3 model entirely into VRAM, with up to 140t/s as inference speed and 24k tokens context window (or up 110K tokens with some flash attention magic)

Prerequisites

Experience with working on a Linux Dev Box
Ubuntu 24 or 25
NVIDIA proprietary drivers installed (version 580 at the time of writing)
CUDA toolking installed
ROCm installed if you use an AMD GPU
Linux build tools + Git installed and configured

Architecture

Here is a rough overview of the architecture we will be setting up:

Installing and setting up Llamacpp

LlamaCpp is a very fast and flexible inference engine, it will allow us to run LLMs in GGUF format locally.

Clone the repo:

git clone git@github.com:ggml-org/llama.cpp.git

cd into the repo:

cd llama.cpp

compile llamacpp for CUDA:

cmake -B build -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DGGML_CUDA_FA_ALL_QUANTS=ON

If you have a different GPU, checkout the build guide here

cmake --build build --config Release -j --clean-first

This will create llama.cpp binaries in build/bin folder.

To update llamacpp to bleeding edge just pull the lastes changes from the master branch with git pull origin master and run the same commands to recompile

Add llamacpp to PATH

Depending on your shell, add the following to you bashrc or zshrc config file so we can execute llamacpp binaries in the terminal

export LLAMACPP=[PATH TO CLONED LLAMACPP FOLDER]
export PATH=$LLAMACPP/build/bin:$PATH

Test that everything works correctly:

llama-server --help

The output should look like this:

Test that inference is working correctly:

llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

Great! now that we can do inference, let move on to setting up llama swap

Installing and setting up llama swap

llama-swap is a light weight, proxy server that provides automatic model swapping to llama.cpp's server. It will automate the model loading and unloading through a special configuration file and provide us with an openai compatible REST API endpoint.

Download and install

Download the latest version from the releases page:

https://github.com/mostlygeek/llama-swap/releases

(look for llama-swap_159_linux_amd64.tar.gz )

Unzip the downloaded archive and put the llama-swap executable somewhere in your home folder (eg: ~/llama-swap/bin/llama-swap)

Add it to your path :

export PATH=$HOME/llama-swap/bin:$PATH

create an empty (for now) config file file in ~/llama-swap/config.yaml

test the executable

llama-swap --help

Before setting up llama-swap configuration we first need to download a few GGUF models .

To get started, let's download qwen3-4b and gemma gemma3-4b

Download and put the GGUF files in the following folder structure

~/models
├── google
│   └── Gemma3-4B
│       └── Qwen3-4B-Q8_0.gguf
└── qwen
    └── Qwen3-4B
        └── gemma-3-4b-it-Q8_0.gguf

Now that we have some ggufs, let's create a llama-swap config file.

Llama Swap config file

Our llama swap config located in ~/llama-swap/config.yaml will look like this:

macros:
  "Qwen3-4b-macro": >
    llama-server \
      --port ${PORT} \
      -ngl 80 \
      --ctx-size 8000 \
      --temp 0.7 \
      --top-p 0.8 \
      --top-k 20 \
      --min-p 0 \
      --repeat-penalty 1.05 \
      --no-webui \
      --timeout 300 \
      --flash-attn on \
      --jinja \
      --alias Qwen3-4b \
      -m /home/[YOUR HOME FOLDER]/models/qwen/Qwen3-4B/Qwen3-4B-Q8_0.gguf

  "Gemma-3-4b-macro": >
    llama-server \
      --port ${PORT} \
      -ngl 80 \
      --top-p 0.95 \
      --top-k 64 \
      --no-webui \
      --timeout 300 \
      --flash-attn on \
      -m /home/[YOUR HOME FOLDER]/models/google/Gemma3-4B/gemma-3-4b-it-Q8_0.gguf


models:
  "Qwen3-4b": # <-- this is your model ID when calling the REST API
    cmd: |
      ${Qwen3-4b-macro}
    ttl: 3600

  "Gemma3-4b":
    cmd: |
      ${Gemma-3-4b-macro}
    ttl: 3600

Start llama-swap

Now we can start llama-swap with the following command:


llama-swap --listen 0.0.0.0:8083 --config ~/llama-swap/config.yaml

You can access llama-swap UI at: http://localhost:8083

Here you can see all configured models, you can also load or unload them manually.

Inference

Let's do some inference via llama-swap REST API completions endpoint

Calling Qwen3:

curl -X POST http://localhost:8083/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {
      "role": "user",
      "content": "hello"
    }
  ],
  "stream": false,
  "model": "Qwen3-4b"
}' | jq

Calling Gemma3:

curl -X POST http://localhost:8083/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {
      "role": "user",
      "content": "hello"
    }
  ],
  "stream": false,
  "model": "Gemma3-4b"
}' | jq

You should see a response from the server that looks something like this, and llamaswap will automatically load the correct model into the memory with each request:



  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today? 😊"
      }
    }
  ],
  "created": 1757877832,
  "model": "Qwen3-4b",
  "system_fingerprint": "b6471-261e6a20",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 12,
    "prompt_tokens": 9,
    "total_tokens": 21
  },
  "id": "chatcmpl-JgolLnFcqEEYmMOu18y8dDgQCEx9PAVl",
  "timings": {
    "cache_n": 8,
    "prompt_n": 1,
    "prompt_ms": 26.072,
    "prompt_per_token_ms": 26.072,
    "prompt_per_second": 38.35532371893219,
    "predicted_n": 12,
    "predicted_ms": 80.737,
    "predicted_per_token_ms": 6.728083333333333,
    "predicted_per_second": 148.63073931406916
  }
}

Optional: Adding llamaswap as systemd service and setup auto restart when config file changes

If you don't want to manually run the llama-swap command everytime you turn on your workstation or manually reload the llama-swap server when you change your config you can leverage systemd to automate that away, create the following files:

Llamaswap service unit (if you are not using zsh adapt the ExecStart accordingly)

~/.config/systemd/user/llama-swap.service:

[Unit]
Description=Llama Swap Server
After=multi-user.target

[Service]
Type=simple
ExecStart=/usr/bin/zsh -l -c "source ~/.zshrc && llama-swap --listen 0.0.0.0:8083 --config ~/llama-swap/config.yaml"
WorkingDirectory=%h
StandardOutput=journal
StandardError=journal
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Llamaswap restart service unit

~/.config/systemd/user/llama-swap-restart.service:

[Unit]
Description=Restart llama-swap service
After=llama-swap.service

[Service]
Type=oneshot
ExecStart=/usr/bin/systemctl --user restart llama-swap.service

Llamaswap path unit (will allow to monitor changes in the llama-swap config file and call the restart service whenever the changes are detected):

~/.config/systemd/user/llama-swap-config.path

[Unit]
Description=Monitor llamaswap config file for changes
After=multi-user.target

[Path]
# Monitor the specific file for modifications
PathModified=%h/llama-swap/config.yaml
Unit=llama-swap-restart.service

[Install]
WantedBy=default.target

Enable and start the units:

sudo systemctl daemon-reload

systemctl --user enable llama-swap-restart.service llama-swap.service llama-swap-config.path

systemctl --user start llama-swap.service

Check that the service is running correctly:

systemctl --user status llama-swap.service

Monitor llamaswap server logs:

journalctl --user -u llama-swap.service -f

Whenever the llama swap config is updated, the llamawap proxy server will automatically restart, you can verify it by monitoring the logs and making an update to the config file.

If were able to get this far, congrats, you can start downloading and configuring your own models and setting up your own config, you can draw some inspiration from my config available here: https://gist.github.com/avatsaev/dc302228e6628b3099cbafab80ec8998

It contains some advanced configurations, like multi-modal inference, parallel inference on the same model, extending context length with flash attention and more

Connecting QwenCode to local models

Install QwenCode
And let's use it with Qwen3 Coder 30B Instruct locally (I recommend having at least 24GB of VRAM for this one 😅)

Here is my llama swap config:

macros:
  "Qwen3-Coder-30B-A3B-Instruct": >
    llama-server \
      --api-key qwen \
      --port ${PORT} \
      -ngl 80 \
      --ctx-size 110000 \
      --temp 0.7 \
      --top-p 0.8 \
      --top-k 20 \
      --min-p 0 \
      --repeat-penalty 1.05 \
      --cache-type-k q8_0 \
      --cache-type-v q8_0 \
      --no-webui \
      --timeout 300 \
      --flash-attn on \
      --alias Qwen3-coder-instruct \
      --jinja \
      -m /home/avatsaev/models/qwen/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf

models:
  "Qwen3-coder":
    cmd: |
      ${Qwen3-Coder-30B-A3B-Instruct}
    ttl: 3600

I'm using Unsloth's Dynamic quants at Q4 with flash attention and extending the context window to 100k tokens (with --cache-type-k and --cache-type-v flags), this is right at the edge of 24GBs of vram of my RTX3090.

You can download qwen coder ggufs here

For a test scenario let's create a very simple react app in typescript

Create an empty project folder ~/qwen-code-test
Inside this folder create an .env file with the following contents:

OPENAI_API_KEY="qwen"
OPENAI_BASE_URL="http://localhost:8083/v1"
OPENAI_MODEL="Qwen3-coder"

cd into the test directory and start qwen code:

cd ~/qwen-code-test 
qwen

make sure that the model is correctly set from your .env file:

I've installed Qwen Code Companion extenstion in VS Code, and here are the results, a fully local coding agent running in VS Code 😁

Follow for more, like, and let me know how you will use this setup in your workflows in the comments.

And if you have any questions, feel free to ask.

Local Intelligence: How to set up a local GPT Chat for secure & private document analysis workflow

Aslan Vatsaev — Mon, 03 Jun 2024 09:21:32 +0000

Intro

In this article, I'll walk you through the process of installing and configuring an Open Weights LLM (Large Language Model) locally such as Mistral or Llama3, equipped with a user-friendly interface for analysing your documents using RAG (Retrieval Augmented Generation). This setup allows you to analyse your documents without sharing your private and sensitive data with third-party AI providers such as OpenAI, Microsoft, Google, etc.

Prerequisites

You can use pretty much any machine you want, but it's preferable to use a machine a dedicated GPU or Apple Silicon (M1,M2,M3, etc) for faster inference.
Docker must be preinstalled

Installation

Ollama

Ollama is a service that allows us to easily manage and run local open weights models such as Mistral, Llama3 and more (see the full list of available models).
Ollama installation is pretty straight forward just download it from the official website and run Ollama, no need to do anything else besides the installation and starting the Ollama service.

Installing Ollama User Interface

Next step is installing the Ollama User Interface that will run on Docker, so Docker must be installed and running before installing the Ollama UI.

To install the UI simply run the following command in the terminal:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name ollama-webui --restart always -e WEBUI_AUTH=false ghcr.io/open-webui/open-webui:main

This will install and start the Ollama UI webserver locally on address http://localhost:3000/

Download a local model

Now that everything is up and running, we need to download a model.

Good general purpose models as of today (May 2024) are Llama3 (from Meta) and Mistral, in this article, I'll show how to install Mistral Instruct.

Go to Ollama library: https://ollama.com/library and type "mistral" in the search bar, then click on the first result:

Pick the instruct variant in the dropdown menu:

And copy the name and the tag of the model from the right side (don't copy the entire command just the model_name:tag part):

In the Ollama UI, click on the username, the bottom left corner, to display the pop over menu, click on "Settings":

Then click on "Models" on the sidebar. This form below allows us to download any model that Ollama supports

Paste the model tag mistral:instruct in the text field and click download:

The model installation is the same for any other models in the Ollama Library

Chat with the model

Once the model is downloaded, you can select it and set it as default:

Let's see if everything works by sending a message to the model:

Great! The model is loaded and running without any issues 🎉🥳

Now we can do some interesting things with it.

Analyse documents and data - RAG (Retrieval Augmented Generation)

You can upload documents and ask questions related to these documents, not only that, you can also provide a publicly accessible Web URL and ask the model questions about the contents of the URL (an online documentation for example). All files you add to the chat will always remain on your machine and won't be sent to the cloud.

Working with a PDF document example

Click the "+" icon in the chat and pick any PDF document you want:

I've uploaded the "Attention All you need" paper as a PDF document, and asked a specific question related to this document:

"What is the purpose of multi head attention mechanism?"

Let's check if the RAG worked correctly by looking into the original PDF document:

The RAG system was able to pinpoint the relevant part of the paper in order to answer the question 🎉

Ask questions about the contents of a Web Page

The URL of the web page must be publicly accessible, if you need to authenticate in order to view the page, the RAG won't work, so if you need to analyse a web page protected by auth, a workaround would be to first download it as PDF and upload it as a simple document.

In the chat field type # followed by a URL, for this example I'll use Doctolib's FAQ about handling relatives in your Doctolib account:

Saving the documents to your Workspace

You can also save your most often used documents in your workspace so you don't have to upload them every time, for that, click on "Workspace". go "Documents" tab, and upload your files here:

Later when you want to work with your documents, just go to chat, and type # in the message fields, you'll be presented with all documents from your work space, you can chose to work with one specific document or all of them in a single chat session:

This is just scratching the surface, the Ollama UI can be configured to make the retrieval even more performant with some tricks. If you're interested in advanced configuration and usage of this workflow let me know in the comments.

Simple state management in Angular with only Services and RxJS

Aslan Vatsaev — Sun, 10 Feb 2019 20:55:07 +0000

One of the most challenging things in software development is state management. Currently there are several state management libraries for Angular apps: NGRX, NGXS, Akita... All of them have different styles of managing state, the most popular being NGRX, which pretty much follows the FLUX/Redux principles from React world (basically using one way data flow and immutable data structures but with RxJS observable streams).

But what if you don't want to learn, setup, use an entire state management library, and deal with all the boilerplate for a simple project, what if you want to manage state by only using tools you already know well as an Angular developer, and still get the performance optimisations and coherency that state management libraries provide (On Push Change Detection, one way immutable data flow).

DISCLAIMER: This is not a post against state management libraries. We do use NGRX at work, and it really helps us to manage very complex states in very big and complex applications, but as I always say, NGRX complicates things for simple applications, and simplifies things for complex applications, keep that in mind.

In this write up, I'll show you a simple way of managing state by only using RxJS and Dependency Injection, all of our component tree will use OnPush change detection strategy.

Imagine we have simple Todo app, and we want to manage its state, we already have our components setup and now we need a service to manage the state, let's create a simple Angular Service:

// todos-store.service.ts

@Injectable({provideIn: 'root'})
export class TodosStoreService {


}

So what we need is, a way to provide a list of todos, a way to add todos, remove, filter, and complete them, we'll use getters/setters and RxJS's Behaviour Subject to do so:

First we create ways to read and write in todos:

// todos-store.service.ts

@Injectable({provideIn: 'root'})
export class TodosStoreService {

  // - We set the initial state in BehaviorSubject's constructor
  // - Nobody outside the Store should have access to the BehaviorSubject 
  //   because it has the write rights
  // - Writing to state should be handled by specialized Store methods (ex: addTodo, removeTodo, etc)
  // - Create one BehaviorSubject per store entity, for example if you have TodoGroups
  //   create a new BehaviorSubject for it, as well as the observable$, and getters/setters
  private readonly _todos = new BehaviorSubject<Todo[]>([]);

  // Expose the observable$ part of the _todos subject (read only stream)
  readonly todos$ = this._todos.asObservable();


  // the getter will return the last value emitted in _todos subject
  get todos(): Todo[] {
    return this._todos.getValue();
  }


  // assigning a value to this.todos will push it onto the observable 
  // and down to all of its subsribers (ex: this.todos = [])
  private set todos(val: Todo[]) {
    this._todos.next(val);
  }

  addTodo(title: string) {
    // we assaign a new copy of todos by adding a new todo to it 
    // with automatically assigned ID ( don't do this at home, use uuid() )
    this.todos = [
      ...this.todos, 
      {id: this.todos.length + 1, title, isCompleted: false}
    ];
  }

  removeTodo(id: number) {
    this.todos = this.todos.filter(todo => todo.id !== id);
  }


}

Now let's create a method that will allow us to set todo's completion status:

// todos-store.service.ts


setCompleted(id: number, isCompleted: boolean) {
  let todo = this.todos.find(todo => todo.id === id);

  if(todo) {
    // we need to make a new copy of todos array, and the todo as well
    // remember, our state must always remain immutable
    // otherwise, on push change detection won't work, and won't update its view    

    const index = this.todos.indexOf(todo);
    this.todos[index] = {
      ...todo,
      isCompleted
    }
    this.todos = [...this.todos];
  }
}

And finally an observable source that will provide us with only completed todos:

// todos-store.service.ts

// we'll compose the todos$ observable with map operator to create a stream of only completed todos
readonly completedTodos$ = this.todos$.pipe(
  map(todos => todos.filter(todo => todo.isCompleted))
)

Now, our todos store looks something like this:

// todos-store.service.ts


@Injectable({providedIn: 'root'})
export class TodosStoreService {

  // - We set the initial state in BehaviorSubject's constructor
  // - Nobody outside the Store should have access to the BehaviorSubject 
  //   because it has the write rights
  // - Writing to state should be handled by specialized Store methods (ex: addTodo, removeTodo, etc)
  // - Create one BehaviorSubject per store entity, for example if you have TodoGroups
  //   create a new BehaviorSubject for it, as well as the observable$, and getters/setters
  private readonly _todos = new BehaviorSubject<Todo[]>([]);

  // Expose the observable$ part of the _todos subject (read only stream)
  readonly todos$ = this._todos.asObservable();


  // we'll compose the todos$ observable with map operator to create a stream of only completed todos
  readonly completedTodos$ = this.todos$.pipe(
    map(todos => todos.filter(todo => todo.isCompleted))
  )

  // the getter will return the last value emitted in _todos subject
  get todos(): Todo[] {
    return this._todos.getValue();
  }


  // assigning a value to this.todos will push it onto the observable 
  // and down to all of its subsribers (ex: this.todos = [])
  private set todos(val: Todo[]) {
    this._todos.next(val);
  }

  addTodo(title: string) {
    // we assaign a new copy of todos by adding a new todo to it 
    // with automatically assigned ID ( don't do this at home, use uuid() )
    this.todos = [
      ...this.todos, 
      {id: this.todos.length + 1, title, isCompleted: false}
    ];
  }

  removeTodo(id: number) {
    this.todos = this.todos.filter(todo => todo.id !== id);
  }

  setCompleted(id: number, isCompleted: boolean) {
    let todo = this.todos.find(todo => todo.id === id);

    if(todo) {
      // we need to make a new copy of todos array, and the todo as well
      // remember, our state must always remain immutable
      // otherwise, on push change detection won't work, and won't update its view
      const index = this.todos.indexOf(todo);
      this.todos[index] = {
        ...todo,
        isCompleted
      }
      this.todos = [...this.todos];
    }
  }

}

Now our smart components can access the store and manipulate it easily:

(PS: Instead of managing immutability by hand, I'd recommend using something ImmutableJS)

// app.component.ts

export class AppComponent  {
  constructor(public todosStore: TodosStoreService) {}
}

<!-- app.component.html -->

<div class="all-todos">

  <p>All todos</p>

  <app-todo 
    *ngFor="let todo of todosStore.todos$ | async"
    [todo]="todo"
    (complete)="todosStore.setCompleted(todo.id, $event)"
    (remove)="todosStore.removeTodo($event)"
  ></app-todo>
</div>

And here is the complete and final result:

Full example on StackBlitz with a real REST API

This is a scalable way of managing state too, you can easily inject other store services into each other by using Angular's powerful DI system, combine their observables with pipe operator to create more complex observables, and inject services like HttpClient to pull data from your server for example. No need for all the NGRX boilerplate or installing other State Management libraries. Keep it simple and light when you can.

Follow me on Twitter for more interesting Angular related stuff: https://twitter.com/avatsaev

Create efficient Angular Docker images with Multi Stage Builds

Aslan Vatsaev — Tue, 13 Nov 2018 19:39:49 +0000

In this write up we’ll see how to dockerize an Angular app in an efficient manner with Docker’s Multi-Stage Builds.

At the time of writing this post, I'm using Angular v7

Prerequisites

NodeJS +8
Angular CLI (npm i -g @angular/cli@latest)
Docker +17.05
Basic understanding of Docker and Angular CLI commands

The plan

To dockerize a basic Angular app built with Angular CLI, we need to do the following:

npm install the dependencies (dev dependencies included)
ng build with --prod flag
move the artifacts from dist folder to a publicly accessible folder (via an an nginx server)
Setup an nginx config file, and spin up the http server

We’ll do this in 2 stages:

Build stage: will depend on a Node alpine Docker image
Setup stage: will depend on NGINX alpine Docker image and use the artifacts from the build stage, and the nginx config from our project.

Initialize an empty Angular project

$ ng new myapp

Add a default nginx config

At the root of your Angular project, create nginx folder and create a file named default.conf with the following contents (./nginx/default.conf):

server {

  listen 80;

  sendfile on;

  default_type application/octet-stream;


  gzip on;
  gzip_http_version 1.1;
  gzip_disable      "MSIE [1-6]\.";
  gzip_min_length   1100;
  gzip_vary         on;
  gzip_proxied      expired no-cache no-store private auth;
  gzip_types        text/plain text/css application/json application/javascript application/x-javascript text/xml application/xml application/xml+rss text/javascript;
  gzip_comp_level   9;


  root /usr/share/nginx/html;


  location / {
    try_files $uri $uri/ /index.html =404;
  }

}

Create the Docker file


### STAGE 1: Build ###

# We label our stage as ‘builder’
FROM node:10-alpine as builder

COPY package.json package-lock.json ./

## Storing node modules on a separate layer will prevent unnecessary npm installs at each build

RUN npm ci && mkdir /ng-app && mv ./node_modules ./ng-app

WORKDIR /ng-app

COPY . .

## Build the angular app in production mode and store the artifacts in dist folder

RUN npm run ng build -- --prod --output-path=dist


### STAGE 2: Setup ###

FROM nginx:1.14.1-alpine

## Copy our default nginx config
COPY nginx/default.conf /etc/nginx/conf.d/

## Remove default nginx website
RUN rm -rf /usr/share/nginx/html/*

## From ‘builder’ stage copy over the artifacts in dist folder to default nginx public folder
COPY --from=builder /ng-app/dist /usr/share/nginx/html

CMD ["nginx", "-g", "daemon off;"]

Build the image

$ docker build -t myapp .

Run the container

$ docker run -p 8080:80 myapp

And done, your dockerized app will be accessible at http://localhost:8080

And the size of the image is only ~15.8MB, which will be even less once pushed to a Docker repository.

You can see a full example in this GitHub repository: https://github.com/avatsaev/angular-contacts-app-example