<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rohan Sircar</title>
    <description>The latest articles on DEV Community by Rohan Sircar (@rohan-sircar).</description>
    <link>https://dev.to/rohan-sircar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3212852%2F940645d8-6627-492a-908b-1cb77f8bd400.jpeg</url>
      <title>DEV Community: Rohan Sircar</title>
      <link>https://dev.to/rohan-sircar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rohan-sircar"/>
    <language>en</language>
    <item>
      <title>Unlocking the Power of LLMs on OpenSUSE with AMD: A Step-by-Step Guide to Installing ROCm and Compiling llama.cpp</title>
      <dc:creator>Rohan Sircar</dc:creator>
      <pubDate>Thu, 12 Jun 2025 08:19:09 +0000</pubDate>
      <link>https://dev.to/rohan-sircar/unlocking-the-power-of-llms-on-opensuse-with-amd-a-step-by-step-guide-to-installing-rocm-and-1doe</link>
      <guid>https://dev.to/rohan-sircar/unlocking-the-power-of-llms-on-opensuse-with-amd-a-step-by-step-guide-to-installing-rocm-and-1doe</guid>
      <description>&lt;h4&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Large Language Models (LLMs) are transforming industries by enabling advanced natural language processing, automation, and decision-making. &lt;br&gt;
Running these models locally on custom hardware setups not only provides better control over performance but also ensures data privacy. If you have an AMD GPU and thought you’d need an NVIDIA GPU to run LLMs efficiently, think again! &lt;/p&gt;

&lt;p&gt;With &lt;strong&gt;ROCm (Radeon Open Compute)&lt;/strong&gt;, you can harness the power of AMD GPUs for LLM inference at a fraction of the cost. 💪&lt;/p&gt;

&lt;p&gt;In this guide, I’ll walk you through setting up ROCm and compiling &lt;strong&gt;llama.cpp&lt;/strong&gt; on an &lt;strong&gt;OpenSUSE&lt;/strong&gt; system. Whether you're a developer, researcher, or AI enthusiast, this tutorial will equip you with the skills to run LLMs efficiently on your AMD GPU.&lt;/p&gt;


&lt;h4&gt;
  
  
  &lt;strong&gt;Section 1: Why ROCm and llama.cpp?&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ROCm Overview:&lt;/strong&gt; ROCm is AMD’s open software platform for GPU computing, designed to accelerate machine learning, AI, and high-performance computing workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s a cost-effective alternative to NVIDIA’s CUDA, enabling AMD GPU users to run LLMs with GPU acceleration.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp Introduction:&lt;/strong&gt; llama.cpp is a lightweight and efficient implementation of LLMs, optimized for both CPU and GPU execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s open-source, highly customizable, and supports various quantization techniques to maximize performance and has great support for ROCM for running LLMs on AMD GPUs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why AMD GPUs?&lt;/strong&gt; AMD GPUs like the RX 7900 XTX offer exceptional value for LLM workloads.
With 24GB of VRAM and ~960 GB/s memory bandwidth, the RX 7900 XTX can handle larger, more sophisticated models while delivering impressive token generation speeds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NVIDIA GPUs like the RTX 4090 (24GB VRAM) cost 2.5 times as much, while the RTX 4080 (16GB VRAM) not only costs more but also limits your ability to run larger models. When it comes to LLMs, VRAM amount and memory bandwidth are the most critical factors—AMD GPUs excel on both fronts.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why OpenSUSE?&lt;/strong&gt; OpenSUSE is known for its stability, flexibility, and robust package management.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With up to date packages thanks to it's rolling release packaging model, making it an ideal platform for experimenting with AI technologies where the underlying dependencies and inference engines are constantly being updated.&lt;/p&gt;


&lt;h4&gt;
  
  
  &lt;strong&gt;Section 2: Preparing Your System&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Before diving into the installation, ensure your system meets the following requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hardware:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;A recent AMD GPU (e.g., Radeon RX 7900 series for RDNA3 or RX 6800 series for RDNA2).&lt;/li&gt;
&lt;li&gt;At least 16GB system RAM (more the better)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software:&lt;/strong&gt; OpenSUSE with root access and basic development tools installed.&lt;/li&gt;
&lt;/ul&gt;


&lt;h4&gt;
  
  
  &lt;strong&gt;Section 3: Step-by-Step Installation Guide&lt;/strong&gt;
&lt;/h4&gt;
&lt;h5&gt;
  
  
  &lt;strong&gt;Step 1: Adding the ROCm Repository&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Add the ROCm repository to your OpenSUSE system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;zypper addrepo https://download.opensuse.org/repositories/science:GPU:ROCm/openSUSE_Factory/science:GPU:ROCm.repo
&lt;span class="nb"&gt;sudo &lt;/span&gt;zypper refresh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  &lt;strong&gt;Step 2: Installing ROCm System Dependencies&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Install the necessary ROCm packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;zypper &lt;span class="k"&gt;in &lt;/span&gt;hipblas-common-devel rocminfo rocm-hip rocm-hip-devel rocm-cmake rocm-smi libhipblas2-devel librocblas4-devel libcurl-devel libmtmd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  &lt;strong&gt;Step 3: Cloning the llama.cpp Repository&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Download the llama.cpp repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ggml-org/llama.cpp.git
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  &lt;strong&gt;Step 4: Setting Up the Build Environment&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Specify your AMD GPU target (e.g., &lt;code&gt;gfx1100&lt;/code&gt; for RDNA3 or &lt;code&gt;gfx1030&lt;/code&gt; for RDNA2) and configure the build:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;HIPCXX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;hipconfig &lt;span class="nt"&gt;-l&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;/clang"&lt;/span&gt; &lt;span class="nv"&gt;HIP_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;hipconfig &lt;span class="nt"&gt;-R&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; cmake &lt;span class="nt"&gt;-S&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-B&lt;/span&gt; build-rocm &lt;span class="nt"&gt;-DGGML_HIP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="nt"&gt;-DAMDGPU_TARGETS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gfx1100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  &lt;strong&gt;Step 5: Compiling llama.cpp&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Build llama.cpp using the specified number of threads (e.g., 16):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build-rocm &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nt"&gt;-j&lt;/span&gt; 16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  &lt;strong&gt;Step 6: Testing the Build&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Download a GGUF model from Hugging Face and test the setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;models
wget &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; qwen2.5-0.5b.gguf https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf
&lt;span class="nb"&gt;cd&lt;/span&gt; ..
./build-rocm/bin/llama-bench &lt;span class="nt"&gt;-m&lt;/span&gt; models/qwen2.5-0.5b.gguf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;you should get an output like this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | ROCm       |  99 |         pp512 |    30061.00 ± 201.52 |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | ROCm       |  99 |         tg128 |        271.27 ± 0.74 |

build: dc39a5e7 (5169)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h4&gt;
  
  
  &lt;strong&gt;Section 4: Enhanced Flash Attention Using rocWMMA&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Flash Attention (FA) is a memory-efficient attention algorithm that enables larger context windows and better performance. To enhanced FA performance, we need to build llama.cpp with the &lt;strong&gt;rocWMMA&lt;/strong&gt; library.&lt;/p&gt;

&lt;h5&gt;
  
  
  &lt;strong&gt;Step 1: Cloning and Building rocWMMA&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Clone the rocWMMA repository and build it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ROCm/rocWMMA.git
&lt;span class="nb"&gt;cd &lt;/span&gt;rocWMMA
&lt;span class="nv"&gt;CC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/lib64/rocm/llvm/bin/clang &lt;span class="nv"&gt;CXX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/lib64/rocm/llvm/bin/clang++ &lt;span class="se"&gt;\&lt;/span&gt;
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-DROCWMMA_BUILD_TESTS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;OFF &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-DROCWMMA_BUILD_SAMPLES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;OFF &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-DOpenMP_CXX_FLAGS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-fopenmp"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-DOpenMP_CXX_LIB_NAMES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"omp"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-DOpenMP_omp_LIBRARY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/usr/lib/libgomp1.so"&lt;/span&gt;
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nt"&gt;-j&lt;/span&gt; 16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  &lt;strong&gt;Step 2: Installing rocWMMA&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Install rocWMMA to the system directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /opt/rocm
&lt;span class="nb"&gt;sudo chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; &lt;span class="nv"&gt;$USER&lt;/span&gt;:users /opt/rocm/
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nt"&gt;-j&lt;/span&gt; 16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  &lt;strong&gt;Step 3: Patching llama.cpp (Optional)&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;If rocWMMA is not detected, apply the provided patch to &lt;code&gt;ggml/src/ggml-hip/CMakeLists.txt&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;diff --git a/ggml/src/ggml-hip/CMakeLists.txt b/ggml/src/ggml-hip/CMakeLists.txt
index 1fe8fe3b..9577203b 100644
--- a/ggml/src/ggml-hip/CMakeLists.txt
+++ b/ggml/src/ggml-hip/CMakeLists.txt
@@ -39,10 +39,12 @@ endif()
 find_package(hip     REQUIRED)
 find_package(hipblas REQUIRED)
 find_package(rocblas REQUIRED)
-if (GGML_HIP_ROCWMMA_FATTN)
-    CHECK_INCLUDE_FILE_CXX("rocwmma/rocwmma.hpp" FOUND_ROCWMMA)
-    if (NOT ${FOUND_ROCWMMA})
-        message(FATAL_ERROR "rocwmma has not been found")
+if(GGML_HIP_ROCWMMA_FATTN)
+    if(EXISTS "/opt/rocm/include/rocwmma/rocwmma.hpp")
+        set(FOUND_ROCWMMA TRUE)
+        include_directories(/opt/rocm/include)
+    else()
+        message(FATAL_ERROR "rocwmma.hpp not found at /opt/rocm/include/rocwmma/rocwmma.hpp")
     endif()
 endif()


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;save this as rocwmma.patch and apply it with git&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git apply rocwmma.patch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  &lt;strong&gt;Step 4: Building llama.cpp with rocmWMMA&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Reconfigure and build llama.cpp with rocWMMA support:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;HIPCXX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;hipconfig &lt;span class="nt"&gt;-l&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;/clang"&lt;/span&gt; &lt;span class="nv"&gt;HIP_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;hipconfig &lt;span class="nt"&gt;-R&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; cmake &lt;span class="nt"&gt;-S&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-B&lt;/span&gt; build-wmma &lt;span class="nt"&gt;-DGGML_HIP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="nt"&gt;-DGPU_TARGETS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gfx1100 &lt;span class="nt"&gt;-DGGML_HIP_ROCWMMA_FATTN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="nt"&gt;-DCMAKE_CXX_FLAGS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-I/opt/rocm/include/rocwmma"&lt;/span&gt;
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build-wmma &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nt"&gt;-j&lt;/span&gt; 16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  &lt;strong&gt;Step 5: Testing the build&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Run llama-bench with FA enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build-wmma/bin/llama-bench &lt;span class="nt"&gt;-m&lt;/span&gt; models/qwen2.5-0.5b.gguf &lt;span class="nt"&gt;-fa&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;you should get an output like this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./build-wmma/bin/llama-bench -m models/qwen2.5-0.5b.gguf -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | ROCm       |  99 |  1 |           pp512 |   31558.16 ± 1501.69 |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | ROCm       |  99 |  1 |           tg128 |        269.67 ± 0.29 |

build: 053b1539 (5558)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h4&gt;
  
  
  &lt;strong&gt;Section 5: Optimizing Your Setup&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Check Memory Usage:&lt;/strong&gt; Use the &lt;code&gt;rocm-smi&lt;/code&gt; command to monitor VRAM utilization by the LLM model and adjust the context window length for optimal performance.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Imatrix Quants:&lt;/strong&gt; Implement techniques like &lt;strong&gt;imatrix quants&lt;/strong&gt; (e.g., IQ4_XS) for smaller model sizes with minimal performance loss.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache Quantization:&lt;/strong&gt; Leverage cache quantization to minimize memory usage and support larger context windows within the same GPU memory.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Community Resources:&lt;/strong&gt; Engage with the &lt;strong&gt;r/LocalLlama&lt;/strong&gt; subreddit and participate in discussions about the latest LLM advancements, model releases, and optimizations for llama.cpp.&lt;/li&gt;

&lt;/ul&gt;




&lt;h4&gt;
  
  
  &lt;strong&gt;Section 6: Why This Matters&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;This &lt;strong&gt;llama.cpp + ROCm setup&lt;/strong&gt; enables you to run cutting-edge Large Language Models (LLMs) efficiently on cost-effective AMD hardware. By leveraging this solution, you can build &lt;strong&gt;fully private chatbots&lt;/strong&gt; and &lt;strong&gt;sophisticated Retrieval Augmented Generation (RAG) systems&lt;/strong&gt; that operate entirely on-premises. This ensures that sensitive data is never exposed to public cloud platforms, addressing critical privacy and security concerns for businesses and organizations.&lt;/p&gt;

&lt;p&gt;Additionally, this setup gives you &lt;strong&gt;complete control and customization&lt;/strong&gt; over your AI workflows. You can run fine-tuned models, optimize performance, and tailor the system to meet your specific needs without being constrained by vendor limitations or cloud dependencies.&lt;/p&gt;




&lt;h4&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;In this guide, we’ve walked through setting up ROCm, compiling llama.cpp, and enabling Flash Attention on OpenSUSE using an AMD GPU. With AMD’s cost-effective hardware and powerful software tools, you can run state-of-the-art LLMs without breaking the bank. These skills not only enhance your technical repertoire but also open doors to exciting opportunities in AI and machine learning.&lt;/p&gt;

&lt;p&gt;If you found this guide helpful, feel free to share it on LinkedIn or connect with me for further discussions. Let’s push the boundaries of what’s possible with LLMs—AMD style!&lt;/p&gt;

</description>
      <category>rocm</category>
      <category>opensuse</category>
      <category>llms</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
