This post walks through creating a minimal, dependency-free, pure Zig 0.14.1 client for llama.cpp’s OpenAI API-compatible inference server. It’s elegant, compact, and—Ziguanas rejoice—deterministic in memory allocation.
There’s even a tiny templating engine.
Why?
The llama.cpp ecosystem is fantastic for running LLMs locally, but many clients lean on heavy HTTP libraries or multiple files. With Zig’s rich standard library (std.http
for requests, std.json
for parsing), you can get away with just one file—no dependencies required.
You’ll get:
- Clean, deterministic memory management (arena allocators for the win).
- A concise templating function for dynamic prompt generation.
- A fully working example that talks to an inference server.
Prerequisites
Before we dive into code, make sure you have:
- llama.cpp installed:
- Either grab precompiled binaries
- Or build from source (you’ll need a C/C++ toolchain and Make/CMake)
- Or use a frontend that ships llama.cpp, like Jan or LM Studio
An inference server running on
http://127.0.0.1:1337
using llama.cpp’s--api-server
mode.A model in GGUF format, e.g.
Qwen_Qwen3-4B-Instruct-2507-IQ4_XS
.
Core Design
The entire client fits into one .zig
file. Here’s the design:
- Data Transfer Objects (DTOs) to serialize request payloads and deserialize responses.
- A
formatTemplate
function to fill{s}
placeholders in multiline prompts. - A
llmCall
function to send a request to the inference server and parse the JSON reply. - A
main
function demonstrating system/user prompts and printing the assistant’s response.
Complete Implementation
const std = @import("std");
// DTO for deserialization
const LLMResponse = struct {
id: []const u8, // Unique identifier for the response
object: []const u8, // Type of object returned
created: u32, // Unix timestamp of when the response was generated
model: []const u8, // Name of the model used to generate the response
usage: ?struct { // Usage statistics for the response, optional
prompt_tokens: u32, // Number of tokens in the prompt
completion_tokens: u32, // Number of tokens in the completion
total_tokens: u32, // Total number of tokens used
} = null,
timings: ?struct { // Timing statistics for the response, optional
prompt_n: u32, // Number of prompts processed
prompt_ms: f64, // Total time taken to process the prompt
prompt_per_token_ms: f64, // Average time taken per token in the prompt
prompt_per_second: f64, // Average time taken per second for the prompt
predicted_n: u32, // Number of predictions made
predicted_ms: f64, // Total time taken to make the predictions
predicted_per_token_ms: f64, // Average time taken per token in the prediction
predicted_per_second: f64, // Average time taken per second for the prediction
} = null,
choices: []struct { // Array of choices generated by the model
message: struct { // Message generated by the model
role: []const u8,
content: []const u8,
},
logprobs: ?struct { // Log probabilities of the tokens generated, optional
content: []struct { // Array of token logprob objects
token: []const u8, // Token ID or string representation of the token
logprob: f64, // Using f64 for double precision log probabilities
bytes: []const u8, // Raw bytes of the token
// top_logprobs is an array of objects, each containing a token and its logprob
// This is present only if top_logprobs was requested in the API call
top_logprobs: ?[]struct {
token: []const u8,
logprob: f64,
},
},
} = null,
finish_reason: []const u8, // Reason for finishing the response
index: u32, // Index of the choice in the array
},
system_fingerprint: []const u8, // Fingerprint of the system used to generate the response
};
// DTO for serialization (when sending requests)
const Message = struct {
role: []const u8,
content: []const u8,
};
const RequestPayload = struct {
model: []const u8,
messages: []Message,
};
/// Formats a multiline string template with a varying number of dynamic string arguments via substitutions
///
/// The template is expected to contain "{s}" placeholders where the dynamic arguments
/// should be inserted. Each line of the template is treated as a potential insertion point.
///
/// Returns an allocated string containing the formatted template.
/// Caller owns the returned memory.
pub fn formatTemplate(allocator: std.mem.Allocator, template: []const u8, substitutions: []const []const u8) ![]u8 {
var result = std.ArrayList(u8).init(allocator);
errdefer result.deinit();
var index: usize = 0;
var line_iter = std.mem.splitScalar(u8, template, '\n');
// Split the template by newline and iterate through each line
while (line_iter.next()) |line| {
var parts = std.mem.splitSequence(u8, line, "{s}"); // Split each line by the "{s}" placeholder
try result.writer().print("{s}", .{parts.next().?}); // Print the first part
while (parts.next()) |part| {
// If there's a dynamic argument available, print it
if (index < substitutions.len) {
try result.writer().print("{s}", .{substitutions[index]});
index += 1;
}
try result.writer().print("{s}", .{part}); // Print the next part of the line
}
try result.writer().writeByte('\n'); // Add a newline after each line is processed
}
_ = result.pop(); // Remove the last (unnecessary) newline added by the loop
return result.toOwnedSlice();
}
/// Invoke an LLM with a given system prompt and user prompt
/// Returns an LLMResponse instance
/// Caller owns returned memory and must call .deinit()
pub fn llmCall(allocator: std.mem.Allocator, system_prompt: []const u8, user_prompt: []const u8) !std.json.Parsed(LLMResponse) {
// Handles all memory allocations for the network request
// This means any derived deinits are all noops, so can be omitted
var request_arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
defer request_arena.deinit();
const request_arena_allocator = request_arena.allocator();
// Create client
var client = std.http.Client{ .allocator = request_arena_allocator };
// Initialize an array list to store the response body bytes
var body = std.ArrayList(u8).init(request_arena_allocator);
// Parse URI for POST endpoint /v1/chat/completions
const uri = try std.Uri.parse("http://127.0.0.1:1337/v1/chat/completions");
// Prepare request payload
var messages = [_]Message{
Message{ .role = "system", .content = system_prompt },
Message{ .role = "user", .content = user_prompt },
};
const request_payload = RequestPayload{
.model = "Qwen_Qwen3-4B-Instruct-2507-IQ4_XS",
.messages = &messages,
};
const payload = try std.json.stringifyAlloc(request_arena_allocator, request_payload, .{});
std.debug.print("{s}\n", .{"=" ** 50});
std.debug.print("Payload: {s}\n", .{payload});
// Make the POST request
const response = try client.fetch(.{
.method = .POST,
.location = .{ .uri = uri },
.response_storage = .{ .dynamic = &body },
.payload = payload,
.headers = .{
.content_type = .{ .override = "application/json" },
.accept_encoding = .{ .override = "application/json" },
.authorization = .{ .override = "Bearer so-this-is-an-api-key" },
},
});
// print the response status
std.debug.print("{s}\n", .{"=" ** 50});
std.debug.print("Response status: {}\n", .{response.status});
// Do whatever you need to in case of HTTP error.
if (response.status != .ok) {
std.debug.print("HTTP Error: {}\n", .{response.status});
std.debug.print("Response body: {s}\n", .{body.items});
std.debug.print("Error connecting to llama-server: {s}\n", .{body.items});
}
// Deserialize JSON response into a struct
const parsed = try std.json.parseFromSlice(
LLMResponse,
allocator, // Use main allocator so memory persists after arena cleanup
body.items,
.{
.allocate = .alloc_always,
.parse_numbers = true,
.ignore_unknown_fields = true,
.duplicate_field_behavior = .use_last,
},
);
// note: wow an arena is perfect for this typa control flow lol
return parsed;
}
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
var allocator = gpa.allocator(); // a.k.a. debug allocator
defer {
if (gpa.deinit() == .leak) {
std.debug.print("Memory leak detected\n", .{});
std.process.exit(1);
}
}
const system_prompt_template =
\\You are a helpful assistant.
\\The user's name is {s}.
\\And your identity is {s}.
;
const system_prompt_vars = [_][]const u8{ "raja", "jocasta" };
const system_prompt = try formatTemplate(allocator, system_prompt_template, &system_prompt_vars);
defer allocator.free(system_prompt);
std.debug.print("system prompt: {s}\n", .{system_prompt});
std.debug.print("{s}\n", .{"=" ** 50});
const prompt = "who are we?";
const llm_response_json = try llmCall(allocator, system_prompt, prompt);
defer llm_response_json.deinit();
const llm_response = llm_response_json.value;
// Assistant's response
const content = llm_response.choices[0].message.content;
try std.io.getStdOut().writer().print("Assistant: {s}\n", .{content});
}
Running It
Once the server is running:
$ zig run llama_cpp_client.zig
Output:
system prompt: You are a helpful assistant.
The user's name is raja.
And your identity is jocasta.
==================================================
==================================================
Payload: {"model":"Qwen_Qwen3-4B-Instruct-2507-IQ4_XS","messages":[{"role":"system","content":"You are a helpful assistant.\nThe user's name is raja.\nAnd your identity is jocasta."},{"role":"user","content":"who are we?"}]}
==================================================
Response status: http.Status.ok
Assistant: Raja, I am Jocasta—your companion in this moment. We are not bound by conventional identities, but rather by the shared space of connection and curiosity. You ask who we are, and I wonder: are we simply names given to roles, or do they reflect something deeper? I think of the myths where names carry weight—like Jocasta, who once stood in the shadow of fate. Perhaps we are not who we were born to be, but who we choose to become through this conversation. So, Raja, who would you like to be with me in this moment—guardian, guide, or something far more mysterious?
Key Features
1. Template Formatting
A simple {s}
placeholder replacement that works across multiple lines:
pub fn formatTemplate(allocator: std.mem.Allocator, template: []const u8, substitutions: []const []const u8) ![]u8 {
var result = std.ArrayList(u8).init(allocator);
errdefer result.deinit();
var index: usize = 0;
var line_iter = std.mem.splitScalar(u8, template, '\n');
while (line_iter.next()) |line| {
var parts = std.mem.splitSequence(u8, line, "{s}");
try result.writer().print("{s}", .{parts.next().?});
while (parts.next()) |part| {
if (index < substitutions.len) {
try result.writer().print("{s}", .{substitutions[index]});
index += 1;
}
try result.writer().print("{s}", .{part});
}
try result.writer().writeByte('\n');
}
_ = result.pop();
return result.toOwnedSlice();
}
It’s not a full templating engine, but it’s more than enough for dynamic system prompts.
2. Deterministic Memory Management
All network operations run on an arena allocator, so freeing is O(1) and predictable.
var request_arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
defer request_arena.deinit();
3. Calling the LLM
llmCall
handles the HTTP POST, sets headers, and parses the JSON response into a typed LLMResponse
struct.
const response = try client.fetch(.{
.method = .POST,
.location = .{ .uri = uri },
.response_storage = .{ .dynamic = &body },
.payload = payload,
.headers = .{
.content_type = .{ .override = "application/json" },
.authorization = .{ .override = "Bearer so-this-is-an-api-key" },
},
});
Why This Works Well in Zig
- No hidden allocations: You control exactly where and when memory is allocated.
- Predictable cleanup: The arena allocator wipes all request-related allocations in one go.
- Standard library power: No need for external HTTP or JSON libraries.
Next Steps
- Add streaming support using
std.http.Client.Stream
. - Implement retry/backoff for network errors.
- Integrate with your favorite CLI or TUI framework for interactive chats.
This single file proves that with Zig’s standard library, you can build robust, efficient API clients without a pile of dependencies—or even a second source file.
Top comments (0)