DEV Community: Zhongkai Fu

New Book: From Tensors to Tokens: Building a Multimodal LLM Inference Engine from Scratch with TensorSharp and Gemma 4 E4B

Zhongkai Fu — Tue, 21 Jul 2026 02:03:23 +0000

My new book is now available on Amazon! 😀

This book is written for AI application developers who are not satisfied with simply calling an LLM endpoint and want to understand model architectures and the internal workings of inference engines. It uses the open-source TensorSharp project and Google’s Gemma 4 E4B GGUF model as practical examples.

TensorSharp has achieved performance parity with llama.cpp across the main benchmarks, while outperforming it in several scenarios. The book explains some of the key performance optimizations and their implementations, including paged and prefix KV caching, continuous batching, GPU kernel fusion, and more.

I chose Gemma 4 E4B, a dense model, because it is a compact multimodal model that supports images, audio, and video, making it suitable for a wide range of devices. TensorSharp also supports and is optimized for MoE and diffusion architectures, as well as model families such as Qwen and GPT-OSS. However, due to limitations in time and book length, these topics are not covered in this edition. Those interested can explore the project directly on GitHub or contact me for further discussion.

I selected GGUF because it is an inference- and edge-device-friendly model format. This is particularly relevant to the .NET ecosystem, where local applications, mobile applications, and game development are important use cases. TensorSharp also supports the Safetensors format, which it currently uses for VAE and LoRA models.

For clarity and ease of understanding, the book primarily presents the CPU code path. In practice, however, TensorSharp supports and is extensively optimized for multiple GPU backends, including NVIDIA CUDA, Apple Metal/MLX, and Vulkan for AMD, Intel, and other devices. More implementation details are available in the GitHub repository.

TensorSharp and this book focus exclusively on model inference. For model training, readers can refer to Seq2SeqSharp, one of my earlier open-source projects. Adding a trainer to TensorSharp itself would not be particularly difficult, but building a modern, efficient architecture that unifies training and inference is a much larger undertaking. I do not currently have sufficient resources to explore that topic properly, so it remains outside the project’s scope for now.

One more point: TensorSharp is a project built with the assistance of a native coding agent. Throughout the book, I have also incorporated my own views on how code-agent-driven projects should be developed and managed—perhaps a little personal perspective woven into the technical content.

The philosophy can be summarized as:

Contracts as the source of truth.
Test-driven development.
Evaluation before optimization.

Here is the book link on Amazon:From Tensors to Tokens: Building a Multimodal LLM Inference Engine from Scratch with TensorSharp and Gemma 4 E4B

Here is TensorSharp Github Repo: https://github.com/zhongkaifu/TensorSharp

Virtual Clothes Try-On by TensorSharp

Zhongkai Fu — Mon, 13 Jul 2026 06:33:07 +0000

The video shows virtual cloth try on demo by TensorSharp using Unsloth Qwen Image Edit 2511 models.

Here are models using in this demo:

Qwen-Image-Edit	MMDiT DiT (the `--model` GGUF)	unsloth/Qwen-Image-Edit-2511-GGUF	e.g. `qwen-image-edit-2511-Q4_K_M.gguf`
Qwen-Image-Edit	Qwen-Image VAE (required)	QuantStack/Qwen-Image-Edit-GGUF	`VAE/Qwen_Image-VAE.safetensors` — place next to the DiT or pass `--qwen-image-vae`
Qwen-Image-Edit	Qwen2.5-VL-7B text encoder (required)	unsloth/Qwen2.5-VL-7B-Instruct-GGUF	Optional vision mmproj: `mmproj-BF16.gguf` (same repo) for image-grounded edits
Qwen-Image-Edit	Lightning LoRA (optional, 4/8-step)	lightx2v/Qwen-Image-Edit-2511-Lightning	`Qwen-Image-Edit-2511-Lightning-4steps-V1.0-bf16.safetensors` via `--qwen-image-lora`

For TensorSharp.Server (OpenAI/Ollama comptiable API endpoint and WebUX chat), it can be launched by this command line:

TensorSharp.Server.exe --model c:\Works\models\qwen-image-edit-2511-Q4_K_M.gguf --qwen-image-vae c:\Works\models\Qwen_Image-VAE.safetensors --qwen-image-vl c:\Works\models\qwen-image-te-Qwen2.5-VL-7B-Q4_K_M.gguf --qwen-image-mmproj c:\works\models\Qwen2.5-VL-7B-mmproj-BF16.gguf --backend ggml_cuda --qwen-image-lora c:\Works\models\Qwen-Image-Edit-2511-Lightning-8steps-V1.0-bf16.safetensors

Here is an benchmarks results comparing to stable-diffusion.cpp:

Image editing (stable-diffusion)

Same input image, prompt, resolution, step count, cfg and seed for every engine. Timings are each engine's own pipeline timers (TensorSharp's [pipe-timing] phases + server elapsedSeconds; sd.cpp's phase logs + generate_image total), so weight-file loading and HTTP/process overhead are excluded on both sides. total (warm) is the steady-state request on an already-running server; first request (cold) additionally pays TensorSharp's per-request DiT rebuild + graph capture on a fresh server (a CLI engine has no such distinction). Lower is better.

Qwen-Image-Edit 2511 (Q2_K DiT + Lightning 4-step LoRA) — image_edit on CUDA, 544x1184, 4 steps

Engine	total (warm)	per step	sampling	text encode	VAE encode	VAE decode	first request (cold)
TensorSharp	40.44 s	7.57 s	30.27 s	7.45 s	0.54 s	1.51 s	54.11 s
stable-diffusion.cpp	48.16 s	9.43 s	37.73 s	4.47 s	1.92 s	2.57 s	—

TensorSharp vs stable-diffusion.cpp (ratio = stable-diffusion.cpp time / TensorSharp time; > 1.0× = TensorSharp faster): total (warm) 1.19×, per step 1.25×, sampling 1.25×, text encode 0.60×, VAE encode 3.56×, VAE decode 1.70×

It also has on par performance on auto regression LLM models comparing to llama.cpp. Here is details: https://github.com/zhongkaifu/TensorSharp/blob/main/docs/engine_comparison_report.md

TensorSharp is an open source local Unsloth (GGUF) LLM inference engine and applications. It supports many models from Unsloth, like Gemma4, DiffusionGemma, Qwen3.6 with multi-modal (image, vision, audio), Qwen Image Edit, reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability using Cuda, Metal and Vulkan. The API is completely compatible with OpenAI and Ollama interface. It has on par performance than llama.cpp

This project is not just a C# wrapper of llama.cpp. It implemented the entire LLM inference engine from bottom to top. If you use CPU backend, it's 100% pure C# code execution. Besides CPU backend, I also implmented CUDA, MLX and GGML backend including ggml_cuda, ggml_vulkan, ggml_metal and ggml_cpu. The GGML backend refer GGML project as external project, and I build a few fusion operation at higher level.

I learned a lot from other projects and apply them for TensorSharp, such as paged KV cache and continuous batching from vLLM, SSD based cache for MoE model from oMLX, GGUF quanztized from llama.cpp and other optimizations for prefill and decode.

Any feedback and comments are welcome. If you like it, it would be really appreciated if you can get this project a star in GitHub: https://github.com/zhongkaifu/TensorSharp . Thanks in advance.

What Bun’s Rust Rewrite Tells Us About Rebuilding the AI Infrastructure Layer in C#

Zhongkai Fu — Sat, 11 Jul 2026 06:49:06 +0000

I use Google translator to translate the entire original blog in Chinese again and post it here. It's a interesting research and insightful thoughts. Hope you like it.

This article is translated from the original blog in Chinese from https://www.cnblogs.com/shanyou/p/21309486

I. Lessons from Bun: System-level software must embrace compiled languages

In late 2025, the Bun team published a blog post that shocked the industry – "Rewriting Bun in Rust". They migrated 535,000 lines of Zig code to Rust in 11 days using 64 Claude instances .

1.1 Why rewrite?

Bun is a JavaScript runtime, and its core challenge lies in the fact that JavaScript is a garbage-collected language, while the runtime requires manual memory management at the underlying level . Zig provides extreme control, but it also introduces structural problems:

node:zlibuse-after-free crash
node:http2The re-entrant JS callback caused the hashmap to become invalid.
UDPSocket.sendMany()Overbounded writing
fs.watch()A memory leak is caused by a GC root reference count underflow.

These are not edge cases, but structural problems in system-level software . When GC and manual memory management are intertwined, the compiler cannot verify the lifecycle and can only rely on the engineer's "extreme caution" and post-hoc fuzzing/ASAN to remedy the situation.

1.2 Rust's Solution: Turning "Style Guidelines" into "Compiler Errors"

The Bun team tried various solutions and ultimately found that: "Homegrown smart pointers offer worse ergonomics than Rust, with none of the guarantees."

Rust's borrow checker transforms "memory safety style guidelines" into compile-time mandatory constraints . This isn't an improvement in the development experience, but a fundamental change in the feedback loop —from "runtime crash → debug → fix" to "compile error → immediate correction".

1.3 Mapping to AI Infrastructure

AI infrastructure faces the same structural problems:

Bun's pain points	The corresponding pain points of AI infrastructure
Zig manual memory management + JS garbage collection hybrid → use-after-free	Python GIL + Dynamic Typing → Runtime Crashes, Memory Leaks, Concurrency Bottlenecks
With 500,000 lines of code, style guidelines are difficult to enforce.	The "glue code" of Python AI frameworks is difficult to maintain.
11 days x 64 Claude rewrites, cost $165,000	The maintenance cost of AI infrastructure increases exponentially with scale.

Key Insight : AI inference services are shifting from "lab scripts" to "production infrastructure," and Python's dynamic typing and GIL are becoming system-level bottlenecks.

II. TensorSharp: A Breakthrough Validation of a C# AI Inference Engine

Before discussing the "potential" of C#, let's answer a fundamental question: Does C# already have the strength to compete with C++ in terms of AI inference performance?

The answer is: It already possesses it, and it is surpassing it .

2.1 TensorSharp's Qwen Image Edit 2511 benchmark

TensorSharp is a deep learning inference engine implemented purely in C#, and recently added support for Qwen Image Edit 2511. In comparison with stable-diffusion.cpp (the de facto standard in C++):

Test conditions : CUDA · 544×1184 · 4 Steps · Q2_K DiT + Lightning 4-step LoRA · Same input, Prompt, CFG, Seed

index	TensorSharp (C#)	stable-diffusion.cpp (C++)	C# Advantages
Total time (Warm)	40.44 seconds	48.16 seconds	Fast 1.19x
Time per step	7.57 seconds	9.43 seconds	Fast 1.25x
Sampling	30.27 seconds	37.73 seconds	Fast 1.25x
VAE encoding	0.54 seconds	1.92 seconds	Fast 3.56x
VAE Decoding	1.51 seconds	2.57 seconds	Fast 1.70x

2.2 Key Breakthrough: C# performance ≈ C++, but engineering capabilities are far superior.

TensorSharp reveals a long-overlooked truth: C# has achieved C++-level performance in AI inference (and even surpasses it in VAE encoding/decoding), while retaining the engineering capabilities for full lifecycle management.

stable-diffusion.cpp and llama.cpp are masterpieces of C++—extremely high performance, but:

No type-safe API contracts
There is no native DI container management model lifecycle
No EF Core manages the generation of historical data.
Without OpenTelemetry tracing inference links
One-click deployment to Kubernetes without .NET Aspire
No Roslyn analyzer catches configuration errors at compile time.

TensorSharp proves that C# can rival C++ in performance while providing full lifecycle management capabilities that C++ can never offer.

III. C# vs Rust vs Go: Language Selection for AI Infrastructure Layer

Bun's choice of Rust was correct—browser engines require extreme memory control . Go is a "cloud-native language"—Kubernetes, Docker, and Istio are all written in Go. But AI infrastructure needs more than just "fast deployment"; it needs complete engineering capabilities "from requirements to evolution . "

3.1 Core Differences

Dimension	Go	Rust	C#	Scene judgment
Memory Model	GC (Minimalist)	Ownership + Borrow Checker	GC + Span	C# wins : AI inference doesn't require extreme memory usage and has zero cost.
Concurrency Model	Goroutines	Tokio/async	async/await + TPL	Go wins : It's the simplest; C# wins : It's deeply integrated with the ecosystem.
Compilation time	Extremely fast (seconds)	Slow (10-30 minutes)	Quick (2-5 minutes)	Go wins : fastest; C# is fast enough.
Binary size	Very small (15MB)	Small (100KB-1MB)	Medium (45MB)	Go wins : smallest; C# is small enough, but more feature-rich.
Kubernetes support	Excellent (client-go)	(kube-rs)	Excellent (.NET K8s + Aspire)	Go wins : Kubernetes itself uses Go; C# wins : Aspire offers a higher level of abstraction.
Observability	Manually configure OTel-go	Manually configure tracing	Native OTel .NET + Aspire Dashboard	C# wins big : Otel First-Class Citizen
Database/ORM	Manual migration of GORM/sqlx	Diesel/SeaORM compile-time verification	EF Core + Code First Automated Migration	C# triumphs : Migration and LINQ are productivity killers
API Contract	Gin/Echo + Manual Verification	Axum/Tonic + Manual Verification	ASP.NET Core + Source Generator + JSON Schema	C# wins : Generates serialized code at compile time.
Dependency Injection	No native implementation, relies on wire	No native support, relies on manual methods.	Native DI + Lifecycle Management + HostedService	C# wins big : Dependency Injection (DI) is a core design pattern in .NET.
Deployment toolchain	Docker + Manual Kubernetes YAML	Cargo + Manual Configuration	One-click .NET Aspire to generate Kubernetes	C# wins hands down : Aspire is "cloud-native Spring Boot"
AI ecosystem integration	ONNX Go Community Maintenance	Candle/burn emerging ecosystem	ONNX C# Official + TensorSharp + SK	C# wins hands down : Microsoft's official AI stack
Development efficiency	2-4 weeks	3-6 months	1-2 weeks	C# wins : Familiarity with GC, large developer base.
Talent Costs	$150K-$180K	$185K-$230K	$130K-$165K	C# wins : Abundant talent and controllable costs
Full lifecycle coverage	40% (coding + deployment)	30% (coding + compilation)	95% (Demand → Evolution)	C# triumphs : the only code to cover the entire lifecycle.

Conclusion : In the AI infrastructure layer, C# wins or ties in 10+/14 dimensions . Rust only leads in 2 dimensions (concurrency safety, zero memory cost), while Go leads in 3 dimensions (compilation speed, image size, native Kubernetes) but lacks full lifecycle coverage.

3.2 Why should C# (instead of Go/Rust) be chosen for AI infrastructure?

                    Full lifecycle management requirements ↑
                         |
    Python  ←———————————·——————————→  Rust
    (Low control + low management)       |            (High control + low management)
                         |
                         ↓  System-level control requirements

              Go  ←——————·——————→  C#
           (High deployment + medium management)       (Medium control + high management)

                    C#'s sweet spot: upper-right quadrant
                    - 2x more complete full lifecycle management than Go
                    - 3-5x higher development efficiency than Rust
                    - TensorSharp proves performance ≈ C++

Bun chose Rust because browser engines require extreme memory control and deep interoperability with C++ libraries.

Go's positioning : a cloud-native language, but limited to the deployment layer .

OpenClaw chooses C# : AI infrastructure requires full lifecycle management + native Microsoft ecosystem + team scalability + C++-level performance proven by TensorSharp .

Go can help you "get the service running," while C# can help you "go from requirements to evolution"—including domain modeling, API contracts, compile-time checks, automatic migration, distributed tracing, one-click deployment, and the image/text reasoning engine itself .

IV. Performance Benchmarks: C# Native AOT outperforms Python and is on par with Go/Rust.

4.1 Cold Start and Deployment Efficiency

language	Cold start (AWS Lambda 1024MB)	Compared to Python
Python	325ms	benchmark
Go	45ms	7.2x Faster
Rust	30ms	10.8x Faster
C# NativeAOT	35ms	9.3x Faster

Deployment Form	Mirror size	Compared to Python
Python AI Inference	1,200MB	benchmark
Go minimal	15MB	80x smaller
C# NativeAOT	45MB	26.7x smaller

4.2 Inference Throughput

TensorSharp Qwen 2511 (CUDA · 544×1184 · 4 Steps):

index	TensorSharp (C#)	sd.cpp (C++)	C# Advantages
Total time (Warm)	40.44s	48.16s	Fast 1.19x
Sampling	30.27s	37.73s	Fast 1.25x
VAE encoding	0.54s	1.92s	Fast 3.56x
VAE Decoding	1.51s	2.57s	Fast 1.70x

ONNX Runtime DeepSeek R1 (RTX 4090 CUDA):

Model	PyTorch	ONNX Runtime (C#)	Advantages
DeepSeek 1.5B Int4	49.7 tok/s	313.3 tok/s	6.3x
DeepSeek 7B Int4	43.5 tok/s	161.0 tok/s	3.7x

4.3 Concurrency Performance

Concurrent users	Python RPS	C# RPS	Advantages
100	3,200	9,500	3.0x
500	4,200	42,000	10.0x
1000	4,500	78,000	17.3x

Concurrent users	Python memory	C# Memory	Advantages
1000	25,000MB	1,600MB	15.6x

4.4 General Calculation

language	1GB JSON processing (AWS Lambda)	efficiency
Python	12,000ms	benchmark
Go	3,200ms	3.8x
Rust	2,050ms	5.9x
C# NativeAOT	2,050ms	5.9x

4.5 Compile-time error catching: ∞ times advantage

Error Type	Python	Go	C#	Cost differences
Null reference	Production crash → Investigation → Rollback	panic → recovery	Roslyn compile-time interception	∞
Type mismatch	Runtime TypeError	Compilation error	Compilation error	∞
Resource leak	Memory overflow → Restart	Depends on GC	`using`Compile-time checks	∞

V. Microsoft Agent Framework: C# is always a "first-class citizen"

In October 2025, Microsoft released the Microsoft Agent Framework (MAF) Public Preview, merging AutoGen and Semantic Kernel into a unified framework.

Evolution Timeline

time	milestone	C# Positioning
2023	Semantic Kernel First Release	First released in C# , with Python to follow.
2024	SK Agent Framework RC	C# first class citizen
2025.5	Azure AI Foundry GA	Unified runtime
October 2025	MAF Preview	AutoGen + SK merged
2026 Q1	MAF 1.0 GA	Production ready
2026 Q2	Process Framework GA	Deterministic workflow

VI. Token Economics: C# for Compression of Hidden Costs

Cost items	Python	Go	C#	C# optimization
Container Image	1,200MB	15MB	45MB	26.7x
cold start	3-10s	<100ms	<100ms	30-100x
Concurrency Model	GIL → Multi-process memory explosion	Goroutines	async/await + thread pool	10x
Runtime error	Production collapse	panic	Compile-time capture	∞
Observability	Manual third-party	Manual configuration	OTel native + Aspire	5x
Deployment Configuration	Manual Kubernetes YAML	Manual Kubernetes YAML	Aspire One-Click Generation	10x

TensorSharp has changed the cost model of image generation: Python stack image 1.2GB, cold start 3-10s, uncontrollable memory; C# stack image <100MB, cold start <1s, DiT reconstruction once reused, controllable memory— this is exactly the economic basis that TokenHub needs .

VII. OpenClaw.NET: Practice of C# AI Native Infrastructure

┌─────────────────────────────────────────┐
│  Python algorithm layer (compatibility retained)               │
│  · PyTorch training · Jupyter prototyping            │
├─────────────────────────────────────────┤
│  MCP protocol (cross-language boundary)                   │
├─────────────────────────────────────────┤
│  C# AI-native infrastructure layer (OpenClaw.NET)     │
│  · TensorSharp (image/text inference engine)       │
│  · MetaSkill DAG (workflow orchestration)            │
│  · Harness engine (execution runtime)             │
│  · TokenHub (Token economics)               │
│  · AxonHub (data collection/CDC)                │
│  · Semantic Kernel (LLM orchestration)            │
│  · Microsoft Agent Framework (Agent lifecycle)│
│  · ONNX Runtime C# API (general-purpose inference)         │
├─────────────────────────────────────────┤
│  .NET runtime (NativeAOT + managed memory)       │
├─────────────────────────────────────────┤
│  Full lifecycle management layer (Aspire + OTel + EF Core)│
└─────────────────────────────────────────┘

Key Design :

MCP protocol : Without rewriting PyTorch, it exposes Python's algorithmic capabilities as a "service" to the C# infrastructure layer.
TensorSharp : A pure C# engine that outperforms C++ sd.cpp, proving that C# is not just "glue" but an "engine".
C# exclusive tier : MetaSkill, Harness, TokenHub, AxonHub, TensorSharp. No Python/Go equivalents.

VIII. Philosophy: From Builder to Agent Leader to Taste

When TensorSharp enables C# developers to build AI engines with C++-level performance, and when .NET Aspire makes one-click deployment the default, a question arises: if "building engines" is no longer a privilege, where does the value of humanity lie?

The answer lies in three progressive concepts: Builder → AI Agent Leader → Taste .

8.1 Builder: Democratizing Tools

The groundbreaking significance of TensorSharp lies not in being 1.19x faster than C++, but in enabling a C# developer to build an image generation engine that surpasses stable-diffusion.cpp without needing to master CUDA kernel programming or understand DiT mathematical principles .

In the past : SDE meant "someone who can write code"—a professional skill, honed through years of training.
Now : A Builder is "someone who uses code to realize ideas"—code is a means, not an end.

C#'s role : Lowering the barrier to entry for Builders. Aspire + SK + MAF + TensorSharp make "engines for everyone" a reality.

8.2 AI Agent Leader: From Execution to Decision Making

MetaSkill DAG is a perfect metaphor:

MetaSkill DAG defines workflows—not "writing code," but "defining the problem space."
The Harness engine executes workflows—not "debug code," but "enables agents to collaborate."
TokenHub tracks economics—not "optimizing performance," but "evaluating input and output."
Humans are responsible for "judging" and "calibrating"—not "fixing bugs," but rather "verifying whether the results match the intent."

In the past , ICs were "task executors"—receiving requirements, breaking down tasks, writing code, submitting PRs, fixing reviews, and delivering features.

Currently : The Agent Leader is the "decision-maker who defines tasks, selects tools, and evaluates outputs"—when faced with ambiguous business problems, they need to:

Defining the problem : "We need a system that can automatically generate marketing images based on user descriptions"—translating business intent.
Tool selection : "TensorSharp for image generation, SK for prompt optimization, and TokenHub for cost tracking"—a decision regarding resource orchestration.
Orchestrate intelligent agents : "MetaSkill DAG: User Input → Prompt Optimization Agent → Image Generation Agent → Quality Assessment Agent → Output" — Designing a collaborative workflow
Verification results : "Does the image align with the brand's tone? Is the cost within budget? Is user feedback positive?" — Value judgment

C#'s role : Providing agent infrastructure. MetaSkill DAG, Harness, TokenHub, MAF—not just "tools," but "operating systems for agent collaboration."

8.3 Taste: Humanity's Last Moat

Taste is not "preference," but rather a structured judgment ability that includes three progressive levels:

Technical Taste: "Is this implementation elegant?"

When AI can generate 100 architectural solutions, Taste decides which one is selected:

Is the code structure clear? Is the interface abstraction appropriate?
Architectural Evolvability: How many files need to be modified when requirements change?

In TensorSharp PR #81, the authors chose a specific DiT reconstruction strategy and the timing of CUDA Graph Capture—not the "only right" one, but rather an "elegant balance between performance, memory, and complexity."

Product Taste: "Is this feature worth implementing?"

When AI can generate an infinite number of functions, Taste decides where to allocate resources:

Are the user pain points real? Are the solutions simple?
Return on investment: Is the value of this feature worth the cognitive bandwidth consumed by the team?

In the design of TokenHub, it is necessary to determine: Is tracking the "cost of each generated token" sufficient? Or is the "cumulative cost per user" necessary? Is "cost trend prediction for the next 7 days" necessary? — This is the product's Taste .

Ethical Taste: "Should this technology exist?"

When AI can generate any content, Taste determines where the boundaries lie:

Can image generation systems be used for deepfakes? How can this be prevented?
Will AI services cause unemployment for certain groups? How can this be mitigated?
Does the agent system respect user privacy and autonomy?

KPMG's Clara AI requires an audit trail—not a technical requirement, but an ethical one —regarding the value judgment that "AI decisions must be explainable and auditable."

C#'s role : Freeing humans from the task of "Taste". Aspire automates deployment, TensorSharp automates inference, and MAF automates agent orchestration—liberating humans from "execution" and allowing them to focus on "judgment".

8.4 Evolutionary Logic of the Three-Layer Model

                    Value hierarchy ↑
                         |
    Taste  ←———————————·——————————→  Ethical judgment
    (Aesthetics/value/ethics)      |            (What deserves to exist)
                         |
                         ↓  Degree of automation

    Agent Leader  ←——————·——————→  Decision orchestration
    (Define/select/verify)      |            (Enable agents to collaborate)
                         |
                         ↓  Tool barrier

    Builder  ←———————————·——————————→  Execution and implementation
    (Use code to realize ideas)      |            (Build engines/write services)

This is not "replacement," but "sublimation" :

The barrier to entry for Builders continues to decrease, eventually becoming a basic skill like "writing".
The Agent Leader has evolved from an "executor" to a "decision-maker," with its core value shifting from "writing code" to "defining problems, orchestrating agents, and validating results."
Taste is the eternal moat: no matter how powerful AI becomes, the judgment of "what is worth doing" always belongs to humans.

8.5 Philosophical Mapping of OpenClaw.NET

level	Human roles	OpenClaw.NET components	C# Toolchain
Builder	Implementing ideas with code	TensorSharp, ONNX Runtime	NativeAOT, Span Roslyn
Agent Leader	Define the problem, orchestrate the agents, and verify the results.	MetaSkill DAG, Harness, TokenHub	SK, MAF, Aspire
Taste	Determining "what's worth doing"	DDD, JSON-LD Ontology	Strongly typed systems, Nullable, Roslyn

IX. Design Proposal: From Passive Auditing to Proactive Taste Interception

9.1 Current Status of OpenClaw.NET: A Passive Audit Infrastructure

OpenClaw.NET currently implements:

Harness Contracts : Inspectable Agent Work Plans (Reactive, Does Not Change Default Behavior)
Evidence Bundles : Checkable operational evidence, risks, and manual review (passive, does not change default behavior).
Governance Ledger : Persistent record of approval and oversight decisions (passive, does not change default behavior)
Plan-Execute-Verify Mode : Proactive governance for high-risk tool execution (but only for security/compliance, not aesthetics/value).
The user_input pause point : manually entered data, but not for value judgment.

These capabilities are all "post-screening"—recording and exposing information for manual review, but not actively intercepting judgments based on "aesthetics/ethics/product value".

9.2 Design Proposal: Taste Review Node – From Passive to Proactive

Based on the existing architecture, a three-layer evolution direction is proposed:

Current state of OpenClaw.NET (implemented):
┌─────────────────────────────────────────┐
│  Passive Harness Contracts              │  ← Inspectable work plans, but no interception
│  Passive Evidence Bundles               │  ← Inspectable runtime evidence, but no interception
│  Passive Governance Ledger              │  ← Inspectable approval records, but no interception
│  Plan-Execute-Verify Mode             │  ← Proactive interception, but only for security/compliance
│  user_input pause point                      │  ← Manual input, but not value judgment
└─────────────────────────────────────────┘
           ↓ Evolution direction
Design proposal (Taste review node):
┌─────────────────────────────────────────┐
│  Active TasteGate feature                   │  ← Proactive interception based on aesthetics/ethics/product value
│  ITasteGate<TInput, TOutput> interface        │  ← Generic constraints verified at compile time
│  TasteDecision enum (Pass/Retry/Abort)  │  ← Output of value judgment
│  Constraint types defined by the Agent Leader              │  ← BrandTaste / EthicalTaste / TechnicalTaste
└─────────────────────────────────────────┘

9.3 Three-Tier Architecture Design Proposal

┌─────────────────────────────────────────┐
│  ① Taste constraint definition layer (led by the Agent Leader)  │
│  · Business intent translation → technical constraints                │
│  · Domain modeling (DDD) → entities, boundaries, aggregate roots    │
│  · Taste presets → aesthetic standards, brand tone, ethical red lines│
│  Output: domain model + Taste constraint document          │
├─────────────────────────────────────────┤
│  ② Review logic execution layer (designed by the Agent Leader + executed by AI)│
│  · MetaSkill DAG → workflow topology           │
│  · Tool selection → TensorSharp? ONNX? SK?    │
│  · Agent role assignment → Prompt/image/quality/cost │
│  · Harness engine → scheduling, state, failure recovery   │
│  Output: executable Agent collaboration graph              │
├─────────────────────────────────────────┤
│  ③ Taste validation layer (final judgment by the Agent Leader)  │
│  · Technical Taste → code structure, API design, evolvability│
│  · Product Taste → user value, brand consistency, return on investment│
│  · Ethical Taste → compliance, social impact, copyright safety  │
│  Output: Pass / Retry / Abort               │
└─────────────────────────────────────────┘

9.4 Example: MetaSkill DAG (Design Proposal) for an AI Marketing Image Generation System

User input (natural language)
    ↓
Taste constraints (brand tone) ←—— Agent Leader preset: "tech blue + minimalism + no people"
    ↓
Budget limit (Token quota) ←—— TokenHub configuration: cost per generation ≤ $0.50
    ↓
┌─────────────────────────────────────────┐
│  Agent layer (autonomous AI execution)                  │
│  · Prompt optimization Agent (SK)                │
│  · Image generation Agent (TensorSharp + CUDA)   │
│  · Quality evaluation Agent (CLIP + aesthetic score)       │
│  · Cost accounting Agent (TokenHub)              │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  Taste review node (design proposal)                │
│  · Technical Taste → Is the API design intuitive? Is the architecture evolvable?│
│  · Product Taste → Image quality? Brand consistency? User value?│
│  · Ethical Taste → Copyright compliance? Deepfake risk?      │
└─────────────────────────────────────────┘
    ↓
    ├─→ Pass → output image + cost report
    ├─→ Retry → optimize Prompt → regenerate (maximum 3 attempts)
    └─→ Abort → record failure → trigger alert → human intervention

9.5 Key Design Principles (Design Proposal)

Principle 1: The location of the Taste review node is determined by "scope of influence × uncertainty".

Decision types	Influence	uncertain	Intervention methods	Example
AI is fully autonomous	Low	Low	No review required	API routing, log format, caching strategy
AI-driven + human oversight	high	Low	Asynchronous review	Model version upgrade, automatic scaling
Human-led + AI-assisted	Low	high	Simultaneous review	Prompt style, UI color scheme, and copywriting tone
Human intervention is necessary.	high	high	Mandatory audit	Taste review, ethical boundaries, and structural direction

Principle 2: The Taste review output is not "Approved/Rejected", but rather "Approved/Reversed/Terminated".

Pass : Meets the Taste criteria, proceed to the next stage.
Retry : There is room for improvement; optimize the return to the upstream Agent (loop limit to prevent infinite rollback).
Abort : The "Taste" threshold is reached, logging fails, an alarm is triggered, and manual intervention is required.

Principle 3: Taste constraints should be encoded at compile time (design proposal).

// Design proposal code: an evolution direction based on OpenClaw.NET's existing type system
// Note: This is not an implementation in the current codebase, but illustrates how to encode Taste constraints as compile-time-verifiable C# types

// Taste constraints as types
public record BrandTaste(
    ColorPalette AllowedColors,           // Compile-time validation: only tech blue and minimalist white are allowed
    bool AllowHumanFaces,                   // Compile-time validation: human images are prohibited
    decimal MaxCostPerImage,                // Compile-time validation: per-image cost cap
    EthicalConstraint[] Constraints,          // Compile-time validation: list of ethical constraints
    StyleGuideline StyleGuide                 // Compile-time validation: style guide
) : ITasteConstraint;

// Taste review node as a generic interface
public interface ITasteGate<TInput, TOutput>
    where TInput : ITasteAuditable           // Compile-time constraint: input must be auditable
    where TOutput : ITasteAuditable          // Compile-time constraint: output must be auditable
{
    TasteDecision Audit(TInput input, BrandTaste taste);
}

// Product Taste review implementation (design proposal)
public class ProductTasteGate : ITasteGate<GeneratedImage, ValidatedImage>
{
    public TasteDecision Audit(GeneratedImage input, BrandTaste taste)
    {
        // Compile-time validation: the type system ensures that input contains all required Taste audit fields
        if (input.StyleScore < taste.MinStyleScore)
            return TasteDecision.Retry("Insufficient style consistency; consider adjusting the Prompt");

        if (input.Cost > taste.MaxCostPerImage)
            return TasteDecision.Abort("Cost exceeds the Taste constraint; triggering a budget alert");

        if (!taste.AllowedColors.Contains(input.DominantColor))
            return TasteDecision.Retry("The dominant color does not match the brand tone; consider regenerating");

        return TasteDecision.Pass();
    }
}

9.6 Agent Leader Capability Model (Design Proposal)

Capability Dimension	AI Agent (Current)	Agent Leader (Human)	complementary relationship
Technical judgment	9/10	7/10	AI execution, human decision-making
Product Insights	7/10	9/10	AI-assisted, human-led
Ethical Sensitive	4/10	9/10	AI-assisted, human-led
Systems thinking	8/10	9/10	AI-assisted, human-led
Aesthetic Intuition	3/10	9/10	AI-assisted, human-led
Risk awareness	6/10	9/10	AI-assisted, human-led
Evolutionary prediction	5/10	8/10	AI-assisted, human-led

9.7 The Transition Path from IC to Agent Leader (Design Proposal)

stage	Role	Core Competencies	Tools/Frameworks	Value output
Level 1: Builder	code implementer	Coding, debugging, optimization	IDE, Git, CI/CD	Functional delivery
Level 2: Agent Operator	Intelligent agent operator	Prompt Project, Agent Configuration	SK, AutoGen	Agent efficiency
Level 3: Agent Leader	Agent Leader	Problem definition, tool selection, process orchestration, and taste review.	MetaSkill DAG, Harness, TokenHub	System value
Level 4: Taste Architect	Aesthetic Architect	Domain modeling, value judgment, ethical boundaries, evolution prediction	DDD, JSON-LD Ontology, Taste constraint type system	Organization Taste

10. Conclusion: C# liberates humans from the burden of tasting.

Bun chose Rust because browser engines require extreme memory control. Go is a cloud-native language, but its full lifecycle coverage is only 40% . AI infrastructure needs a "full lifecycle native language"—C # has 95% coverage , outperforming in 10+/14 dimensions, and TensorSharp surpasses C++'s stable-diffusion.cpp in image generation performance .

But more important than technology selection is the philosophical positioning and design proposal :

When TensorSharp enables C# developers to build AI engines with C++-level performance, when .NET Aspire makes one-click deployment the default, and when Semantic Kernel makes LLM orchestration as natural as writing LINQ— humans are freed from "execution" and can focus on "judgment . "

Building upon OpenClaw.NET's existing passive auditing infrastructure (Harness Contracts + Evidence Bundles + Governance Ledger), Taste's audit node design proposal evolves this "judgment" from "post-event" to "in-event":

Problem Definition Layer : Agent Leader translates business intent and pre-sets Taste constraints.
Intelligent Agent Orchestration Layer : AI Autonomous Execution, Agent Leader Design Process
Taste Verification Layer : The Agent Leader injects human aesthetic, value, and ethical judgments at key nodes.

The OpenClaw.NET you are working on essentially validates this proposition: using C# to build AI-native infrastructure lowers the barrier to entry for Builders, clarifies the role of Agent Leader, and makes Taste the last line of defense for humanity .

This is not a narrative of replacing Python/Go/Rust/C++—each language remains irreplaceable in its respective field—but rather C# is taking over the higher-value level of AI's production, service-oriented, infrastructure-oriented, and engine-oriented development , while freeing humans to do what only humans can do: define problems, orchestrate intelligent agents, and verify tastes .

Performance Benchmark Cheat Sheet

Dimension	Python	Go	Rust	C++	C#	Optimal
cold start	325ms	45ms	30ms	—	35ms	Rust
Mirror	1,200MB	15MB	100KB-1MB	—	45MB	Go
gRPC QPS	45K	920K	950K	—	1,000K+	C#
Image generation	—	—	—	48.16s	40.44s	C#
Token throughput	49.7	—	—	—	313.3	C#
Concurrent RPS	4,500	82K	95K	—	78K	Rust
Memory (1000 concurrent users)	25GB	1.4GB	1.2GB	—	1.6GB	Rust
Full life cycle	20%	40%	30%	10%	95%	C#
Deployment toolchain	Manual YAML	Manual YAML	Manual configuration	Manual Makefile	Aspire One-Click	C#
Observability	Manual third-party	Manual configuration	Manual configuration	none	Native + Dashboard	C#
AI ecosystem	Python native	Community maintenance	Emerging Ecosystems	C++ native	TensorSharp + ONNX + SK	C#

C# vs Go vs Rust vs Python vs C++: AI Infrastructure Selection Decision Tree

                    What is your scenario?
                         |
            ┌────────────┼────────────┬────────────┐
            ↓            ↓            ↓            ↓
      Algorithm research/experimentation   Cloud-native infrastructure   AI services/infrastructure   System kernel/engine
            |            |            |            |
         Python         Go            C#           Rust
            |            |            |            |
      · Jupyter      · K8s/Docker   · Inference services    · Browser engine
      · PyTorch      · Minimalist microservices   · Agent orchestration  · OS components
      · Rapid prototyping     · High-concurrency gateway   · Token economics · Safety-critical
      · Paper reproduction     · Monitoring/logging    · Image/text generation · Zero-cost memory
                         |            |
                         |            ↓
                         |           C++
                         |            |
                         |      · Legacy engines
                         |      · Hardware drivers
                         |      · Extreme optimization
                         |
                         ↓
                    TensorSharp proves:
                    C# can replace C++ for AI inference engines
                    while retaining full lifecycle management capabilities
                    freeing humans to focus on Taste
                    Design proposal for a Taste review node based on OpenClaw.NET
                    evolving the Agent Leader's judgment from "after the fact" to "in the loop"

TensorSharp supports Vulkan backend

Zhongkai Fu — Tue, 07 Jul 2026 02:17:34 +0000

Due to high Vulkan backend demand, I update TensorSharp and release the initial version of GGML Vulkan backend by leveraging external GGML project. The native Vulkan backend will be implemented later. I tested it on Nvidia Geforce RTX 3080 Laptop GPU, and Intel(R) UHD Graphics on Windows. They all work. However, I do not have AMD GPU, so I have no way to get it tested. It's really appreciated if you have AMD GPU and would like to try it out. Any feedback and comment are welcome.

Here is the benchmark I run to compare with llama.cpp:

Performance ratio — TensorSharp vs reference engines

Geomean of TensorSharp's per-scenario speedup over each reference engine on the same backend, across every scenario both engines ran (single-stream, MTP-off). A value > 1.0× means TensorSharp is faster (for decode / prefill throughput) or lower-latency (for TTFT); — = no overlapping cells. Per-scenario ratios are in each model's section below.

Model Comparison decode prefill TTFT
Gemma 4 E4B it (Q8_0, dense multimodal) vs llama.cpp · Vulkan 0.93× 0.96× 0.95×
Gemma 4 12B it (QAT UD-Q4_K_XL, dense) vs llama.cpp · Vulkan 1.18× 0.97× 0.95×

Gemma 4 E4B it (Q8_0, dense multimodal) (gemma4-e4b)

Decode throughput (tok/s)

Scenario TensorSharp · Vulkan llama.cpp · Vulkan
text_short 41.6 45.3
text_long 40.9 44.5
multi_turn 41.3 43.6
function_call 41.2 44.4
Prefill throughput (tok/s)

Scenario TensorSharp · Vulkan llama.cpp · Vulkan
text_short 1641.7 1641.1
text_long 1157.0 1718.1
multi_turn 1695.5 1454.3
function_call 1661.2 1531.6
Time to first token (ms, lower is better)

Scenario TensorSharp · Vulkan llama.cpp · Vulkan
text_short 1203.0 1187.0
text_long 2719.0 1813.0
multi_turn 1235.0 1422.0
function_call 1219.0 1328.0
Performance ratio — TensorSharp vs reference (> 1.0× = TensorSharp faster)

Decode throughput

Scenario vs llama.cpp · Vulkan
text_short 0.92×
text_long 0.92×
multi_turn 0.95×
function_call 0.93×
Prefill throughput

Scenario vs llama.cpp · Vulkan
text_short 1.00×
text_long 0.67×
multi_turn 1.17×
function_call 1.08×
Time to first token (latency; > 1.0× = TensorSharp lower)

Scenario vs llama.cpp · Vulkan
text_short 0.99×
text_long 0.67×
multi_turn 1.15×
function_call 1.09×

Gemma 4 12B it (QAT UD-Q4_K_XL, dense) (gemma4-12b)

Decode throughput (tok/s)

Scenario TensorSharp · Vulkan llama.cpp · Vulkan
text_short 31.3 31.1
text_long 31.4 30.0
multi_turn 30.9 31.6
function_call 60.8 31.9
Prefill throughput (tok/s)

Scenario TensorSharp · Vulkan llama.cpp · Vulkan
text_short 766.1 729.4
text_long 635.2 647.4
multi_turn 617.5 636.6
function_call 587.4 674.7
Time to first token (ms, lower is better)

Scenario TensorSharp · Vulkan llama.cpp · Vulkan
text_short 2578.0 2672.0
text_long 4953.0 4813.0
multi_turn 3391.0 3250.0
function_call 3531.0 3016.0
Performance ratio — TensorSharp vs reference (> 1.0× = TensorSharp faster)

Decode throughput

Scenario vs llama.cpp · Vulkan
text_short 1.01×
text_long 1.05×
multi_turn 0.98×
function_call 1.91×
Prefill throughput

Scenario vs llama.cpp · Vulkan
text_short 1.05×
text_long 0.98×
multi_turn 0.97×
function_call 0.87×
Time to first token (latency; > 1.0× = TensorSharp lower)

Scenario vs llama.cpp · Vulkan
text_short 1.04×
text_long 0.97×
multi_turn 0.96×
function_call 0.85×
In case you didn't know what is TensorSharp, here is an introduction:

TensorSharp is an open source local Unsloth (GGUF) LLM inference engine and applications. It supports many models from Unsloth, like Gemma4, DiffusionGemma, Qwen3.6 with multi-modal (image, vision, audio), image edit, reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability (support Cuda, Metal and Vulkan backends). The API is completely compatible with OpenAI and Ollama interface. It has on par performance than llama.cpp

This project is not just a C# wrapper of llama.cpp. It implemented the entire LLM inference engine from bottom to top. If you use CPU backend, it's 100% pure C# code execution. Besides CPU backend, I also implemented CUDA, MLX and GGML backend. The GGML backend refer GGML project as external project, and I build a few fusion operation at higher level.

I learned a lot from other projects and apply them for TensorSharp, such as paged KV cache and continuous batching from vLLM, SSD based cache for MoE model from oMLX, GGUF quantized from llama.cpp and other optimizations for prefill and decode.

Any feedback and comments are welcome. If you like it, it would be really appreciated if you can get this project a star in GitHub. Thanks in advance.

TensorSharp.ai Review: A .NET-Native Way to Run GGUF Models Locally

Zhongkai Fu — Tue, 23 Jun 2026 07:09:52 +0000

Why TensorSharp is interesting right now

Local AI is no longer just a Python or C++ story. TensorSharp is an open-source, .NET-native inference engine for GGUF models that gives developers three ways to work: a CLI for quick tests, an ASP.NET Core server with a browser chat UI, and OpenAI- plus Ollama-compatible HTTP APIs for drop-in integration. The official docs also position it as a real C# library you can embed via NuGet, which is the part that makes it stand out from many local-LLM tools that stop at “runs on localhost.”

If you are a general software developer, the shortest description is this: TensorSharp is for teams that want local or on-prem LLM inference without forcing their stack to revolve around Python. The home page promises that prompts, documents, and images never leave the machine, there are no per-token fees, and the engine speaks familiar OpenAI and Ollama wire formats. That makes it especially relevant for internal copilots, privacy-sensitive assistants, lab environments, and .NET shops that would rather embed inference than wrap a foreign runtime.

What TensorSharp actually ships

At the product level, TensorSharp bundles more than a model runner. Official docs describe TensorSharp.Cli for one-shot prompts, REPL usage, multimodal experiments, JSONL batch workflows, and benchmarks; TensorSharp.Server for browser chat plus REST APIs; and a set of NuGet packages for direct embedding in .NET code. Supported backends include pure C# CPU, GGML CPU, GGML Metal, GGML CUDA, direct CUDA, and Apple MLX, with Windows, macOS, and Linux support documented in the repo and wiki.

Model support is broader than you might expect for a young project. The official supported-models page lists Gemma 3 and 4, Qwen 3 and 3.5/3.6-family models, GPT-OSS, Nemotron-H, Mistral 3, and DiffusionGemma-style text-diffusion models. Multimodal support is also part of the story: Gemma 4 supports image, video, and audio input, while several other families support image input. Tool calling, structured outputs, and a thinking-mode flag are documented across the HTTP API surface.

One of the more compelling capabilities is compatibility. TensorSharp’s server exposes Ollama-style endpoints like /api/generate and /api/chat/ollama, plus OpenAI-style /v1/chat/completions. The docs explicitly show redirecting an OpenAI client to http://localhost:5000/v1, which lowers migration friction for existing apps. In practice, that means teams can test local inference without rewriting their application contracts from scratch.

Here is the kind of developer workflow the docs imply, distilled into one flow:

flowchart LR
    A[Pick a GGUF model] --> B[Build TensorSharp]
    B --> C[Choose backend]
    C --> D[Run CLI or start TensorSharp.Server]
    D --> E[Call OpenAI or Ollama-compatible API]
    E --> F[Add multimodal input or tool calls]
    F --> G[Tune batching, sampling, and benchmarks]

A minimal example from the official HTTP docs uses the standard OpenAI Python client against TensorSharp’s local endpoint:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:5000/v1", api_key="not-needed")

resp = client.chat.completions.create(
    model="Qwen3-4B-Q8_0.gguf",
    messages=[{"role": "user", "content": "Explain mixture-of-experts in one sentence."}],
    max_tokens=80,
)
print(resp.choices[0].message.content)

Where TensorSharp fits and where it does not

The biggest strength here is architectural fit for C# developers. TensorSharp is not just “compatible with .NET”; it is written in C#/.NET and exposes package layers for tensor primitives, runtime, models, and backends. If you want to keep inference inside an existing ASP.NET or service-oriented codebase, that is a strong differentiator from tools that mainly optimize for CLI convenience or Python-native serving. The project also documents advanced serving ideas like continuous batching, paged KV cache, and speculative decoding, which suggests it is trying to compete on systems design rather than on wrappers alone.

There are still tradeoffs. First, the setup is more “developer toolchain” than “double-click desktop app”: the quick start expects .NET 10, Git, and in some cases CUDA or Apple build tooling. Second, while the project publishes internal regression numbers and references a cross-engine benchmark matrix, the public-facing benchmark page is not yet as polished or comparative as what many buyers expect. Third, pricing, enterprise support, and formal compliance claims are unspecified in the reviewed materials, so teams with procurement or audit requirements will need direct clarification.

My take: TensorSharp looks most compelling for developers who want local GGUF inference with a real .NET embedding story, OpenAI-compatible integration, and enough systems-level optimization to move beyond toy demos. If you want the absolute easiest consumer-grade local setup, Ollama still looks simpler. If you want large-scale Python-first serving, vLLM remains the more established choice. But if your stack, team, and deployment model are already C#-heavy, TensorSharp is one of the more interesting projects to watch.

Pros: strong .NET-native embedding story, OpenAI/Ollama compatibility, multimodal support, multiple hardware backends, and official documentation for continuous batching and paged KV caching. Cons: public pricing/support details are unspecified, formal security/compliance claims are unspecified, and the public benchmark story is still more engineering-facing than buyer-facing.

Suggested Dev.to tags: dotnet, csharp, llm, local-ai, opensource

Comparison snapshot

Tool	Core focus	Unique strengths
TensorSharp.ai	Self-hosted GGUF inference for .NET developers	Native C# embedding via NuGet, OpenAI/Ollama-compatible APIs, multiple backends including MLX and GGML, documented multimodal + batching features
llama.cpp	Low-level C/C++ LLM inference across diverse hardware	Foundational GGUF ecosystem, minimal setup philosophy, broad hardware/performance focus
Ollama	Developer-friendly local model runtime and API	Easiest onboarding, polished CLI/runtime UX, local-first with optional cloud account plans and integrations
vLLM	High-throughput, memory-efficient LLM serving	Strong production-serving narrative, PagedAttention + continuous batching, broad hardware targets, OpenAI-compatible API

From a positioning standpoint, TensorSharp competes less on “friendliest consumer UX” than Ollama and less on “most established Python-serving engine” than vLLM. Its clearest niche is the developer who wants local or internal LLM serving with C# as a first-class implementation language, not just as a client calling out to another runtime.

Reader checklist, social blurbs, and source links

Quick fit checklist

You already build in C#/.NET and would benefit from embedding inference directly rather than calling a separate Python service.
You want local or on-prem inference with OpenAI- or Ollama-compatible APIs and no per-token metering.
You need GGUF support plus optional multimodal workflows such as image, video, or audio input.
You are comfortable validating performance, support expectations, and compliance requirements yourself because public pricing/support/security detail is still limited.

Tweet-length social blurbs

“TensorSharp is one of the more interesting local-AI projects I’ve seen for .NET teams: GGUF inference, OpenAI/Ollama-compatible APIs, multimodal support, and direct C# embedding in one stack. If your AI roadmap is C#-heavy, this is worth a look.”

“Ollama made local AI feel easy. TensorSharp makes it feel native to .NET. The big differentiator is not just localhost inference, but running and embedding GGUF models directly inside a C# application architecture.”

“If you want privacy-first local inference without per-token fees and you’d rather point your existing OpenAI client at localhost than rebuild your stack, TensorSharp has a compelling angle—especially on Apple Silicon and NVIDIA hardware.”

Source links

The primary materials used for this review were official TensorSharp pages plus official comparator pages for llama.cpp, Ollama, and vLLM.

TensorSharp: .NET Native Open Source Local LLM Inference Engine

Zhongkai Fu — Mon, 22 Jun 2026 17:09:36 +0000

TensorSharp
I would like to share my latest open source .net native local LLM inference engine and applications. It supports many models, like Gemma4, DiffusionGemma, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface. It has on par performance than llama.cpp

This project is not just a C# wrapper of llama.cpp. It implemented the entire LLM inference engine from bottom to top. If you use CPU backend, it's 100% pure C# code execution. Besides CPU backend, I also implmented CUDA, MLX and GGML backend. The GGML backend refer GGML project as external project, and I build a few fusion operation at higher level.

Any feedback and comments are welcome. If you like it, it would be really appreciated if you can get this project a star in GitHub. Thanks in advance.