<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Zhongkai Fu</title>
    <description>The latest articles on DEV Community by Zhongkai Fu (@zhongkaifu).</description>
    <link>https://dev.to/zhongkaifu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3997340%2F8de0c4c9-4495-4eca-aea0-688845bc68c9.jpg</url>
      <title>DEV Community: Zhongkai Fu</title>
      <link>https://dev.to/zhongkaifu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zhongkaifu"/>
    <language>en</language>
    <item>
      <title>TensorSharp.ai Review: A .NET-Native Way to Run GGUF Models Locally</title>
      <dc:creator>Zhongkai Fu</dc:creator>
      <pubDate>Tue, 23 Jun 2026 07:09:52 +0000</pubDate>
      <link>https://dev.to/zhongkaifu/tensorsharpai-review-a-net-native-way-to-run-gguf-models-locally-5b56</link>
      <guid>https://dev.to/zhongkaifu/tensorsharpai-review-a-net-native-way-to-run-gguf-models-locally-5b56</guid>
      <description>&lt;h3&gt;
  
  
  Why &lt;a href="https://tensorsharp.ai/" rel="noopener noreferrer"&gt;TensorSharp&lt;/a&gt; is interesting right now
&lt;/h3&gt;

&lt;p&gt;Local AI is no longer just a Python or C++ story. &lt;a href="https://tensorsharp.ai/" rel="noopener noreferrer"&gt;TensorSharp&lt;/a&gt; is an open-source, .NET-native inference engine for GGUF models that gives developers three ways to work: a CLI for quick tests, an ASP.NET Core server with a browser chat UI, and OpenAI- plus Ollama-compatible HTTP APIs for drop-in integration. The official docs also position it as a real C# library you can embed via NuGet, which is the part that makes it stand out from many local-LLM tools that stop at “runs on localhost.”&lt;/p&gt;

&lt;p&gt;If you are a general software developer, the shortest description is this: TensorSharp is for teams that want local or on-prem LLM inference without forcing their stack to revolve around Python. The home page promises that prompts, documents, and images never leave the machine, there are no per-token fees, and the engine speaks familiar OpenAI and Ollama wire formats. That makes it especially relevant for internal copilots, privacy-sensitive assistants, lab environments, and .NET shops that would rather embed inference than wrap a foreign runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  What TensorSharp actually ships
&lt;/h3&gt;

&lt;p&gt;At the product level, TensorSharp bundles more than a model runner. Official docs describe &lt;code&gt;TensorSharp.Cli&lt;/code&gt; for one-shot prompts, REPL usage, multimodal experiments, JSONL batch workflows, and benchmarks; &lt;code&gt;TensorSharp.Server&lt;/code&gt; for browser chat plus REST APIs; and a set of NuGet packages for direct embedding in .NET code. Supported backends include pure C# CPU, GGML CPU, GGML Metal, GGML CUDA, direct CUDA, and Apple MLX, with Windows, macOS, and Linux support documented in the repo and wiki. &lt;/p&gt;

&lt;p&gt;Model support is broader than you might expect for a young project. The official supported-models page lists Gemma 3 and 4, Qwen 3 and 3.5/3.6-family models, GPT-OSS, Nemotron-H, Mistral 3, and DiffusionGemma-style text-diffusion models. Multimodal support is also part of the story: Gemma 4 supports image, video, and audio input, while several other families support image input. Tool calling, structured outputs, and a thinking-mode flag are documented across the HTTP API surface. &lt;/p&gt;

&lt;p&gt;One of the more compelling capabilities is compatibility. TensorSharp’s server exposes Ollama-style endpoints like &lt;code&gt;/api/generate&lt;/code&gt; and &lt;code&gt;/api/chat/ollama&lt;/code&gt;, plus OpenAI-style &lt;code&gt;/v1/chat/completions&lt;/code&gt;. The docs explicitly show redirecting an OpenAI client to &lt;code&gt;http://localhost:5000/v1&lt;/code&gt;, which lowers migration friction for existing apps. In practice, that means teams can test local inference without rewriting their application contracts from scratch. &lt;/p&gt;

&lt;p&gt;Here is the kind of developer workflow the docs imply, distilled into one flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    A[Pick a GGUF model] --&amp;gt; B[Build TensorSharp]
    B --&amp;gt; C[Choose backend]
    C --&amp;gt; D[Run CLI or start TensorSharp.Server]
    D --&amp;gt; E[Call OpenAI or Ollama-compatible API]
    E --&amp;gt; F[Add multimodal input or tool calls]
    F --&amp;gt; G[Tune batching, sampling, and benchmarks]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A minimal example from the official HTTP docs uses the standard OpenAI Python client against TensorSharp’s local endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:5000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3-4B-Q8_0.gguf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain mixture-of-experts in one sentence.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Where TensorSharp fits and where it does not
&lt;/h3&gt;

&lt;p&gt;The biggest strength here is architectural fit for C# developers. TensorSharp is not just “compatible with .NET”; it is written in C#/.NET and exposes package layers for tensor primitives, runtime, models, and backends. If you want to keep inference inside an existing ASP.NET or service-oriented codebase, that is a strong differentiator from tools that mainly optimize for CLI convenience or Python-native serving. The project also documents advanced serving ideas like continuous batching, paged KV cache, and speculative decoding, which suggests it is trying to compete on systems design rather than on wrappers alone. &lt;/p&gt;

&lt;p&gt;There are still tradeoffs. First, the setup is more “developer toolchain” than “double-click desktop app”: the quick start expects .NET 10, Git, and in some cases CUDA or Apple build tooling. Second, while the project publishes internal regression numbers and references a cross-engine benchmark matrix, the public-facing benchmark page is not yet as polished or comparative as what many buyers expect. Third, pricing, enterprise support, and formal compliance claims are unspecified in the reviewed materials, so teams with procurement or audit requirements will need direct clarification.&lt;/p&gt;

&lt;p&gt;My take: TensorSharp looks most compelling for developers who want local GGUF inference with a real .NET embedding story, OpenAI-compatible integration, and enough systems-level optimization to move beyond toy demos. If you want the absolute easiest consumer-grade local setup, Ollama still looks simpler. If you want large-scale Python-first serving, vLLM remains the more established choice. But if your stack, team, and deployment model are already C#-heavy, TensorSharp is one of the more interesting projects to watch. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; strong .NET-native embedding story, OpenAI/Ollama compatibility, multimodal support, multiple hardware backends, and official documentation for continuous batching and paged KV caching. &lt;strong&gt;Cons:&lt;/strong&gt; public pricing/support details are unspecified, formal security/compliance claims are unspecified, and the public benchmark story is still more engineering-facing than buyer-facing. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Suggested Dev.to tags:&lt;/strong&gt; &lt;code&gt;dotnet&lt;/code&gt;, &lt;code&gt;csharp&lt;/code&gt;, &lt;code&gt;llm&lt;/code&gt;, &lt;code&gt;local-ai&lt;/code&gt;, &lt;code&gt;opensource&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison snapshot
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Core focus&lt;/th&gt;
&lt;th&gt;Unique strengths&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TensorSharp.ai&lt;/td&gt;
&lt;td&gt;Self-hosted GGUF inference for .NET developers&lt;/td&gt;
&lt;td&gt;Native C# embedding via NuGet, OpenAI/Ollama-compatible APIs, multiple backends including MLX and GGML, documented multimodal + batching features&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;llama.cpp&lt;/td&gt;
&lt;td&gt;Low-level C/C++ LLM inference across diverse hardware&lt;/td&gt;
&lt;td&gt;Foundational GGUF ecosystem, minimal setup philosophy, broad hardware/performance focus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;td&gt;Developer-friendly local model runtime and API&lt;/td&gt;
&lt;td&gt;Easiest onboarding, polished CLI/runtime UX, local-first with optional cloud account plans and integrations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;td&gt;High-throughput, memory-efficient LLM serving&lt;/td&gt;
&lt;td&gt;Strong production-serving narrative, PagedAttention + continuous batching, broad hardware targets, OpenAI-compatible API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From a positioning standpoint, TensorSharp competes less on “friendliest consumer UX” than Ollama and less on “most established Python-serving engine” than vLLM. Its clearest niche is the developer who wants local or internal LLM serving with C# as a first-class implementation language, not just as a client calling out to another runtime. &lt;/p&gt;

&lt;h2&gt;
  
  
  Reader checklist, social blurbs, and source links
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Quick fit checklist&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You already build in C#/.NET and would benefit from embedding inference directly rather than calling a separate Python service. &lt;/li&gt;
&lt;li&gt;You want local or on-prem inference with OpenAI- or Ollama-compatible APIs and no per-token metering. &lt;/li&gt;
&lt;li&gt;You need GGUF support plus optional multimodal workflows such as image, video, or audio input. &lt;/li&gt;
&lt;li&gt;You are comfortable validating performance, support expectations, and compliance requirements yourself because public pricing/support/security detail is still limited. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tweet-length social blurbs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;“TensorSharp is one of the more interesting local-AI projects I’ve seen for .NET teams: GGUF inference, OpenAI/Ollama-compatible APIs, multimodal support, and direct C# embedding in one stack. If your AI roadmap is C#-heavy, this is worth a look.” &lt;/p&gt;

&lt;p&gt;“Ollama made local AI feel easy. TensorSharp makes it feel native to .NET. The big differentiator is not just localhost inference, but running and embedding GGUF models directly inside a C# application architecture.” &lt;/p&gt;

&lt;p&gt;“If you want privacy-first local inference without per-token fees and you’d rather point your existing OpenAI client at &lt;code&gt;localhost&lt;/code&gt; than rebuild your stack, TensorSharp has a compelling angle—especially on Apple Silicon and NVIDIA hardware.” &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source links&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The primary materials used for this review were official TensorSharp pages plus official comparator pages for llama.cpp, Ollama, and vLLM. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://tensorsharp.ai/" rel="noopener noreferrer"&gt;TensorSharp home&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tensorsharp.ai/overview.html" rel="noopener noreferrer"&gt;TensorSharp Overview &amp;amp; Architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tensorsharp.ai/server.html" rel="noopener noreferrer"&gt;TensorSharp Server &amp;amp; Web UI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tensorsharp.ai/http-api.html" rel="noopener noreferrer"&gt;TensorSharp HTTP API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tensorsharp.ai/code-api.html" rel="noopener noreferrer"&gt;TensorSharp C# Library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tensorsharp.ai/models.html" rel="noopener noreferrer"&gt;TensorSharp Supported Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tensorsharp.ai/advanced.html" rel="noopener noreferrer"&gt;TensorSharp Advanced Features&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tensorsharp.ai/benchmarks.html" rel="noopener noreferrer"&gt;TensorSharp Benchmarks &amp;amp; Testing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/zhongkaifu/TensorSharp" rel="noopener noreferrer"&gt;TensorSharp GitHub repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/zhongkaifu/TensorSharp/releases" rel="noopener noreferrer"&gt;TensorSharp v3.0.0.0 release notes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>dotnet</category>
      <category>csharp</category>
    </item>
    <item>
      <title>TensorSharp: .NET Native Open Source Local LLM Inference Engine</title>
      <dc:creator>Zhongkai Fu</dc:creator>
      <pubDate>Mon, 22 Jun 2026 17:09:36 +0000</pubDate>
      <link>https://dev.to/zhongkaifu/tensorsharp-net-native-open-source-local-llm-inference-engine-4ena</link>
      <guid>https://dev.to/zhongkaifu/tensorsharp-net-native-open-source-local-llm-inference-engine-4ena</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/zhongkaifu/TensorSharp" rel="noopener noreferrer"&gt;TensorSharp&lt;/a&gt;&lt;br&gt;
I would like to share my latest open source .net native local LLM inference engine and applications. It supports many models, like Gemma4, DiffusionGemma, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface. It has on par performance than llama.cpp&lt;/p&gt;

&lt;p&gt;This project is not just a C# wrapper of llama.cpp. It implemented the entire LLM inference engine from bottom to top. If you use CPU backend, it's 100% pure C# code execution. Besides CPU backend, I also implmented CUDA, MLX and GGML backend. The GGML backend refer GGML project as external project, and I build a few fusion operation at higher level.&lt;/p&gt;

&lt;p&gt;I learned a lot from other projects and apply them for TensorSharp, such as paged KV cache and continuous batching from vLLM, SSD based cache for MoE model from oMLX, GGUF quanztized from llama.cpp and other optimizations for prefill and decode.&lt;/p&gt;

&lt;p&gt;Any feedback and comments are welcome. If you like it, it would be really appreciated if you can get this project a star in GitHub. Thanks in advance.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
