<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Joe Munene</title>
    <description>The latest articles on DEV Community by Joe Munene (@ghost_gi_m).</description>
    <link>https://dev.to/ghost_gi_m</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864674%2F025d6cad-2618-458b-8279-25178f560b99.jpg</url>
      <title>DEV Community: Joe Munene</title>
      <link>https://dev.to/ghost_gi_m</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ghost_gi_m"/>
    <language>en</language>
    <item>
      <title>A New opensource Security AI model being built.</title>
      <dc:creator>Joe Munene</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:31:46 +0000</pubDate>
      <link>https://dev.to/ghost_gi_m/a-new-opensource-security-ai-model-being-built-20de</link>
      <guid>https://dev.to/ghost_gi_m/a-new-opensource-security-ai-model-being-built-20de</guid>
      <description>&lt;h1&gt;
  
  
  I Built an Open-Source Cybersecurity LLM From Scratch in Python
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;What if you could build your own AI model — not fine-tune someone else's, not wrap an API — but actually build a transformer from scratch and train it on cybersecurity data?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's exactly what I did. And I'm releasing it under Apache 2.0 so anyone can use it, improve it, and build on it.&lt;/p&gt;

&lt;p&gt;Meet &lt;strong&gt;GhostLM&lt;/strong&gt; — an open-source, cybersecurity-focused language model built entirely from scratch in PyTorch. No pretrained weights. No wrappers. Every single component written by hand.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/joemunene-by/GhostLM" rel="noopener noreferrer"&gt;https://github.com/joemunene-by/GhostLM&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Built GhostLM
&lt;/h2&gt;

&lt;p&gt;Here's the thing about current AI models: they're incredibly powerful, but they weren't built for security. When you ask GPT-4 about a CVE vulnerability or a CTF challenge, it gives you a reasonable answer — but it's reasoning from general knowledge, not from deep security context.&lt;/p&gt;

&lt;p&gt;I wanted a model that actually &lt;em&gt;understands&lt;/em&gt; cybersecurity language — the patterns, the terminology, the attack methodologies. And I wanted to build it myself, not because I thought I could out-engineer OpenAI, but because &lt;strong&gt;the best way to understand how something works is to build it from the ground up.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My goal was simple: create the first open-source, cybersecurity-focused language model that anyone can run, inspect, and improve.&lt;/p&gt;




&lt;h2&gt;
  
  
  What GhostLM Is
&lt;/h2&gt;

&lt;p&gt;GhostLM is a decoder-only transformer language model — the same architecture family as GPT-2, GPT-3, and Llama — but built entirely from scratch. No &lt;code&gt;transformers.AutoModel&lt;/code&gt;, no &lt;code&gt;from_pretrained()&lt;/code&gt;. Just raw PyTorch tensors and matrix multiplications.&lt;/p&gt;

&lt;p&gt;It comes in three sizes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Layers&lt;/th&gt;
&lt;th&gt;Dim&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ghost-tiny&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;~14.5M&lt;/td&gt;
&lt;td&gt;✅ Trained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ghost-small&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;~55M&lt;/td&gt;
&lt;td&gt;🔄 Planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ghost-medium&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;td&gt;~160M&lt;/td&gt;
&lt;td&gt;🔜 Future&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;It's trained on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CVE vulnerability descriptions&lt;/strong&gt; from the NVD database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CTF writeups&lt;/strong&gt; covering real challenge types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cybersecurity research papers&lt;/strong&gt; and abstracts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And it's fully open source under Apache 2.0.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Let me show you what "built from scratch" actually looks like.&lt;/p&gt;

&lt;h3&gt;
  
  
  Causal Self-Attention
&lt;/h3&gt;

&lt;p&gt;This is the core of every transformer. Here's GhostLM's implementation — no &lt;code&gt;F.scaled_dot_product_attention&lt;/code&gt;, no hidden magic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;size&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Combined QKV projection and split
&lt;/span&gt;    &lt;span class="n"&gt;qkv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;c_qkv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qkv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_heads&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Reshape to (B, n_heads, T, head_dim)
&lt;/span&gt;    &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_heads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_heads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_heads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Scaled dot-product attention
&lt;/span&gt;    &lt;span class="n"&gt;att&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Apply causal mask (lower triangular)
&lt;/span&gt;    &lt;span class="n"&gt;att&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;att&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;masked_fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;causal_mask&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Softmax + dropout + weighted sum
&lt;/span&gt;    &lt;span class="n"&gt;att&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;att&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attn_dropout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;att&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;

    &lt;span class="c1"&gt;# Reassemble heads and project back
&lt;/span&gt;    &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;contiguous&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resid_dropout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;proj&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every line is intentional. The causal mask ensures the model can only attend to previous tokens (autoregressive). The attention weights are manually computed with the classic &lt;code&gt;QK^T / sqrt(d)&lt;/code&gt; formula.&lt;/p&gt;

&lt;h3&gt;
  
  
  Transformer Block
&lt;/h3&gt;

&lt;p&gt;The block stacks attention and feed-forward layers with a pre-norm architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Pre-norm + self-attention with residual
&lt;/span&gt;    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ln_1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;# Pre-norm + feed-forward with residual
&lt;/span&gt;    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ffn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ln_2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why pre-norm?&lt;/strong&gt; I chose pre-normalization (LayerNorm before each sub-layer) over post-norm because it's significantly more stable for training, especially on smaller models. The gradients flow more cleanly through the residual connections, and you don't need as careful a learning rate schedule.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weight Tying
&lt;/h3&gt;

&lt;p&gt;One optimization that saves ~25 million parameters: the output projection layer shares weights with the token embedding. Instead of learning two separate &lt;code&gt;vocab_size × d_model&lt;/code&gt; matrices, we learn one and reuse it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lm_head&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token_embedding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the same trick GPT-2 uses, and it works because the embedding and output projection are fundamentally doing the same thing — mapping between token space and hidden space.&lt;/p&gt;




&lt;h2&gt;
  
  
  Training Data
&lt;/h2&gt;

&lt;p&gt;The data pipeline is one of the most important parts of any ML project. GhostLM's pipeline collects from three sources:&lt;/p&gt;

&lt;h3&gt;
  
  
  NVD CVE Descriptions (Real Data)
&lt;/h3&gt;

&lt;p&gt;I hit the National Vulnerability Database REST API directly — no HuggingFace dependency needed. Paginated requests with rate limiting, parsing nested JSON responses, extracting English descriptions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://services.nvd.nist.gov/rest/json/cves/2.0?resultsPerPage=2000&amp;amp;startIndex=0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vulnerabilities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;cve_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;descriptions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gave me &lt;strong&gt;9,925 real CVE descriptions&lt;/strong&gt; — the kind of text that says &lt;em&gt;"A buffer overflow in the XYZ component allows remote attackers to execute arbitrary code via crafted input."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Full Pipeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NVD API → 9,925 CVE descriptions (real)
Synthetic papers → 500 security research abstracts
Synthetic CTF writeups → 500 challenge solutions
─────────────────────────────────────────────────
Total: 10,925 records → ~490,532 tokens
Train: 10,378 | Validation: 547
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline handles text cleaning (unicode normalization, whitespace stripping, non-printable character removal), tokenization, chunking, and train/val splitting — all in &lt;code&gt;data/collect.py&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Training Results
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. I trained ghost-tiny on a &lt;strong&gt;ThinkPad Yoga 11e with a Celeron N4100 and 4GB of RAM&lt;/strong&gt;. Yes, really.&lt;/p&gt;

&lt;h3&gt;
  
  
  Loss Progression
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Steps&lt;/th&gt;
&lt;th&gt;Train Loss&lt;/th&gt;
&lt;th&gt;Val Loss&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;10.84&lt;/td&gt;
&lt;td&gt;10.04&lt;/td&gt;
&lt;td&gt;Random initialization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;7.12&lt;/td&gt;
&lt;td&gt;6.27&lt;/td&gt;
&lt;td&gt;First CVE patterns emerge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;td&gt;5.89&lt;/td&gt;
&lt;td&gt;5.41&lt;/td&gt;
&lt;td&gt;Starting to form sentences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2,000&lt;/td&gt;
&lt;td&gt;4.63&lt;/td&gt;
&lt;td&gt;4.58&lt;/td&gt;
&lt;td&gt;Grammar improving&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3,000&lt;/td&gt;
&lt;td&gt;3.91&lt;/td&gt;
&lt;td&gt;3.95&lt;/td&gt;
&lt;td&gt;Security vocabulary appearing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4,000&lt;/td&gt;
&lt;td&gt;3.52&lt;/td&gt;
&lt;td&gt;3.58&lt;/td&gt;
&lt;td&gt;Coherent attack descriptions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5,000&lt;/td&gt;
&lt;td&gt;3.38&lt;/td&gt;
&lt;td&gt;3.46&lt;/td&gt;
&lt;td&gt;Best checkpoint saved&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The loss curve is healthy — train and validation are tracking closely, no signs of overfitting yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generation at 5,000 Steps
&lt;/h3&gt;

&lt;p&gt;Here's what the model generates when prompted with &lt;em&gt;"A SQL injection attack works by"&lt;/em&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A SQL injection attack works by using the admin_user sequences in the web server. Web Application Firewall Evasion Techniques present a critical defense layer against commercial and model checking. Our model achieves 94% detection rate with transformer-based sequence modeling to identify common vulnerability patterns including buffer overflows.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Is it perfect? No. It bleeds between topics (SQL injection → WAF → research paper language). But it's producing grammatically correct sentences with real security terminology. At 5,000 steps on a 14.5M parameter model running on a laptop from 2018, I'll take it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Honest Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Topic coherence&lt;/strong&gt; — the model jumps between subjects mid-generation. It needs more steps to learn to stay on topic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memorization&lt;/strong&gt; — some outputs are lifted nearly verbatim from training data. More diverse data would help.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size&lt;/strong&gt; — 14.5M params is tiny. ghost-small (55M) will be a significant jump.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU training&lt;/strong&gt; — at ~1.8s per step, 10,000 steps takes hours. GPU or TPU is needed for serious training.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I've already applied for &lt;strong&gt;Google TPU Research Credits&lt;/strong&gt; to train ghost-small on proper hardware. The plan:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ghost-tiny to 10,000+ steps&lt;/strong&gt; — finish what I started&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ghost-small on TPU/GPU&lt;/strong&gt; — 55M params with real compute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace Hub release&lt;/strong&gt; — public model weights anyone can download&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live demo on HuggingFace Spaces&lt;/strong&gt; — try GhostLM in your browser&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark vs GPT-2&lt;/strong&gt; — objective comparison on cybersecurity tasks&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The entire project is open source. Clone it, run it, break it, improve it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/joemunene-by/GhostLM.git
&lt;span class="nb"&gt;cd &lt;/span&gt;GhostLM

&lt;span class="c"&gt;# Install everything&lt;/span&gt;
make &lt;span class="nb"&gt;install&lt;/span&gt;

&lt;span class="c"&gt;# Download training data&lt;/span&gt;
make data

&lt;span class="c"&gt;# Train ghost-tiny on CPU&lt;/span&gt;
make train-tiny

&lt;span class="c"&gt;# Chat with the trained model&lt;/span&gt;
make chat

&lt;span class="c"&gt;# Run the web demo&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;gradio
python demo/app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I'm actively looking for contributors. If you want to help with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Finding new cybersecurity datasets&lt;/li&gt;
&lt;li&gt;Implementing Flash Attention or RoPE&lt;/li&gt;
&lt;li&gt;Adding distributed training&lt;/li&gt;
&lt;li&gt;Writing documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check out &lt;a href="https://github.com/joemunene-by/GhostLM/blob/main/CONTRIBUTING.md" rel="noopener noreferrer"&gt;CONTRIBUTING.md&lt;/a&gt; and open a PR.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;I'm a 20-year-old computer science student in Nairobi, Kenya. I don't have access to massive compute clusters or research lab budgets. But I do have curiosity, persistence, and a belief that &lt;strong&gt;open-source AI shouldn't only come from well-funded labs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GhostLM is proof that you can build something meaningful from scratch with limited resources. The architecture is clean, the training pipeline works, and the model is learning. It's not going to replace GPT-4 — but it's a foundation that anyone can build on.&lt;/p&gt;

&lt;p&gt;If you found this interesting, star the repo, try it out, and let me know what you think. The best part of open source is that it gets better when more people are involved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/joemunene-by/GhostLM" rel="noopener noreferrer"&gt;https://github.com/joemunene-by/GhostLM&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0&lt;/p&gt;

&lt;p&gt;Built with ❤️ in Nairobi, Kenya 🇰🇪&lt;/p&gt;

</description>
      <category>cybersecurity</category>
      <category>llm</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
