I indexed 500+ security terms into a searchable glossary — what I learned about knowledge bases

#webdev #database #ai #knowledge

About eight months ago I decided that the cybersecurity site I run needed a proper glossary. Not a static page with 50 definitions, but a real searchable knowledge base — terms organized by category, each with a proper definition, related terms, difficulty level, and a clean URL.

We're now at 500+ terms across 8 categories: Active Directory (102 terms), General Security (101), AI Security (working toward 100), Hacking (working toward 100), Compliance, Cloud Security, DevSecOps, Forensics, and OT/ICS. The backend is MySQL, the frontend uses Meilisearch for full-text search, and the whole thing is part of a Go Fiber application.

Here's what I actually learned building it — not the "knowledge bases are great for SEO" takes, but the technical and editorial problems I didn't anticipate.

Slug uniqueness is harder than you think

Every glossary term gets a URL slug. Simple enough. The problem is that cybersecurity is full of acronyms, and acronyms collide.

"APT" could be Advanced Persistent Threat or Advanced Package Tool (Linux). "CA" is Certificate Authority or Certification Authority (same thing, two common names). "IR" is Incident Response or Infrared. "SOC" is Security Operations Center or System-on-Chip.

My initial slug generation was naive:

import re
import unicodedata

def slugify(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode()
    text = re.sub(r'[^\w\s-]', '', text.lower())
    return re.sub(r'[-\s]+', '-', text).strip('-')

This works fine until you try to insert "Advanced Persistent Threat" and "Advanced Package Tool" and get two slugs that both want to be apt. Then you add the definition for "Active Directory" and it wants ad, which you've already given to "Access Denied."

My solution was a suffix-based disambiguation system: if a slug already exists, append the category abbreviation (apt-securite vs apt-linux). If that also conflicts, append a counter. The database enforces uniqueness with a UNIQUE constraint:

CREATE TABLE glossary_terms (
    id          INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    term        VARCHAR(255) NOT NULL,
    slug        VARCHAR(255) NOT NULL UNIQUE,
    category    ENUM('general','active-directory','ia','hacking',
                     'conformite','cloud','devsecops','forensics','ot') NOT NULL,
    definition  TEXT NOT NULL,
    difficulty  TINYINT UNSIGNED DEFAULT 1 COMMENT '1=beginner, 2=intermediate, 3=expert',
    related     JSON COMMENT 'array of related term slugs',
    sources     JSON COMMENT 'array of {title, url} objects',
    created_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    INDEX idx_category (category),
    INDEX idx_difficulty (difficulty),
    FULLTEXT INDEX ft_term_def (term, definition)
);

The UNIQUE constraint on slug means slug conflicts fail hard at insert time, which is the right behavior — better a failed insert I can handle than a silent overwrite of an existing term.

Category taxonomy matters more than definitions

Early on I focused on writing good definitions. What I should have focused on was getting the category structure right first.

My original categories were too broad. "Security" as a category is meaningless when your entire glossary is about security. I split it five times before landing on the current structure. The problem with restructuring categories after you have 200 terms in the database is that you have to re-review every term's placement, update all the category-filtered search indexes, and regenerate the category landing pages.

The Active Directory category ended up being my most visited by a significant margin — 102 terms, well-defined scope (everything related to AD, LDAP, Kerberos, GPO, LDAP injection, pass-the-hash, BloodHound, etc.), and a clear audience (sysadmins and pentesters). Category coherence matters for readers and for how search clusters results.

A lesson I'd apply to any knowledge base: design your taxonomy with at least 10 example terms per category before committing to it. If you can't generate 10 clear examples without overlap, your categories are wrong.

LLM-generated definitions need human review — especially for CVEs

I used an LLM to draft a significant portion of the definitions. At the scale of 500+ terms, writing every definition from scratch is not feasible for a small team. The drafting speed is genuinely useful.

The problem is accuracy variance. For well-established terms with lots of training data — RSA, TLS handshake, SQL injection, VLAN — the drafts were good with minor edits needed. For specific CVEs, niche attack techniques, or recent additions (post mid-2024 anything), the quality dropped sharply. The model would describe the vulnerability pattern correctly but attribute wrong CVSSv3 scores, wrong affected versions, or wrong discovery timeline.

My review process became: auto-accept definitions for general/conceptual terms after a quick read, mandatory manual fact-checking for anything CVE-related or with version-specific technical claims. The CVE definitions specifically need a cross-reference against the NVD entry. This sounds obvious in retrospect.

For the AI security category, the term landscape itself is still evolving fast enough that definitions needed to be written fresh rather than drafted — the concepts are too recent and too inconsistently defined across sources for an LLM to have reliable training signal on them.

Search over a glossary needs fuzzy matching

A static MySQL FULLTEXT index works fine until users search for "kerberosting" (missing an 'a'), "phising" (one 's'), or "ransomwar" (trailing character dropped). These are real queries from the logs. Exact-match fulltext search returns nothing.

Meilisearch handles this much better out of the box. Its typo-tolerance is configurable, and for a glossary where almost every query is a technical term, the right settings are:

minWordSizeForTypos.oneTypo: 4 — allow one typo for words of 4+ characters
minWordSizeForTypos.twoTypos: 8 — allow two typos for longer words
Set searchable attributes to ["term", "definition"] but rank term matches higher

The difference in user experience between exact-match and typo-tolerant search on a technical glossary is substantial. Someone who searches for "kerberosting" and gets no results doesn't search again — they leave.

The parts that worked better than expected

Linking related terms was worth the effort. When a term has a related array pointing to 3-4 other slugs, readers navigate more than twice as many pages per session. For a knowledge base where depth of understanding matters (you can't understand pass-the-hash without understanding NTLM hashes, which requires understanding authentication protocols), the links create a learning path rather than a dead end.

Difficulty levels were also useful in ways I didn't fully anticipate. Filtering to "beginner" terms gives a coherent introduction path. The distribution ended up roughly 20% beginner, 60% intermediate, 20% expert — which felt right for an audience that's mostly practitioners rather than newcomers.

The glossary is part of the broader content library at AYI NEDJIMI Consultants, where we also cover practical security topics — network hardening, compliance frameworks, threat analysis, and more. If you're in the security space, the free hardening checklists are probably the most practically useful thing on the site — 17 checklists for FortiGate, pfSense, Sophos, Active Directory, Windows Server, Linux, DNS security, and others.

Building this taught me more about taxonomy design and search relevance tuning than about cybersecurity itself. The technical problems in a knowledge base project are mostly not the ones you expect.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists for FortiGate, Palo Alto, pfSense, Sophos, Active Directory and more — PDF and Excel.