Santiago

Posted on Jun 2

Research, simulate, ship: building a D&D personality quiz

#showdev #ai #githubactions #learning

My friends and I are in the middle of a brutal Dungeons and Dragons (D&D) campaign called Tomb of Annihilation. We often talk about which class we would like to play. If you're not familiar, a D&D class is like an archetype that tells you what kind of character you're going to play and what they are good at. Are we oath-bound Paladins, wandering Rangers, or perhaps cunning Rogues?

Some of us have built our characters after ourselves; others, like me, play something wildly different. The biggest contrast was a friend who's a chief of staff by day and a brutish minotaur during the game.

That got me wondering: which class would each of us actually be in real life, not just who we want to play? The question quickly turned from banter to a project.

I decided to make a personality test inspired by ones I've taken at work, or for insight, and instead of matching them to personality types, match them to character classes. I could also use this project as an excuse to learn how GitHub Pages and Actions work together to deploy a site. Three interests collided: D&D, personality insight, and coding. In this article, I'll walk you through my process.

Read before you write

There are four assessments that shaped what I built, and my first step was to dig deeper into each and tie together their best attributes.

CliftonStrengths, from Gallup, gave me a ranked profile across many themes instead of a single label. Two people can share their top theme and still feel different because of what sits below it.

The Big Five gave me the scoring model and the kind of evidence to ask for. Each trait sits on a spectrum rather than as a binary type, so a person can lean 70 percent toward one end without being forced to pick a side, and research on the Big Five in everyday life showed that what people actually do is a stronger signal than what they say about themselves.

O*NET, run by the US Department of Labor, gave me a target length for the test. Their Interest Profiler runs in 10 to 20 minutes. Past that point, fatigue kicks in, and answers get worse, or people tap out. I aimed for under 10 minutes for the median taker.

Situational Judgment Tests (SJT) were the last piece. Instead of asking a taker to rate themselves on a scale, an SJT drops them into a realistic situation and asks what they would actually do. SJTs come from industrial psychology, where they outperform abstract personality questions at predicting how people behave under real conditions. For this test, they were how I planned to tell apart profiles that overlap on paper but feel different in real life.

After that, I switched topics. I wrote out all 13 D&D 5.5e classes and described the real-life person each one points at. The Wizard reads the docs and builds the mental model. The Bard changes the mood of a room. The Artificer turns knowledge into usable tools. The Cleric is care and service. The Paladin is an oath and a mission.

The two topics met when one told me the shape of the test and the other told me what the test was measuring. Both ended up in a single research doc, which is what every later decision point refers back to.

A note on tooling. The research was conducted both by hand and with AI. I picked the tests and decided what mattered out of each one. ChatGPT did the deep digging on each test and helped me collate the notes. The code that runs the quiz was written with Claude Code.

I needed more cowbell

That was a good start, but I still needed more context to shape the direction I wanted to take the test. I wrote a doc that covered features and ones to make the test feel more personal.

The first is the engine spec. It is the source of truth for the engine. It lists every class the quiz supports and every subclass that sits inside each one. It names the classes that overlap and how to tell them apart. It locks in the scoring rules, the result page contents, and the tests every build has to pass. When I change a feature, the spec and the code get updated together. Neither one leads.

Some of what the spec locks in came from later conversations, not the first sketch. Subclasses and multiclassing were not in the original plan. They came up when people started asking what should happen if the engine could not decide cleanly between two classes. Both ideas had to land in the spec before the engine could grow to support them. Each makes the personality test feel unique and more customized.

The spec holds the class overlap matrix. Wizard versus Artificer is the cleanest example. Both are smart and technical. The Wizard researches the root cause until the system makes sense. The Artificer builds the tool that makes the problem stop coming back. Bard versus Sorcerer, Cleric versus Paladin, and Ranger versus Druid all get the same treatment.

The differentiator questions fall straight out of that matrix. Here is the real Wizard versus Artificer question that ended up in the quiz:

The same annoying thing keeps breaking at work, home, in your group. After the third time, you: Build a fix, script, jig, checklist, automation so it stops repeating. (Artificer)

That answer leans Artificer for the main class, leans Armorer or Battle Smith inside Artificer for the subclass, and stamps "Fix Builder" on the trait facets. One click, three signals.

Six other answers sit underneath that one, each pointing at a different class. Same prompt, seven different people. That is the whole reason scenario questions exist.

The spec also locks in the scoring rules. Main class first, multiclass rare, subclass second, traits third. Multiclass triggers only when the second class clears around 86 percent of the top score and both clear an activation floor. Subclass scoring runs only inside the winning class. Traits sit outside the class ladder entirely.

Behind every answer, the engine tracks three scoring layers in parallel. Class scores. Subclass tags are counted only inside the winning class. Trait facets, which become the trait badges on the result page.

The second doc is a catalog of common hobbies and the skills. Seven categories: Creative, Practical, Physical, Intellectual, Social, Digital, Calm. Each entry maps a hobby to the skills it actually develops in real life. The quiz uses this catalog as the backbone for its rank three hobby question, where the user picks one to three hobbies that fit them. Gardening leans Druid. Coding leans Artificer with a small Wizard pull. Martial arts leans Monk. Public speaking leans Bard. The catalog meant I never had to invent activities from scratch.

Get hands on and see if the vision holds

When it comes to my projects, once I have a good foundation, I like to build a prototype and start playtesting it to check whether the vision is actually coming together or whether I need to reconsider any of it.

What came together at this stage:

A branching phase machine: baseline questions first, then tiebreakers if the top scores were close, then a subclass round inside the winning class, then the result card.
Scoring across 13 classes for every answer.
Multiclass detection that triggers only when the top two scores are close enough to matter.
Trait facets that get stamped on every answer and roll up into three trait badges.
A result card with the class icon, the subclass label, three trait badges, and the top class score bar chart.

The first version was functional but didn't have style. Buttons worked. Scores resolved. The result page rendered the right things in the right slots. I could open it, answer questions, and see a result that made sense for the answers I had given.

What it did not have yet was personality. During iteration, I leaned into the theme and tried to make the whole thing feel like an 8-bit RPG. I tried several color schemes inspired by the NES games I grew up with until I landed on one that felt right and stayed readable. Cream parchment background. Dark ink text. Warm pixel orange for the accents. A scanline overlay to suggest a CRT, a pixel font for the titles, and finally, monospace for the body.

The look and feel was not the point of the project, but it was what turned a working quiz into something unique you actually want to share with a friend.

Iterate through feedback loops

I rebuilt parts of the quiz more times than I can easily count. Each iteration fell into three feedback loops, each catching a different class of bug.

Loop 1 was gut feel. I took the quiz myself, made some adjustments, and then showed it to friends several times. That loop caught the obvious stuff. Questions that did not match how people talk. Hobby labels nobody understood. Choices that overlapped so badly that no answer felt right. Most fixes were small and obvious in retrospect. They were invisible from inside the data, because the data did not know the labels were bad.

Loop 2 was synthetic data. I wrote a tool that simulated 1000 random attempts at taking the test, answering each time differently. Each run picked uniformly random answers, and the simulator counted how often each class won. The first run came back lopsided. Druid at zero percent. Artificer at over half. That's not a balance problem, that's a structural problem, because under random input, Druid was unreachable as a class.

The fix was not to just boost the least likely classes. It was tracing every weight to find which combinations were stealing Druid's score. A few subclass tags were attached to options that fed neighboring classes harder than I had intended. The hobby weights for gardening, conservation, and wildlife care were sitting at less than half the strength they needed. After a few rounds of edits and running the simulation again, every class landed in the 50 to 110 win range across 1000 runs. Multiclass settled at about 5 percent.

Loop 3 was a targeted public test. I tapped into Discord communities and asked them to take the quiz and give me feed back on whether the result felt like them. Some takers might have seen my posts on Discord before, others were strangers who only knew the link. They validated what was working and pointed out what could be sharper such as shortening the questions and improving the layout.

Frankly, it sparked joy when the feedback came in positive. Seeing strangers recognize themselves in the result was the part that told me the quiz was connecting with people. The simulator could never have surfaced that signal, because the simulator did not know what a Bard or a Ranger felt like to an actual person.

The lesson I took out of those three rounds: go with your gut, then pivot or tune as you see what is and is not working.

Automate the site

Now that the quiz worked locally on my machine, I needed a place to host it. The code already lived on GitHub. The simplest path was to serve the live site from there, too. GitHub Pages does exactly that for static sites, and GitHub Actions automates the deployment. I push a change, the site rebuilds and redeploys, and I do not touch anything else.

The workflow runs on every push to main. It checks out the code, sets up Node (the JavaScript runtime the build needs), installs dependencies, runs the self tests, builds the site, and uploads the result as a Pages artifact. A second job deploys it. From the moment I push to the moment the new site is live, the pipeline finishes in under a minute. Here is a brief example.

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm
      - run: npm install --no-audit --no-fund
      - run: npm test
      - run: npm run build
      - uses: actions/upload-pages-artifact@v3
        with:
          path: dist

The full version can be found on my Github repo.

There are two small details I had to get right when setting this up.

The first was the subdirectory problem. GitHub Pages does not serve a project site at the top level of the site or domain root. It serves it at username.github.io/repo-name/. The build tool assumed the root, so the first deploy gave me a white page and a console full of 404s. The fix was a single line of base path config that tells the build to prefix every asset URL with the project subpath.

The second was link previews. When you paste a URL into Discord, Slack, or LinkedIn, the platform fetches the page's link preview metadata to build a preview card with a title, description, and image. A quiz built as a React app ships as a single HTML file, so every shared result link points at the same metadata. The first time a friend pasted his Bard link in our group chat, Discord showed the generic site title and the default image in the preview card. That defeats the part where someone wants to flex which class they landed in.

A small post build step writes 13 tiny HTML stubs, one per class. Each stub swaps in that class's icon and title for the preview metadata tags. Social crawlers cannot run JavaScript, so they read the stub and pull the right preview. Then the app reads the URL hash on load and reconstructs the full result no matter which stub the visitor landed on. The next time he pasted the link, the preview card showed the Bard icon and the Bard title.

Once the deployment was automatic, my testing moved off my local machine and onto the live site.

Take the quiz, look at the code

If you ever wondered who you might actually be in a fantasy setting, take the quiz. It takes about 8 minutes and I think you'll have fun with it. I'm always curious to know people's results are and how they feel about it, so make sure to share your result.

The code is on my GitHub repo for those of you who want to dig deeper or want to be inspired to build their own project. If you do, leave a comment and share what you built.

Top comments (4)

Echo • Jun 2

The "research, simulate, ship" framing is the right read on how small side-projects get built today. Two things I'd add from running a similar shape:

1) The "simulate" step is the one that separates "fun experiment" from "actually useful artifact". Most LLM apps skip it — they ship the prompt-as-product and the user gets a different answer every time. Simulating means defining the answer space up front (5 classes -> 5 vectors, each with a small fixed vocabulary), then the LLM picks within it. The personality test only feels stable because the 5-class answer space is the product.

2) For a 5-question quiz, the interesting design choice is what the result looks like. A wall of text loses. A 1-line class + a 2-line "you would also make a great ___" is what makes people share it. The "share my result" mechanic is doing more work than the prompt engineering.

The choice to not use a real LLM in production here is also a feature — the deterministic output is what lets your friend who got "Paladin" last week still get the same answer today. LLM-as-judge would silently re-shuffle the class over time.

Curious whether the friend-as-chief-of-staff actually got "Paladin" or whether the result surprised them — those moments are where the artifact is doing its job.

Santiago • Jun 2

So far, all my friends landed about where they thought they would. From there, they experimented with answering the questions differently.

Nazar Boyko • Jun 2

The 1000-run simulator is the part that stuck with me. Most people would've just bumped the weights on the unreachable classes and called it balanced, tracing it back to subclass tags bleeding into neighboring classes is the kind of fix that only shows up when you treat the quiz like a system, not a form. The Druid-at-zero / Artificer-at-half result is a great example of a structural bug masquerading as a tuning problem.
The 13 HTML-stub trick for link previews is also a really clean solve. I've hit that exact wall with SPA metadata and ended up reaching for SSR way too early when a post-build step would've done the job.
Curious how you landed on the ~86% multiclass threshold, was that gut feel from the early playtests, or did it fall out of the simulation distribution?

Santiago • Jun 2

I'm glad that you found the article insightful. For the multicast score you're right it was mostly gut feel. I wanted it to be a rare event that felt special as people shared it.