I have an unconventional hobby. Via a project I roam around schools in Czechia and introducing my programming job to kids. As I always talk about AI tools, I've decided to put together [a simple chatbot website] that communicates with my paid ChatGPT.
Using my No. 1 webdev framework Nuxt, this is simple enough. Just add a module that someone already developed to provide connection to ChatGPT API, feed it with your API key and you can start sending prompts and receiving answers. Spice the website with some basic styling and UI elements provided by official the Nuxt UI module. And it is almost good to go.
Almost, because the responses come back in Markdown format. I am not fully sure if the connector module does that under the hood or if it is default for (mostly) programming-related questions, but anyway - it has to be dealt with.
The cool thing is my text processing function is currently only 22 "active" lines of code.
Parsing the input
I identified a few patterns to deal with:
HTML tags
This is not really a markdown yet, but angle brackets should be replaced with respective HTML entities to avoid them being rendered as actual tags. That's simple enough:
input = input.replaceAll('<', '<')
input = input.replaceAll('>', '>')
Bold and italic font
Let's start cracking Markdown with basic typography. ChatGPT typically only uses double **
to emphasize points in list. However, there are more variants. Italic is single underscore OR single asterisk; bold font is double asterisk OR double underscore.
I am catching the patterns using regular expressions (remember, they are not your enemy, folks). I need to match the opening and closing pattern, then capture everything in between that is NOT the pattern character (to avoid natural greediness of the regex).
I added a special rule saying the formatted text should occur after a space (\s
metacharacter). This solves the edge case when you ask how the formatting is made in Markdown - the response comes back wrapped in backticks. Like this, I can leave it untouched and present the actual syntax.
To get the desired formatting effect in HTML, I enclose the captured result in respective HTML tags.
First look for double occurrences - those would be bold texts. The remaining ones are left for the italics. Unlike when parsing fully free text input, you can quite rely on ChatGPT to maintain proper structure.
This is how it looks in the code:
// format bold font
input = input.replaceAll(/\s\*\*([^*]+)\*\*/g, `<strong>$1</strong>`)
input = input.replaceAll(/\s__([^_]+)__/g, `<strong>$1</strong>`)
// format italic font
input = input.replaceAll(/\s\*([^*]+)\*/g, `<em>$1</em>`)
input = input.replaceAll(/\s_([^_]+)_/g, `<em>$1</em>`)
Code blocks
Code blocks are quite common when you ask programming-related questions. To put them to good use, they need to be highlighted from the text. The idea is - when you find the first (opening) one, replace it with opening <pre> tag. The next one should be replaced with a closing tag. Repeat that until there is no pattern left.
First look for the "big" blocks enclosed with triple backticks. Wrapping <div> is added to ensure they will start at new line:
while (input.includes('```')) {
input = input.replace('```', '<div><pre>')
input = input.replace('```', '</pre></div>')
}
The remaining will be the inline codes between single backticks:
while (input.includes('`')) {
input = input.replace('`', '<pre>')
input = input.replace('`', '</pre>')
}
Headlines
Headlines are frequent inhabitants of ChatGPT responses. In markdown, 1-6 #
characters are used to represent h1-h6 tags. The rule says there has to be a space and then everything until the end of current line is considered a headline. And for the same reason as with bold/italic text, there should be a whitespace right before the start (and not backtick, when ChatGPT tries to express the actual syntax).
Thus, for h1 the treatment looks like this:
input = input.replaceAll(/\s#\s([^\n]*)\n/g, `<h1>$1</h1>`)
And you can figure out how to do the rest for sure.
Links
HTML links are a little bit tricky, but let's do it together. In Markdown, we have square brackets filled with the displayed text directly followed by the regular brackets containing the target link. We need to capture what's inside and inject it into the html <a> structure.
Matching brackets is not that straightforward, because they serve as meta characters in regular expressions. Therefore, the target expression is infested with additional backslashes to escape them:
input = input.replaceAll(/\s\[([^\]]+)\]\(([^)]+)\)/g, `<a href="$2">$1</a>`)
Let's break it down a bit. You can use Regex101 to follow me.
-
\s
just to avoid syntax being lost when enclosed in backticks -
\[
literal match of opening square bracket -
([^\]]+)
capturing group for link text - anything that is NOT closing square bracket (because of greediness) 1-n times -
\]
literal match of closing square bracket -
\(
literal match of opening bracket -
([^)]+)
capturing group for link href - anything that is NOT closing bracket 1-n times -
\)
literal match of closing bracket
Aaand...done :)
New lines
Finally, a small but important task. Replace literal line ends which would be otherwise swallowed by the browser with <br>
tags to maintain desired visual structure:
input = input.replaceAll('\n', `<br>`)
Wrapping up
Here you can see how my function looks in full. It already includes some basic Tailwind-based CSS styling.
function preFormat(input: string): string {
if (!input) {
return ''
}
// escape html tags
input = input.replaceAll('<', '<')
input = input.replaceAll('>', '>')
// format bold font
input = input.replaceAll(/\s\*\*([^*]+)\*\*/g, `<strong>$1</strong>`)
input = input.replaceAll(/\s__([^_]+)__/g, `<strong>$1</strong>`)
// format italic font
input = input.replaceAll(/\s\*([^*]+)\*/g, `<em>$1</em>`)
input = input.replaceAll(/\s_([^_]+)_/g, `<em>$1</em>`)
// format markdown code blocks
while (input.includes('```')) {
input = input.replace('```', '<div class="my-2 p-1.5 bg-slate-200 opacity-75 text-black rounded"><pre>')
input = input.replace('```', '</pre></div>')
}
// format markdown inline code
while (input.includes('`')) {
input = input.replace('`', '<pre class="inline-block p-0.5 bg-slate-200 opacity-75 text-black font-bold">')
input = input.replace('`', '</pre>')
}
// format headlines
input = input.replaceAll(/\s######\s([^\n]*)\n/g, `<h6 class="font-bold">$1</h6>`)
input = input.replaceAll(/\s#####\s([^\n]*)\n/g, `<h5 class="font-bold">$1</h5>`)
input = input.replaceAll(/\s####\s([^\n]*)\n/g, `<h4 class="font-bold">$1</h4>`)
input = input.replaceAll(/\s###\s([^\n]*)\n/g, `<h3 class="text-lg font-bold">$1</h3>`)
input = input.replaceAll(/\s##\s([^\n]*)\n/g, `<h2 class="text-xl font-bold">$1</h2>`)
input = input.replaceAll(/\s#\s([^\n]*)\n/g, `<h1 class="text-2xl font-bold">$1</h1>`)
// format links
input = input.replaceAll(/\s\[([^\]]+)\]\(([^)]+)\)/g, `<a href="$2" class="hover:text-slate-300">$1</a>`)
// format newlines -> br
input = input.replaceAll('\n', `<br>`)
return input
}
You can test the outcome by running some prompts against my website. If it runs out of credits, let me know (last time they expired without even being used up).
There is room for improvement - for example lists are not being rendered as proper HTML lists, but for its purpose it is already good enough.
If you have questions or suggestions, I'll be happy to address them in the comments section! Or open an issue in the GitHub repo.
Top comments (0)