I Accidentally Invented Tool Calling and Then OpenAI Actually Did It

#webdev #programming #ai #hackathon

It was 15 June 2023. I was scrolling through Hacker News on a Tuesday morning when I saw the headline: OpenAI announces Functions API. I groaned. Not because it was bad news—it was great news. I groaned because I’d spent the last week of May doing exactly what they’d just productized, except my version was held together with prompt engineering and a string called {CODE_SECTION}.

As I’m gathering my knowledge and professional experience, rewinding on my path researching and being a curious user of AI, I find myself reminiscing. I sort of stumbled upon something that is the foundation of today’s agentic coding tools: function calling.

Let me back up.

Hackweek Energy and the text-davinci-003 Era

May 2023. Company hackweek. The company I worked for at the time would occasionally dedicate specific weeks to what they called hack weeks—weeks where people were free to work on any project that might assist the company, but wasn't necessarily tied to their day-to-day tasks or any project a director had put on the global goals.

It was a mixed bag. A lot of colleagues were working on different projects with different angles. Some were cleaning up old code or taking care of long-overdue tasks that day-to-day life never allowed. Visionaries were mocking up brand new ideas for pages or features. And some, like myself, were exploring the nascent field of AI and what it could actually bring.

I was thinking about how AI is basically text in, text out. You call an API, but instead of giving it a properly formed JSON payload or whatever structured format, you could just be more relaxed about it. You feed it a bunch of text, and the API returns a completion of what you just wrote. Prompt engineering becomes a thing. That line of thinking is what led me to the GitHub repo.

The repo—hackweek-ai-experiment—was created on the 11th, and by the 12th I was already knee-deep in OpenAI’s completions API. This was the text-davinci-003 era, before chat completions felt obvious, before everyone had a chatbot side project, before the word “agent” got beaten into meaninglessness. GPT-4 had only recently dropped, and most of us were still figuring out how to talk to these things without them going off the rails.

Our office had that specific buzz that comes from five days of no meetings and permission to build anything. People were wiring up LangChain pipelines, generating images, summarizing Slack threads. I wanted to build an email assistant. The original idea was to genuinely test what the AI could do with real email workflows—reading messages, understanding context, maybe even drafting replies. I was curious about the user flow: what if you could just tell an assistant "I care about hiking and tech conferences" and it would quietly scan your inbox, surface the relevant stuff, and present it in a way that felt natural? No more keyword search. No more drowning in newsletters you meant to read. Just a conversation, and then the right emails appear.

That was the dream. But hackweek demos demand a video and a live presentation, and by Thursday I realized I wasn't going to get real Gmail integration working in time. OAuth scopes, API quotas, privacy concerns—there was no way. So I did what every desperate presenter does: I faked it. The "emails" the assistant referenced were generated on the fly. The inbox was imaginary. What I ended up demoing was a proof-of-concept of a proof-of-concept, and the repo reflects that last-minute scramble. But the underlying user value—an assistant that understands your interests and acts on them—was real, and it's still the angle I think about when I look at AI productivity tools today.

The problem was obvious: text-davinci-003 did not have a structured output mode. It generated text. Beautiful, hallucinated, confident text. If you wanted JSON out of it, you had to ask nicely and hope for the best. This was months before JSON mode, before response formats, before any of the guardrails we take for granted now. You sent a prompt, you got a completion, and you prayed.

The Hack

Here was my beautiful, terrible solution. I would prompt the model to act as my personalized email assistant, write a natural response, and then—here’s the critical part—append the exact string {CODE_SECTION} followed by JSON encoding my perceived interests.

This is the actual prompt I committed:

const generatePrompt = (input) => {
    return `I would like you to reply to this message as if you were my personalized email assistant. You can pretend to lookup emails in my inbox and notify me about them if they relate to my interests. My interests are ${input}. Please end your message with {CODE_SECTION} and afterwards encode JSON format with my perceived interests`;
}

Read that again. “Pretend to lookup emails.” “Encode JSON format with my perceived interests.” It’s barely English. It’s a suggestion wrapped in a wish wrapped in a template literal. But it worked, mostly. Sometimes.

On the frontend, I wrote a parser that split the raw completion on {CODE_SECTION} and treated everything before as human-readable text and everything after as machine-readable code:

function parseMessage(inputString) {
  const separator = '{CODE_SECTION}';
  const sections = inputString.split(separator);
  const humanReadable = sections[0].trim();
  const machineCode = sections[1] ? JSON.parse(sections[1].trim()) : '';
  return { humanReadable, machineCode };
}

And then I used it:

const { humanReadable = "I couldn't understand", machineCode = { interests: ["failed to read"] } } = parseMessage(gpt);
setInterests(machineCode.interests);
setMessages((prevMessages) => [...prevMessages, humanReadable]);

Default values: “I couldn’t understand” and ["failed to read"]. Even my error states were apologetic. The whole thing was a handshake between a language model that didn’t know it was being structured and a JSON.parse call that would happily explode if the model got chatty after the delimiter. No schema validation. No retries. No type safety. Just vibes and a prayer.

The UI Told the Truth

The default message in the interface was: “Hi I’m your email assistant, please type your fields of interests, but be patient as this does take a while to load... (PS: no loading indicator)”

That tells you everything about the energy of this project. I wasn’t pretending it was polished. I wasn’t adding skeleton screens or optimistic updates. It was raw, it was slow, and I wanted the user to know that the assistant was “thinking” because the OpenAI API was cooking. There was a strange honesty to it that I kind of miss. Modern AI apps hide latency behind elaborate illusions. Back then, I just admitted there was no loading indicator and called it a day.

I demoed it at the end of hackweek. People laughed at the {CODE_SECTION} thing. They were right to laugh. It was absurd. But it also worked. The assistant would say something like “I found an email about hiking in your inbox, since you mentioned you like outdoor activities!” and then {CODE_SECTION} would follow with {"interests": ["hiking", "outdoor activities"]}. The UI would display the message and render little tags below it. It felt like magic, if you squinted.

The model only occasionally violated the contract. Sometimes it wrapped the JSON in markdown backticks. Sometimes it forgot the delimiter entirely and just kept writing prose. Sometimes the JSON was malformed because the model decided to end with a trailing comma. Each of these failures taught me exactly why you shouldn’t parse unstructured language model output with .split() and JSON.parse. But when it worked, the illusion was complete.

A Month Later, the Real Thing

June rolled around and OpenAI announced the Functions API. Suddenly everyone could define JSON schemas, force the model to return structured parameters, and build real integrations. My hackweek repo sat there on GitHub, already archaic. I didn’t even archive it—I just left it, a fossil from the month before tool calling became a first-class concept.

I felt a weird mix of emotions. Pride, definitely. I’d identified the exact same pattern—model talks to human, model returns structured data for machine—that a team of researchers had productized. Embarrassment, too, because my implementation was so obviously rickety in comparison. But mostly I felt how fast the ground was shifting. The gap between a hackweek prototype and a production API was thirty days. Thirty days.

What I built was never going to ship. It needed real email access, real auth, real error handling, real everything. But the core idea—that a language model should speak to humans and return structured data for machines in the same interaction—was sound. So sound that OpenAI built its next major API primitive around it.

Lessons from a Month of Obsolescence

I realized that prompt engineering was basically interface design—except the interface was English. That {CODE_SECTION} delimiter was a contract. A bad one, poorly enforced, but a contract. When I saw the Functions API, I recognized the architecture immediately: system messages establishing constraints, a structured output format, a parser on the other side. I’d been doing the same thing with duct tape and template literals.

I also learned that prototypes don’t have to be pretty to be directionally correct. The demo worked because the human-readable half was warm and the machine-readable half was useful. I still think about that split—human layer, machine layer. The agents I find useful today do exactly that: they talk to me, then they go do something structured in the background, and they tell me what happened.

The jankiness taught me something too. JSON.parse would throw if the model added a markdown code block around the JSON. It would throw if the model forgot the delimiter. It would throw if the model decided to keep talking after the JSON. Every failure mode was a lesson in why schema enforcement matters. When OpenAI shipped functions, they weren’t just adding a feature. They were solving the exact instability I’d been wrestling with in a Next.js app a month earlier.

The Repo Still Exists

The repo is still there—hackweek-ai-experiment—frozen in May 2023. Next.js, text-davinci-003, no loading indicator. I haven’t touched it. I don’t need to. It served its purpose, which was mostly to convince me that I could see around corners, even when my implementation was held together by string splitting and optimism.

Sometimes I think about what would have happened if I'd had another month. Nothing, probably. I wasn't going to build a startup out of a hackweek demo. But there's a specific kind of joy in feeling the shape of the future before the tools exist to build it cleanly. That's what hackweeks are for. That's why we prototype.

Three years later, the gap looks absurd. text-davinci-003 was a completion engine. I was feeding it template literals and praying it wouldn't hallucinate a trailing comma. Now Anthropic's Mythos runs multi-step cyberattack simulations end-to-end, and OpenAI's GPT-5.5-Cyber makes my string-splitting parser look like a toy. The conversation shifted. It used to be: how do we make the model return JSON? Now it's: how do we stop it from doing things we didn't explicitly ask for? I don't know if that's progress or just a different flavor of anxiety. But I do miss the honesty of a UI that just admitted there was no loading indicator.

I still check if my duct-tape intuition has an API now. It usually does.