DEV Community: Margaret Kashuba

HubSpot + OpenAI integration patterns: webhooks, properties, and the failure modes nobody tells you about

Margaret Kashuba — Tue, 02 Jun 2026 16:20:43 +0000

HubSpot's native AI features are a starter kit. They are fine for "summarize this email" and "draft a follow-up". The moment you want production AI behaviour inside a HubSpot workflow — multi-step reasoning, custom retrieval, write-back to structured properties, with proper retries and audit logging — you are off the marketing brochure and into engineering territory. This post is what I wish I had read before we shipped our first HubSpot + OpenAI integration for a B2B SaaS RevOps team.

If you want the broader business case, AltheraCode (the studio I work with) has a published case study on this. I'm going to skip the business case here. This is engineer-to-engineer.

The four surfaces where AI plugs into a HubSpot stack

You have exactly four real integration surfaces. Everything you read about HubSpot AI patterns reduces to one of these.

Workflow custom code actions. Node.js, runs serverless inside HubSpot, 20-second hard timeout, 100 MB memory cap. The most common path. Good for synchronous enrichment that fits in 20 seconds.
Webhooks out of HubSpot. Workflow → POST to your service → your service does whatever → writes back via HubSpot CRM API. The right choice when you need more than 20 seconds, more than 100 MB, or anything that needs queueing.
The CRM API directly, polled or event-driven via app subscriptions. You operate outside HubSpot entirely and treat HubSpot as a database with REST endpoints. Best for high-volume bulk operations.
Conversations API and timeline events. Underused. Lets you write AI-generated content into the contact or deal timeline without touching structured properties. Excellent for "summaries" that should be auditable but not searchable.

Pick the surface that fits the SLA you need, not the one that feels familiar.

Pattern 1: lead-intent enrichment as a workflow custom code action

Most common pattern. Contact enters a workflow on some engagement threshold; you call OpenAI; you write a summary property back. Looks like this:

exports.main = async (event, callback) => {
  const hubspot = require('@hubspot/api-client');
  const OpenAI = require('openai');

  const client = new hubspot.Client({ accessToken: process.env.HS_PRIVATE_APP_TOKEN });
  const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

  const contactId = event.object.objectId;

  const contact = await client.crm.contacts.basicApi.getById(
    contactId,
    ['email', 'company', 'jobtitle', 'linkedin_url', 'last_seen_url']
  );

  const prompt = `You write a single 4-sentence pre-call summary for an account executive.
Rules:
- Do NOT describe the company in marketing language.
- Reference exactly one specific, dated fact from the company's public footprint.
- Note the prospect's likely role in a buying committee.
- End with one concrete question the AE should ask first.

Inputs:
${JSON.stringify(contact.properties)}`;

  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }],
    temperature: 0.2,
    max_tokens: 200,
  });

  const summary = completion.choices[0].message.content;

  await client.crm.contacts.basicApi.update(contactId, {
    properties: { ai_pre_call_summary: summary }
  });

  callback({ outputFields: { summary } });
};

Three notes that cost me time.

You must add the @hubspot/api-client and openai packages explicitly in the custom code action's package list. The HubSpot UI does this for you only if you discover the right dropdown. I missed it for two hours.

event.object.objectId is reliable. event.inputFields is not — workflow inputs that look populated in the UI sometimes arrive as undefined, especially after a property rename. Always re-fetch from the CRM API, do not trust the inputs.

Use temperature: 0.2 or lower for any structured writeback. At 0.7 you will get cute summaries with one outlier sentence per week that breaks downstream parsing.

Pattern 2: AI-summarized deal notes with property writeback

A call lands in Gong (or Fathom, or Granola — same shape). You want the prospect's stated objection, timeline, and decision criteria written into structured properties on the HubSpot deal.

The naive version writes a free-text summary into a single Note property. The production version uses three separate custom properties and a function-calling response so you can actually run pipeline analytics on them later.

const tool = {
  type: 'function',
  function: {
    name: 'record_deal_signals',
    description: 'Record structured signals from a sales call.',
    parameters: {
      type: 'object',
      properties: {
        stated_objection: { type: 'string' },
        stated_timeline: {
          type: 'string',
          enum: ['no_timeline', '0_30_days', '30_90_days', '90_plus_days']
        },
        decision_criteria: {
          type: 'array',
          items: { type: 'string' }
        }
      },
      required: ['stated_timeline', 'decision_criteria']
    }
  }
};

const completion = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: 'You extract structured signals from sales call transcripts. Do not paraphrase. Quote.' },
    { role: 'user', content: transcript }
  ],
  tools: [tool],
  tool_choice: { type: 'function', function: { name: 'record_deal_signals' } },
  temperature: 0.1
});

const signals = JSON.parse(completion.choices[0].message.tool_calls[0].function.arguments);

await client.crm.deals.basicApi.update(dealId, {
  properties: {
    stated_objection: signals.stated_objection || '',
    stated_timeline: signals.stated_timeline,
    decision_criteria: (signals.decision_criteria || []).join(' | ')
  }
});

Two production lessons here.

Force tool_choice to a specific function rather than auto. Auto will occasionally return a chat message with no tool call when the model decides the input does not warrant one. You can not tolerate that flakiness inside a workflow.

For the decision_criteria array, store as a pipe-delimited string in a single HubSpot property rather than trying to model a multi-select. HubSpot multi-select properties are limited and brittle for free-text values.

Pattern 3: timeline events instead of properties

People reach for properties because they are the obvious unit. For AI-generated content that is interesting context but not data you'll filter pipeline reports on, write to the timeline instead.

await client.crm.timeline.eventsApi.create({
  eventTemplateId: 'YOUR_TEMPLATE_ID',
  email: contact.email,
  tokens: {
    summary: aiGeneratedNote,
    source: 'gong_call_2026_05_28',
    confidence: 0.92
  }
});

Timeline events are searchable, sortable by time, and don't pollute your property list. We use timeline events for "research summaries", "previous-call recap on next-call open", and "agent-flagged risk signals". Anything that's "context for a human" goes here. Anything you'll run a report on goes into a property.

Failure modes, by frequency

In rough order of how often each one bit me:

Workflow custom code 20-second timeout. If your AI call plus enrichment plus writeback can not finish in 20 seconds, you must move to pattern 2 (webhook out). I have seen people split a call across two custom code actions to "fit". Do not do this. The state management between them is a nightmare. Bite the bullet and externalise.

HubSpot rate limits. 100 requests per 10 seconds per portal for the v3 API. If your AI agent is enriching new contacts in bulk, you will hit this. Implement a leaky-bucket queue in front of your writes.

Property rename detection. A marketing ops person renames Lead_Score_v2 to Lead Score (V2). Half your code breaks. Build a nightly job that fetches property metadata and diff-checks against your code's expected schema.

Custom code action package versions. HubSpot pins package versions on the runtime side. openai@4.x works; openai@5.x does not at the time of writing. Pin explicitly.

Idempotency on retries. Workflows retry on failure. Your AI action must be idempotent — same input, same output property update — or you will have duplicate timeline events and AE confusion. Use a request-ID derived from the contact ID and a hash of the prompt inputs.

Notes that turn out to be portal-visible. HubSpot has a "private note" UI affordance and a "shared portal note" property — they are not the same thing. Read the docs carefully, and default to draft-mode for any AI writeback that touches a customer-shared object.

Logging and observability

The unfun part. Production AI inside HubSpot needs three things you will not have on day one:

Every model call logged to a side store (we use a simple Cloudflare D1 instance with request_id, contact_id, model, prompt_hash, response, latency_ms, cost_cents).
A prompt_version property written alongside every AI-generated value so you can A/B prompts in production without losing history.
A weekly cost-and-quality report. Cost is easy. Quality is a small spreadsheet where the sales ops lead grades 20 sampled outputs each week. You do not skip this.

Closing

Real HubSpot AI engineering is more workflow-mechanics than model-engineering. The model is the easy part. The retries, the property hygiene, the 20-second budget, the rate limits, the portal-visibility surprises — that is where the work lives.

If you ship a HubSpot AI integration this quarter, the thing that will save you a sprint is reading the workflow custom code action docs cover to cover before you write your first prompt.

I'd be curious how others are handling property-rename detection. We brute-force it with a daily diff. There must be a smarter way.

Slack AI assistant for engineering: RAG, pgvector, and the parts that broke

Margaret Kashuba — Fri, 29 May 2026 13:49:06 +0000

TL;DR — Built a Slack-native AI assistant that answers engineering questions from Confluence, GitHub wikis, Notion, and PDFs. RAG with pgvector + hybrid retrieval + permissions enforced at retrieval (not at the LLM). Two mistakes cost us about two weeks. Here is the field report so you don't repeat them.

Sometime around the third sprint of this project I stopped believing that "internal AI assistant" was a real product category and started believing it was an interface problem dressed up as an AI problem. I want to write down why, because most of the posts I see about AI assistants focus on the wrong layer.

The team I worked with — a small AI engineering studio called AltheraCode — builds software for the construction-engineering world. Not the sexy bit — not BIM, not generative design — the boring middle. Specifications. Standards. Internal rules nobody reads until they have to. The kind of knowledge that lives in three different Confluence spaces, four shared drives, and the head of a senior engineer who's been there nine years and is one bad week away from going on sabbatical.

The brief was: build something so that when a junior engineer has a question on a Thursday afternoon, they don't have to DM that senior engineer and feel guilty about it.

Sounds simple. It isn't.

The thing nobody tells you about engineering knowledge

Wikis don't fail because the search is bad. They fail because asking is easier than searching, and humans pick the easier path every single time. Once a team has more than maybe 25 engineers, the dominant pattern for "how do I do X" is to ask in a channel, not to type into a search bar. The whole internal-search-engine industry has been quietly losing this fight for fifteen years.

So when you ship an AI assistant, you are not competing with the wiki. You are competing with #eng-help. The assistant has to live where engineers already are, answer faster than a human can type a reply, and be wrong less often than the wiki is stale. That last bar is lower than you'd hope and a lot harder than it sounds.

We picked Slack as the surface and stuck with it. There was an early conversation about also building a web UI "for the cases where Slack feels wrong" — we killed that idea in week two and I'm glad we did. Two interfaces would have meant half the attention on either of them.

What we actually built

In rough strokes:

Slack Bolt bot running in the client's AWS account, surfaced as a slash command, an @mention in any channel, and a DM.
Ingestion pipeline pulling from Confluence, GitHub wiki, repo READMEs and /docs, Notion, and a curated set of S3-hosted PDFs (technical specs and internal standards).
Embedding store — pgvector inside the existing PostgreSQL deployment.
Retriever layer with hybrid search: dense embeddings + BM25 keyword scoring + a re-ranker + per-document permission filter keyed on the user's Slack identity.
LLM layer — OpenAI's GPT-4 generation at the time, strictly grounded on retrieved chunks, with an explicit "I don't know" fallback path.
Citation surface — every answer in Slack ships with inline source links and a thumbs-up/down so we can grade quality in production.
Audit log — every query, retrieval, answer, and feedback signal in PostgreSQL with a 90-day retention window.

The architecture is not novel. The novelty is in two choices.

pgvector instead of a dedicated vector DB

People love to argue about this. Here is the version I will defend: if your corpus is in the hundreds of thousands of chunks and you already run a healthy Postgres, pgvector is fine. Better than fine — it saves you a piece of infrastructure that has a permanent operational tax. Pinecone and friends are wonderful at hundreds of millions of chunks. You probably don't have hundreds of millions of chunks. I'd rather optimise pgvector for another six months than babysit a new managed service.

Permissions enforced at retrieval, not in the prompt

This is the one I will die on. You cannot tell an LLM "don't reveal X" and trust it. Models will leak things you asked them not to leak, especially when a clever user asks the right way. The only correct answer is to not put the restricted thing in the context window in the first place.

We mapped Slack identity → SSO identity → per-document ACL, applied as a row-level filter on pgvector before the LLM ever sees anything. It looks like this at the SQL level:

SELECT chunk_text, document_id, ts_rank, embedding <=> $1 AS dist
FROM chunks c
JOIN document_acl a ON a.document_id = c.document_id
WHERE a.principal_id = ANY($2::text[])           -- user's SSO groups
ORDER BY (dist * 0.7) + (1 - ts_rank) * 0.3
LIMIT 20;

It's more work upfront and it pays back forever.

The bits we got wrong

There were two real mistakes. I'll spare you the small ones.

Ingestion pipeline was undersized. The first version re-embedded the entire corpus every night. That was fine at ten thousand chunks. It fell over around two hundred thousand. We rewrote the pipeline to hash content and only re-embed pages that actually changed, plus carved out a separate fast-path for new pages arriving between nightly runs. That cost us about ten days. We should have caught it in design review.

Hallucination guardrail was too permissive. Our first prompt told the model to "answer based on the retrieved documents and acknowledge when you don't have enough information." Reasonable on paper. In practice the model was acknowledging far less than it should. Engineers were getting confident-sounding wrong answers and they noticed fast. Trust in the bot started slipping in the second week.

We rewrote the system prompt to require an explicit citation per factual claim, and to refuse rather than guess when retrieval returned nothing meaningful. Roughly:

You answer engineering questions using ONLY the provided context chunks.
Rules:
1. Every factual claim must cite a chunk by ID, in the form [doc-XX].
2. If the retrieved chunks do not contain a confident answer, say:
   "I don't have a confident answer for that — try #eng-help."
3. Never invent function names, endpoints, JIRA tickets, or process steps.
4. If the question is ambiguous, ask one clarifying question instead of guessing.

Context chunks:
{retrieved_chunks}

Question:
{user_question}

Refusal rate went up. Trust came back. The lesson here is that "I don't know" is a feature, and you should treat it like one.

If I were starting again I'd skip those two mistakes and reclaim about two weeks of calendar.

What it does on a normal Tuesday

The assistant answers questions like:

"What's our naming convention for new microservices?"
"Who owns the billing pipeline on-call rotation?"
"What's the difference between the v3 and v4 telemetry schemas?"
"How do we handle multi-tenant tenancy isolation in the report generator?"
"Where do I file a vendor-onboarding request and who approves it?"

These are answers that used to take five minutes of a senior engineer's time, or two days of a junior engineer's polite-but-frustrated asking around. The assistant answers most of them in under three seconds, with sources cited.

It doesn't generate code. It doesn't editorialise. It doesn't try to give opinions about how the architecture should work. It retrieves and rephrases, with sources cited. That boundary is what makes it trustworthy. Every time we have been tempted to widen the scope — "what if it could draft the PR description too" — we have reminded ourselves that the moment it starts inventing things, it stops being a knowledge surface and becomes a thing engineers have to double-check, which is the opposite of what we set out to build.

A few things I'd take to the next project

The interface is the product. A great RAG pipeline behind a mediocre Slack experience is a mediocre product. Invest disproportionately in the conversation design.
Refuse by default. "I'm not confident" beats a confident wrong answer every single time. Engineers learn fast which tools they can trust, and they don't come back when they get burned.
Permissions at retrieval, not in the prompt. Saying it twice because it's the single most common mistake I see in AI assistant projects, and the failure mode ends up in a postmortem.
Hybrid retrieval, not pure semantic. Engineering questions are full of exact identifiers — function names, error codes, internal acronyms. Pure embedding search misses these. Add BM25 and a re-ranker on day one.
Ship the boring parts first. Audit logging, feedback collection, citation rendering — all the stuff that doesn't feel like "the AI part" — is what makes the system survive contact with a real team. Skip it and you'll be retrofitting it under deadline pressure six months later.

Why I'm writing this

Internal AI assistants are at an awkward stage. Every B2B tech company is building one. Most of them are quietly underperforming. Almost nobody writes honestly about why. The result is a lot of LinkedIn posts about "transforming knowledge work with AI" and not enough postmortems about the part where retrieval permissions almost leaked a sensitive document.

If you're scoping a similar project, the most useful thing I can offer is the list of mistakes above. Skip them. Spend the saved weeks on the parts of your own domain that I don't know about.

I'd love to read your version of this post when you ship.