HelperX

Posted on Jun 26

How I Detect Already-Replied Comments Without an API: Reply Deduplication

#automation #javascript #architecture #node

The Reply to Comments module in HelperX does something that sounds simple: auto-respond to comments on your posts. But "respond to comments" hides a hard sub-problem — how do you know which comments you've already replied to?

Reply to the same comment twice and you look like a broken bot. Skip a comment you haven't answered and you've left a follower hanging. The module has to maintain, for every post, an accurate picture of which comments have been answered and which haven't — and it has to do it without a clean "have I replied?" API flag, because no such flag exists.

This article is about the deduplication system we built. It's a small piece of code with a surprisingly large number of edge cases.

Why this is harder than it looks

The naive mental model: "reply to comments, then mark them as done." That has three problems.

Problem 1: You can't trust a "replied" flag because there isn't one. X doesn't expose a clean boolean for "has the post author replied to this comment." You have to infer it.

Problem 2: Comments arrive out of order and incrementally. A post accumulates comments over hours and days. When the module runs, it sees the current comment set — including some it has already processed and some that are new. It must tell them apart.

Problem 3: Replies can disappear or fail. A reply we attempted might have been rate-limited, failed silently, or been deleted. The module must distinguish "we replied successfully" from "we tried to reply" from "we never tried."

Without solving all three, the module either double-replies (bad) or skips unanswered comments (also bad).

The data model

The core of the solution is a persistent record of every comment we've seen and every reply we've sent. Two tables (simplified):

CREATE TABLE seen_comments (
  slot_id      TEXT NOT NULL,
  post_id      TEXT NOT NULL,
  comment_id   TEXT NOT NULL,
  author_id    TEXT NOT NULL,
  first_seen   INTEGER NOT NULL,   -- epoch ms
  PRIMARY KEY (slot_id, comment_id)
);

CREATE TABLE sent_replies (
  slot_id      TEXT NOT NULL,
  comment_id   TEXT NOT NULL,
  reply_id     TEXT,               -- null if attempt failed
  status       TEXT NOT NULL,      -- 'sent' | 'failed' | 'skipped'
  sent_at      INTEGER,
  PRIMARY KEY (slot_id, comment_id)
);

seen_comments records what we've encountered. sent_replies records what we've done about it. The two-table split is deliberate: seeing a comment and successfully replying to it are different events that can be far apart in time (a reply fails, retries an hour later, succeeds).

The deduplication algorithm

When the module runs against a post, here's the logic:

async function processComments(slotId, postId, comments) {
  for (const comment of comments) {
    // Step 1: have we seen this comment before?
    const seen = await db.getSeenComment(slotId, comment.id);
    if (!seen) {
      await db.recordSeenComment(slotId, postId, comment);
    }

    // Step 2: have we already acted on it?
    const priorAction = await db.getSentReply(slotId, comment.id);
    if (priorAction && priorAction.status === 'sent') {
      continue; // already replied successfully — skip
    }

    // Step 3: should we reply at all?
    if (!shouldReply(slotId, comment)) {
      await db.recordSentReply(slotId, comment.id, { status: 'skipped' });
      continue;
    }

    // Step 4: attempt the reply
    const result = await attemptReply(slotId, comment);
    await db.recordSentReply(slotId, comment.id, {
      status: result.ok ? 'sent' : 'failed',
      replyId: result.replyId,
    });

    // Step 5: only send ONE reply per module run, even if more qualify
    break;
  }
}

A few things in there are worth unpacking.

The three states that matter

The combination of seen_comments and sent_replies gives every comment one of four effective states:

seen?	sent_replies.status	Meaning	Action
no	—	Brand new comment	Evaluate, maybe reply
yes	(none)	Seen before, no action recorded yet	Evaluate, maybe reply
yes	`sent`	Replied successfully	Skip
yes	`failed`	Tried to reply, it failed	Retry
yes	`skipped`	Evaluated, deliberately not replied	Skip

The failed → retry path is critical. Without it, a single rate-limit or transient error would permanently silence that comment. With it, the next module run picks up where the last one left off.

The skipped state prevents the module from re-evaluating a comment it already decided not to reply to — which matters because the "should we reply?" check isn't free.

What "should we reply?" actually checks

Not every comment deserves a reply. The module is configured to reply only to comments that meet criteria, and the criteria are where most of the edge cases live.

function shouldReply(slotId, comment) {
  // 1. Don't reply to yourself
  if (comment.author_id === getSlotUserId(slotId)) return false;

  // 2. Don't reply to your own prior replies in the thread
  if (isMyReply(slotId, comment)) return false;

  // 3. Only reply if no one (specifically: the post author) has answered
  if (hasAuthorReply(getPost(slotId, comment.post_id), comment)) return false;

  // 4. Only reply to direct comments on the post, not nested replies
  if (comment.in_reply_to !== comment.post_id) return false;

  // 5. Respect the comment age window
  if (Date.now() - comment.created_at > MAX_COMMENT_AGE) return false;

  return true;
}

The most interesting check is #3: has the post author already replied? This is the one with no clean API. We infer it by walking the comment's reply subtree:

function hasAuthorReply(post, comment) {
  // The post author is the slot's user. We look at all replies to
  // this comment and check if any are from the post author (us).
  return comment.replies.some(
    reply => reply.author_id === post.author_id
  );
}

If we (the post author) have already manually replied to a comment, the module leaves it alone. The module only fills in for comments the operator hasn't gotten to — it's a safety net, not a replacement for genuine engagement.

The double-reply problem, in detail

The single most important invariant: never reply to the same comment twice. Here's how each failure mode is prevented.

Race condition (two module runs overlap): Two runs could both read "no sent reply" and both attempt. We prevent this with a database-level claim — before attempting, the run atomically inserts a sent_replies row with status pending. The second run's insert fails (primary key collision), so only one run proceeds.

async function claimComment(slotId, commentId) {
  try {
    await db.insertSentReply(slotId, commentId, { status: 'pending' });
    return true; // we claimed it
  } catch (e) {
    if (e.code === 'SQLITE_CONSTRAINT_PRIMARYKEY') return false; // someone else has it
    throw e;
  }
}

The pending → sent/failed transition happens after the reply attempt. A run that crashes mid-attempt leaves a pending row, which a janitor process later marks as failed (if it's been pending too long), making it eligible for retry.

Reply succeeded but we crashed before recording it: Worst case — the reply is live on X, but our DB has no record. Next run, we'd reply again. We mitigate by recording the reply immediately on success, before any other work, and by checking the live thread on startup to reconcile any pending rows against reality.

The same comment arrives with a different ID: X occasionally changes comment IDs across API versions or pagination boundaries. We additionally key on (post_id, author_id, text_hash) as a fallback identity, so a comment that reappears with a new ID but same author and text is still recognized as seen.

Idempotency across module restarts

The module restarts frequently — server deploys, crashes, manual operator pauses. None of these should cause double-replies or missed comments. The design guarantees this because:

seen_comments is append-only and persistent. A restart doesn't lose track of what we've seen.
sent_replies is persistent and atomic. A restart doesn't lose track of what we've done.
The algorithm is stateless apart from the database — given the DB state and the current comment set, it deterministically produces the same actions regardless of when it runs.

This is the real payoff of the two-table model. The module has no in-memory state that matters; everything it needs to make a correct decision is in the database. Restart it mid-run, mid-comment, mid-reply — the next run picks up correctly because the DB is the source of truth.

The reconciliation job

Once a day, a background job reconciles sent_replies against reality:

For each sent reply, verify the reply still exists on X. If it was deleted, mark it as deleted (the comment becomes eligible for a fresh reply if it's still unanswered).
For each pending reply older than 10 minutes, check whether a reply actually went out (search the thread for our reply). If yes, mark sent; if no, mark failed.
For each failed reply, decide whether to retry (transient error) or give up (permanent error like a blocked user).

This job is what makes the system self-healing. Without it, transient failures accumulate as failed rows that never retry, and the operator slowly stops getting replies to their comments. The reconciliation job turns "failed" into a temporary state rather than a permanent one.

What we learned

1. Separate "seen" from "acted." They're different events at different times. Conflating them forces you to choose between missing comments (marking seen as acted) and double-acting (marking acted as merely seen).

2. Treat the reply attempt as a state machine, not a boolean. pending → sent | failed captures reality better than "did we reply?" The pending state specifically is what prevents concurrent double-replies.

3. The database is the source of truth, not memory. Every design decision should survive a process restart. If it doesn't, you have a correctness bug waiting for the next deploy.

4. Infer "already replied" defensively. When you can't trust a flag, walk the data (here, the reply subtree) and infer. It's more code, but it's correct where the flag would be absent.

5. Reconcile against reality periodically. Local state drifts. A daily job that checks local claims against the live system catches drift before it becomes visible to the operator.

The deduplication system is unglamorous — it's the kind of code that's invisible when it works and catastrophic when it doesn't. But "this bot replied to me three times" is the fastest way to make an account look automated, and a single visible double-reply undoes weeks of careful persona work. Getting this right is the price of admission for any auto-reply feature.

HelperX powers Reply to Comments with a deduplication layer that never double-replies, retries failures, and reconciles against the live thread daily. Free 30-day trial.

Top comments (1)

Frank • Jun 26

This is neat! I'm curious if you