CodeKing

Posted on May 21

"My DingTalk Coding Bot Said It Started the Task. Then It Never Sent the Result"

#tutorial #webdev #node #ai

The most annoying mobile-agent failure is not a crash.

It is the fake success message.

You send a task from DingTalk. The bot replies:

Task accepted.

Then Claude Code or Codex actually runs for a while, finishes the work, and nothing comes back to the phone.

That is worse than an immediate error. It makes you think the agent is still working, when the real problem is that the result fell out of the delivery path.

The setup

I have been building CliGate, a local AI gateway for Claude Code, Codex CLI, Gemini CLI, dashboard chat, and mobile channels.

The mobile-channel idea is simple:

send a task from DingTalk
route it to Claude Code or Codex on my machine
keep the runtime session attached to that DingTalk conversation
send approvals, questions, progress, and final results back to the same chat

The first part worked.

DingTalk could trigger the runtime.

The broken part was the final callback.

The bug: I was replying to the assistant run, not the runtime

The channel layer used to behave too much like this:

inbound message
  -> assistant run
  -> immediate assistant reply
  -> send message back to DingTalk

That sounds fine until the assistant delegates to a long-running runtime.

In that case, the useful result is not the immediate assistant text. The useful result is the runtime terminal event:

runtime completed
runtime failed
runtime asks a question
runtime asks for approval

My old logic only waited for the final runtime result in a narrow multi-session case. If one assistant run produced multiple runtime sessions, it would fan in and wait. But the common path is just one delegated runtime:

/cligate ask Claude Code to fix this bug

That produced one runtime session, so the channel often got the "started" message and missed the real result.

The fix was small but important:

function shouldDeferBackgroundCallback(result = null) {
  const sessionIds = Array.isArray(result?.assistantRun?.relatedRuntimeSessionIds)
    ? result.assistantRun.relatedRuntimeSessionIds.filter(Boolean)
    : [];
  return result?.assistantRun?.status === ASSISTANT_RUN_STATUS.WAITING_RUNTIME
    && sessionIds.length > 0;
}

The old mental model was:

only wait when there are multiple runtime sessions

The correct model is:

if the assistant delegated to any runtime, wait for that runtime result before treating the channel reply as complete

That change made single-session mobile tasks behave like real tasks instead of fire-and-forget acknowledgements.

The second bug: DingTalk's session webhook can lie by omission

DingTalk gives you a sessionWebhook for replying inside the inbound interaction window.

So the obvious implementation is:

if sessionWebhook exists and has not expired:
  send through sessionWebhook
else:
  send through App API

That is what I started with.

The problem is that the timestamp is not the whole truth. A session webhook can still look fresh locally while DingTalk rejects it server-side because the session was consumed or closed.

So this code was too optimistic:

if (sessionWebhook && (!expiredAt || expiredAt > now + 15_000)) {
  for (const chunk of textChunks) {
    result = await this.sendViaSessionWebhook(sessionWebhook, chunk);
  }
  return result;
}

If that send failed, the whole delivery failed.

The fix was to treat session webhook as the cheap first attempt, not the only attempt:

if (sessionWebhook && (!expiredAt || expiredAt > now + 15_000)) {
  try {
    for (const chunk of textChunks) {
      result = await this.sendViaSessionWebhook(sessionWebhook, chunk);
    }
    return result;
  } catch (err) {
    // fall through to App API
  }
}

Then the provider falls back to the DingTalk App API:

for (const chunk of textChunks) {
  result = await this.sendViaAppApi({
    conversationId: conversation?.externalConversationId,
    text: chunk,
    robotCode: channelContext.robotCode || '',
    conversationType: channelContext.conversationType || '',
    senderStaffId: channelContext.senderStaffId || ''
  });
}

That made delivery much more reliable.

The important lesson: a webhook expiry timestamp is not a delivery guarantee.

The third bug was hidden in the registry

This one was more subtle.

CliGate supports channel provider instances. The raw provider template in the registry is not the same thing as a started provider instance.

The started instance has settings:

clientId
clientSecret
robotCode
mode
runtime defaults

The raw template does not.

That matters because DingTalk App API fallback needs credentials:

const clientId = chooseSetting(this.settings, 'clientId', 'appKey');
const clientSecret = chooseSetting(this.settings, 'clientSecret', 'appSecret');

If the outbound delivery sender asks the raw registry for dingtalk, it may get a provider object with no settings. Then the session webhook fails, the App API fallback starts, and the fallback has no credentials.

So the channel manager now injects an instance-aware registry shim into both the dispatcher and the delivery sender:

const instanceAwareRegistry = {
  get: (providerId, instanceId) => this.getInstance(providerId, instanceId)
};

this.outboundDispatcher.registry = instanceAwareRegistry;
this.outboundDispatcher.deliverySender?.setRegistry?.(instanceAwareRegistry);

The second line is the one that matters.

It is easy to update the dispatcher and forget that the actual send path lives one object deeper.

Runtime events now drive outbound delivery

The architecture I trust more is event-based:

runtime event
  -> find channel conversations tracking that runtime session
  -> format event for the channel
  -> arbitrate whether to send now or suppress
  -> send through provider instance
  -> record delivery

The dispatcher listens to runtime session events:

this.unsubscribe = this.runtimeSessionManager.eventBus.subscribeAll((event) => {
  this.handleRuntimeEvent(event).catch(() => {});
});

Then it finds conversations tracking that runtime:

const conversations = this.conversationStore.listByTrackedRuntimeSessionId(event.sessionId);

And sends through the delivery sender:

await this.deliverySender.send({
  conversation: latestConversation,
  channel: latestConversation.channel,
  sessionId: event.sessionId,
  eventSeq: event.seq,
  message: {
    text: formatted.fullText || formatted.text || '',
    buttons: formatted.buttons || [],
    session,
    event
  }
});

That is the boundary I wanted.

The assistant may start the work, but the runtime event owns the runtime result.

I added tests for the boring parts

The boring parts are where channel bugs usually hide.

There is a test for DingTalk falling back to the App API when the session webhook is unavailable:

assert.match(String(calls[0].url), /oauth2\/accessToken/);
assert.match(String(calls[1].url), /robot\/oToMessages\/batchSend/);
assert.deepEqual(calls[1].body.userIds, ['staff_123']);
assert.equal(calls[1].body.robotCode, 'robot_123');

There is also coverage for group conversation fallback:

assert.match(String(calls[1].url), /robot\/groupMessages\/send/);

And the delivery sender records sent and suppressed deliveries into the assistant event ledger, so debugging does not depend on guessing whether the provider was called.

That is what I want for mobile agents: not just "send a message", but an auditable delivery path.

The workflow after the fix

The flow I wanted now looks like this:

DingTalk message comes in.
CliGate routes it to the assistant or direct runtime path.
Claude Code or Codex starts a runtime session.
The DingTalk thread tracks that runtime session.
Runtime terminal events trigger outbound delivery.
DingTalk session webhook is tried first when useful.
If that fails, App API fallback sends the result.

The user sees the thing that matters:

Claude Code: fixed the failing test and updated the route handler.

not just:

Task accepted.

What I learned

Mobile coding agents need stronger delivery semantics than chat demos.

It is not enough to prove that the bot can receive a message. It has to survive the whole lifecycle:

accepted
started
waiting for approval
waiting for user input
completed
failed
delivered
suppressed with a reason

And if the channel has multiple send paths, the code has to treat the first path as an optimization, not the truth.

For DingTalk, that meant:

do not trust sessionWebhook freshness too much
fall back to App API when webhook send fails
make sure the sender uses the started provider instance, not the raw provider template
wait for runtime results even when there is only one runtime session

That is not the flashy part of building an AI coding agent.

But it is the part that decides whether you can actually trust it from your phone.

If you want to inspect the implementation, the project is here:

CliGate on GitHub

I am curious how other people are handling mobile agent delivery. Do you send one "task accepted" message, or do you wire final runtime events back into the original chat thread?

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.