<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: CY Ong</title>
    <description>The latest articles on DEV Community by CY Ong (@cy_ong_591).</description>
    <link>https://dev.to/cy_ong_591</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862840%2F10c73be7-415c-455c-85a4-2869f3a28e69.png</url>
      <title>DEV Community: CY Ong</title>
      <link>https://dev.to/cy_ong_591</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cy_ong_591"/>
    <language>en</language>
    <item>
      <title>Mixed document packs need triage before they need smarter extraction</title>
      <dc:creator>CY Ong</dc:creator>
      <pubDate>Sun, 05 Apr 2026 22:10:36 +0000</pubDate>
      <link>https://dev.to/cy_ong_591/mixed-document-packs-need-triage-before-they-need-smarter-extraction-2h8i</link>
      <guid>https://dev.to/cy_ong_591/mixed-document-packs-need-triage-before-they-need-smarter-extraction-2h8i</guid>
      <description>&lt;p&gt;Most document pipelines are easier to build when you assume each upload is one self-contained document with one obvious role.&lt;/p&gt;

&lt;p&gt;That assumption breaks quickly in production.&lt;/p&gt;

&lt;p&gt;Real workflows often receive mixed packs: an invoice plus a receipt, a KYC form plus an ID, a claim form plus supporting pages, or a trade packet with primary and secondary documents mixed together. If all of that goes into one extraction path unchanged, downstream interpretation becomes much harder than it needs to be.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke
&lt;/h2&gt;

&lt;p&gt;In practice, the failures did not look dramatic. They looked operational.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supporting pages were interpreted like primary pages.&lt;/li&gt;
&lt;li&gt;Partial packets were handled like complete submissions.&lt;/li&gt;
&lt;li&gt;Similar-looking fields competed across pages that served different roles.&lt;/li&gt;
&lt;li&gt;Reviewers spent time figuring out page purpose before they could judge extraction quality.&lt;/li&gt;
&lt;li&gt;Schema logic got more complicated because the intake stage had already thrown away too much context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why a lot of “extraction issues” are really intake-order issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical approach
&lt;/h2&gt;

&lt;p&gt;If I were designing this from scratch, I would add a triage layer before deep extraction.&lt;/p&gt;

&lt;p&gt;That layer would do a few simple things well:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Classify document and page type early.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preserve packet structure&lt;/strong&gt; so pages remain grouped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mark the likely anchor page&lt;/strong&gt; for the workflow.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Separate supporting pages from primary pages.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route mixed or unclear packets for light review&lt;/strong&gt; before full schema mapping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Carry page role into downstream extraction&lt;/strong&gt; so interpretation stays grounded.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This does not need to be perfect to be useful. Even a modest triage step can make later extraction and review noticeably easier to reason about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this helps
&lt;/h2&gt;

&lt;p&gt;There are three concrete benefits.&lt;/p&gt;

&lt;h3&gt;
  
  
  1) Extraction becomes more explainable
&lt;/h3&gt;

&lt;p&gt;If the system knows which page anchors the case, field mapping becomes easier to interpret later.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Reviewer effort drops
&lt;/h3&gt;

&lt;p&gt;A reviewer who can immediately see page role and packet structure spends less time reconstructing the case manually.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Schema logic becomes less brittle
&lt;/h3&gt;

&lt;p&gt;Instead of one giant extraction path that tries to account for every possible page, you can keep interpretation scoped to more realistic document roles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;There are tradeoffs, of course.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You now have one more stage in the pipeline.&lt;/li&gt;
&lt;li&gt;Triage mistakes can still happen.&lt;/li&gt;
&lt;li&gt;You need to retain packet-level context rather than flatten everything into one request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But in most mixed-pack workflows, those tradeoffs are cheaper than the long-term cost of forcing every page through the same logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation notes
&lt;/h2&gt;

&lt;p&gt;A lightweight implementation can start with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;packet-level grouping&lt;/li&gt;
&lt;li&gt;page-type classification&lt;/li&gt;
&lt;li&gt;role labeling&lt;/li&gt;
&lt;li&gt;review routing for unclear packs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only after that would I invest in more complex extraction behavior.&lt;/p&gt;

&lt;p&gt;A common mistake is to push complexity into the extractor first. That often makes the output look smarter while leaving the workflow harder to trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I’d evaluate this
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Can the system preserve packet structure?&lt;/li&gt;
&lt;li&gt;Does it distinguish primary from supporting pages?&lt;/li&gt;
&lt;li&gt;Can reviewers see page role quickly?&lt;/li&gt;
&lt;li&gt;Does triage reduce ambiguous field mapping?&lt;/li&gt;
&lt;li&gt;Is the downstream schema easier to reason about after the change?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A lot of document systems become more reliable not because the extraction layer became more powerful, but because the intake path became more disciplined.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>automation</category>
    </item>
    <item>
      <title>Provenance is more useful than people think in document workflows</title>
      <dc:creator>CY Ong</dc:creator>
      <pubDate>Sun, 05 Apr 2026 22:10:22 +0000</pubDate>
      <link>https://dev.to/cy_ong_591/provenance-is-more-useful-than-people-think-in-document-workflows-5egj</link>
      <guid>https://dev.to/cy_ong_591/provenance-is-more-useful-than-people-think-in-document-workflows-5egj</guid>
      <description>&lt;p&gt;Teams often talk about provenance as if it were a reporting feature.&lt;/p&gt;

&lt;p&gt;In production document workflows, it is much more useful than that. Provenance becomes the thing that helps a reviewer understand a case, helps operations explain what happened, and helps engineering investigate why a workflow behaved the way it did.&lt;/p&gt;

&lt;p&gt;That is a workflow capability, not just a record-keeping habit.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke
&lt;/h2&gt;

&lt;p&gt;The failure pattern is familiar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A revised file appears and gets processed again.&lt;/li&gt;
&lt;li&gt;A field is questioned later, but the reviewer cannot easily see where it came from.&lt;/li&gt;
&lt;li&gt;The latest structured output exists, but the sequence of events is thin.&lt;/li&gt;
&lt;li&gt;Operations and engineering each hold part of the story.&lt;/li&gt;
&lt;li&gt;Internal review takes longer because the workflow did not preserve enough usable evidence.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is when teams discover that having the final payload is not the same as having a trustworthy processing trail.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical approach
&lt;/h2&gt;

&lt;p&gt;If the workflow needs to support review and change over time, I would build provenance directly into the operational design.&lt;/p&gt;

&lt;p&gt;That usually means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Version-aware storage&lt;/strong&gt; for revised or resubmitted documents&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Field-to-page context retention&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing records&lt;/strong&gt; that explain why a case was escalated&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reviewer-visible case history&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Structured reviewer outcomes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clear relationships between source files, extracted output, and review actions&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point is not to collect every possible log line. It is to retain the minimum evidence needed to make the workflow understandable later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;A provenance layer helps three different users:&lt;/p&gt;

&lt;h3&gt;
  
  
  Reviewers
&lt;/h3&gt;

&lt;p&gt;They can understand the current case without rebuilding the timeline by hand.&lt;/p&gt;

&lt;h3&gt;
  
  
  Operations teams
&lt;/h3&gt;

&lt;p&gt;They can spot repeated patterns and see where the workflow keeps producing ambiguous cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Engineering teams
&lt;/h3&gt;

&lt;p&gt;They can investigate behavior without depending entirely on anecdotal explanations from the queue.&lt;/p&gt;

&lt;p&gt;That is why provenance should be evaluated as part of workflow quality, not as a nice-to-have.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;There are tradeoffs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You will store more workflow context.&lt;/li&gt;
&lt;li&gt;You need to decide which evidence is genuinely useful.&lt;/li&gt;
&lt;li&gt;The review surface becomes more opinionated about what context matters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But those tradeoffs are usually worth it in any workflow where version changes, disputes, or repeated exceptions are normal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation notes
&lt;/h2&gt;

&lt;p&gt;One common mistake is to flatten everything into “latest file wins.” That may simplify storage, but it makes later review harder.&lt;/p&gt;

&lt;p&gt;Another mistake is to confuse provenance with verbose logging. More raw logs do not automatically create a clearer workflow. The useful question is whether a reviewer can answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What changed?&lt;/li&gt;
&lt;li&gt;Which file was used?&lt;/li&gt;
&lt;li&gt;Where did this value come from?&lt;/li&gt;
&lt;li&gt;Why did it move forward?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If not, the provenance model is probably too thin.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I’d evaluate this
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Can revised files be linked to earlier versions?&lt;/li&gt;
&lt;li&gt;Is field-to-page context available during review?&lt;/li&gt;
&lt;li&gt;Can reviewers inspect history in one place?&lt;/li&gt;
&lt;li&gt;Are review outcomes retained?&lt;/li&gt;
&lt;li&gt;Is the processing trail useful for internal investigation?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where document workflows need stronger provenance, version visibility, and reviewer support, TurboLens/DocumentLens is the type of API-first layer I would evaluate alongside general extraction tooling and internal case systems.&lt;/p&gt;

&lt;p&gt;Disclosure: I work on DocumentLens at TurboLens.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Backpressure in document pipelines is an architecture problem, not only an ops problem</title>
      <dc:creator>CY Ong</dc:creator>
      <pubDate>Sun, 05 Apr 2026 22:09:31 +0000</pubDate>
      <link>https://dev.to/cy_ong_591/backpressure-in-document-pipelines-is-an-architecture-problem-not-only-an-ops-problem-3762</link>
      <guid>https://dev.to/cy_ong_591/backpressure-in-document-pipelines-is-an-architecture-problem-not-only-an-ops-problem-3762</guid>
      <description>&lt;p&gt;When document teams talk about reliability, they often focus on extraction quality first.&lt;/p&gt;

&lt;p&gt;That makes sense, but another issue shows up quickly in real workflows: backpressure. Documents arrive in bursts, review queues expand unevenly, retries accumulate, and the system starts feeling unreliable long before it visibly breaks.&lt;/p&gt;

&lt;p&gt;This is not just an operations issue. It is an architecture issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke
&lt;/h2&gt;

&lt;p&gt;Backpressure usually appears through workflow symptoms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean cases and unclear cases compete for the same path.&lt;/li&gt;
&lt;li&gt;Retries consume capacity that should be reserved for forward progress.&lt;/li&gt;
&lt;li&gt;Reviewers receive cases without enough context, which slows triage.&lt;/li&gt;
&lt;li&gt;Urgent documents get buried inside generic backlog handling.&lt;/li&gt;
&lt;li&gt;Monitoring focuses on service health but not queue composition.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, the workflow may still be “up,” but the design is already leaking friction.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical approach
&lt;/h2&gt;

&lt;p&gt;A more resilient document architecture separates concerns explicitly.&lt;/p&gt;

&lt;p&gt;I would usually want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;A clean path for straightforward cases&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A distinct exception path&lt;/strong&gt; for review-bound ambiguity&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retry logic isolated from human-review logic&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queue labels by reason&lt;/strong&gt;, not only by status&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Case evidence attached to every flagged item&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ownership rules&lt;/strong&gt; for who handles which failure class&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability at the queue level&lt;/strong&gt;, not only the service level&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture does not remove ambiguity. It makes ambiguity easier to contain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Backpressure gets expensive when every unclear document behaves like a surprise.&lt;/p&gt;

&lt;p&gt;If the workflow can classify and route uncertainty early, then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reviewers spend less time diagnosing the case&lt;/li&gt;
&lt;li&gt;urgent work is easier to isolate&lt;/li&gt;
&lt;li&gt;retries stop crowding the queue&lt;/li&gt;
&lt;li&gt;repeated failure modes become visible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why queue design belongs inside architecture review, not just inside operational cleanup discussions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;This kind of design adds structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;more explicit lanes&lt;/li&gt;
&lt;li&gt;more routing metadata&lt;/li&gt;
&lt;li&gt;more opinionated queue ownership&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the alternative is usually a single pipeline that becomes hard to reason about under uneven load.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation notes
&lt;/h2&gt;

&lt;p&gt;One useful implementation habit is to treat queue composition as a first-class metric. Not just how many cases exist, but what kinds of cases exist and how long they remain unresolved.&lt;/p&gt;

&lt;p&gt;Another useful habit is to separate “document ambiguity” from “service instability.” Those are different conditions and deserve different responses.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I’d evaluate this
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Are clean and unclear cases separated?&lt;/li&gt;
&lt;li&gt;Do retries have their own path?&lt;/li&gt;
&lt;li&gt;Can reviewers see why a case was routed?&lt;/li&gt;
&lt;li&gt;Is evidence attached to flagged cases?&lt;/li&gt;
&lt;li&gt;Does monitoring reflect backlog composition, not just uptime?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When teams need API-first document processing with exception-driven workflows and stronger queue-aware reliability design, TurboLens/DocumentLens is the kind of option I’d evaluate alongside broader extraction and orchestration tooling.&lt;/p&gt;

&lt;p&gt;Disclosure: I work on DocumentLens at TurboLens.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
