DEV Community

shonit kapoor
shonit kapoor

Posted on

How I Replaced a $50,000/Year Licensed SDK with 800 Lines of Swift

Every year, our company wrote a check for a mobile check capture SDK.
Not a small check. A $50,000+ check. For a black box library that we couldn’t customize, couldn’t debug, and couldn’t update on our own timeline.
When Apple shipped a new iOS version, we held our breath waiting for the vendor to catch up. When we needed a UI tweak for a client, we filed a support ticket and waited weeks. When the SDK crashed in production, we had no stack trace that meant anything.
After three years of this, I decided to replace it entirely.
This is the story of how I built CheckCaptureKit — a complete in-house iOS check capture framework using only Apple’s native Vision framework — and what I learned along the way.

The Problem With Licensed SDKs Nobody Talks About

MiSnap is a well-known enterprise SDK for mobile check capture. It works. Banks use it. But when you’re a product team trying to move fast, it creates problems that compound over time.
You’re on their timeline, not yours. Every iOS beta season is a gamble. Will the vendor have an updated SDK ready before your release date? Sometimes yes. Sometimes no. And when no, your entire release is blocked.
Customization is a negotiation. Want to change the capture overlay color to match a new client’s brand? That’s a support ticket. Maybe a contract amendment. Definitely not a one- line code change.
Debugging is impossible. When something goes wrong in a black box, you get a generic error code and a PDF manual. No stack trace, no source, no answers.
The cost never goes down. Licensing fees only go up, especially when you’re locked in and switching costs are high.
I knew Apple’s Vision framework existed. I’d used it for other things. One afternoon I started asking a serious question: could I build this myself?

The answer was yes — in about three weeks of focused development.

What Check Capture Actually Requires

Before writing a line of code, I mapped out exactly what the SDK was doing. It came down to five things:

  1. Document detection — find the check in the camera frame in real time
  2. Auto-capture — trigger capture automatically when the check is stable
  3. Image quality validation — reject blurry, dark, or skewed captures
  4. MICR extraction — read the routing and account numbers from the bottom
  5. Front and back capture — guide the user through both sides That’s it. Five things. Apple’s Vision framework handles all of them natively for free.

The Architecture: Four Internal Layers

I designed CheckCaptureKit around four focused internal components, each with a single responsibility. This is the part most developers skip — they build one massive camera controller that does everything. That’s how you get untestable, unmaintainable code.
DocumentDetector runs continuously on a background queue, using Vision’s rectangle detection to find the check in frame 30 times per second. The key insight here is tuning the detection parameters specifically for check dimensions — most rectangle detectors are too generic out of the box.
QualityAnalyzer is a gatekeeper. Before any capture is triggered, three independent quality checks must all pass. I won’t share the exact thresholds here — those came from weeks of real-world testing across different lighting conditions and check types — but the architecture of having explicit, tunable quality gates is what separates a professional implementation from a weekend project.
AutoCaptureEngine was the hardest piece to get right. A single good frame isn’t enough — anyone who’s built real-time camera features knows that individual frames lie. The check needs to be stable and consistent across multiple consecutive frames before capture triggers. One good frame is luck. Three agreeing frames is confidence. The exact stability algorithm is in the playbook.
MICRExtractor uses Vision’s text recognition with specific configuration for the MICR E-13B
font used on every check. The tricky part isn’t the recognition — it’s the parsing logic that correctly identifies routing numbers, account numbers, and check numbers from the raw recognized text. Bank formats are not as standardized as you’d hope.

The Public API: Where Complexity Goes to Hide

Here’s the part I’m most proud of. From the outside, all of that complexity disappears:

let kit = CheckCaptureKit(configuration: .default)
  kit.present(from: self) { result in
      switch result {
      case .success(let capture):
          // frontImage, backImage, micrData — everything you need
          submitDeposit(capture)
      case .cancelled:
          break
      case .failure(let error):
          showError(error)
      }
}
Enter fullscreen mode Exit fullscreen mode

One function call. The caller never sees DocumentDetector, QualityAnalyzer, or AutoCaptureEngine. They don’t need to. That’s the point.
A clean public API is worth more than clever internals. Design the surface your callers interact with first. Build everything else to support it.

The Results

After shipping to production across multiple client tenants:

Before (MiSnap)

Annual cost - $50,000+
Capture success rate - 94%
iOS update dependency - Vendor timeline
Customization time - Weeks
Debug visibility - None

After (CheckCaptureKit)
Annual cost - $0
Capture success rate - 97%
iOS update dependency - Our timeline
Customization time - Hours
Debug visibility - Full Source

The capture success rate improvement wasn’t magic — it came from tuning quality thresholds specifically for our clients’ real-world conditions. Something we could never do with a black box.

Three Things I Learned

Apple’s native frameworks are massively
underestimated.
Vision, CoreML, CoreImage — these are production-grade tools that most developers treat as toys. Before paying for any third-party SDK, spend a week seriously evaluating what Apple already ships for free.

Stability buffers beat single-frame detection every time. This pattern applies far beyond check capture. Any real-time detection problem — face recognition, document scanning, AR anchoring — benefits from requiring consecutive agreeing frames before acting. One frame can be wrong. Three rarely are.

The cost of dependency is always higher than it looks. The $50k/year was visible on a budget spreadsheet. The engineering hours lost to vendor coordination, update blocking, and workarounds were invisible — but very real. Build vs buy decisions need to account for the full cost of being locked in.

Want the Full Implementation?

The architecture decisions above are the what and the why. The exact implementation — the quality thresholds, the stability algorithm, the MICR parsing logic, and how it all fits together with the camera pipeline — is in the iOS Architecture Playbook.
It also covers MVVM-C architecture, Combine in production, multi-tenant iOS design patterns, and XCUITest automation that actually survives CI. All from 8+ years of shipping production iOS apps across enterprise clients.
[iOS Architecture Playbook — $29, instant PDF download →]
https://shonitk.gumroad.com/l/rtiyj

Built something similar? Hit a wall with Vision framework? Drop your questions in the comments.

Top comments (0)