Stalefish Labs

Posted on Mar 31 • Originally published at stalefishlabs.com

Building Confidence Into Uncertain Verdicts

#softwaredevelopment #systemdesign #ui #ux

This is the final article in the Building a Weather Decision Engine series. The previous articles covered the drying model, multi-app architecture, edge cases, and the Yardwise inversion. This last one is about the output side, how the engine communicates decisions to people who just want to know if they should go ride.

Why Not a Percentage?

The first version of the engine returned a moisture percentage. Users hated it. To be honest I kinda hated it too but it was the first logical thing to represent as meaningful output.

"Your trail is at 47% moisture" sounds precise. But what do you do with that? Is 47% rideable? It depends on the surface, your tolerance for mud, whether you care about trail damage, and how far you're willing to drive for a "maybe." The percentage outsources the decision to the user, which is the entire thing the app was supposed to handle. Besides, I have no idea what percentage tips me one way or the other toward making a real life ride decision.

A percentage also implies false precision. The engine's moisture model is a reasonable approximation, not a soil sensor reading. Presenting a number with two significant digits suggests an accuracy that doesn't exist, so it's misleading. The difference between 47% and 49% is noise, not signal.

Three states (Yes, Maybe, No) work because they match how people actually think about the decision. You're either going (Yes), not going (No), or weighing it (Maybe). The engine's job is to put you in the right decision bucket, not to give you a homework problem.

The Maybe State Is the Product

The Yes and No verdicts are straightforward. The real design challenge is Maybe.

Maybe exists for conditions where reasonable people would disagree. A wetness score of 0.4 on a dirt trail means it's damp but not muddy. Some riders would go. Others would wait. A veteran on a hardtail who likes playing the drift might like the extra challenge of some surprise slip here and there. A flowier rider on a full-suspension with less tire clearance might not want to bother and risk the occasional wet low spot. There's also some variance in trails opening and closing based on conditions, which is another facet of where local knowledge would tip a Maybe verdict one way or the other.

The engine can't make that call for you. What it can do is tell you why conditions are borderline, so you can apply your own judgment.

Rationale: The Why Behind the Verdict

Every assessment includes a structured rationale:

struct GroundwiseRationale {
    let headline: HeadlineReason
    let decisiveFactor: DecisiveFactor
    let headlineContext: HeadlineContext
    let details: [DetailReason]
    let values: RationaleValues
}

The headline is the one-line summary: "Light rain still drying" or "Strong drying clearing moisture" or "Frozen conditions — ice hazard." It's what the user reads first.

The decisive factor identifies the single most important reason for the verdict. This is the tie-breaker — the one variable that, if changed, would flip the result. "Recent heavy rain" or "weak drying conditions" or "surface sensitivity."

The details array provides up to six contributing factors, each describing how a specific weather element is affecting conditions, either positively or negatively. Temperature is helping. Humidity is slowing things down. Wind is moderate. These aren't ranked by importance, they're presented as a set of forces that the user can scan to build their own picture, with the verdict-supporting forces appearing first.

The engine deliberately avoids showing the raw numbers in the rationale. Users don't need to know that drying strength is 0.53 or that the wetness score is 0.38. They need to know "drying conditions are moderate — warm but humid." The rationale translates numbers into plain language.

Confidence: Admitting What You Don't Know

Each verdict has a confidence level — Low, Medium, or High. Don't forget that we're highly dependent on the weather source, and can't control the occasional hiccup where a piece of key data is missing. That's where Confidence enters the picture, and it starts at High and gets reduced by specific factors:

Missing timing data → Low confidence. If the engine doesn't know when rain ended (the minutesSinceRainEnded value is nil), it can't model the timing decay that's central to the wetness calculation. It defaults to a 0.5 timing score (assume moderate concern) and drops confidence to Low. The verdict might be right, but the engine is guessing about a critical input.

Patchy rain → Medium confidence. When the weather source indicates precipitation has been spotty and inconsistent (patchyRainLikely), the actual conditions at the user's spot might differ from what the nearest weather station recorded. The engine can't know whether the rain hit your trail or the parking lot a mile away. These are unknown unknowns, well maybe they're known unknowns...either way we don't know!

Near-threshold conditions → Medium confidence. If the wetness score lands within 0.05 of a verdict boundary (0.25-0.35 near the Yes/Maybe line, or 0.55-0.65 near the Maybe/No line), the engine reduces confidence because a small change in any input could flip the result. This is the engine saying "I'm calling it Maybe, but it could easily be Yes."

The confidence level affects how the UI presents the verdict. A High-confidence No is "Don't ride — conditions are poor." A Low-confidence Maybe is "Conditions are uncertain — check conditions on the ground." Same verdict structure, different emphasis.

What I Didn't Do With Confidence

I considered making confidence a continuous value (0-1) or adding more levels. I didn't, for the same reason I use three verdict states instead of a percentage: more granularity creates more decisions for the user without adding useful information. The whole point here is to simplify decision making.

The three levels map to three communication strategies:

High: Trust the verdict
Medium: The verdict is our best call, but conditions might surprise you
Low: We're short on data, verify on the ground

That's enough to calibrate expectations without overwhelming.

Recovery Outlook: "So When Will It Be Good?"

The natural follow-up to a No or Maybe verdict is "when will conditions improve?" I was initially skeptical about including this but it has turned out to be extremely valuable because as is often the case, I'm actually not looking at the app to ride right this moment, I'm typically planning ahead for "later today." The engine provides a recovery outlook, a qualitative time estimate, to attempt to answer.

The outlook is a simple 2D lookup: wetness score vs. drying strength.

Wetness / Drying	Strong	Moderate	Weak
Low (< 0.4)	Within hours	A few hours	Maybe later today
Medium (0.4-0.6)	A few hours	Later today	Maybe later today
High (≥ 0.6)	Later today	Maybe later today	Unlikely today

The categories are deliberately vague. "Within hours" means 1-2 hours. "Later today" means afternoon or evening. "Unlikely today" means tomorrow at the earliest. The engine doesn't say "rideable at 2:47 PM" because that precision doesn't exist in the model and it would be ridiculous to pretend that level of accuracy.

The "maybe later today" bucket is the uncertainty hedge — conditions might improve, but the engine isn't confident enough to commit. It's the recovery equivalent of the Maybe verdict.

There's also a "may worsen" outlook for cases where the forecast indicates more precipitation. If the next few hours include rain, the engine won't tell you things are improving, even if current drying conditions are strong. This prevents the frustrating experience of waiting for conditions to improve only to get rained on again. And this is fairly common on days where I'm pretty sure the trail is day right now but less sure of how the afternoon is going to go. In this regard, the engine isn't just a past weather + surface conditions evaluator, it's also taking a peek into the future (forecast) to give you a sense of if things are trending better, worse, or more of the same.

Risk Categories: More Than Just "Wet"

It occurred to me fairly quickly when building Ridewise that what we're dealing with here isn't just the ability to ride, as in muddy or not muddy, but also the risk implications. I'm using risk here as a two-way term depending on activity: risk to the user and in some cases risk to the surface. For a concrete skatepark, it's entirely user risk, but for a mountain bike trail or sod soccer field, it's risk to the surface. Actually the mountain bike example might cut both ways but you get the idea. And there's the quality of the activity that I decided was also effectively a risk. So beyond the binary wet/dry question, the engine evaluates three distinct risk categories:

Surface damage — Will using this surface in current conditions cause lasting harm? This matters most for natural grass athletic fields (0.9 sensitivity), newly seeded lawns (up to 1.0 with establishment boost), and clay courts (0.85). It matters least for artificial turf (0.05), metal (0.05), and compost (0.3).

Activity safety — Is the surface dangerous to use? Wet metal has the highest safety sensitivity (0.95) because it becomes extremely slippery. Composite ramp surfaces like Skatelite and Ramp Armor are also high (0.85), along with metal and concrete. Artificial turf and potting mix are low (0.4) because their textures maintain grip even when wet. Indeed nobody skates on potting mix, but this illustrates how one engine can serve three different apps if you're very careful about the design.

Activity quality — Even if it's safe, will the experience be poor? Mud on a dirt trail (0.7 quality sensitivity) makes for a miserable ride. A damp skatepark (0.6) is more debatable, and can range from totally fine (squeegee the ramp and it may dry out) to a complete no-go.

These three risks are calculated independently and presented alongside the verdict. A Maybe verdict might come with low safety risk but high damage risk — meaning "you'd be fine riding, but you'd rut up the trail." That distinction helps the user make an informed call.

Putting It All Together

The full assessment output:

struct GroundwiseAssessment {
    let verdict: Verdict              // Yes / Maybe / No
    let confidence: Confidence        // Low / Medium / High
    let rationale: GroundwiseRationale // Why
    let riskSurfaceDamage: RiskLevel  // Harm to surface
    let riskActivitySafety: RiskLevel // Danger to user
    let riskActivityQuality: RiskLevel// Experience quality
    let recoveryOutlook: RecoveryOutlook? // When will it improve
}

The verdict is the headline. Confidence tells you how much to trust it. The rationale explains why. Risk levels add nuance. Recovery outlook answers "when?"

None of these fields exist in isolation. Together, they give the user a complete picture in a few seconds of scanning — enough information to make a confident decision without reading a weather report.

The Design Principle

The engine's output design follows one principle: make the decision for the user, then show your work.

Most users will see the verdict color, read the headline, and decide. That's the 80% case, and it should take three seconds. The rationale, risk levels, and recovery outlook are there for the 20% who want to understand why, or who are in a borderline situation where the details matter. And yes, I want wildly overboard on the proving your work part because I wanted it to be crystal clear that this isn't a pretty veneer over what is essentially a weather app — the Groundwise engine is doing sophisticated modeling in service of a "simple" yet nuanced and challenging question about rideability/playability/vulnerability (plant vulnerability in Yardwise).

This is the opposite of how most weather apps work. They give you all the data and expect you to synthesize it into a decision. The Groundwise engine synthesizes first, then offers the data for verification. The user's default is to trust the verdict and act on it. The detail is available but not required.

That's what made three states better than a percentage. A percentage demands interpretation. A verdict offers a recommendation. The supporting detail is opt-in, not mandatory.

Thanks for Reading

This series covered a lot of ground, from the core concept through the drying math, multi-app architecture, winter edge cases, the Yardwise inversion, and now the verdict UX. If you've built something that turns noisy data into human decisions, I'd love to hear about your approach. The threshold and confidence problems are universal.

The Groundwise engine powers Ridewise, Fieldwise, and Yardwise — all available for iOS from Stalefish Labs. Find us on Bluesky.

DEV Community