DEV Community: SciForce

From OMOP Workflows to Living Evidence: SciForce at OHDSI Europe Symposium 2026

SciForce — Tue, 21 Jul 2026 14:38:17 +0000

Introduction

This April, Polina Talapova and Mariia Pahur represented SciForce at the 7th European OHDSI Symposium in Rotterdam – three vivid days of workshops, poster sessions, MindMeetsMachines mapping competition and an oral presentation aboard the SS Rotterdam, a retired ocean liner moored on the Maas river.

The symposium's theme was Continuous Collaboration for Living Evidence Generation. The word "living" matters here. Traditional evidence-generation projects are often designed as discrete studies. A living-evidence model places greater emphasis on reusable study definitions, regularly updated data, versioned vocabularies, repeatable analytical pipelines, and findings that can be reassessed as new information becomes available. Evidence generation is becoming more continuous and iterative. Supporting that model requires maintainable infrastructure rather than a one-off data conversion or analysis.

This transition was visible throughout the Rotterdam programme . The community is moving from "we converted our data to OMOP" toward "we maintain production-grade evidence infrastructure." That is not yet true of every implementation, but it reflects a clear direction among more mature OMOP programmes. The transition depends on standardized vocabularies, phenotyping, data quality, analytical tooling, and community coordination – all working together. A weakness in any one of these layers can compromise the comparability or reproducibility of the resulting evidence.

SciForce's contributions sat directly in that space: vocabulary infrastructure, participation in an expert-versus-machine terminology-mapping challenge, an OMOP extension table for environmental research, and applied clinical research workflows. This article covers what we brought to Rotterdam, what the programme showed about where the field is heading, and what trends are shaping real-world evidence infrastructure across the OHDSI community.

Why OHDSI Europe Matters: Many Countries, Shared Evidence Questions

The 7th European OHDSI Symposium brought together 375 attendees at the main event, following a two-day pre-symposium programme of workshops and community meetings. To understand why that gathering matters, and why the conversations happening there are consequential for healthcare research globally, it helps to start with the problem the community exists to solve.

A shared problem across fragmented data

Medical data is recorded differently everywhere – different coding systems, different EHR architectures, different local clinical practices. Yet the research questions are often the same: which treatments produce better outcomes in real clinical practice? What safety risks appear across patient populations? How consistently can a phenotype be identified in heterogeneous data?

Answering those questions reliably across multiple data sources requires three things to work together:

a shared data structure (OMOP CDM)
a shared semantic layer (OHDSI Standardized Vocabularies)
reusable analytical methods and tools (HADES, ATLAS, Achilles, DQD, and others)
local data expertise and governance capable of executing and validating the work at each participating site

OHDSI – the Observational Health Data Sciences and Informatics community – is the international open-science initiative of over 4,500 researchers, clinicians, developers, and data scientists that builds and maintains all three.

In Europe, health data is distributed across many languages, national terminologies, healthcare systems, and governance frameworks. In many projects, legal, ethical, contractual, privacy, or operational constraints make centralizing patient-level data difficult or inappropriate. This makes Europe a practical test case for federated evidence generation: research coordinated across countries without moving patient data, with the results that are nonetheless standardized and comparable. The model does not eliminate heterogeneity; it makes differences in implementation, vocabulary coverage, phenotype logic, and data quality more visible – and therefore more important to manage.

SciForce in OHDSI: A Decade of Work on Semantic Infrastructure

SciForce has been part of the OHDSI community since 2015. For us, Rotterdam therefore felt less like a conventional conference and more like a working session within a community to which our team has contributed over several years.

The work started with medical data mapping and grew from there into vocabulary development, concept creation, relationship logic, semantic QA, clinical validation, and OMOP CDM representation rules.

Our team members have contributed to a range of OHDSI vocabulary initiatives, including work involving RxNorm Extension, ICD-family vocabularies, ATC, dm+d, and CIEL. These efforts have addressed different parts of the same problem: representing local or national clinical terminology in a form that can support consistent OMOP-based analysis. The exact scope, dates, and attribution of each contribution should be documented in the final references or an accompanying project timeline.

Alongside that, we participate in the Vocabulary WG, GIS WG, THEMIS, Psychiatry WG, Clinical Trial WG, Survey WG, and the second edition of the Book of OHDS. These groups are where many practical modelling conventions, unresolved edge cases, and contribution processes are discussed.

What SciForce Brought to Rotterdam

SciForce contributed to the symposium through an oral presentation, Jackalope Plus participation in the MindsMeetsMachines challenge, and four poster projects – some led by SciForce and others developed with collaborators across the OHDSI community.

"Dreaming about OHDSI Standardized Vocabularies"

Polina Talapova received the 2025 OHDSI Titan Award for Data Standards. . ‘Her presentation formed part of a symposium session examining the next stage of OHDSI’s development. Patrick Ryan and Renske Los had discussed the community’s future direction, while the Titan presentations reflected on how that direction could be translated into practical work.

Her answer:

Vocabularies need to stop being a lookup table behind Athena and start working as shared infrastructure. The layer that makes a cohort defined in Rotterdam means the same thing as one defined in Seoul or São Paulo. Without that, the rest of the evidence stack is only as reliable as each organization's local mapping work – which varies enormously.

There's a real foundation to build on. OHDSI Standardized Vocabularies, Athena, Usagi, expert stewardship, contribution pathways, AI-assisted mapping tools coming from multiple directions. But the gaps are just as real. Contribution pathways that exist but nobody can find. The same mapping problems being solved independently by a dozen different teams. A handful of experts holding most of the knowledge. No QA process that actually scales. The community has standardization – what it doesn't yet have is infrastructure.

What Polina proposed:

Shift the model – organizations contribute source vocabularies; the community delivers reusable mappings back instead of everyone solving it alone.
Scale stewardship with AI – candidate mappings, duplicate detection, hierarchy enrichment, QA triage handled by LLMs, with humans in charge of decisions.
Build a shared workspace – one place for vocabulary intake, mapping, distribution, review, and QA instead of scattered tools.
Organize for collective progress – shared problem intake, visible contribution spaces, pathways from local prototype to community asset.
Coordinate and build – near-term: automation and intake. Mid-term: shared workspace. Long-term: community-scale semantic service.

The talk drew one question from the audience that framed it well: are we actually preparing the next generation to build this? Polina's answer was that the responsibility sits with the current generation – and that the tools and community to do it are already there.

Jackalope Plus in the MindMeetsMachines Vocabulary Challenge

Mariia Pahur presented Jackalope Plus in the competition and co-led one of the human mapping groups – so SciForce was working both sides of the exercise at the same time.

Jackalope Plus is an LLM-assisted terminology-mapping system designed to decompose complex source expressions, retrieve candidate concepts, and support structured expert review. For example, a source expression such as “MRI of the lumbar region” may be analysed into a procedure component and an anatomical component. Depending on the target vocabulary and applicable modelling conventions, the final representation may use a single precoordinated concept, a compositional structure, or another supported mapping pattern.

The system combines source-text processing, vector-based candidate retrieval, terminology metadata, and constrained language-model prompting. It supports batch processing through an API and presents candidate mappings for expert review and quality assurance. The point isn't to automate mapping, but to make the expert's job faster and more structured. At MindMeetsMachines, ten AI teams mapped procedural concepts in advance while human teams mapped the same set live.

For one reported challenge dataset of 292 codes, Jackalope Plus returned a candidate result for all but one code and achieved a reported ExactMatch agreement of 0.63 against the challenge reference. These results should be interpreted in the context of the dataset, reference standard, matching definitions, and evaluation protocol.

A separate benchmarking exercise reported automated processing of 749 source terms in approximately 16 minutes, compared with several hours of manual mapping effort. However, machine execution time and end-to-end expert mapping time are not directly equivalent unless preprocessing, review, adjudication, and QA are measured consistently. Variation among expert mappings also illustrates that terminology mapping may involve legitimate ambiguity rather than a single self-evident answer. The competition left SciForce with a concrete development list: better support for broadMatch and narrowMatch logic, better candidate ranking, uncertainty handling, and explainability for expert review.

Posters on OMOP Extensions, Mapping Quality, and Clinical Research Workflows

EXTERNAL_EXPOSURE: representing environmental and social context

The OHDSI GIS Working Group built an OMOP CDM extension table to capture environmental and social exposures: air pollution, deprivation indices, housing conditions, climate data. The associated Gaia tooling and Bridge2AI for Clinical Care – CHoRUS work demonstrate how such a model could connect environmental context with critical-care research data. The maturity of each implementation – prototype, synthetic demonstration, pilot, or production use – should be stated explicitly in the published version. The table remains a community extension rather than part of the current OMOP CDM core specification.

https://ohdsi-europe.org/images/symposium-2026/posters/74_poster.pdf

The larger modelling problem is significant. Environmental variables differ from conventional clinical events in temporal resolution, spatial granularity, provenance, uncertainty, and exposure assignment. A useful OMOP representation must preserve these properties rather than merely place values into a new table.

Laterality mapping: when standardization removes clinically relevant detail

When a side-specific source diagnosis maps to a broader target concept without equivalent laterality, left–right information may be lost unless it is retained elsewhere in the data model. This does not occur in every mapping, but it is a recurrent risk when the target vocabulary lacks an equivalent precoordinated concept or when the mapping convention selects a broader standard concept. In the reported ophthalmology analysis, 3,116 ICD-10CM source concepts were associated with 730 standard ones. A pilot submission to SNOMED International was accepted in October 2025 and went into production in December. For studies in which laterality affects eligibility, treatment, recurrence, or outcomes, loss of this detail can alter the phenotype before analysis begins. The practical remedy may involve improved standard vocabulary content, explicit modifier representation, source-value preservation, or a combination of these approaches.

https://ohdsi-europe.org/images/symposium-2026/posters/66_abstract.pdf

Jackalope Plus benchmarking and a structured mapping workspace

The Jackalope Plus poster expanded on the challenge findings and proposed a three-mode mapping workspace: 1) manual mapping; 2) AI-assisted mapping with expert selection; 3) AI-generated mapping followed by structured expert QA. The value of this model is not that one mode is universally superior. It allows the workflow to be adapted to concept complexity, mapping risk, terminology maturity, and the level of evidence needed for approval.

https://ohdsi-europe.org/images/symposium-2026/posters/75_poster.pdf

First-episode psychosis study protocol

The first-episode psychosis project included 10 phenotype definitions, 986 diagnosis codes, and 11 vocabularies, with validation work conducted through OHDSI Phenotype Phebruary. The planned study asks which antipsychotic treatment pathways are associated with better outcomes in people with first-episode psychosis. OMOP is the infrastructure that makes a distributed study possible; it is not the clinical evidence itself. That distinction matters. A common data model can make an analysis executable across sites, but it cannot by itself guarantee that the phenotype, exposure definition, confounding strategy, outcome model, or source data are clinically valid.

https://ohdsi-europe.org/images/symposium-2026/posters/99_poster.pdf

Six Trends Shaping Real-World Evidence Infrastructure

The poster programme gave a clear picture of where the community's attention is. The official event report described 130 poster presentations and 15 software demonstrations. A substantial portion of the programme addressed data standards and management, including ETL pipelines, terminology mapping, vocabularies, data quality, and reproducible infrastructure. A precise category count should be included only if the programme has been independently classified using documented criteria.
The recurring question was not simply whether an organization could transform data into OMOP. It was how to keep that transformation accurate, reproducible, and useful over time.

1) OMOP implementation is becoming a production discipline

More mature organizations are treating OMOP as a maintained data product rather than a one-time conversion. That means accounting for source-system changes, incremental loading, vocabulary releases, CDM upgrades, quality monitoring, lineage, deployment reproducibility, and regression testing. . For clients, this changes the scope of an OMOP programme. The deliverable is no longer only an ETL script and a populated schema; it is an operating model for detecting and managing change.

2) Vocabularies and mappings are critical infrastructure

ICD-to-SNOMED, drug normalization, LOINC hierarchy, RxNorm Extension, local vocabularies, exact and broad and narrow mapping relationships, semantic drift, mapping QA – none of this is auxiliary terminology work. It determines whether cohorts, exposures, outcomes, and cross-database comparisons are actually reliable. A structurally valid OMOP database may therefore remain analytically unreliable if its mappings are incomplete, overly broad, semantically inconsistent, or no longer aligned with the vocabulary release in use.

3) AI-assisted curation is growing – but it needs expert review

LLM-guided mapping, clinical NLP, annotation pipelines, AI-assisted SQL generation, knowledge graphs, conversational cohort tools – the AI layer is expanding across the OMOP ecosystem. FinnGen's 520,000 genotyped individuals are being used as an external genetic benchmark for phenotype quality – one example of AI-assisted methods being validated against independent evidence. The consistent message across posters and sessions wasn't "AI replaces experts." It was controlled AI embedded in workflows with human validation, provenance tracking, and QA built in.

4) Federated evidence generation is the practical model

Data stays local, especially in Europe. Shared protocols run across different sites. Outputs get standardized and compared. DARWIN EU, EHDEN, HERON-UK, cross-Nordic collaborations – these aren't pilot projects anymore. National Nodes are the coordination structures making it work country by country. Cross-Nordic federated machine learning on colorectal cancer data is running across Danish and Norwegian datasets – data staying local while insights travel, in practice.

5) The ecosystem is building more tools on top of OMOP

ATLAS 3.0, OmopViewer, OmopStudyBuilder, Shiny apps, dashboards, cohort tools, sandbox environments, MCP servers. OMOP is becoming a platform, not just a database format. The tooling layer is what makes standardized data usable for researchers and clinicians who aren't OMOP specialists.

6) The clinical expert / data engineer interface is where it gets hard

Phenotypes, concept sets, mapping rules, clinical criteria, validation, interpretation – this is where clinical meaning becomes computable logic. It's also where the most errors enter the pipeline and where the most expertise is required. Getting a diagnosis code right in a mapping table is one thing. Getting the clinical logic of a cohort definition right – timing, exclusions, edge cases, what counts as exposure – is another. Organizations therefore need review processes in which clinical experts, terminology specialists, data engineers, and study-methods specialists can inspect the same assumptions in forms each discipline can evaluate.

What These Trends Mean for Organizations Using OMOP

Six trends came through clearly at the symposium. SciForce is active in five of them – vocabulary infrastructure, AI-assisted mapping, GIS extensions, federated evidence generation, and the clinical expert / data engineer interface – and has been for most of the past decade.

- Vocabulary infrastructure
The problem the community keeps hitting – mapping quality, local terminology that doesn't connect to standard concepts, QA that doesn't scale – is the infrastructure SciForce has been building. Not as background work, but as the layer that determines whether everything built on top of it is actually reliable.

- LLM-assisted mapping
MindMeetsMachines showed Jackalope Plus working at production scale. What came out of the competition is a clear picture of what the next version needs: better broadMatch and narrowMatch logic, improved candidate ranking, uncertainty handling, explainability for expert review. A tool that already handled a full batch competitively among ten AI teams, with a concrete development list.

- GIS extensions and EXTERNAL_EXPOSURE
EXTERNAL_EXPOSURE takes OMOP into territory it didn't cover before – air pollution, deprivation indices, housing conditions, climate data. The same infrastructure work, applied to environmental and social exposures rather than clinical ones.

- Where the demand is growing
Sweden, Denmark, Ireland, France – more and more European systems are moving from initial OMOP adoption toward something they can maintain long-term. The consistent gap at that stage is semantic: vocabulary coverage, mapping quality, local terminology, QA.

Conclusion

Patrick Ryan's keynote ended with a question to the room: how would achieving the OHDSI dream impact you? The top answer, from 24 people, was: it helps explain medical mysteries that still haunt us.
The Rotterdam programme reinforced a practical lesson: reliable real-world evidence depends on more than converting records into a common schema. It requires maintainable ETL, defensible semantic mappings, transparent phenotype logic, expert validation, and governance capable of keeping all of these elements aligned as data and standards change.
SciForce contributes at that intersection of clinical meaning, terminology engineering, and data implementation. For organizations already using – or preparing to adopt – OMOP, a useful next step is to examine not only whether their data conforms to the CDM, but whether its semantic decisions can be reviewed, reproduced, and maintained over time.

Telehealth Platform Architecture: Building Secure, Scalable Virtual Care Systems

SciForce — Tue, 07 Jul 2026 14:21:30 +0000

Introduction

Building a telehealth platform at clinical scale means solving for hospital network restrictions, HIPAA compliance and auditability, and the data load of continuous remote monitoring – and the architecture decisions that determine whether it holds up are mostly made in the first few sprints.

The engineering debt from early decisions starts showing up at scale: video sessions dropping when hospital firewalls, restrictive egress policies, or network address translation prevent a direct media path; or PHI exposure in multi-party call architectures where the media-routing and encryption model was never explicitly defined. In healthcare, that debt is unusually expensive to carry – $7.42 million per breach on average. Breaches involving stolen credentials took an average of 292 days to identify and contain across the industries included in the same analysis, but that figure should not be presented as the healthcare-sector average. The virtual care market at $141.19 billion means that debt compounds across an expanding surface area.

What makes telehealth architecture genuinely hard is that these layers have dependencies on each other – service-boundary decision affects identity, data access, observability, and the potential blast radius of a failure; an EHR integration pattern determines what the AI layer can actually see at intake, and a WebRTC topology choice determines whether a clinical session survives a degraded hospital network at all.

Virtual Care Platform Architecture

Solving for video first is intuitive, but at scale the video feed is only one part of the stack. The real challenge is building an ecosystem where your backend choices don't box you into a compliance corner or prevent you from integrating with a hospital's existing EHR. The platform is a clinical workflow and data platform with a real-time communication layer. We'll break down how to structure your backend, handle patient data interoperability, and configure real-time streaming so the platform holds up under clinical, rather than consumer, load.

Modular Monolith vs. Microservices Telehealth Backend

In a HIPAA-regulated environment, backend architecture is partly a security decision. Service isolation can reduce the blast radius of a vulnerability, but only when service identities, credentials, networks, data stores, and authorization policies are genuinely isolated. A notification service and an EHR connector are not separated merely because they are deployed as different services.

The case for microservices comes with real operational overhead from day one:

Service discovery, inter-service auth, and distributed tracing all need to be in place before the first clinical workflow runs
CI/CD pipelines that handle dozens of independent deployment targets
Strong internal API contracts, or coupling creeps back in within a few sprints
Security monitoring must correlate activity across services, identities, queues, gateways, and databases.

A monolith ships faster and is easier to reason about early on. Its limitations emerge when unrelated clinical capabilities share deployment cycles, privileges, resources, and failure modes. A vulnerability in one module does not automatically compromise the entire application, but insufficient internal separation can make containment and remediation more difficult. Video ingestion and auth also have very different load profiles, and scaling them together may be inefficient.

The hybrid path – a modular monolith with enforced internal boundaries, decomposed into services as compliance requirements and load demand – works, but only if those boundaries are drawn correctly from the start. Retrofitting them is the expensive version of the same decision.

The correct choice is therefore not “microservices or monolith” in isolation. It is the smallest architecture that can enforce the required security boundaries, scale the workloads that actually differ, and remain operable by the team responsible for it.

EHR Integration and Interoperability Architecture

Epic, Athenahealth, and Oracle Health, including former Cerner platforms, each expose patient data differently. Although standardized FHIR APIs are increasingly available, supported resources, profiles, authorization flows, write capabilities, app-registration processes, and local configurations still vary. A telehealth platform that builds direct connectors to each one ends up maintaining three separate integration layers that break independently. U.S. interoperability policy is moving the ecosystem toward standardized APIs, but the obligations are not uniform. ONC certification requirements have expanded access to FHIR-based APIs in certified health IT, while CMS-0057-F primarily requires specified payers to implement or expand patient, provider, payer-to-payer, and prior-authorization APIs beginning in 2027. These rules do not create universal bidirectional EHR write access for every telehealth platform.

FHIR Data Modeling and Clinical Semantics

FHIR R4 is what makes vendor-agnostic integration viable. A conformant FHIR interface gives these systems a common exchange model; it does not require them to use the same internal database schema. A result from a remote monitoring session can be represented as a FHIR Observation and, where the receiving system supports the required workflow, written back to the patient record. That still requires:

implementation profiles such as US Core or relevant specialty guides;
standardized terminologies and units;
patient and device identity resolution;
provenance and source-system metadata;
validation of clinically plausible values;
duplicate and correction handling;
authorization for the intended read or write operation.

FHIR resources are containers for exchange; semantic interoperability depends on how those resources are profiled, coded, validated, and governed. HL7 FHIR standards still run in most hospital systems underneath FHIR, which means the integration layer needs to handle both, often simultaneously.

API Gateway, Access Control, and EHR Adapter Design

An API gateway sitting in front of the EHR is the primary healthcare API security control — handling read/write permissions, rate limiting, audit trails, and access control before patient data reaches the application layer. An API gateway alone does not normalize clinical semantics. Vendor-specific capabilities are typically handled through an integration layer or adapter model that translates supported workflows into an internal canonical contract while preserving source provenance. SMART on FHIR standardizes important authorization and application-launch patterns, but local registration and configuration differences remain.

EHR Write-Back, Synchronization, and Provenance

Reading from an EHR is often more manageable than writing back. Updating problem lists, posting encounter notes, or synchronizing medication changes from a remote session requires permissions and workflows that vendors and healthcare organizations may restrict. Write-back also requires explicit decisions about:

which system is authoritative;
whether an update creates, replaces, amends, or appends a record;
how concurrent changes are detected;
how retries remain idempotent;
how rejected or partially accepted updates are surfaced;
how provenance is retained.

This is a common source of production failure, and where the architecture needs explicit handling rather than optimistic assumptions.

WebRTC Architecture for Low-Latency Clinical Video

Hospital IT runs symmetric NAT configurations that block the direct peer-to-peer connections WebRTC relies on by default. This affects 30–40% of hospital networks – meaning a platform without a properly configured TURN relay server will drop sessions for a significant share of clinical users. TURN acts as a media relay when a direct connection is unavailable, but it adds network distance, infrastructure cost, and another availability dependency. TURN capacity should be deployed regionally and tested against the target environments. The relevant quality budget includes round-trip time, jitter, packet loss, bitrate adaptation, retransmission behavior, and recovery after network changes; there is no universal latency threshold that guarantees clinical adequacy for every workflow.

Codec selection – the algorithm used to compress and decompress the video stream – is a decision most platforms make once and rarely revisit. RFC 7742 mandates VP8 and H.264 as the WebRTC compliance baseline, but in a clinical context the choice has direct diagnostic implications: block artifacts from an under-resourced codec can obscure a skin lesion in a dermatology consult, and a codec that throttles on older hardware locks out the remote patient monitoring (RPM) use case entirely. The right choice depends on the device profile of the clinical population, the bandwidth constraints of the deployment environment, and whether the platform needs to support high-resolution diagnostic imaging or general teleconsultation.

High-resolution radiology or pathology images should generally remain in validated DICOM or specialty image workflows rather than being treated as ordinary WebRTC video. Live video and diagnostic image exchange are related but distinct requirements.

Session state is where the reliability requirements of clinical video diverge from consumer tooling. A mid-assessment dropout needs:

reconnect logic that restores the same encounter;
graceful degradation to audio-only when video becomes unsustainable;
clear network-quality indicators;
explicit signaling when image quality may be inadequate for the assessment;
safe handling of recording, consent, and participant changes.

Case Study: HIPAA-Compliant Mental Health Platform with Role-Based Access Control

A digital mental health company had a system that stored neuropsychological test results in unstructured formats, had no role-based access controls, and used a general-purpose video tool without an appropriate Business Associate Agreement for the way PHI was being handled. For a platform handling sensitive psychiatric assessments, that was not a viable compliance position.

The rebuilt platform ran on HIPAA-compliant cloud storage from AWS under an applicable BAA, with the platform controls configured according to the project’s risk assessment and the cloud shared-responsibility model. The virtual care platform security data layer used managed PostgreSQL with encrypted storage, controlled backup and restoration procedures, and TLS-protected network connections. Encryption keys and administrative access were managed separately from ordinary application credentials.

The access model was the architectural centerpiece: six distinct roles – Doctor, Nurse, Administrator, Analyst, Patient, and Family Member – each with explicitly scoped permissions. A Doctor initiates and reviews assessments. A Family Member can view only the information explicitly authorized by the patient or permitted through an applicable legal-representative relationship. An Analyst receives only the minimum data required for the approved analytical purpose, using de-identified or appropriately limited data where feasible. Every data access event, such as read, write, export, was logged.

To make the trail tamper-evident, security-relevant logs must be protected from ordinary application administrators through append-only storage, cryptographic integrity controls, or an appropriately isolated logging service.

The reporting layer was built for use during active consultations: five visualization types per patient, a presentation mode for structured in-session review, and a treatment dynamics dashboard tracking progress across visits. Clinicians could pull up a patient's full testing history mid-consultation without leaving the interface.

Patient retention increased 20%, administrative workload dropped 30%, and infrastructure costs came down 25% in the evaluated project period.

Telehealth Security, Compliance, and Data Protection

In 2024, the HHS Office for Civil Rights confirmed 663 breaches affecting 242.9 million individuals – the Change Healthcare attack alone accounting for 192 million of them. For a telehealth platform, the entry points are not just the obvious ones: compromised video sessions, abused EHR APIs, unsecured patient endpoints, and vulnerable third-party SDKs all sit within the attack surface.

HIPAA sets the U.S. regulatory baseline for covered entities and business associates. For processing subject to the GDPR, health data are special-category data under Article 9. The organization must identify an applicable Article 6 legal basis and Article 9 condition, implement controller–processor agreements where required, assess international-transfer mechanisms, and apply privacy principles such as minimization, purpose limitation, and accountability. Explicit consent is one possible Article 9 condition, not a universal requirement, and GDPR does not impose a blanket rule that health data remain inside the EU.

HIPAA, GDPR, and International Compliance Requirements

HIPAA compliance in telemedicine software development extends further than the clinical data layer. A vendor that creates, receives, maintains, or transmits PHI on behalf of a covered entity or another business associate may qualify as a business associate and require a BAA. The analysis depends on the vendor’s role; not every network conduit or unrelated supplier is automatically a business associate. The 60-day breach notification clock starts from the point of discovery, which means forensic logging and incident response need to be built into the platform architecture before they're needed.

For platforms subject to GDPR, Article 9 requirements do not map cleanly onto HIPAA. The architecture may need to support:

documented legal bases and processing purposes;
granular consent where consent is the selected legal basis;
withdrawal and restriction workflows;
data-subject rights;
controller-processor agreements;
Data Protection Impact Assessments for high-risk processing;
lawful safeguards for transfers outside the EEA;
country-specific health-data requirements.

EU deployment is not simply a configuration switch on top of a HIPAA-oriented build. It requires deliberate decisions about legal roles, purposes, data flows, retention, cross-border transfers, and how patient rights are operationalized.

End-to-End Encryption for Patient-Doctor Consultations

Scaling a video session beyond two participants requires a relay server – a Selective Forwarding Unit that receives media streams from all participants and forwards them to each other. Standard WebRTC for telehealth uses DTLS-SRTP to protect media in transit. In a conventional SFU topology, the encrypted transport terminates at the SFU, meaning the server may have access to media content even though the links between each endpoint and the SFU are encrypted. The SFU does not need to decode the clinical meaning of the video to route it, but transport encryption alone does not necessarily provide end-to-end confidentiality from the server operator.

SFrame closes that gap by encrypting at the application layer before the media reaches the relay. The relay sees only the routing metadata it needs; the decrypted content never exists on the server.

At-rest encryption is the baseline. The implementation question is key management – whether encryption keys live in a dedicated service separate from the data they protect, or co-located with the infrastructure in a way that makes a single compromised credential a full exposure event. Separate key scopes for independent data classes and services can reduce blast radius, but the exact boundary should follow the system’s threat model, operational requirements, and recovery design.

Identity Management, MFA, and Audit Logging

Clinical environments create identity management conditions that standard enterprise architectures aren't designed for. Shared workstations, shift handoffs, and time pressure during patient care generate workarounds – shared credentials, persistent sessions, unlocked terminals – that undermine access controls regardless of how well they're designed on paper.

HIPAA's minimum necessary standard requires organizations to take reasonable steps to limit unnecessary or inappropriate access and disclosure. Authorization should therefore be enforced by trusted backend services and data-access policies, not only by hiding interface elements. The mental health platform described earlier enforced permissions at the API level – a family member's credential hitting the database directly returns the same result as the UI would show: patient-facing summaries, not the underlying psychiatric assessment data.

Multi-factor authentication addresses the credential compromise vector that role granularity alone doesn't close. MFA enforced only at login doesn't protect a session hijacked after authentication. Token lifetimes, session invalidation on role change, and re-authentication requirements for high-sensitivity actions – exporting patient data, accessing psychiatric notes – are what determine whether MFA is a meaningful control or a compliance checkbox.

Audit logging commonly becomes incomplete when it is implemented in only one layer. A useful audit model correlates:

identity-provider and MFA events;
API authorization decisions;
application activity;
database access;
administrative and configuration changes;
exports and bulk downloads;
failed authentication and access attempts.

Database auditing can strengthen coverage, but it cannot by itself capture all application intent, identity context, exports, or authentication events. The platform therefore needs centralized, time-synchronized, access-controlled logs with retention and integrity protections appropriate to incident investigation. That completeness determines whether an incident can be scoped accurately or has to be treated as a worst-case exposure when the 60-day breach notification clock starts running.

Real-Time Patient Data Integration

A telehealth consultation is a snapshot. The clinical picture between sessions – blood pressure trends, glucose fluctuations, cardiac irregularities – lives in the data coming off wearables and home monitoring devices. Getting that data into the platform reliably, and in a format clinicians can act on, is where most integrations run into trouble.

Wearable and IoT Device Data Integration

Device manufacturers have no obligation to structure data the same way. A cardiac patch and a blood pressure cuff both produce clinical measurements, but the format, unit conventions, and device identifiers in their outputs are proprietary – determined by the manufacturer, not by any shared standard. A platform must not assume that measurements are comparable merely because they share a familiar display label. Getting that into a consistent, queryable format means mapping each device's schema to FHIR Observation resources – and as a 2025 Frontiers in Digital Health study on wearable data integration found, that translation step is still where most implementations run into trouble.

The normalization layer needs active maintenance. Device firmware or API updates can change outputs. Version-aware parsers, schema contracts, unit validation, plausible-range checks, and quarantine of unknown payloads are needed to prevent a changed format from silently becoming a clinically misleading value.

Regulated devices generally provide more formal documentation and change control, but neither regulatory clearance nor vendor documentation guarantees that an integration interface will remain unchanged. Consumer devices may introduce additional limitations in validation, intended use, data access, and measurement accuracy.

The transport layer – Bluetooth for short-range continuous monitoring, cellular for devices that need to transmit independently of local network availability – well established but not operationally trivial. Pairing failures, battery depletion, offline periods, clock drift, duplicate delivery, device replacement, and home-network variability remain part of the clinical data-quality problem.

Remote Patient Monitoring Architecture and Alert Design

Continuous glucose monitoring generates a reading every five minutes. At population-level alert thresholds, a lot of those will fire on values that are statistically abnormal but completely normal for that patient. Across a full monitoring panel, that alert volume is what makes clinicians stop trusting the channel.

A meta-analysis of 19 randomized controlled trials in chronic heart failure patients found that telehealth monitoring reduced all-cause hospitalization (OR=0.63) and heart failure-related hospitalization (OR=0.70) compared to standard care. Patient-specific baselines, persistence rules, rate-of-change criteria, symptom context, medication changes, and clinician-approved escalation policies may improve alert relevance. These rules must be clinically validated rather than inferred solely from retrospective data.
A useful RPM program must define not only when an alert fires but also:

who is responsible for reviewing it;
during which hours monitoring occurs;
how quickly the patient should expect a response;
what happens when data stop arriving;
how false positives and missed events are reviewed;
whether the platform is monitoring, screening, diagnosing, or recommending treatment.

Edge Processing for Time-Sensitive Clinical Alerts

Some cardiac events may require rapid detection, while others can be reviewed asynchronously. Edge processing can reduce dependence on network availability and preserve timely local alerting, but the appropriate latency requirement depends on the device’s intended use and clinical risk. On-device inference keeps the threshold logic local – it fires in milliseconds, regardless of what the network is doing.

Cloud processing handles the rest: longitudinal trend analysis, ML model training, audit storage. PHI leaving the device introduces transit compliance requirements – encryption in transit, documented data flows, BAA coverage for every service it passes through.
Edge processing also introduces responsibilities:

safe model and ruleset updates;
rollback;
device attestation;
version tracking;
battery and compute constraints;
consistency between edge and cloud decisions;
auditability of which algorithm version generated an alert.

Escalation logic is where most RPM programs underinvest early. Which alerts reach which clinician role, at what threshold, through which channel, and with what escalation path if unacknowledged should be defined before deployment.

Once clinicians lose confidence in the alert channel, recovering trust usually requires more than a threshold change: alert ownership, workflow fit, explainability, response burden, and historical false-positive patterns may all need to be addressed.

Case Study: Personalized Lab Result Analysis with OCR, NLP, and Longitudinal Trends

A result can remain inside a population reference interval while forming part of a clinically relevant longitudinal trend. Whether that trend warrants attention depends on the analyte, biological and analytical variation, specimen conditions, comorbidities, medication, and the broader clinical context. A single change from 5.2 to 6.1 mmol/L should not automatically be treated as clinically significant.

The system built for a physician office laboratory network put the personalization into the pipeline. Incoming results – images, scans, emails from lab machines – were processed through OCR and NLP to extract structured data. Because OCR and NLP errors can change clinical meaning, extracted values, units, reference intervals, patient identifiers, and dates require confidence thresholds, validation rules, and a human-review pathway before clinical use. The analytical layer preserved the laboratory’s original reference interval while calculating longitudinal trends and deviations from the patient’s prior results. It did not redefine a laboratory reference range without clinical and analytical validation. Patient-facing explanations used clinician-approved language and clearly distinguished the reported result, general educational context, and any instruction to contact a healthcare professional. Any functionality that interprets results or recommends clinical action requires separate intended-use and regulatory assessment.

The delivery architecture was a microservice API on Kubernetes – separate endpoints for lab machine integration, IoT and health app integration, and white-label deployment – with long-term storage on AWS for historical trend analysis.

AI-Powered Triage and Patient Intake Portals

The routing decision a triage portal makes depends entirely on what data it can see. An RPM alert that fired at 6am, such as irregular heart rhythm, blood pressure outside the patient's baseline, needs to be in front of the urgency scoring layer when the patient opens the intake form at 9am. If the portal is pulling from a static patient profile rather than a live feed from the monitoring and EHR layers, that alert isn't there, and the routing call gets made without it.

Clinician override paths, escalation policy, and mandatory human review thresholds need to be defined before go-live. In the United States, the regulatory status of smart triage algorithms depends on their intended use, intended user, output, level of automation, clinical risk, and whether the user can independently review the basis of the recommendation. Some CDS functions are excluded from the statutory definition of a medical device; others may be regulated device software and require an appropriate premarket pathway. Health-system procurement and governance reviews commonly examine model validation, algorithm-change control, human oversight, cybersecurity, data use, incident responsibility, and monitoring after deployment. The exact requirements vary by institution and the product’s regulatory status.

Conclusion

If the architecture decisions in this article feel familiar, some of them are probably already in production. The ones that aren't yet constrained, such as backend structure, EHR integration pattern, encryption model, are cheaper to revisit now than after the first hospital deployment exposes them.

A useful check: if a hospital network blocked the preferred video path today, could the platform fail over through a tested TURN configuration without changing application code? If a device changed its payload format, would the platform reject the unknown version or silently store a plausible-looking measurement? If an alert went unacknowledged, would the system know who was responsible for the next action?

Those questions reveal more about clinical readiness than a successful demonstration call.

SciForce works at this architecture layer: secure virtual care platforms, EHR and device interoperability, remote-monitoring pipelines, real-time health data analytics, and clinically responsible AI integration. For organizations moving from a functional prototype to a platform that must survive hospital networks, security review, and real clinical use, these are the decisions worth resolving early. If any of this looks familiar, let's talk.

AI in Medical Imaging: Improving Diagnostic Accuracy and Workflow

SciForce — Wed, 01 Jul 2026 16:11:18 +0000

Introduction

A radiologist on a standard hospital shift may read dozens to well over a hundred imaging studies, depending on subspecialty, setting, shift structure, and case complexity. Each one is a search for something that might be subtle, easy to miss, or buried in noise. At that volume, non-trivial discrepancy or error rate is a known risk in radiology practice, especially under high workload and time pressure. Radiologists are working through growing imaging volumes with a workforce that has never fully caught up with demand, and fatigue, interruptions, case complexity, and system design all contribute .

AI is starting to move selected parts of this problem: deep learning reconstruction has cut MRI scan time by over 50% in clinical studies for selected protocols and institutions without sacrificing image quality, and CT nodule detection sensitivity has reached 95% in peer-reviewed benchmarks but with performance depending heavily on dataset, lesion type, threshold, and false-positive burden.

Adoption is accelerating because the evidence base is maturing in several high-volume use cases, and imaging volumes keep growing faster than the workforce can absorb them. In practice, most deployments stall on the same two problems: getting AI outputs into the radiologist's existing PACS view without a custom engineering project, and convincing clinical staff to trust a system they didn't ask for. The third problem is often underestimated: proving that the model still performs on local scanners, protocols, patient demographics, and reporting workflows after it leaves the benchmark dataset. Health systems getting real value from AI have approached it as a workflow problem rather than a software purchase.

Why AI Adoption in Medical Imaging Is Accelerating

AI in medical imaging has moved well past the research phase. By late 2025, the FDA had cleared 873 AI-enabled medical imaging devices, up from a handful a decade ago, and radiology accounts for the bulk of those approvals – which reflects both the maturity of imaging AI research and the volume of real-world deployment data now available to regulators. Regulatory clearance, however, is not the same as clinical fit. A cleared model may still fail to deliver value if it is poorly matched to local imaging protocols, patient mix, infrastructure, turnaround-time goals, or reporting culture.

Convolutional Neural Networks and Vision Transformers: What They Actually Do in a Clinical Context

Most approved medical imaging AI runs on convolutional neural networks, vision transformers, or a hybrid of both – and which one a vendor is using can give useful clues about the model’s inductive biases, data requirements, and likely failure modes.

CNNs are fast, well-validated, and good at finding what they were trained to find. The problem is they process an image through local filters that are progressively combined into higher-level features, which means anything that only makes sense in relation to distant anatomy may require additional architectural mechanisms, training examples, or post-processing to capture reliably. A subtle mediastinal shift suggesting tension pneumothorax is a useful example of a finding where local opacity detection alone is insufficient; the model must learn the broader spatial relationship between lung volume, mediastinum, pleural space, and clinical urgency.

Vision Transformers use attention mechanisms that can model relationships across distant image regions, which helps with spatially distributed findings. The catch is they need far more training data to generalize, and in radiology that data is harder to come by than the research papers suggest. Mixed scanner vendors, inconsistent acquisition protocols, and retrospective annotation noise are the norm, and each one degrades generalization.

Most serious radiology AI systems now combine both architectures. When deployment underperforms, one common cause is predictable: the model was validated on data that looked nothing like the department it landed in – different scanner manufacturers, different kV settings, different patient demographics. Validation results on data that matches your specific equipment and case mix matter more than aggregate benchmark performance across a curated dataset.

Feature Extraction in X-Ray and MRI Scans

Feature extraction is the part where model performance either holds up in the real world or doesn't. A chest X-ray model needs to distinguish ground-glass opacity from consolidation, and catch a small pleural effusion at the costophrenic angle that a fatigued reader might scroll past. Whether it does that reliably comes down to whether the training data reflected enough real-world imaging variation to teach the model where the boundaries actually are.

SciForce ran into this directly on a lung pathology detection project. The client needed to identify TB and COVID-19 from chest X-rays – two diseases whose visual signatures overlap, present differently depending on stage and patient demographics, and had already defeated several general-purpose classifiers the client tried before. The dataset was messy in the way real clinical data is: variable image quality, mixed acquisition conditions, no clean benchmark to train against. EfficientNet-B7 was picked after architecture selection focused on robustness to input-resolution variation and efficient scaling across network depth, width, and image resolution. The system reached 95% diagnostic accuracy and cut manual image review time by 25% in the evaluated project setting, and those numbers held in deployment because the development process prioritized deployment-representative data rather than a clean but artificial benchmark.

Generative AI for Image Enhancement and Reconstruction

The data problem in radiology AI is straightforward: real-world clinical datasets don't contain enough examples of rare pathologies or underrepresented patient populations to train models that generalize reliably. Generative AI is one promising tool for reducing that gap, but it is not a substitute for real-world validation.

Diffusion models can synthesize realistic imaging data conditioned on specific demographic and pathological characteristics, meaning a model may be augmented with synthetic examples designed to improve coverage of rare or underrepresented cases, provided that synthetic images are clinically reviewed and evaluated against external real-world data. A 2025 study found that adding synthetic data improved rare pathology detection by 33% AUC – a result that is encouraging but should not be generalized beyond the tested dataset and task. How synthetic data generation fits into an AI development pipeline is covered in more depth in our dedicated article.

Modality translation – generating a contrast-enhanced MRI sequence from a non-contrast scan – is an active research and validation area for situations where contrast is contraindicated or unavailable. Instead of rescheduling or accepting a diagnostic compromise, the model may estimate contrast-like information from already acquired data, but such outputs require strict validation and should not be treated as equivalent to acquired contrast-enhanced imaging unless cleared and clinically validated for that use.

Diffusion-based image reconstruction is the third application, and operationally the most visible. By reconstructing diagnostic-quality images from undersampled acquisition data, these models cut MRI scan time without degrading image fidelity in selected protocols – which means shorter scan sessions, fewer motion artifacts from patient movement, and better scanner utilization across the department when local validation confirms diagnostic non-inferiority. The same approach applies to low-dose CT and PET, where the tradeoff between radiation exposure and image quality has always been a clinical constraint.

Among these three areas, reconstruction currently has the clearest operational pathway in selected clinical imaging workflows, while synthetic augmentation and modality translation remain more dependent on use-case-specific validation, governance, and regulatory context. Published benchmarks are useful, but they do not replace local performance testing.

Reducing Human Error in Oncology and Radiology

Diagnostic error in radiology is largely a variability problem – the same chest X-ray read differently by two radiologists depending on when in the shift it lands. More precisely, diagnostic variability reflects perceptual limits, reader experience, workload, clinical context, report urgency, and uncertainty in the image itself. AI's most measurable contribution is narrowing that variability on high-volume, protocol-driven reads where the target finding is well-defined and the operating threshold is clinically acceptable.

Early Cancer Detection: Sensitivity vs. Specificity

Sensitivity and specificity are the two numbers that actually determine what a deployed AI system does to a radiology department – how many cancers it catches, and how many unnecessary callbacks it generates. Getting that balance right for a specific clinical context is harder than picking the model with the best published AUC.

The MASAI trial – 80,000 women, prospective, randomized, run inside a structured mammography screening program – found AI-supported screening detected cancer at 6.4 per 1,000 screened vs 5.0 in the control group, while cutting radiologist screen-reading workload by 44.2%. Those numbers came from a model tuned for two requirements that pull in opposite directions: sensitive enough to catch early-stage cancers, specific enough to keep false positive callbacks manageable. The trial conditions were controlled enough that both held simultaneously.

Lung cancer tells a more complicated story. A UK study evaluating seven commercially available AI devices against over 5,000 chest radiographs found sensitivity ranging from 20.8% to 77.8% across products – the weakest system missed nearly 80% of lung cancers, while false positive counts ranged from 10 to 2,039 per system. Three devices outperformed radiologists; four didn't. "AI for lung cancer detection" is not one thing, and the performance spread between products is wider than most clinical AI literature suggests.

That's the question to bring into any procurement conversation: at what sensitivity/specificity operating point was this model validated, does that match your clinical use case, and how does it perform against other products in the same category? The next question is just as important: how will performance be monitored after deployment, once scanner settings, patient flow, and reader behavior begin to change?

Computer-Aided Diagnosis (CAD) as a Second Opinion

CAD comes in two forms that get conflated constantly. CADe marks regions the radiologist may have missed and flags them for review. CADx goes further, characterizing what a finding might mean: malignancy likelihood, probable stage, tissue type. The distinction matters before any procurement conversation gets to pricing.

CADe targets a specific failure mode – perceptual error, where the finding was on the image but wasn't caught. Around 35% of lung nodules are missed during screening for this reason, a rate that reflects the difficulty of sustained pattern recognition at clinical volume. A system applying identical detection criteria to every scan, without fatigue, addresses that gap directly but may also create new false-positive, false-negative, and automation-bias risks if not governed carefully.

SciForce's lung pathology detection system treated sequencing as an architecture decision from the start. Prioritization logic restructured which cases reached the radiologist first – the AI changed the order of work rather than inserting findings into the read. False positive rate was a specific development target throughout, because once radiologists start treating alerts as noise, no amount of model accuracy recovers the clinical value.

For oncology AI beyond imaging, SciForce's lung cancer and lymphoma case study and two-part series on AI in cancer care (part 1, part 2) cover what rigorous validation looks like across the full treatment pathway.

Streamlining Radiology Workflows with Automated Prioritization

Most radiology worklists work based on a first-in, first-out or urgency-modified queue, depending on institution and workflow rules. A chest CT showing intracranial hemorrhage waits behind a routine knee MRI if it arrived later: the queue has no awareness of what's inside. A non-contrast head CT with intracranial hemorrhage, or a chest X-ray with pneumothorax, can lose time in a queue if the workflow has no reliable way to surface critical findings early. In 2024, 976,000 scans waited more than one month in the UK alone – a 28% increase from 2023, described by the Royal College of Radiologists as the worst reporting backlog on record. Fixing the sequencing problem is where AI makes its most immediate operational difference.

Smart Triage: Prioritizing Critical Findings in the Worklist

Unlike FIFO, AI triage scores incoming scans for urgency and reorders the queue in real time. It also introduces a failure mode FIFO doesn't have: a missed finding gets pushed to the bottom and can wait longer than it would have under the original system. A waiting time ceiling – automatic escalation of any scan beyond a defined threshold regardless of AI confidence score – is the fix. Most deployments don't include it by default.

When triage is built correctly, the turnaround time gains are well-documented. AI worklist prioritization reduced average pneumothorax report turnaround time from 80.1 minutes to 35.6 minutes compared to FIFO in a workflow simulation study. For intracranial hemorrhage, where delayed intervention directly affects survival, published implementation studies have reported shorter notification or turnaround times, and some before/after analyses have observed lower 30- and 120-day mortality after AI implementation; these findings are clinically important but should be interpreted with study-design limitations in mind.

SciForce's lung pathology detection system reduced critical case review time by 30–40% in the project setting – pneumonia, TB, and COVID-19 findings surfaced at the top of the queue while routine studies waited, without requiring radiologists to interact with a separate interface.

Integration with PACS (Picture Archiving and Communication Systems)

PACS is the system radiologists work in – it stores, retrieves, and displays medical images at the workstation where reads happen. Any AI tool that doesn't surface its output there gets treated as optional, and optional tools don't get used consistently. This is where most radiology AI deployments underperform in practice – the model works, but results land in a separate viewer that radiologists check when they remember to.

Institutions procuring AI tools from multiple vendors typically discover the integration problem after contracts are signed. Each tool arrives with its own PACS connection requirements and its own result format – what looked like a four-tool deployment becomes four separate integration projects, each with its own maintenance cycle. RSNA's 2024 IHE guidance addresses this directly: a standards-based orchestration layer – one integration point routing studies to the right models and returning results in a PACS-compatible format – keeps that from happening. The time to specify it is before the first vendor conversation, not after the third tool is already in production.

For buyers, this changes the procurement checklist. The question is not only “What is the model’s AUC?” but also: Can it exchange DICOM objects and AI results using standards-based workflows? Can it write results back into the reporting environment? Can it be monitored? Can it fail safely? Can the hospital add a second or third model without rebuilding the integration layer?

Beyond Detection: From Imaging AI to Decision Support

In systematic reviews, override rates for rule-based CDSS alerts reach 90% – clinicians have learned to click through them, including when the alert is genuine.

Modern AI-driven CDSS can be designed to incorporate richer clinical context than fixed rule-based alerts, but they require careful validation, explainability, governance, and monitoring before they should influence care pathways. A rule-based system flags every patient on two specific medications regardless of whether the clinical team already reviewed the interaction last week. An AI-CDSS looks at the full picture – prior tolerance, current clinical context, actual patient-level benefit-risk information – but only if those data are available, reliable, and clinically validated for the intended use. When AI was integrated into the clinical decision pathway for intracranial hemorrhage, 30-day mortality dropped from 27.7% to 17.5% in one before/after implementation study.

SciForce case – Patient Similarity Networks

The analysis surfaced a long-term cardiovascular signal that hadn't been visible in the fragmented source data — the kind of finding that only becomes computable when patient records are standardized and analyzed at scale. The data existed – EHRs, lab results, genetic profiles – but it was scattered across incompatible systems and impossible to analyze as a whole.

SciForce brought it together into a single standardized environment, then built networks that clustered patients by how similar they actually were – clinically, genetically, demographically. For the first time, the client could see what happened to comparable patients over time: which cardiovascular events occurred, when, and in which subgroups.

The project illustrates a broader point: image-level AI is only one layer of clinical intelligence. For many medical and life-science questions, the harder task is connecting model outputs to standardized longitudinal data, comparable patient cohorts, outcomes, and evidence that can withstand clinical, regulatory, or payer scrutiny.

Conclusion

Most radiology AI can tell you it flagged something. Few can tell you why – and fewer still can tell you what happened to the last hundred patients where it flagged the same thing. That's the difference between a detection tool and a system that actually supports a decision.

The practical question is not whether an AI model can detect a finding in a benchmark dataset. The question is whether it can be validated on local data, integrated into the clinical workflow, monitored after deployment, and connected to decisions that matter. SciForce works at that intersection: medical AI development, imaging pipelines, standardized clinical data, workflow integration, and evidence generation. For teams moving from model output to clinically usable systems, that is where the real work begins.

If that's the problem you're solving, let's talk.

Sustainable AI: Strategies for Managing Compute Costs and Energy Efficiency

SciForce — Wed, 10 Jun 2026 13:51:20 +0000

Introduction

In 2025, the world’s data centers consumed 485 terawatt-hour of energy, with AI-related demand growing at 50%. By 2030, the consumption is expected to reach 950 TWh – twice as much as today, and equals approximately the entire electricity consumption of Japan. Goldman Sachs forecasts that about 60% of new demand will be met by burning fossil fuels, increasing global carbon emissions to 220 million tons. And as the chart below shows, the emissions cost escalates sharply with each new generation of frontier model.

Better efficiency is part of what's driving this. The IEA reports that power consumption per AI task is declining at a rate it calls "unprecedented in energy history", but cheaper inference doesn’t reduce the footprint, and the savings are invested into growth. Five major tech companies collectively spent over $400 billion on data center infrastructure in 2025, with more planned for 2026.

Sustainable AI development is about treating compute the way we are used to treating any finite resource like oil: instrumenting, finding leaks, optimizing. What we see repeatedly, working with AI-driven organizations, is that the waste has usually been accumulating for months before anyone has the visibility to catch it. The organizations that fix that tend to discover that sustainability and cost reduction are the same project: reducing AI carbon footprint and minimizing the infrastructure bill turn out to be the result of the same AI cost optimization actions.

Energy-Effocient Model Training Techniques to Lower Resource Consumption

When talking about AI's energy footprint, the first instinct is to look at infrastructure: data centers and cooling systems are budgetable, and renewable energy contracts seem like a logical path to optimization. But the most powerful lever is the model itself: its architecture sets the ground for everything that follows. With training costs running from $79 million for GPT-4 to $170 million for Llama 3.1-405B, and frontier runs already being discussed in the billion-dollar range, getting architecture right has become as much a financial and environmental decision as an engineering one.

Weight pruning and model distillation

Think of a trained neural network as a dense web of numerical connections – millions or billions of them. Pruning asks which of those connections are actually doing useful work, and removes the ones that aren't. The result is a smaller, faster model that retains most of what the original learned. CMU's Bonsai method achieves 50% sparsity on a single consumer-grade GPU, with the resulting models running twice as fast as those produced by older weight pruning AI techniques — the accuracy tradeoff that made pruning impractical is shrinking.

Knowledge distillation takes a complementary approach: instead of trimming an existing model, you train a smaller one to replicate the outputs of a larger one. The large model acts as a teacher; the smaller one learns to match its behavior on the tasks that matter. In production, distilled models can meaningfully reduce inference compute at negligible quality loss, though the savings depend on how far the student model departs from the teacher's architecture.

Quantization: from FP32 to INT8 and beyond

Every number stored inside a neural network takes up memory and costs compute to process. Model quantization reduces the precision of those numbers — from the 32-bit floating-point decimals (FP32) models are typically trained with, down to 16-bit floats (FP16), or simpler 8-bit and 4-bit integers (INT8, INT4). Less precision means smaller models that run faster and cost less to serve, while the quality loss turns out to be negligible in most cases.

![Memory footprint (relative)

Modern AI chips are physically designed to run faster at lower precision. Nvidia built its latest data center GPUs to accelerate INT8 and lower formats natively — so running a quantized model isn't fighting against the hardware, it's working with it. Researchers at the University of Washington measured up to ~8× higher serving throughput at INT4 compared to FP16, with only 1.4% accuracy loss on a 65-billion parameter model.

Until recently, quantizing a model this large required a rack of expensive server-grade GPUs, however LEANQUANT, presented at ICLR 2025, showed it can be done on two off-the-shelf consumer GPUs in under a day.

Low-rank adaptation (LoRA) for efficient fine-tuning

Fine-tuning, or adapting a pre-trained model to a specific task or domain, traditionally means updating all of the model's weights on new data. For large models, that's computationally expensive and slow. LoRA sidesteps the problem by freezing the original model entirely and training only a small set of additional parameters that sit alongside it. The base model stays untouched; only the adapter gets updated.

The memory savings are significant. A 2025 benchmark found that LoRA-adapted Llama 3.1 8B required less than 9 GB of GPU memory: down from over 30 GB for full fine-tuning, while still outperforming the untuned base model by 36%. Combined with quantization, the gains compound further, making LoRA fine-tuning cost of large models practical on a single consumer GPU.

The most common failure mode is misconfigured rank – the key parameter that controls how much the adapter can learn. Set it too low and the adapter doesn't have enough capacity to pick up the target domain. Set it too high and you give back most of the memory savings. The subtler risk is queries that fall outside what the adapter was trained on: LoRA handles these worse than a fully fine-tuned model would, because the frozen base model and the adapter weren't built to work together on unfamiliar inputs. It works best when the target domain is narrow and the training data genuinely reflects production inputs, which is exactly the condition that determines whether the efficiency gains hold or quietly erode.

In practice: Automated Retraining Without a GPU-Heavy Architecture

A public-sector healthcare organization needed to forecast disease spread across administrative districts: predicting next-day infection counts by location, updated automatically as new epidemiological data was published. The system had to operate without developer oversight: data ingestion, retraining, evaluation, and deployment all fully automated, with no manual quality assurance step in the loop.

The starting problem was the data itself. Incoming datasets had no schema documentation, field meanings had to be reverse-engineered manually, and the pipeline had to handle schema drift without introducing model bias or corrupting the training set. Missing time windows were filled via trend-based extrapolation, and the model could do its work only once that foundation was stable.

SciForce built an LSTM-based forecasting pipeline that ingested newly published public health data on a monthly schedule, retrained automatically, and promoted a new model only if it outperformed the incumbent on MAPE, MAE, and RMSE — otherwise the existing model stayed in production. Predictions were served via a REST API that accepted geographic coordinates, mapped them to administrative tracts, and returned both current case counts and next-day forecasts. The system achieved a MAPE of 5.35% across regional forecasts without a dedicated GPU cluster, and without a human in the retraining loop.

Carbon-Aware Computing Strategies

The electricity powering a data center in Iowa at 2am on a windy Tuesday carries a very different carbon footprint from the same workload running on a coal-heavy grid at peak demand. Carbon-aware computing scheduling is about managing that difference: timing and routing workloads to take advantage of when and where the grid runs cleaner.

The infrastructure investment is already happening. Around 40% of all corporate renewable energy agreements signed in 2025 came from technology companies, and the pipeline of nuclear offtake agreements between data center operators and small modular reactor projects grew from 25 GW to 45 GW in less than a year. The theoretical case for green cloud computing is strong; large-scale, independently audited production results are not yet public.

Scheduling Training Jobs Based on Renewable Energy Availability

The potential of carbon-aware scheduling looks very different depending on which constraints apply to your environment. Research models from UMass Amherst show that freely routing any workload to the greenest region at the greenest time can reduce emissions by up to 96%. Add realistic capacity constraints, where green AI regions fill up fast and headroom restricts how much can migrate, and that drops to 51%. For organizations where GDPR or HIPAA blocks cross-jurisdiction routing entirely, the only remaining lever is time-shifting within a single region, which delivers around 3% for a long training run. The theoretical maximum for time-shifting alone is 19%, meaning most of the potential is already gone before a line of scheduling code gets written.

The technique applies to training jobs, not inference. Training is a batch workload that can be deferred or rerouted without affecting end users. Inference can't be treated the same way: a live query has no temporal flexibility, and cross-region routing introduces latency most production SLOs won't tolerate. Mapping those two constraints, workload type and data residency rules, is the work that determines whether carbon-aware scheduling is worth pursuing in a given environment, and how much it can realistically deliver.

Cooling Systems — Air vs. Liquid Cooling in AI Data Centers

Five years ago, a standard server rack drew 5–10 kW. A rack of current AI accelerators draws 60–125 kW. Air-cooled systems handle around 5.4 kW per square meter; direct-to-chip liquid cooling handles 82.7 kW. At that ratio, liquid AI data center cooling stops being a preference and starts being a physical necessity.

Three-quarters of facilities still run perimeter air cooling as their primary system, per the Uptime Institute 2025 Cooling Survey. The obstacle is economics — the only publicly available non-vendor retrofit data, a California Energy Commission pilot, puts the simple payback period at around 12 years. The case for switching is capacity: liquid cooling reaches densities air can't. Liquid systems do use more water, though, so the environmental tradeoff is real and worth accounting for.

In practice: Predictive Cooling Maintenance Before Failures Hit Uptime

A data center operator serving enterprise clients in finance, healthcare, and e-commerce had a critical cooling pump that kept failing without warning. Each failure meant unplanned downtime, and standard maintenance cycles weren't catching the issue because the failure only became visible after it had already happened.

The complication: the client had no labeled dataset – no historical record of which sensor readings had preceded past failures. Rather than training a conventional predictive model, SciForce deployed an unsupervised anomaly detection approach using Isolation Forest across data from over 100 sensors monitoring temperature, pressure, and flow rates simultaneously. Multiple algorithms ran in parallel, with a majority-voting system flagging anomalies only when most algorithms agreed, reducing false positives while maintaining sensitivity. Correlation analysis then narrowed the critical monitoring surface from 100+ sensors down to 4 that were directly predictive of failure.

The Financial ROI of Green AI Infrastructure

Inference costs for a GPT-3.5-level model fell from $20 per million tokens in late 2022 to $0.07 by October 2024: a 286x reduction in under two years. That kind of cost compression makes it easy to treat compute as effectively free. The problem is that aggregate demand grows faster than unit costs fall, and at the scale where AI infrastructure becomes a material line item, the idle waste adds up faster than the per-token savings. An H100 GPU runs $2–4 per hour billed whether the cluster is active or not. At 70% usage, an 8-GPU cluster carries roughly $3,700–7,000 per month in idle costs alone. The waste is usually visible in the bill but invisible in the system, which is why per-job cost attribution tends to be the first thing that needs fixing.

Reducing OpEx Through Efficient Compute Utilization

A 2026 empirical study tracking telemetry across 11,791 production GPU jobs found that only 61% of GPU time was doing useful work. The rest split between GPUs sitting empty between jobs and GPUs running a job but stalled rather than computing: that second category alone consumed 10.7% of runtime energy.

Pipeline bubbles are one of the main reasons utilization collapses inside large training runs. When a model trains across hundreds or thousands of GPUs, the work gets split into stages: different GPUs handle different parts of the computation. These stages don't always hand off to each other cleanly, leaving GPUs allocated and billed while they wait for the next stage to be ready. A NeurIPS 2025 paper found pipeline bubbles consume 15–30% of a training job's GPU allocation under typical configurations, exceeding 60% at the largest scales. Fixing the scheduling logic and getting the stages to hand off more cleanly recovered up to 63% more utilization on an 8,000-GPU run.

In practice: Cutting Idle Infrastructure With Event-Driven Processing

A video processing platform handling conversion, compression, and optimization for media companies and individual creators was running servers around the clock — billing continuously regardless of whether any videos were in the queue. During quiet periods like weekends or late nights, CPU and memory sat idle at full cost. During spikes, the same infrastructure couldn't scale fast enough, causing processing backlogs. Manual monitoring staff had to intervene to clear delays and restart failed jobs.

SciForce rebuilt the pipeline around AWS Fargate and Amazon ECS — containers spun up only when a video was uploaded and shut down immediately on completion. A Python-based dispatcher handled routing, error detection, and automatic restarts, eliminating manual oversight entirely. The results: infrastructure costs fell 50%, processing time dropped 40%, concurrent upload capacity doubled, and labor cost from manual monitoring fell 30%. Every gain came from eliminating idle resource consumption — no new hardware, no model changes.

In practice: Reducing LLM Spend by Routing Only the Right Queries to the Model

An enterprise performance management platform had consolidated HR, sales, financial, and operational metrics into a single AI-powered system — but every query, regardless of complexity, was routed through the same LLM path. Pulling a specific sales figure from a known source cost roughly the same as summarizing six months of trend data, because both went through GPT-4. The result was slow response times, high inference costs, and an AI hallucination rate that made some outputs unreliable for business decisions.

SciForce built a hybrid processing layer that separated queries by what they actually required. Simple lookups — employee stats, sales figures, predefined reports — went through vector search and rule-based retrieval. Summarization, trend analysis, and complex analytical tasks went to the LLM. After benchmarking seven models on response speed, deployment cost, and RAG performance, GPT-4o-mini was selected for LLM-routed queries. Guardrails were added to filter queries and validate responses before they reached end users.

The outcome: LLM usage fell 37–46%, AI processing costs dropped 39%, simple lookups got 32–38% faster, and hallucinations fell 68%. Efficiency and quality moved in the same direction, because the system was finally being asked to do what it was designed for.

Conclusion

A right-sized model running at lower precision generates less heat, which means less cooling load, which means a lower PUE. Fewer retraining cycles mean fewer GPU hours, which shrinks the window that carbon-aware scheduling needs to cover. Taking compute seriously as a resource is what connects all of it. The organizations that do this well tend to find that knowing what's running, what it costs, and whether it needs to be is most of the work.

SciForce works with AI-driven organizations on the full range of these challenges, from the model to the infrastructure bill. If anything in this article looks familiar, let's talk.

Predictive Maintenance in 2026: How AI, Edge Computing, and Agentic Systems Turn Detection Into Action

SciForce — Thu, 04 Jun 2026 15:23:26 +0000

Equipment failures don't happen out of the blue: pressure drifting lower, or a slightly different vibration pattern precedes the failure over weeks or months. None of these is big enough to cause an incident on its own, but the trend would show that action is already necessary.

BlueScope, an Australian steel manufacturer, used to monitor their equipment through visual checks and basic low-level switches, until they introduced Siemens Senseye predictive maintenance system. Half a year after installation, one of the sensors spotted a gradual drop in hydraulic tank levels and sent a warning well before the pressure drop would be critical. The maintenance staff had enough time to investigate, find a leak and fix it into a scheduled maintenance window. Over time, predictive maintenance prevented 1,950 hours of unplanned downtime and 53 complete process interruptions for Bluescope.

The market already recognizes the value: Grand View Research states the global PdM industry was valued at $14.29 billion in 2025, and is expected to reach $98.16 billion by 2033, at a CAGR of 27.9%. Growth is driven by the growing pressure to eliminate unplanned downtime, and the integration of AI and edge computing into maintenance operations. Manufacturing and energy lead adoption today; aerospace and defense is the fastest-growing segment.

How Predictive Maintenance Works in 2026

The hard part is no longer whether a system can detect a subtle signal: it can. More importantly, the alert must reach the right technician and result in a work order, rather than stay on the dashboard. So what do we let the system handle on its own, and where is human judgement still necessary?

The Architecture Behind Modern PdM

A modern PdM system has four jobs: collect sensor data, transmit and store it reliably, run models that distinguish real signals from noise, and route alerts to the right people in time to act. Each layer depends on the one below it, and each has its own failure mode.

Layer 1 — Sensors and Data Collection

Getting sensor coverage used to be the hard part: multiple devices, cable runs, commissioning the setup resulted in a financial toll on business before the work even started. Today, a single wireless unit can measure multiple parameters at the same time, and installation has become easier as well, especially for complex or remote equipment.

But the easier it is to collect data, the more noise there is to deal with. We worked with a data center operator that had 107 sensors running on their cooling system, and one of the pumps still kept regularly failing. With more than a hundred signals available, nobody could see which of them mattered. We compared the sensor data against the failure dates, and found that only four of them consistently changed before each failure. While other ones were delivering real data reflecting the state of the system as well, this data just wasn't relevant to that particular failure.

Layer 2 — Transmission and Storage

Most PdM systems today successfully combine edge and cloud architecture. The ultimate deciding factor is whether the decision needs to happen close to the machine, or whether it requires a wider data overview.

Edge is default when it comes to high-speed or high-precision operations: servo correction, defect rejection, or safety response can't wait for a network round trip. The same applies to remote or offshore premises, or plants with inconsistent connectivity: if transmission drops, cloud models train on gaps and alerts reflect machine state from hours ago. Another major factor is data control: in heavily regulated industries like oil and gas or aerospace, data can't leave the building, so on-premise deployment is the only viable option regardless of what the scalability argument says.

Cloud is the pick where the system needs a broader view: model training across multiple facilities or long-term trend analysis need more data than a single facility can produce. But this only works if the edge is feeding the cloud consistently — without a reliable learning loop, models run stale and nobody notices until the alerts start missing things they used to catch.

Most organizations outside regulated industries end up combining both edge and cloud. This delivers value only if both layers are well-coordinated — otherwise, the edge runs stale models and the cloud trains on unreliable data.

Layer 3 — Modeling and Anomaly Detection

Most modeling failures come down to trust or time. If the system fires more alerts that the maintenance team can reasonably process, the trust erodes. If the model was accurate once deployed, but gradually became less reliable as conditions change, it can go unnoticed until something fails that the system should have caught.

Södra's three pulp-and-paper mills had 1000 sensors that produced between 300 and 500 alerts every week because the threshold-based system couldn't distinguish between natural process variation and real failure. When they showed the model what normal operations and failures looked like for each individual asset over time, they started receiving about 20 alarms per week.

We ran into both problems when we worked on adding an anomaly detection layer to a client’s monitoring platform. They already had good sensor coverage, sending vibration and temperature data to the cloud, but didn’t have labeled failure data, so we had to train the model from scratch. We assessed several algorithms, finding the most consistent one and built a retraining scheduler that updates the model every 14 days.

Layer 4 — Routing, Action, and Human Oversight

Detection is important, but the overall value of PdM deployment depends on who sees the alert and how fast they act on it. Strongest deployments combine automation and human oversight, handling routine steps automatically: routing the anomaly, drafting work order, checking spare parts and notifying the necessary team. Ambiguous and consequential cases reach human specialists, while the system should already provide them necessary context, instead of firing an alert for them to investigate.

Omya caught a developing gearbox bearing fault on one of their roller mills when vibration started drifting 0.5 to 1mm/s above the model's baseline. A case was opened, the signal tracked over several weeks, and the bearing was replaced before it failed. When the maintenance team intervened, they had a case backed by weeks of trend data.

The SCG Chemicals gas turbine case shows what happens when the system and human experts disagree. In September 2023, the system spotted an anomaly in the turbine's cooling zone and identified a stator ring as the likely source. In December, the manufacturer inspected the turbine and said it was fine. SCG Chemicals didn't force an immediate intervention, but prepared spare parts and waited for the next planned shutdown. When the machine was inspected in June 2024, the damage was confirmed. The model was right, and the incident was resolved successfully because the anomaly and its exact location were detected before the issue became visible to the manufacturer, and the machine was able to operate for eight months after detection.

The Changing Role of the Maintenance Team

PdM doesn’t remove the human maintenance work, but eliminates manual inspections, unexpected failures at 2am or chasing a fault in three systems that don’t talk to each other. That overhead is most of what maintenance teams mostly do: industry benchmarks put hands-on wrench time at 18 to 30% for most facilities, meaning 70 to 80% of a technician's day is already going to everything except the skilled work.

What’s filling the freed-up time? In a mature PdM environment, the technician spends more time reviewing what the system flagged and deciding whether action is needed, based on what they know about that specific machine or line. Sometimes the system catches a real developing fault, sometimes it’s reacting to a normal process it hasn't seen before. In the SCG Chemicals case, the anomaly flagged was so subtle that even the manufacturer's inspection couldn't see it 4 months after the initial detection. Human judgement to wait until scheduled shutdown was right, and no algorithm was positioned to make that decision.

Predictive Maintenance Trends Shaping 2026

The 2026 PdM trends are often presented as a list of separate advances: the models are getting smarter, the sensors are easier to deploy and edge computing work faster. All is true, but the major shift is integration that closes the gap between detection and action: the right alert reaches the right person with context already prepared, the response chain runs without waiting for human initiation, and the system can act on detected anomaly rather than just report it.

IoT as Operational Infrastructure

IoT sensors are the foundation the rest of the stack runs on — without reliable data coming in, there's nothing for the models to learn from and nothing for the agentic layer to act on. Sensor coverage used to be the hard part, but now entry-level kits from ifm and Tractian cost hundreds of dollars per monitored asset and are installed wirelessly within minutes. The main question now is how to make the most out of data that was already collected.

Ajinomoto's amino acid plant in Eddyville, Iowa had years of process data before they started building a predictive model, and the first task was deciding what data to keep. Shutdowns, upsets, and abnormal operating periods had to come out of the training set. A model that learns from those periods treats disruption as normal and starts flagging healthy operation as suspicious.

Once the baseline was clean, the model flagged a fluidized bed dryer whose blower motor was running harder than it should. No standard alarm had fired. The team inspected during a scheduled wash day and found the bed 80% blocked with caramelized product — cleared on schedule, not during an unplanned stoppage. The plant now avoids 10 to 15 hours of unplanned downtime per month across their monitored assets.

Edge AI — Intelligence at the Source

The more assets you monitor continuously, the more decisions need to happen faster than a cloud round-trip allows. It takes about 200 milliseconds for sensor data to travel to a cloud server and back. On a high-speed production line with built-in inspection checking 600 units per minute, a 200ms delay means 2 potentially defective items may pass through before the system can respond. On a live production line, an electrical fault needs to trigger a shutdown in under 20 milliseconds. With cloud processing taking 50 to 500ms, by the time the response comes back, the safe shutdown window is already closed.

In 2025, Siemens embedded Armv9-based AI processors directly into production line sensors. When a bearing runs above its optimal temperature range, the sensor slows the motor, rebalances the load, and activates a cooling cycle. The response happens on the chip, without the data leaving the machine.

Most industrial facilities run equipment that's decades old — machines built before wireless connectivity existed, too costly or critical to replace. Edge devices make those assets monitorable without modifying them, acting as an intelligence layer on top of existing infrastructure. Managing that layer across multiple sites and machine types is its own engineering challenge — one we've covered in depth in our guide to DevOps for embedded systems.

Agentic AI — From Prediction to Autonomous Action

At the earliest stages of PdM development, the system’s job ended at the alert, and what happened next depended on how experienced the alert recipient was, whether they hadn't missed the useful alert in a hundred false ones, or whether they were even on shift. With hundreds of assets monitored continuously and edge devices detecting faults in milliseconds, the volume of signals that need a response has outgrown what a human-initiated workflow can keep up with. Agentic AI removes this variability, and a detected anomaly now can trigger case opening, relevant data collection, drafting working order, checking spare parts and scheduling a technician.

In one research deployment at a ceramic manufacturer in Italy, the system monitored hydraulic presses, kilns, and glazing lines using four specialized agents working in sequence. Sensing agents detected equipment anomalies, reasoning agents classified the fault type and estimated remaining useful life, action agents checked spare parts availability and scheduled the repair, coordination agents managed the handoffs. The system handled 92% of decisions autonomously, escalating the remaining 8% to humans when confidence was low or safety-critical assets were involved.

The SCG Chemicals case shows what that 8% actually protects against. When the system flagged the cooling zone anomaly and the manufacturer's inspection came back clean, no autonomous workflow was positioned to resolve that contradiction. The decision to prepare spare parts and hold until the planned shutdown wasn't a routing step — it was a judgment call about whom to trust, made by a person, without a protocol that would have produced the same outcome automatically.

In agentic PdM, the boundary between what the system can handle and what needs human approval has to be defined before deployment: otherwise, instead of getting value from agentic PdM, companies replace old maintenance problems with governance ones. We've covered the practical steps for governing agentic workflows in more depth separately.

Digital Twins — Now Powered by Generative AI

A digital twin is a virtual model of physical equipment where you can simulate failure scenarios, test maintenance strategies, and train predictive models without waiting for real failures. What you can simulate is limited: the twin is only as good as the failure data it's built on — and that same ceiling determines how much an agentic system can handle autonomously. A reasoning agent classifies faults confidently only for failure modes it has seen enough times to recognize. Everything outside that boundary gets escalated to a human, which is exactly the variability agentic AI was supposed to remove.

The limitation shows when it comes to rare failures that don't generate enough training data: a turbine blade fracture that happens once in twenty years is only one example — not enough for a model to learn the pattern. Other failures are disasters you genuinely hope will never happen, which means hoping your training dataset stays empty.

Rather than waiting for rare failures to accumulate, generative models, such as GANs and diffusion models primarily, create synthetic datasets that simulate those failure conditions at scale, training a model on thousands of virtual examples of something that may have occurred once in the real world, or never. A 2026 review of 86 studies on synthetic data in predictive maintenance found this approach being used across heavy machinery and industrial processes, specifically where real failure data is too rare or too consequential to wait for.

Healthcare — Reliability as a Patient Safety Issue

West China Hospital runs one of the busiest radiology departments in the world. When their CT scanner fails unexpectedly, a patient doesn't get scanned, a procedure gets rescheduled, a clinical decision gets made without the information it needed. That's what makes equipment reliability in healthcare a different problem from equipment reliability in manufacturing.

Their CT predictive maintenance program worked from real-time data across more than two million exposures collected between 2019 and 2023. The model predicted overheating events within a 20-minute window and arcing faults roughly one to two days in advance: specific failure modes, on specific equipment, with lead times calibrated to what the clinical environment actually needs to respond.

That changes who owns the problem. "The vibration readings look slightly elevated" triggers a maintenance ticket. "This scanner has a 70% probability of an arcing fault within 48 hours" triggers a patient scheduling conversation.

SMEs — PdM Without the Enterprise Budget

A three-person maintenance team can now run a monitored plant on a modest monthly budget — the hardware is affordable, the software is subscription-based, and managed service providers handle the modeling. What takes longer is the conversation that should happen before the first alert fires: which assets get priority, who has authority to pull a machine offline, and what counts as a signal worth acting on versus background noise.

There's a business case angle that smaller operations often miss. Insurers writing equipment breakdown coverage for manufacturers treat documented sensor monitoring as a risk reduction — and price it accordingly. Operations with continuous monitoring and maintenance records qualify for premium credits that unmonitored plants don't. That discount typically runs 10 to 15% on equipment breakdown premiums. For a smaller operation scrutinizing every line of the business case, it's a return that shows up regardless of whether the system ever catches a specific failure.

Case studies

Preventing Recurrent Pump Failures in a Datacenter Cooling System

A technology company running large data centers for clients in finance, healthcare, and e-commerce had a recurring problem with a pump in their cooling infrastructure. It kept failing despite regular inspections, and every time nobody knew anything was wrong until the pump had already failed.

The client's dataset contained over 100 sensor parameters monitoring temperature, pressure, flow rates, and system behavior. The main problem was that there was no labeling connecting specific sensor readings to failure events.

We built an unsupervised anomaly detection system using Isolation Forest, ECOD, and One-Class SVM. To filter out single-algorithm noise, we established that an anomaly gets flagged only when two of them agree it's there.

Once we had failure data, we ran a correlation analysis against the known pump replacement dates and identified 4 out of 107 sensors that consistently changed behavior before each incident. The client now has a real-time monitoring system watching those 4 sensors — when the pattern appears, they get an early warning.

Real-Time Machine Monitoring and Anomaly Detection Solution

A client came to us with a condition monitoring platform that already had solid infrastructure — wireless sensors capturing triaxial vibration and velocity data, streaming via MQTT through AWS IoT Core into MongoDB. Their customers could see machine status, run time, downtime, and sensor readings in real time. None of it gave them any warning before something failed.

They needed an anomaly detection layer on top of what already existed. Six algorithms were tested against a single criterion: how consistently each one characterized normal behavior — because an inconsistent model generates false positives, and false positives are how a maintenance team learns to mute the alerts. COPOD flagged near-continuously across the triaxial acceleration and velocity readings, making it effectively unusable in a live environment. HBOS produced the most stable characterization of normal behavior across all six sensor features and became the default — consistent enough to trust, light enough to run continuously.

Each machine gets its own model trained on its own sensor history, because a motor and a conveyor don't share a baseline. Models retrain automatically every 14 days so accuracy doesn't drift as conditions change without anyone having to trigger it manually.

Conclusion

Detection is the part that gets talked about. Sensors, models, accuracy rates — these are the problems the industry has largely solved, and there's no shortage of vendors ready to demonstrate them.

What happens after the alert fires is harder to talk about than the detection itself. Someone has to see it, decide it's worth acting on, and have the authority to do something about it. The system needs a defined boundary between what it handles alone and what it escalates — and that boundary needs to hold up when the model and the expert reach different conclusions. Most organizations have the detection layer working. The organizational work around it is where implementations are still catching up.

If you're still in the monitoring-and-alerting phase, get in touch to talk through what the next step looks like for your operation.

Why Healthcare AI Fails in the Real World

SciForce — Wed, 27 May 2026 14:02:04 +0000

Introduction

In 2018, a clinical informaticist launched a tool to handle intake forms and clinical notes so doctors could spend less time typing and more time doctoring. A small study with 18 medical students suggested that the Cydoc smart intake form could substantially reduce note-writing time while maintaining note quality, although broader validation in practicing clinicians was still needed. By August 2025, the company was gone.

The postmortem names the main reason: Cydoc lived outside the EHR. Doctors had to copy the notes from the Cydoc interface and paste them into the EHR, which meant working in two windows and adding an extra workflow step for routine clinical documentation. The founder later described the lack of EHR integration as a fatal adoption mistake.

Cydoc isn’t an exception. Even with a strong model, healthcare AI projects can fail when they add friction to already complex clinical workflows. A Gartner survey of infrastructure and operations leaders conducted in late 2025 found that only 28% of AI use cases fully succeeded and met ROI expectations, while 20% failed outright; poor data quality, limited data availability, and weak workflow integration were among the reported barriers. From pre-build through pilot and scale, the same mistakes are made, and the good news is that they are not inevitable.

Pre-Build: Set Up to Fail

Pre-build failures are the easiest to miss because there is nothing to debug yet and nothing live to roll back. By the time the consequences show up, fixing them can be significantly more expensive than preventing them during product design, data access planning, and workflow discovery.

Cydoc knew from the beginning that EHR integration mattered: the founder had lived through broken EHR workflows in her own clinical training. But the company couldn't afford to build it, so they shipped without it and postponed the problem. The EHR integration never arrived, and Cydoc spent years trying to sell a tool that required clinicians to change their workflow instead of fitting into it.

Solving the Wrong Problem

The most common pre-build failure starts when someone finds something the model can do well and only then starts looking for a clinical problem to attach it to.

The tool gets built, the scores look good, and nobody uses it. An alert that confirms what a physician already suspects, or points at a risk they can't act on in that moment, gets ignored regardless of how accurate it is.

Before building anything, find one clinician who deals with the problem you are targeting and ask when exactly it happens in their shift, what they do now, and whether a tool like yours would genuinely make the job easier or just add more friction. For healthcare AI, “user discovery” is not a marketing exercise, it is a clinical safety, adoption, and implementation requirement. Sometimes the answer points away from AI entirely, and accepting that at the very beginning saves months of work and thousands of dollars.

Counting on Data That Isn’t There

The common mistake is thinking that the data will look something like a labeled research dataset. Real EHR data is chaos: a large share of clinically meaningful information exists in unstructured notes, reports, and narratives, and much of it is not mirrored in structured fields. Any project counting on clean, analysis-ready data will hit this wall.

A 2025 study across 1.8 million patient records found that only 13% of clinically relevant concepts in free text had any equivalent in structured fields. At the visit level, where a clinician documents a specific encounter, that dropped to 7%.

On top of that, the same diagnosis gets coded differently across departments, and missing values follow patterns that reflect documentation culture rather than patient reality. A model trained on this may treat these artifacts as clinical signals.

SciForce ran into this semantic standardization problem while building internal healthcare AI tools: terms from source systems that wouldn't map to standard vocabularies, clinical details lost in conversion, specialists pulled into weeks of manual work without consistent results. That's how Jackalope was born – an ML-powered tool for automating medical data standardization across OMOP CDM and SNOMED CT. For teams building healthcare AI, this is not a peripheral data-cleaning task; it is the layer that determines whether a model can be trained, validated, explained, and reused across sites.

Treating Data Access Like a Detail

Paperwork and patient data access are a common point of collapse: you need to get ethics board approval, permission to use de-identified data, pass IT security checks, and often data use agreements. In many institutions, these processes are sequential or only partially parallel, which turns data access into a project-critical dependency rather than an administrative detail.

A study across 277 protocols found that ethics review takes 112 days on average across 10 VA Institutional Review Boards – now imagine the time needed for a small startup. A 2025 multi-site study documented that data use agreements take 26 months to execute, with actual data extraction taking another 14-22 months. At this scale, two months of training a model easily become years of waiting for approval.

The practical response is to start the paperwork from day one, before the model architecture is even sketched. In the meantime, use publicly available datasets like MIMIC-III/IV from PhysioNet or the eICU Collaborative Research Database to train your model. Synthetic data can be useful for testing pipelines, interfaces, privacy-preserving workflows, and some model-development assumptions, but it should not be treated as a substitute for validation on representative real-world clinical data.

Pilot: Workflow Pushes Back

Every pilot starts the same way: the demo goes well, someone says "this could really change things", and two months later, no one is using the product.

Cydoc had paying customers who weren't using the product because it meant changing a workflow that already worked well enough. A tool can be technically sound, clinically relevant, and still end up unused for reasons that have nothing to do with the model.

Accuracy Without Clinical Value

Getting good scores during internal validation is a success, but it’s not a sufficient reason to deploy the model.
A 2025 JAMA Network Open study reviewed same-admission AI models in literature and found that 40.2% of them were trained on ICD codes as input data to predict mortality. However, ICD codes are assigned by billing staff after the patient is discharged and describe the final diagnosis, not what was known at the beginning of the treatment. In the authors’ mortality prediction experiment, models using ICD codes achieved very high AUROC values, illustrating label leakage rather than clinically usable prospective prediction. To avoid a similar situation, audit every input available at the moment the clinician needs to use the model. Even a small second-institution validation cohort can catch what internal testing misses.

Too Many Alerts, Too Little Action

After enough false alerts that don't get clinicians anything specific to act on, they learn that the interruption isn’t worth it.

External validations of Epic sepsis prediction models have repeatedly shown that performance can vary by site, threshold, patient population, and implementation context; before publication, this exact “14%” figure should be verified against the cited paper. And even when it fired correctly, it often arrived after sepsis had already been identified by other means. When it comes to alert systems, alerts should not only be accurate, but arrive in time and provide enough information for clinicians to act differently because of them.

Another question is whether an alert system is the right interface at all. For a healthcare technology provider, SciForce built an LLM-powered semantic search that lets a doctor ask a question about a specific patient – in plain language, at the moment they're ready to act, and get a relevant answer pulled from the patient's records. This is a different design philosophy: instead of pushing another alert into an overloaded workflow, the system supports clinician-initiated retrieval at the point of decision.

One More Dashboard Nobody Wanted

A reliable predictor of pilot failure is a tool that requires clinicians to leave the system they already work in. Cydoc lived outside the EHR, which meant the clinical staff had to manage a second interface: one extra step for each patient on every shift.

Duke University hit a related workflow-integration challenge with Sepsis Watch. The sepsis prediction tool was deployed on a separate iPad, which meant nurses had to monitor the iPad, cross-reference the patient chart, and manually pass the alert to the treating physician. The hospital had to create an entirely new nursing role to connect AI and the clinical workflow. This doesn’t mean the system failed clinically. Duke later reported expansion of Sepsis Watch. But it does show that successful AI deployment may require new labor, new roles, and active workflow repair, not just a model and an interface.

Johns Hopkins solved the same problem differently. They embedded a similar sepsis model directly as a clickable icon in the existing EHR interface, with no separate system or login required. Across five hospitals, 89% of alerts were evaluated, and patients whose alerts were confirmed within three hours showed an 18.7% reduction in mortality. The lesson is not that one interface pattern always wins, the lesson is that adoption depends on whether the tool fits the clinical decision pathway, accountability structure, and timing of care.

Scale: Works Here, Fails There

A successful pilot means the model worked for one institution. To turn it into a widely adopted and commercially successful product requires consistent performance at new sites, regulatory clearance, and architecture that scales without the need to rebuild it from scratch.

Same Model, New Reality

A 2026 multicenter study tested the Epic Sepsis model across numerous hospitals. The model assigns each patient a sepsis risk score based on their clinical data, but the same cutoff doesn’t work well for all hospitals. To catch 60% of sepsis cases, one hospital would need a risk score cutoff of 14, while another would need 37. An analysis across a network of nine hospitals showed that performance ranged from poor to acceptable, with no single benchmark that worked well across all sites.

Take two hospitals: a large urban teaching hospital treating post-surgical complications and ICU patients, and a smaller regional hospital receiving lower-acuity cases.

Naturally, the average patient from an urban hospital has a higher baseline sepsis risk than one from a regional site. That alone shifts the scoring baseline. The first hospital is also likely to have stronger lab infrastructure, more advanced equipment, and more detailed documentation. That means that the model trained on its data would rely on a richer data picture. A single configuration wouldn't work equally well for both sites: set the cutoff too high and the model would miss sepsis in regional hospitals; set it too low, and the model would flood the urban hospital with false alerts.

You need to deal with this problem before deployment: avoid institution-specific dependencies, and run second-site validation during development, rather than after signing the contract. Even without such dramatic site differences, patient populations still change over time, clinical practices evolve, and documentation quirks shift. Together, those changes can quietly degrade model performance in production before anyone notices. To avoid this, continuous monitoring and retraining need to be planned during development.

For a public healthcare organization monitoring region-wide infection spread, SciForce built a pipeline with automated retraining triggered when a drift score exceeded a defined threshold. The same practice can be applied to multi-site deployments, where each new site introduces the model to a different data environment. For clients, this changes the procurement question from “Can you build a model?” to “Can you operate and monitor this model safely after deployment?”

Regulatory Surprise

The line between a clinical decision support tool and a regulated medical device is not obvious.

For non-device clinical decision support, the FDA focuses on statutory criteria including whether the software analyzes medical information rather than images or device signals, whether it supports rather than replaces professional judgment, and whether the clinician can independently review the basis for the recommendation.

The most consequential factors are intended use, transparency, and whether the clinician can independently review the basis for the recommendation.. A tool that says "this patient has sepsis" is making a diagnostic claim and is likely regulated.A tool that says "three of the seven sepsis criteria are present in this record, here are the values" is surfacing information and leaving the judgment to the clinician, making it more likely to fall outside the regulated category. This distinction is not a loophole, it must be reflected consistently in product design, labeling, user interface, validation strategy, and sales language.

Kintsugi hit the regulatory wall hard. They built a machine learning tool for anxiety and depression screening based on short free-speech voice samples. A peer-reviewed study across about 15,000 participants found sensitivity of 71.3% and specificity of 73.5% in detecting moderate or severe depression – a result comparable to other mental health screening tools.

To scale as a diagnostic AI product, the company needed FDA De Novo authorization. De Novo is the regulatory pathway for products novel enough that no FDA-cleared equivalent existed to point to – the longer, more expensive route compared to the standard 510(k). For FY2026, FDA user fees are $26,067 for a 510(k) and $173,782 for a De Novo request, review timelines vary, and the FDA De Novo goal is 150 FDA review days excluding time on hold, while studies of AI/ML-enabled devices have reported longer median review times for De Novo than 510(k). The filing fees alone run $26,067 for a 510(k) and $173,782 for De Novo.

The venture-backed product was ultimately unable to survive that timeline, combined with the cost of the clearance process. In February 2026, Kintsugi shut down commercial operations and open-sourced its work.

Map your intended use case against the FDA's four-factor test before committing to a product architecture. If there is any uncertainty, engage a regulatory consultant: the cost of early advice is a fraction of what a late discovery costs.

Architecture That Doesn’t Travel

Most early healthcare AI products are built around one institution's specific setup. That works for a pilot. The problem starts when you scale to a second site with a different EHR vendor, unfamiliar data structures and new ways of recording clinical information.

One architectural fix is to build the integration layer around standards such as HL7 FHIR where appropriate, while recognizing that FHIR alone does not solve terminology mapping, local workflow variation, historical data extraction, or analytics-ready cohort construction. Certified EHRs are now required to support FHIR-based APIs under the 21st Century Cures Act, which means a standardized data layer is achievable without custom extraction work at each new site. This creates a more realistic path to standard integration, but not a guarantee of plug-and-play deployment.

When a German university hospital needed to connect observational research data to operational clinical workflows, SciForce built an OMOP CDM to HL7 FHIR conversion pipeline that made real-time data exchange between the two systems possible.

For a US health insurer working across multiple hospital systems with inconsistent data formats, SciForce built a cloud-native pipeline on Snowflake conforming to the PCORnet CDM standard, turning what would have been a custom integration project at each new site into a repeatable process. This is the implementation layer many healthcare AI products underestimate: not model development, but repeatable, governed data movement across heterogeneous clinical environments.

Conclusion

Across all three stages, most of the factors that determine whether a healthcare AI project fails or survives are not about performance. By the time the model is ready to deploy, they are already locked by decisions made months and years earlier.

Clinical AI is hard, the regulatory environment is still maturing, and some projects fail for genuinely unpredictable reasons. But many of the most damaging failure modes are predictable: weak workflow fit, inaccessible data, label leakage, alert fatigue, site-specific model behavior, unclear regulatory strategy, and architecture that cannot travel. While successful deployment isn’t guaranteed, removing the nine most predictable reasons for failure is a much better starting point.

At SciForce, we treat healthcare AI deployment as an infrastructure problem before we treat it as a modeling problem. That means building the data layer, terminology mapping, interoperability strategy, monitoring logic, and clinical workflow fit early enough to prevent predictable failure. If your AI product is moving from prototype to pilot, or from pilot to scale, this is the moment to examine whether the architecture is ready for real clinical environments.

Explore more of our insights on building healthcare AI that actually ships → https://sciforce.solutions/case-studies?tag=healthcare

DevOps Meets Generative AI: Building, Testing, and Deploying LLM-Powered Apps

SciForce — Wed, 20 May 2026 13:25:37 +0000

Last spring, OpenAI released a GPT-4o update that made the model hard to trust: it returned sycophantic and less reliable answers than usual, even though nothing was changed in users’ prompts and workflows.

When an LLM system starts drifting in production, the deployment history doesn’t catch it early: nothing changed in the codebase, and providers didn’t release any official updates either. Meanwhile, some providers might have adjusted a classifier without notice, and a request that worked fine yesterday, starts returning confidently wrong answers tomorrow.

If you are already running delivery pipelines, the entire process looks familiar. However, an LLM pipeline has a different kind of release object, where a minor change in prompt, model version, or guardrail can alter system behavior, even though the main codebase was never touched.

What Shapes LLM Production Behavior

While application code gets versioned carefully, changes to prompts, retrieval settings, and guardrails often happen without a formal record, making it harder to identify what exactly caused the drift in model behavior.

*- Prompts *
Sometimes, the reason for regression is a minor change in system prompt: someone changes a sentence targeting one edge case, and an unrelated query category unexpectedly starts performing worse. This happens when multiple people can edit the prompt directly, leaving the edit outside the release record.

- Model versions
In May 2025, Google redirected two dated Gemini endpoints to a newer model without notice. Developers building on gemini-2.5-pro-preview-03-25 found out the software behaved differently than the day before. Afterward, Google updated its documentation to clarify what “stable” and “preview” meant for different endpoints types. If the app works oddly, the provider might have updated the model without notice – worth checking what exact model versions show up in your API responses.

- Retrieval configuration and source data
In RAG systems, answers can drift because the index got stale or because someone changed chunking, ranker, top-k, or the embedding model – none of these requires the app to throw an error. As a result, a financial reporting assistant can start citing figures from outdated quarterly reports, because the knowledgebase was updated without refreshing the index.

*- Guardrails *
Guardrail rules are often managed outside the main app release process. The compliance team might tighten a refusal rule in a separate console, and the app starts rejecting the queries that worked fine without any change on the engineering side.

- Evaluation
A test set built when the product launched doesn't automatically update as the product evolves. A model can keep passing eval while production has moved on: the query mix has shifted, and cases that were rare at launch now make up much of the workload.

Building the delivery pipeline

In traditional software delivery, the release surface is mostly code. In LLM systems it expands to include prompts, model versions, retrieval configuration, and guardrails – components that affect production behavior just as much as the application, but rarely get the same release controls.

Knowing when a release is good enough to ship

In a traditional release you have to make sure that the software runs correctly. When deploying an LLM system, you have to make sure that it behaves acceptably and safely across the full range of inputs it will encounter in production.

Golden prompts

They are fixed test cases that reflect what the system is supposed to do. For the customer support assistant, it checks whether it correctly identified the issue, pointed to the right support article, avoided making things up and escalated when necessary.

When preparing a release, each golden prompt is checked on those dimensions with pass\fail criteria defined before the evaluation. Some checks can be automated, while ambiguous, user-facing or high-risk outputs still need human attention. Not every failure is equally important: failure to escalate or wrong citation block the release immediately, while slightly worse phrasing on a low-traffic query probably doesn't.

Baseline comparison

Eval scores are less stable than they look. One study on prompt sensitivity found accuracy swings of up to 76% from formatting differences alone, with no change to meaning. That is why every candidate release needs to be measured against the production version: without that reference, even a strong score can be a regression from what is already running.

Controlled rollout

Staged deployment strategies let you validate the release in production before committing to it fully. Shadow testing sends user requests in parallel through both current and new versions, but users only see the responses from the current one. Canary testing goes further and shows the new version's responses to a small bunch of real users. If something goes wrong, you catch it on small traffic and roll back before it goes further. Before you start, decide in advance what "something is wrong means", whether it's worse quality of replies, more refusals, or higher cost per query.

Versioning

A quality gate is as good as the release record behind it. If the record doesn't include the exact version of the prompt, retrieval or guardrail configuration, eval set, embedding model that are going live, you might be testing last week's setup.

Any single change to any of them should trigger reevaluation, because even one edit can break the entire construction.

Deploying without losing the gains

Clearing every quality gate doesn't guarantee a smooth release. Inference workloads fail differently from the standard web apps due to concurrency and adding hardware doesn't resolve bottlenecks caused by provider-side rate limits or a queue backing up under long-context requests.

Cost behavior is also harder to predict than token billing alone would suggest. Context growth in lengthy conversations, retrieval payloads, tool-call recursion, and retry loops on failed calls all compound, making inference accountable for 80–90% of total cost of ownership in production GenAI deployments. One of ways to cut the inference costs is query routing – it's faster and cheaper to run routine lookups through deterministic search or rule-based logic.

Keeping it reliable once it's live

Once the system is in production, the question shifts from whether it behaves correctly to whether you know when it stops. Factors that affect production LLM behavior, such as provider update, guardrail adjustment, or users phrasing requests differently, don't always leave obvious signals, and the challenge is to catch the shifts earlier than users do.

Monitoring what matters

Specific metrics, like retry volume and path shifts, can catch tool-use problems early, but the signal usually becomes visible when the bill arrives and the users start complaining. It's easy to overlook cost growth as a monitoring problem, because it compounds slowly – Azure’s documentation confirms that content filter rejections and timeouts get billed even when processing fails. You need to monitor cost thresholds in advance, such as cost per query, per workflow, token growth, and retry spend.

Where human judgment stays in the loop

While automated evaluation catches a lot, it misses things a human would notice. The system can skip confidently wrong answers, while a human looking at real outputs over time would spot a pattern with the system consistently mishandling certain types of requests, or plausible but wrong answers becoming more frequent

Ownership, decisions, and accountability

Governance in LLM systems tends to fail quietly, usually for the same reason. Who can block a release? What counts as a production incident? What happens when output quality drops after a provider update nobody initiated?

When responsibility for the app, user experience, guardrails, and eval set is split across different departments, these questions often go unanswered. As a result, when something breaks with no trace in the codebase, there is no designated person to decide whether the regression is acceptable or whether to declare an incident.

What this looks like in practice

The client’s enterprise performance management platform was slow, expensive, and hard to debug. Two problems were compounding each other.

The first was routing: simple queries that could be handled by a database call were being processed by the LLM instead, just like complex analytical tasks. Based on internal benchmarking, making a database call would have been roughly 40x cheaper and 10x faster.

The second was traceability: the platform had been built with a separate ML model for each end customer, so when outputs degraded, there was no reliable way to tell whether it was caused by model, retrieval configuration, or something else.

What we changed

We replaced per-client model architecture with a shared vector search foundation, and added rule-based routing, directing simple lookups to the database and complex ones to the LLM. We tested several models on client data to handle complex requests - GPT-4, GPT-4o, GPT-4o-mini, Mistral, and Mixtral. GPT-4o-mini offered the best balance, matching the effectiveness of GPT-4o at a lower cost.

All prompts, retrieval settings, and guardrails were versioned, making it possible to assess each release candidate based on consistent benchmarks.

For the routing layer, we developed its own test set, regression checks and configured periodic recalibration as user queries evolved. While hybrid architecture was no simpler, it was testable and versioned, making it easier to manage than the original one.

Results

LLM usage dropped by 37-46% depending on workload type, and latency for simple lookups improved by 32-38%. 68% fewer outputs were flagged as irrelevant or misleading. Manual reconciliation work (the analyst time spent catching and correcting output errors) decreased by 58%.

Conclusion

There's usually a moment, somewhere between the successful demo and the first production incident, when the operational gap becomes obvious. A useful starting point: if something went wrong with your current system today – output degrading, behavior shifting, costs spiking – could you tell within an hour what combination of model, prompt, retrieval configuration, and source data caused it? If the answer is no, that's where to start.

If you want to run that diagnostic on your current system, we're happy to do it with you.

Want to make your LLM systems more reliable, scalable, and cost-efficient in production? Read our articles about LLM and DevOps on the blog 👉 https://sciforce.solutions/blog?tag=LLM&tag=dev-ops

How FinOps Reduces Cloud and GPU Spend for AI-Driven Companies

SciForce — Thu, 07 May 2026 09:48:39 +0000

Introduction

At some point in an AI company's growth, the GPU bill stops making sense, and we are looking at a cluster running at 3 am for a model that never shipped.

That's the bill that eventually lands on someone's desk, and the first instinct is a cleanup to identify waste and kill orphaned resources. It worked when cloud spend drifted slowly enough for a monthly review to catch up, but by 2025, AI infrastructure spending grew 166% year over year.

The job was run, and the bill for it would arrive only two weeks later. By that time, the same misconfigured job would run again and again. The bill review would become a historical reconstruction of what it was supposed to do, who approved it, and, by that time, people who could answer those questions had moved on to the next experiment.

Why AI Costs Break Normal Cost Logic

A standard cloud bill is predictable, because you spend more when you do more. AI workloads cost the same whether working or idle, and idle GPU doesn't throw alerts the way a failed process does; it just runs, or rather doesn't run, at full price. The costs build in the background while the dashboards stay quiet.

The Bill Behaves Less Predictably

When a GPU is involved, you can run the same cluster for two weeks with a different job schedule and receive a different bill each time. While GPU infrastructure is 5-10x more expensive than standard compute, to say that the difference between these two bills will be impressive is a mild way to put it.

Inference is the major cost driver in AI workflows: Gartner puts inference costs at 55% of AI-optimized IaaS spending by 2026 and expects them to reach 65% by 2029. Unlike training jobs, it doesn’t have a shutdown schedule, and becoming the majority of spend, unmanaged cost-per-query multiplies the bill with each new user added.

Low GPU Usage Gets Expensive Fast

The AI Infrastructure Alliance’s 2024 survey states that only 7% of organizations exceed 85% GPU usage at peak, while 53% sit between 51-70%, and 15% never even break 50%. Most idle usage comes from a capacity sized for worst-case demand that never arrives and training jobs that are finished, but keep active environments in case someone might need it soon.

An H100 capacity runs $2–4 per GPU-hour, billed whether the cluster is active or not. At 70% usage, an 8-GPU cluster carries roughly $3,700 a month in idle costs on a specialized provider, $7,000 on a major one.

Where The Money Leaks

For nine years, cloud waste has been the top optimization priority, actively declining for five of them. Flexera’s 2026 State of the Cloud Report shows that this year, cloud waste grew from 27% to 29%, with AI workloads as the major driver. The table below runs through the most common cloud waste categories.

The next model version is often already in training, while the environments from the previous one are still running. Shutdown schedules and TTLs would help, configuring them is hardly the highest thing in anyone’s priority list. According to Harness, FinOps in Focus, 2025, 68% of developers don't have fully automated cost savings practices implemented, and 86% state that it takes at least a week to find idle and orphaned resources and take action.

The State of FinOps 2025 report states that 63% of organizations are actively managing AI spends, however FinOps in Focus reports that only 39% of developers have full visibility into unused resources.

This shows that while cost visibility has grown, most organizations still haven’t built an attribution level that allows them to act on it. Without attribution, cost visibility is just watching the dashboard more closely wondering why the bill doesn’t move, which is far from a traceable and controlled bill.

What FinOps Looks Like in Practice

While spinning up a training job, engineers can check the history of similar jobs on the same model and at roughly the same volume, and estimate its future cost before committing resources. If a job is counting 3x over the estimate, it can be killed mid-run before it blows the bill.

This is how FinOps works: engineers see the financial consequences of their decisions in real time. Spend is traceable at the moment it’s created, oversized jobs can be stopped ASAP, and the final bill finally stops being a surprise.

Per-job attribution makes it possible, and it must exist before any job runs. Without it, the next engineer deciding whether to rerun a job has no way to know the last one cost $800, or that three nearly identical runs already happened this month.

Start with idle infrastructure

Non-production environments are the easiest place to start. They don't serve users, shutting them down automatically won't affect product performance, and most platforms support it natively. The reason it doesn't happen: restarting a GPU environment takes time, and the engineer who ran the job expects to come back to it.

Reduce the cost of live workloads

In many GenAI workloads, inference can account for 80-90% of total spend. If every request is routed to the most expensive model path by default, cost per query stays high, no matter if the task needs that level of reasoning or not. We ran into exactly that with one of our clients: simple lookups were taking the same expensive path as the work that actually needed the model.

Tracking What Runs

Enforce tagging at the pipeline level: model version and experiment ID as required fields. For resources already running without it, match costs using pipeline logs and timestamps; historical spend without attribution is largely unrecoverable, and the clock starts from when instrumentation goes in.

ClearML, Weights & Biases, and cloud-native cost explorers like AWS Cost Explorer, surface per-job cost data accurately once that metadata is consistently in place. The metrics worth tracking: cost per training run, GPU usage by job, and time-to-detection for idle resources.

How this played out in real systems

Neither of these cases started as a cost project: the cost results showed up because the underlying infrastructure problem got fixed. When the infrastructure stops working against itself, the bill reflects it.

400,000 customers, one infrastructure standard

The original brief was compliance — PCI-DSS, ISO, HIPAA across every AWS region. Meeting those standards required every region to be built on identical configurations.

SciForce moved the client's infrastructure to a single repeatable standard using Terraform and Terragrunt, so every region was built and managed from the same source. Deployments were automated through a Jenkins-to-Concourse transition and Wavefront monitoring was added to catch deviations early.

As a result, the time necessary for configuration and migration dropped by 52%, and the deployments on new compute resources became 63% faster. Once the infrastructure stopped drifting from region to region, the cost picture got easier to control, and total infrastructure TCO improved by 50%.

Query routing decision that cut AI processing costs by 39%

The client's AI assistant was answering every question the same way: routing all queries through the LLM regardless of what was being asked. Pulling a sales figure for last quarter costs roughly the same as summarizing six months of trend data if both go through GPT-4. One of those queries needs the model. The other doesn't.

SciForce built a hybrid processing layer that separated the two. Simple lookups, such as employee stats and sales figures, went through vector search and rule-based retrieval. Summarization and trend analysis went to the LLM. In practice, if a query was pulling a specific number from a known source, it didn’t need the model. If it needed the model to think, it went there.

After assessing seven models on speed, cost, and response quality, SciForce chose GPT-4o-mini for the LLM-routed queries because it held up on quality at a fraction of the cost of larger models. Guardrails were added to filter queries and validate responses, reducing hallucinations and costs.

The financial result was up to 46% reduction in LLM usage and costs for AI processing of queries lowered by 39%. Query routing also had a positive effect on overall tool performance: simple lookups are now processed 32% faster, and the answers have 68% less hallucinations.

Conclusion

The bill arrived. You can't explain it. And because you can't explain this one, you can't prevent the same mistakes from reappearing next month.

FinOps breaks this loop by putting a price tag on each job during provisioning. Attribution helps you predict the job's cost by comparing it to similar jobs before committing to it. If the job is already active but overspending, you can notice it early to stop it before it compounds the bill.

Which training job drove last month's GPU spend? If that takes more than a few minutes to answer, the attribution layer isn't there yet. SciForce can help build it.

DevOps for Embedded Systems: A Modern Guide for Manufacturers

SciForce — Wed, 29 Apr 2026 14:16:31 +0000

Intro

Firmware failures don’t stay confined to software. They stop lines, knock out motors, and ruin batches. Once production is down, firmware stops being “just code.” Even so, many manufacturers still treat firmware as a fixed machine component: ship it once, assume it will hold up, and deal with the fallout later.

That approach breaks down fast at scale. Last year, 61% of manufacturers faced unplanned downtime, causing nearly $1 billion in losses. At the same time, the software estate keeps getting larger. With 40 billion IoT devices expected by 2034, the embedded code running inside controllers, vision systems, and gateways is becoming harder to ignore and harder to update safely.

Embedded DevOps is the delivery model for that environment. It gives a disciplined way to release, validate, and support firmware changes across thousands of deployed devices without turning an update into a shutdown.

How Embedded Systems Run Plant Operations

Embedded systems support jobs where timing slips show up immediately. A servo may correct position 10,000 times each second, and a vision system may reject a defective part in less than a millisecond. That work stays on the device rather than in the cloud because adding network latency or connection loss to the control path is unacceptable.

That local processing follows a continuous on-device cycle: sensors capture physical conditions such as position, speed, temperature, and current, and a processor (an MCU or MPU) runs the embedded software, typically on an RTOS or Linux. The control logic then checks those readings against rules, setpoints, and safety limits, and actuators such as motors, valves, and relays execute the resulting command.

The cycle repeats hundreds or thousands of times per second. That’s why predictable timing matters more here than in almost any other software.

Alongside the control loop, most plants run a second path for telemetry, diagnostics, and configuration. It touches every piece of equipment on the line: controllers, vision cameras, drives, AGVs, and condition monitoring nodes. Data flows upward through a gateway or edge layer into a stack of higher-level systems, each at a different scope and timescale.

At the shop floor, SCADA handles live monitoring and alarms — the operator's window into what the line is doing right now. One layer up, MES connects that real-time picture to production execution: work orders, quality records, traceability. Above that, cloud or analytics platforms collect data across sites for fleet-level monitoring and remote service.

The devices feeding this stack range from small microcontrollers handling a single control task to Linux-based edge computers running machine vision or on-device AI. That range matters because any update process has to work across all of it.

Why Embedded Delivery is Slow and High-Risk

A bad embedded release can stop a line, leave a device dead on boot, or create a safety incident. The software is tied to physical hardware, so validation depends on specific equipment, environmental conditions, and production context that are hard to reproduce.

Validation constraints and late surprises

HIL (hardware-in-the-loop) benches are expensive, limited in number, and hard to scale. Most teams have two or three for an entire product portfolio. That scarcity forces serialised testing, which pushes hardware-related issues late in the cycle, often to final integration, sometimes to the shop floor itself.

Compounding this: reproducing a build from three years ago means finding the exact compiler version, SDK, and hardware revision that existed then. Without disciplined build environment management, that's often impossible. The result is a rebuild that's slightly different from what originally shipped, and with no way to detect it.

Hardware and variant complexity

A single update may need to run on thousands of machines, each with slightly different hardware. Over a ten-year product lifecycle, a manufacturer might replace a sensor or chip when the original is discontinued. A supplier changes a component without announcement. A customer in Germany runs custom safety logic that conflicts with the standard release. Each of these is a quiet fork in the test matrix, and the matrix compounds faster than any team can validate it manually.

Real-world release risk

In manufacturing, a software bug is a physical event. Unplanned downtime costs between $10,000 and $500,000 per hour, depending on the industry. At that level, even a short outage gets expensive fast. A bad update can send a specialist on-site to recover the system by hand. That is enough to make every firmware release slow, cautious, and heavily approved.

Security and compliance pressure

Patching embedded devices has always been operationally difficult. Now it's also a compliance requirement. Regulators and enterprise customers increasingly require a Software Bill of Materials (SBOM) — a full inventory of every software component inside a device, and expect vulnerabilities to be addressed within defined timeframes. The problem is that the same narrow maintenance windows that make updates risky also make rapid patching nearly impossible. Security and operational stability are pulling in opposite directions, and most embedded teams don't yet have a process that satisfies both.

Organizational friction

Development, QA, and operations often work in silos, with manual handoffs and paper approvals replacing automated checks. Nobody clearly owns the basic question of what software is running on which machines in the field, so when something breaks, teams end up tracing versions through spreadsheets, emails, and service notes instead of checking a reliable record. That slows containment and drags out release decisions, because nobody can say with confidence what is running where.

Embedded DevOps for manufacturers: the operating model that removes bottlenecks

When a field issue surfaces at 2 am, four things determine how fast you can respond: whether you can identify exactly what's running on the affected units, whether you can reproduce the build that shipped to them, whether you have test evidence showing what was validated and on what hardware, and whether there's a clear record of how that release was approved.

Embedded DevOps is the operating model that builds that path covering how a change becomes a signed, traceable release, how it's validated on real hardware, how it reaches the factory floor, and how it rolls out across deployed devices without putting production at risk.

1. Build and release integrity

Most embedded release problems trace back to the same two questions: what did we ship, and can we rebuild it exactly? Build integrity is what puts both within reach.

The foundation is repeatable builds: the same code and build inputs producing the same binary regardless of who runs it or where. In practice, that means pinning toolchains, compilers, and SDKs as versioned dependencies, standardizing the build environment (usually containerized), and recording build inputs on every run: repo revision, toolchain version, build flags, feature toggles, target profile. Without this, two engineers running the same build get subtly different outputs and have no way to detect the difference.

Once a build is a release candidate, it needs to be treated as a controlled product rather than a file on someone's laptop. That means:

Immutable artifacts: the same binary is promoted forward, never rebuilt for the same version
Clear identification: version and build ID linked to a specific commit and target device family
Signing at build time, verification at deployment
Central storage with metadata: supported targets, minimum bootloader version, compatibility notes

From there, artifacts move through stages: dev builds for daily work, validation builds backed by hardware test evidence, release builds approved for factory provisioning and field rollout. Only artifacts with the right evidence advance. That gate is what prevents a build that passed unit tests but never touched real hardware from reaching the factory floor.

2. Validation in layers (fast early, hardware where it matters)

Hardware-related issues are most costly after a change is already queued for a bench, a factory build, or a site rollout. The layered approach exists for one reason: to catch problems as early as possible and save limited HIL benches for where they're genuinely needed.

Per-change gates: unit checks, static analysis, packaging and signature verification. Fast enough to run on every commit, broad enough to catch most integration problems before anything touches hardware.
SIL (software-in-the-loop): timing edge cases, protocol logic, regression across configurations. Anything that you can prove in simulation gets proven here, without competing for bench time.
HIL (hardware-in-the-loop): reserved for what only hardware can prove: sensor behavior, timing jitter, driver interactions, power and thermal limits. Routing every change through HIL is what turns benches into bottlenecks.
Release readiness: boot and update paths, including failure cases, safety and stop behavior, performance under load. The final gate before anything reaches the factory floor.

3. Lab and factory readiness (hardware evidence + traceability)

Most teams treat the lab as a shared resource — a few benches, booked informally, with results that vary depending on who ran the test. At a scale that stops working. A lab-as-a-service model makes hardware testing consistent and predictable:

Scheduled access with queuing and reservations
Standardized remote controls for power cycling, flashing, and log capture
Automatic evidence capture on every run: firmware version, hardware revision, run ID, logs
One supported provisioning workflow instead of a collection of scripts that only one engineer fully understands

Factory integration is a different problem. A factory-ready pipeline provisions device identity, locks in calibration and configuration, and records evidence that enables containment when something goes wrong in the field. Every shipped unit needs a traceable thread connecting it back to its release:

Serial number and device identity
Firmware build ID and configuration version
Calibration records and end-of-line test results
Shipment batch

Without that thread, containing a field issue means manually cross-referencing build logs, shipping records, and test results — work that can take days and still leave gaps.

4. Fleet operations and risk control

Deploying to thousands of devices in the field is where a bad release does the most damage and where the ability to intervene is most limited. The pipeline doesn't end at the factory floor.

Safe rollouts

Most rollout failures come from expanding too fast, before there is enough evidence that the update is stable in real conditions. The fix is a staged deployment with hard health gates.

Rollout sequence: internal and lab devices → pilot line or site → phased expansion by plant and device family
Expansion criteria: stability and boot behavior, plausible sensor ranges, communications under load, control-loop timing, fault and alarm rates
Recovery readiness: rollback and safe-mode behavior defined before rollout starts, with A/B partitions or an equivalent mechanism tested as part of release readiness

Support also needs structured logs, crash data where feasible, and a diagnostics playbook that works under pressure.

Controls that match the risk

The right amount of process depends on the change. Updating a timing-critical safety path isn’t the same decision as changing a configuration parameter, and treating them the same way is what slows teams down without making releases safer. Test tiers should reflect that, aligned to change impact across per-change, nightly, and pre-release stages.

Security, compliance, and variant management follow the same logic. SBOM generation, signature verification at deployment, and a record of what is running where belong in the pipeline by default. So do explicit versioning rules across SKUs, hardware revisions, and supplier changes, with defined compatibility contracts and support horizons.

SciForce case study: Safeguarding Cooling Systems to Save a Data Center

A technology company operating large data centers had a recurring issue: a critical pump in the cooling system kept failing without warning. Each failure led to unplanned downtime. Regular inspections didn’t solve it because the team usually discovered the problem only after the pump had already failed.

Cooling systems are controlled and monitored through on-site industrial equipment (sensors, controllers, and gateways). The value comes from fast detection close to the equipment and reliable signals that can trigger action before a breakdown – exactly the kind of environment where embedded and edge systems live.

Key constraint: the available sensor data wasn’t labeled with “failure / no failure,” so a standard supervised predictive model couldn’t be trained immediately.

What SciForce built

SciForce created a real-time anomaly detection pipeline using data from 100+ sensors (temperature, pressure, flow rate, and other operational readings). To reduce noise and improve reliability, we applied multiple anomaly detection methods (including Isolation Forest, ECOD, and One-Class SVM) and used majority voting: an event was flagged only when most methods agreed.

We then compared detected anomalies with known pump replacement dates and used correlation analysis to identify which sensor patterns appeared consistently before failures. This narrowed monitoring down to four critical sensors and enabled an early-warning system that can be surfaced at the edge (local alerts) and/or forwarded upstream for monitoring and reporting.

Results

30% fewer false alarms
25% less unplanned downtime related to pump failures
20% faster maintenance response time
40% higher detection accuracy

Getting anomaly detection right took careful work: 100+ sensors, multiple methods, and majority voting to filter noise. Keeping it right requires an update process that doesn't quietly change what the system does. That's what embedded DevOps is built to protect.

Conclusion

Most firmware update processes run on assumptions — the build matches what shipped, hardware hasn't drifted since the last release. In manufacturing, broken assumptions show up on the floor.

Embedded DevOps puts evidence where the assumptions were. You know what's running, you can rebuild what shipped, and there's a recovery path that's been tested rather than improvised. Firmware updates don't get easier. The risks just stop being surprises.

If that gap sounds familiar, SciForce runs readiness assessments that show exactly where the process breaks down and what it takes to fix it.

Agentic AI vs. Chatbots: Why 40% of Enterprises Are Switching to Autonomous Workflows

SciForce — Wed, 18 Mar 2026 16:22:03 +0000

Introduction: The Shift from Conversational AI to Autonomous Execution

Chatbots helped businesses get started with AI, but their impact has been limited — they respond to questions, follow scripts, and stop at the conversation. They don’t take action.

AI agents do. These systems can plan, decide, and carry out tasks across tools like CRMs, ERPs, and internal platforms — all with minimal human input. They act more like digital team members than assistants.

Gartner projects that by 2026, 40% of enterprise applications will include task-specific AI agents, up from under 5% in 2025. According to Cloudera, 96% of enterprises are expanding their use of AI agents, especially in operations, analytics, and IT.

This article breaks down what AI agents are, how they differ from traditional chatbots, where they’re already being used, and why they’re becoming essential to the next phase of enterprise automation.

What Is an Autonomous AI Agent, and Why It’s More Than a Chatbot

Autonomous AI agents are software systems that set goals, make decisions, and complete tasks across business tools with minimal human involvement. They operate independently, respond to real-time changes, and take action based on triggers, schedules, or incoming data.

These agents can manage multi-step workflows across platforms like CRMs, ERPs, and internal applications. They stay active, adapt to new information, and carry out tasks such as tracking progress, sending updates, or moving work through systems.

With their speed, flexibility, and ability to work across systems, AI agents are becoming a valuable part of how enterprises streamline operations and scale efficiently.

Core Capabilities

Autonomous AI agents stand out by combining several advanced abilities that allow them to operate across complex enterprise environments. These core capabilities make them well suited for high-impact, repetitive, or time-sensitive tasks:

1. Goal understanding: A request comes in (a user message, a system event, or a scheduled trigger). The agent identifies the goal, the objects involved (lead, ticket, invoice, KPI), and the expected output.

2. Planning: It creates a short plan: which steps to run, what data is needed, which tools to use, and what a successful result looks like.

3. Multi-step execution: The agent runs the steps in order. Each step produces an intermediate result that guides the next step until the workflow is complete.

4. Tool integration: It connects to business systems through APIs or connectors to read records, update fields, create tasks, send messages, or trigger automations.

5. Memory & context: It keeps track of what has happened in the workflow and uses relevant history when needed, such as prior actions, open tasks, or preferences.

6. Quality checks: Before sending a final answer or taking an action, it verifies key data points, checks consistency, and flags uncertain results.

7. Human oversight: For higher-risk actions or unclear cases, it pauses and asks for approval or escalates to a person with a clear summary and recommended next steps.

8. Security & access: All actions follow permissions and policy rules. Sensitive data is protected, and key actions are logged for auditing.

9. Monitoring: It records operational metrics such as success rate, speed, tool errors, and cost, so teams can measure performance and improve the system over time.

Together, these capabilities let an agent turn requests or system events into completed work across business tools. It can run tasks step by step, keep context, check results, and escalate unclear cases—while following access rules and tracking performance.

What About Chatbots and Copilots?

Many organizations began their AI journey with chatbots — simple tools built to handle FAQs, support tickets, and basic customer service tasks. More recently, AI copilots have entered the picture, offering helpful suggestions, content generation, and automation within specific apps like Microsoft 365 or Salesforce.
Both have proven useful in supporting productivity and handling repetitive requests. However, their capabilities are limited when it comes to running real business operations:

Chatbots are designed for short, reactive conversations.

-- They work well for high-volume tasks like password resets or order status checks.
-- But they lack memory, initiative, and the ability to execute multi-step processes.
-- They typically operate on the surface of systems, without deep integration.

Copilots provide more intelligent assistance within tools.

-- They help users draft emails, summarize documents, or trigger in-app automation.
-- But they still rely on user input, don’t retain long-term context, and remain confined to single platforms.
-- They cannot act independently or coordinate tasks across systems.

While both play a role in improving user experience and reducing task load, they’re ultimately support tools — not autonomous workers. For enterprises aiming to coordinate complex workflows, automate decisions, and scale operations without scaling headcount, AI agents offer the next level of capability.

Why Enterprises Are Switching to AI Agents?

Many companies are looking for ways to move faster, cut manual work, and handle more complex operations without adding extra staff. Tools like chatbots and basic automation can help with small, routine tasks — but they’re limited when it comes to connecting systems or making decisions. AI agents fill that gap. They run entire workflows from start to finish, work across platforms like CRMs or ERPs, and respond to changes in real time.

- Operational efficiency at scale

AI agents automate manual, high-volume tasks across departments like finance, IT, HR, and sales — cutting workload and speeding up execution. Organizations report over 60% reduction in manual work when using agents for internal processes. In sales, for example, agents now handle lead follow-up, outreach, and CRM updates that previously required dedicated staff.

- Capabilities beyond chatbots and automation

Agents manage complex workflows like compliance checks, procurement coordination, and dynamic task routing. Unlike traditional tools, they adapt to changing inputs and operate across systems in real time.

- Strategic competitiveness

Companies see AI agents as critical to staying agile and efficient. 93% of IT leaders plan to deploy agents by 2025, aiming for faster decisions and better coordination across platforms.

- Always-on responsiveness

Agents work continuously in the background, reacting instantly to triggers, data changes, and events, helping teams respond faster and avoid delays in areas like support or supply chain.

- Enterprise-ready deployment models

Adoption is growing fast: 66% of companies are building agents on AI infrastructure platforms like Azure or AWS, while 60% are using agent capabilities already built into platforms like Salesforce or Microsoft Dynamics

AI Agents Across US and European Markets

AI agents are moving from pilots to real use in industries where work is complex and heavily process-driven. In many cases, they handle high-volume, multi-step tasks inside business systems, while people oversee exceptions and controls. The examples below show how this is happening in finance, logistics, and healthcare across the US and Europe, followed by the main challenges leaders should plan for before scaling.

Finance

Banks are moving beyond basic GenAI assistants toward autonomous, multi-step workflows in onboarding/KYC, back-office accounting, and financial crime operations:

Goldman Sachs has described building autonomous systems with Anthropic for trade and transaction accounting and for client vetting and onboarding.
JPMorgan is scaling its LLM Suite across the organization, with access for about 250,000 employees and roughly half using it nearly daily, and has begun deploying agentic AI for more complex tasks, including generating an investment banking deck in about 30 seconds.
McKinsey reports the largest gains come when agents run end-to-end compliance workflows with human oversight: one practitioner can typically supervise 20+ agents, enabling ~200%–2,000% productivity gains in KYC/AML in their experience.

Logistics / supply chain

Reuters reports that freight and logistics players including DHL, Ryder, and Flexport are among 70+ enterprise customers using AI agents. These deployments target routine coordination tasks that slow operations down at scale, such as rate negotiation and appointment booking – work that otherwise ties up teams with high-volume calls, emails, and status updates.

Healthcare

Healthcare is starting to use AI agents in areas where automation can be controlled and supervised, such as patient outreach, scheduling, and revenue-cycle operations. Universal Health Services has deployed Hippocratic AI’s agents to make post-discharge follow-up calls, with escalation to staff when needed. In the UK, Somerset NHS Foundation Trust reports that an outpatient booking virtual assistant is projected to save 600 staff hours per week and £456,000 per year at target adoption. McKinsey also estimates that agent-driven revenue-cycle workflows could cut providers’ cost to collect by 30–60% by automating steps like eligibility checks, denials handling, and follow-ups under governance.

Challenges and What to Plan For

AI agents can bring major improvements to how businesses work, but there are also challenges to consider before rolling them out. A recent Cloudera report (2025) shows that the top concerns for companies are data privacy (53%), connecting with older systems (40%), and high setup costs (39%). These are valid concerns — but with the right preparation around systems, oversight, and team support, businesses can manage the risks and get strong results from using agents.

- Trust and Oversight
Right now, only 27% of organizations fully trust AI agents. For agents to take action safely, companies need ways to review, explain, and control what the agent does. Adding human checks, alerts, and clear logs helps build confidence — especially in industries with strict rules.

- System Integration
Many older systems weren’t built to work with AI agents. Without the right APIs or data access, agents can’t do their job. Companies need to assess where updates are needed and make sure tools can connect and share data reliably.

- Changing Roles and Teams
As agents take over repetitive tasks, people’s roles shift toward supervising, reviewing, and improving outcomes. This brings new KPIs and the need for training. Teams should prepare for new workflows and invest in skills that support working alongside AI.

- Compliance and Ethics
Rules like GDPR and the upcoming EU AI Act require companies to keep AI decisions clear, fair, and traceable. It’s important to build in ways to monitor agent behavior, explain results, and follow local regulations.

Case study: From Legacy Chatbot to Advanced Enterprise Analytics with LLM Integration

A multi-industry enterprise performance management provider built an AI-enabled platform to centralize business metrics and improve decision-making. In practice, the product interprets user goals (e.g., “why did hiring slow down?”), retrieves the right data across systems, applies policy controls, and returns validated outputs as summaries, reports, or alerts.

What was holding them back

The client’s constraints were mainly about reliable execution across systems:

Fragmented data meant the tool couldn’t reliably execute cross-system requests (HR + CRM + finance + ops) without manual reconciliation.
LLM overuse made the “brain” too expensive and slow for routine actions (simple lookups shouldn’t require full reasoning).
Accuracy risk created low trust in decisions, especially for executive dashboards and KPI explanations.
Security and compliance requirements required strict tool permissions and auditability before any autonomous execution could be considered safe.
Unstructured inputs needed an efficient pipeline so the tool could “read” documents without turning every step into a costly LLM call.

What SciForce implemented

SciForce redesigned the legacy Rasa-based chatbot into an intelligent execution workflow that combines orchestration, tool use, and controls:

- Single source of truth (tool-ready data layer): unified HR, CRM, finance, and operational data so an agent can retrieve consistent KPI evidence across systems.
- Hybrid routing (agent orchestration): the system decides how to execute each request: fast retrieval/rules for lookups, LLM reasoning for complex tasks like summarization, trend analysis, and forecasting.
- Guardrails + validation (safe agent behavior): query filtering, response checks, role-based access control, and audit logs—so the agent can act within policy and reduce misleading outputs.
- Document intelligence pipeline (multi-tool execution): parsers for structured sources, LLM only when ambiguity requires deeper interpretation, reducing cost while keeping coverage broad.
- API-first modular design (scalable tool integration): microservices + APIs so the agent can plug into enterprise systems, scale, and deploy cloud or on-prem depending on governance requirements.

Results

The redesigned system delivered measurable improvements in execution efficiency, reliability, and trust:

58% reduction in manual reconciliation of metrics (less human “glue work” between tools)
68% reduction in hallucination rate (higher trust in agent outputs)
37-46% reduction in LLM usage (smarter orchestration, lower cost)
32-38% lower latency for simple lookups (faster routine execution)
39% reduction in AI processing costs (better resource allocation)
47% reduction in dashboard navigation time (faster access to answers for execs/analysts)

Conclusion

For most organizations, the opportunity with AI agents is simple: faster execution across the systems where work already happens. Start with one workflow that repeats daily, define guardrails and escalation rules, and measure impact with a short scorecard: time saved, cost per case, error rate, and adoption. Once the numbers hold, scaling becomes a business decision, not a technical debate.

Which workflow would you want to automate first – and what result would make the pilot a clear win?

The Rise of Virtual Hospitals: How AI Copilots are Managing the Full Patient Journey

SciForce — Thu, 12 Mar 2026 11:21:09 +0000

Introduction

The COVID-19 pandemic changed how healthcare works. When in-person visits dropped, telehealth, remote monitoring, and home care quickly became necessary, and many of these solutions are now here to stay.

Virtual hospitals and AI copilots are leading this shift. Virtual hospitals use video calls, remote monitoring, and mobile care teams to deliver hospital-level care at home. AI copilots support clinicians by drafting, summarizing, coding, and prioritizing information, while clinical decisions remain clinician-owned, with clear override mechanisms and auditability.

In 2025 survey contexts, documentation was the dominant AI use case; reported time savings (up to 1-4 hours per day) varied widely by workflow and measurement method. In the same survey context, administrative inbox automation (including faxes) was also reported as a material efficiency gain, but results depend on how “time saved” is measured and verified.

For healthcare leaders, virtual care and AI are becoming central to staying competitive. The strategic question is no longer whether virtual care and AI are feasible, but whether they can be deployed safely and measured reliably at scale.

The Virtual Hospital: A New Care Delivery Architecture

In this article, “virtual hospital” refers to two related models:

Hospital-at-home — substitutive acute inpatient-level care delivered at home
Virtual wards — remote monitoring and rapid response supporting early discharge or step-down care

These models deliver inpatient-level protocols and oversight for selected patients. Rather than replicating full inpatient infrastructure at home, safety is achieved through continuous monitoring, rapid escalation rather and eligibility (both in hospital-at-home and virtual ward models). Chronic Remote Patient Monitoring (RPM) may rely on a similar technology stack but remains operationally distinct from substitutive acute care, with different eligibility criteria and KPIs.

Programs should state upfront: who qualifies, who does not, and what triggers immediate escalation.

Scaling a virtual hospital is as much regulatory and financial as it is clinical. The model must map to reimbursable pathways (acute substitutive care vs step-down monitoring vs chronic RPM), define clinician accountability, and ensure credentialing and licensure for the jurisdictions served. Operationally, this includes documentation standards, consent and privacy requirements, device data policies, and clear liability boundaries for escalation decisions and adverse events.

Care is coordinated from a central clinical hub, while in-home services, including nursing, phlebotomy, imaging, infusions, oxygen setup, and medication delivery, provide the hands-on layer required for acute pathways. Through video visits, remote vital monitoring, and shared EHRs, patients remain continuously connected to their care team. This enables coordinated management of conditions such as post-surgical recovery, heart failure, chronic obstructive pulmonary disease (COPD) and infections. Further, operationally defined SLAs (not general principles), conservative thresholds and explicit decision rights ensure that escalation is fast, consistent, and auditable.

System impact should be measured with operationally defined KPIs:

An ‘avoided admission’ should be counted only when a patient meets pre-defined clinical criteria that would ordinarily trigger admission (e.g., ED evaluation + admission order intent, or protocol-defined admission threshold) but is safely managed at home without inpatient admission within a defined window (e.g., 72 hours).
‘Avoided bed-days’ should be calculated as the difference between expected inpatient LOS for a matched pathway and actual days managed virtually, using the same attribution rules.
Alert performance should be tracked as: alert rate per patient-day, actionable alert yield (% leading to intervention), time-to-acknowledge, and time-to-intervention - measured from system timestamps, not self-report.

Adding to that, safety of the virtual hospital depends on data governance and auditability. Every transformation - unit normalization, terminology mapping, threshold logic, and risk score configuration - should be version-controlled, traceable, and reviewable, with clear ownership for changes. Data quality checks should run continuously (missingness, out-of-range values, device connectivity gaps, timestamp integrity, and duplicate events). For AI components, drift monitoring must be explicit: changes in population case-mix, sensor behavior, or documentation patterns should trigger recalibration reviews and, when needed, rollback to a prior validated configuration.

How the Architecture Works (System View)

The three-layer operating model describes who does what, the five-domain stack describes which systems enable it.

Patient-Side Care Layer

This layer is where care is delivered to the patient at home. It includes remote monitoring devices, video consultations, and mobile clinical teams. Vital signs are tracked through connected tools, while nurses and other clinicians provide in-home services such as check-ups, tests, imaging, and medication administration.

Hospital-at-home delivers inpatient-level protocols and oversight for selected patients, supported by continuous monitoring and rapid escalation rather than on-site hospital infrastructure. Eligibility depends on clinical stability, predictable care needs, adequate home environment, social support, and the ability to escalate safely when required.

Orchestration & Data Layer

This layer orchestrates care delivery by connecting clinical teams, patients, and operational workflows into a unified system. It integrates EHRs with data from monitoring devices, labs, and imaging while coordinating staffing, equipment, medication delivery, and transport. AI supports triage, risk scoring, and real-time alerts to enable early detection of deterioration and timely intervention.

At scale, AI-driven triage and risk scoring require clinical-grade governance, including version-controlled logic, auditability, continuous performance monitoring, and recalibration to mitigate model drift and alert fatigue. Operational deployment must align with reimbursement, licensure, and medico-legal accountability frameworks.

Clinical Command Layer (24/7)

A multidisciplinary team monitors incoming data streams RPM (remote patient monitoring): vitals, symptom reports, and results as they are finalized), resolves alerts, and executes escalation pathways: virtual consults, dispatch of in-home teams, and rapid transfer to emergency department (ED) or inpatient care when thresholds are met.

Technology Stack

Rather than relying on a single platform, the virtual hospital is built on integrated capability layers that together form a digital and clinical operating system, supporting continuous data capture, communication, clinical intelligence, care coordination, and system-wide integration across the full patient journey.

- Sensing (data capture)

Remote patient monitoring devices, wearables, and diagnostic peripherals that collect vital signs and clinical measurements.
Examples: Philips RPM, Masimo, iRhythm (ECG), Dexcom (glucose), Omron (BP), Current Health (acquired by Best Buy Health and later divested back to its co-founder in 2025).

- Communication (clinical interaction)

Secure video, messaging, and virtual ward platforms used for consultations and team coordination.
Examples: consumer telehealth platforms (e.g., Teladoc/Amwell), enterprise collaboration (e.g., Teams/Zoom for Healthcare), and national virtual visit services (e.g., NHS Attend Anywhere).

- Intelligence (AI and analytics)

AI systems for triage, risk prediction, clinical decision support, and early-warning alerts.
Examples: Corti (clinical copilot and documentation), Viz.ai (stroke detection), Aidoc (radiology AI), Azure Health Bot.
Early warning scores embedded in EHRs (including proprietary deterioration indices) can support escalation workflows, but performance is context-dependent and requires local validation and ongoing calibration.

- Coordination (workflow and logistics)
Scheduling, routing, care pathway automation, and home-care orchestration.
Examples: Medically home (now dispatchhealth), Epic Care Coordination, Salesforce Health Cloud, GetWell, WellSky.

- Integration (clinical backbone)

Interoperable EHRs and connected imaging, lab, and pharmacy systems that provide a unified patient record.
Examples: clinical information systems: Epic, MEDITECH, veradigm, picture archiving and communication systems (PACS) systems from GE Healthcare and Siemens Healthineers, pharmacy systems such as Omnicell and BD Pyxis.

These layers together form the digital and operational foundation that enables virtual hospitals to deliver coordinated, continuously monitored care as an integrated system, rather than as standalone telehealth services.

AI Copilots: The Digital Workforce of Modern Care

AI copilots are software assistants embedded into healthcare workflows that support clinicians in real time. They process clinical interactions and patient data, generate documentation, flag risks, and assist with decision-making across the care process. Positioned as workflow and attention management systems, AI copilots summarize, draft, and prioritize, while clinical decisions remain clinician-owned with explicit audit trails and override mechanisms. Unlike traditional tools that handle isolated tasks, AI copilots work across systems and workflows, reducing administrative burden and improving efficiency, especially in virtual and hybrid care models that require continuous monitoring and coordination.

Key Functions and Value of AI Copilots

AI copilots support clinical teams by handling routine work and highlighting important information at the right time.

- Automated documentation and coding:
AI copilots capture clinical conversations and patient details to create notes, summaries, and codes, reducing manual paperwork and documentation errors.

- Predictive support for triage and patient risk:
Implemented with the above mentioned governance, AI copilots help identify higher-risk patients and support faster, more accurate triage decisions by analyzing vital signs, test results, and symptoms.

- Patient interaction through natural language:
Chat and voice tools allow patients to report symptoms, ask questions, and receive guidance, while collecting structured information for care teams.

- Real-time alerts and decision support:
AI copilots notify clinicians of changes or risks that need attention, helping teams respond quickly and safely without unnecessary alerts. Noise reduction is not a one-time feature: it requires continuous measurement of alert burden per clinician, time-to-acknowledge, and escalation yield, with thresholds adjusted under clinical governance.

AI Copilots in Real Clinical Use

AI copilots are already being used in healthcare as clinician-facing assistants built directly into daily workflows. These systems work continuously in the background, reduce administrative effort, and support clinical decisions rather than performing isolated tasks.

- Nuance DAX Copilot (Microsoft)

An ambient AI copilot that listens to clinician–patient conversations and automatically creates clinical notes inside the EHR. They report significant per-encounter time savings in vendor case studies (7 minutes per patient); measured impact varies widely across organizations depending on workflow, baseline documentation burden, and how “time saved” is captured.

- Corti (NHS and emergency care)

A real-time clinical copilot used in emergency and urgent care settings. It supports documentation and highlights quality and safety issues during live interactions. According to vendor-reported data, deployments show up to 80% less documentation time and 40% fewer errors.

- Innovaccer Provider Copilot
Provider copilots such as Innovaccer’s are designed to pre-summarize the chart, draft notes, and surface care gaps before and after visits, aiming to reduce cognitive load and standardize follow-through.

A Practical Guide to Implementing Virtual Hospitals and AI Copilots

As virtual hospitals and AI copilots become part of everyday healthcare, the main challenge is no longer adopting new tools, but making them work reliably at scale. Many organizations already use virtual care or AI, yet struggle to turn these efforts into a consistent operating model.

This guide focuses on the practical choices that help healthcare teams implement virtual hospitals and AI copilots effectively in daily clinical operations.

Step 1: Define the scope before the technology

A common early mistake is trying to virtualize everything at once. Successful programs begin with a narrow, clearly defined scope.
This typically includes:

Specific patient cohorts, such as post-acute recovery, chronic condition monitoring, or early discharge cases
Clear clinical boundaries that define what can be treated virtually and when escalation to in-person care is required
A limited set of workflows to virtualize first

Virtual hospitals work best where monitoring is frequent, deterioration can be identified early, and escalation pathways are well defined. Starting with a focused scope helps teams build safety, trust, and operational clarity before expanding to broader use cases. Safety depends on explicit eligibility and exclusion rules - clinical stability, predictable trajectory, home environment readiness, and defined “no-go” conditions - rather than broad promises of “hospital-level care for everyone.”

At this stage, SciForce works with healthcare teams to translate clinical goals into clearly defined patient cohorts, data requirements, and initial workflows that can be safely supported by virtual care and AI copilots.

Step 2: Assign single ownership, not shared responsibility

Virtual hospitals and AI copilots often lose momentum when ownership is unclear. When too many teams share responsibility, decisions slow down and accountability fades. In successful programs:

One executive is clearly responsible for results
Clinical, operational, and digital teams support the program, but do not jointly own it
Decision-making authority for clinical rules, escalation paths, and technology choices is clearly defined

Organizations that make progress treat virtual care as a core service with clear leadership, not as a side project spread across multiple teams.

Step 3: Integrate into existing workflows before adding intelligence

AI copilots deliver real value only when they are embedded into everyday clinical workflows. Tools that sit outside core systems may perform well in pilots, but they are rarely used consistently in routine care.

In practice, this means copilots must deliver documentation, alerts, and clinical summaries inside the EHR, without requiring clinicians to switch tools or manage parallel processes. In virtual hospitals, copilots act as the connective layer between continuous care activity and the clinical record, translating ongoing monitoring and interactions into usable, timely information.

At this stage, a common blocker is fragmented and inconsistently coded medical data, which limits what copilots can reliably surface. Data quality and model governance are prerequisites: provenance, terminology consistency, and auditable transformations are required before AI outputs can be safely embedded into clinical workflows. Jackalope, developed by the SciForce team, automates clinical data (EHRs, claims, registry and clinical trial data) standardization, improves mapping precision by up to 25% and reduces processing time by 50% compared to manual mapping1.

Step 4: Use AI to prioritize attention, not replace judgment

In virtual hospitals, continuous monitoring generates far more data than clinical teams can review manually. AI copilots are most effective when they manage this information flow and protect clinician attention, rather than attempting to automate clinical decisions.

- Filter high-volume data in real time
AI systems continuously analyze vital signs, lab results, device data, and patient-reported inputs, reducing noise and identifying early signs of deterioration.

- Escalate only actionable cases
Instead of sending constant alerts, AI prioritizes patients and events that require timely human intervention, helping teams respond before conditions worsen.

- Keep clinical decisions with clinicians
AI copilots should prioritize and summarize, while clinical decisions remain clinician-owned with auditability and clear escalation pathways. Patient similarity networks reinforce this model by providing contextual comparisons to similar cases, helping clinicians recognize meaningful deviations and assess risk without automating clinical judgment.

This model is especially important in virtual hospitals, where many patients are monitored at the same time. SciForce builds AI systems that help clinicians focus on the most important cases first, enabling faster and more effective responses while keeping all treatment decisions and escalation with human care teams.

Step 5: Design escalation pathways before launch

In virtual hospitals, safety depends on clear escalation rather than perfect prediction, with AI copilots identifying risk early and clinicians responding decisively.

Automated risk detection: AI continuously monitors patient data and flags early signs of deterioration. 2.Clinical review: A nurse or physician assesses the alert using recent trends and contextual information.
Remote intervention: Care is adjusted through virtual consultation or in-home services when appropriate.
In-person escalation: Patients are rapidly transferred to emergency or inpatient care when risk thresholds are met.

Escalation pathways should be defined through operational Service Level Agreements (SLAs), including time-to-acknowledge alerts, time-to-virtual contact, time-to-dispatch in-home teams, and time-to-transfer when emergency or inpatient care is required.

Safety at scale depends more on conservative thresholds and clearly defined decision rights than on perfect prediction: AI flags risk, clinicians adjudicate, and escalation follows pre-agreed pathways.

Step 6: Measure impact at the system level

Time saved by individual tools is rarely a reliable indicator of success. Organizations that scale virtual hospitals and AI copilots focus instead on system-level outcomes that reflect capacity, quality, and cost. In practice, this means tracking metrics such as:

Patients managed per clinician
Readmissions and avoided admissions
Speed of escalation and intervention
Coverage hours achieved without staffing increases
Length of stay (virtual versus in-hospital)
Emergency department visits avoided
Time from alert to clinical intervention
Usage of in-home services compared to inpatient resources

System-level metrics must be defined using clear operational definitions — for example, what qualifies as an “avoided admission,” how readmissions are attributed, and how alert-to-intervention intervals are measured across systems.

Measuring system-level impact depends on aligning virtual care, clinical, and utilization data into one consistent view. SciForce supports this through healthcare ETL and data integration work that enables reliable measurement across care settings, including large-scale standardization of clinical and claims data.

Step 7: Expand deliberately, not opportunistically

Successful teams expand virtual hospitals and AI copilots only after core workflows are stable and outcomes are consistently measured. Expansion usually happens in stages, starting with additional patient cohorts, then extending to new AI-assisted workflows, and eventually to broader geographic coverage.

In mature programs, growth follows proven operational readiness and clinical confidence, rather than vendor availability or short-term opportunities.

Conclusion

Virtual hospitals and AI copilots are becoming part of the core healthcare operating model. The real challenge is not adoption, but execution: integrating AI into clinical workflows, connecting fragmented data, and scaling virtual care safely and reliably. Scaling reliably requires four foundations: explicit eligibility/exclusion rules, governed escalation SLAs, interoperable data with auditability, and outcome measurement with clear definitions.

At SciForce, we focus on the foundations that make this possible: AI-driven clinical intelligence, healthcare data integration, and end-to-end medical software development.

If your organization is planning or refining a virtual hospital, virtual ward, or AI copilot initiative, book a free consultation to assess readiness, define safe clinical scope, and identify practical next steps

The DevOps Metrics That Matter in 2026 (And the Ones That Don’t)

SciForce — Thu, 05 Mar 2026 12:23:50 +0000

Introduction

DevOps metrics are no longer limited to engineering teams. In 2026, they directly affect costs, delivery speed, and business risk.

The financial impact of failure makes this clear. New Relic’s 2025 Observability Forecast shows that high-impact IT outages carry a median cost of $2 million per hour, or more than $33,000 per minute. The median annual cost of such outages reaches $76 million per organization.

When downtime carries this level of cost, the metrics used to guide delivery and operations stop being technical details and start shaping financial outcomes.

This exposes a gap in how DevOps is often measured. Metrics like commits, builds, or tickets closed say little about system resilience, recovery speed, or the true cost of failure. What matters instead is how quickly changes can be delivered safely, how fast incidents are detected and resolved, and how reliably systems operate under load.

In 2026, the DevOps metrics that matter are the ones that connect speed, reliability, and cost efficiency to real business outcomes. This article explains which metrics belong on that list — and which ones don’t.

Why DevOps Metrics Changed and Why It Matters Now

The way DevOps metrics have changed reflects a shift in cost and risk, not in tools or workflows.

Flexera’s 2025 State of the Cloud Report shows that 84% of organizations struggle with cloud cost management, while 50% already run generative AI workloads in the cloud. These workloads scale fast, rely on expensive infrastructure, and increase the financial impact of inefficient delivery and system instability.

This changes what DevOps decisions mean in practice. Cloud and AI environments can grow instantly, and small inefficiencies or failures quickly turn into higher costs and broader risk.

As a result, DevOps outcomes now have direct financial consequences:

A deployment can increase infrastructure spend within minutes
A reliability issue can affect multiple services or regions
An inefficient pipeline increases cost and risk over time

In this environment, activity-based metrics lose their value. Counts of commits, builds, or tickets completed show effort, not results. They don’t explain whether delivery is improving, systems are becoming more stable, or costs are under control.

Modern DevOps metrics focus on outcomes instead:

How quickly changes reach production
How often those changes fail
How fast teams recover from incidents
How much it costs to run and scale systems

These metrics make delivery speed, reliability, and cost visible at the same time — and set the direction for the sections that follow.

The DevOps Metrics That Actually Matter

Modern DevOps metrics fall into three groups that show how software delivery creates and protects value. They measure how fast ideas reach production, how reliably systems operate, and how efficiently infrastructure spend is used.

These groups are based on widely used industry approaches, including DORA metrics for delivery performance, reliability measures from SRE practices, and cost metrics from FinOps, rather than internal activity counts.

Together, these metrics show whether DevOps is improving real outcomes. The sections below focus on the measures that consistently relate to delivery speed, system stability, and cost control.

1. Speed Metrics: How Fast Ideas Turn into Value

Speed metrics show how quickly changes move from code to production. In the DORA framework, speed is measured through deployment frequency and lead time for changes, which reflect how efficiently work flows through delivery. Delays matter because slower delivery pushes feedback out, raises risk, and postpones value.

1.1 Deployment Frequency (DORA metric)

Deployment frequency measures how often an organization releases code to production.
Higher deployment frequency usually reflects a delivery process built around small, incremental changes rather than large, infrequent releases:

Smaller changes reduce the blast radius of failures
Rollbacks are simpler and faster
Issues are easier to trace to a specific change

Frequent deployments also reduce the time between implementation and real-world feedback:

Ideas are validated sooner in real environments
Unsuccessful changes are detected earlier
Adjustments can be made before costs escalate

Deployment frequency ultimately reflects how quickly an organization can respond to demand and adapt to change.

1.2 Lead Time for Changes (DORA metric)

Lead time for changes measures how long it takes for a code change to move from commit to production.

Short lead times indicate an efficient delivery pipeline with minimal friction. Long lead times signal growing coordination overhead and higher cost of delay:

Feedback arrives later
Learning slows down
Planning becomes less predictable

As lead time increases, even small changes accumulate into larger, riskier releases. This raises the likelihood of failures and increases recovery effort.

Among DevOps metrics, lead time is one of the clearest indicators of delivery efficiency. Reducing lead time improves responsiveness, lowers coordination costs, and enables faster iteration without sacrificing control.

2. Reliability Metrics: How DevOps Protects Revenue

Reliability metrics describe how safely changes are introduced and how systems behave under failure. They capture how often changes fail, how quickly services recover, and how consistently systems remain available over time.

2.1 Change Failure Rate (DORA metric)

Change failure rate measures how often deployments lead to incidents, rollbacks, or degraded service.

A low change failure rate suggests stable releases and effective checks before deployment. When the rate increases, it signals higher risk, even if changes are delivered quickly:

More incidents that affect users
Greater effort spent on reactive work
Lower confidence in the release process

High deployment frequency alone does not reduce risk. If the change failure rate is high, delivery becomes less predictable and downtime exposure increases.

2.2 Mean Time to Restore (DORA metric)

Mean Time to Restore (MTTR) measures how quickly service is restored after an incident. Since failures are inevitable in complex systems, recovery speed often matters more than avoiding every failure. Lower MTTR limits the impact of outages by:

Reducing total downtime
Reducing the number of services and users affected
Lowering revenue and productivity loss

Improvements in monitoring, alerting, incident response, and rollback automation usually appear first as faster recovery times.

2.3 Availability (Derived reliability metric)

Availability measures how consistently systems remain operational.

Rather than tracking individual incidents, it summarizes the overall reliability outcome experienced by users. It captures the cumulative effect of delivery and recovery practices over time.

Availability reflects the combined effect of:

How often changes fail
How quickly systems recover when they do

High availability does not imply the absence of failures. It indicates that failures are infrequent, short-lived, and contained well enough that overall service continuity is preserved.

3. Cost & Efficiency Metrics: DevOps and Margins

Cost and efficiency metrics connect delivery performance to financial outcomes. They show whether speed and reliability are achieved efficiently or depend on rising infrastructure spend, and whether delivery costs scale in proportion to value.

3.1 Unit Economics

Unit economics measure cost per unit of value, such as cost per transaction, user, deployment, or service. The concept comes from business and finance, but it has become increasingly relevant in DevOps as cloud-native systems scale.

In modern environments, delivery frequency, infrastructure usage, and reliability decisions directly affect unit cost. As a result, DevOps teams influence whether costs grow in proportion to value or faster than usage.

Unit economics matter more than total cloud spend because they show how costs behave as usage grows:

Stable or declining unit costs indicate scalable systems
Rising unit costs signal inefficiencies that compound with growth

Without unit economics, teams may reduce cloud bills in the short term while masking structural cost problems that reappear at scale.

3.2 Resource Usage and Waste

Resource usage metrics show how much of the available compute, storage, and networking capacity is actually used.

Low usage means paying for resources that sit idle. Common reasons include provisioning for peak load that rarely occurs, idle workloads left running, inefficient scaling rules, and duplicated environments. Examples include:

Servers with consistently low CPU or memory usage
Databases sized far beyond actual demand
Development or staging environments left running when not in use
Storage volumes allocated well above what is needed

Improving the metric lowers costs without slowing delivery or reducing reliability. In many cases, it is the fastest way to improve margins because it removes waste already built into the system.

What to Stop Measuring — and What to Measure Instead

As DevOps becomes responsible for cost, reliability, and margins, not all metrics remain useful. Many commonly tracked metrics show how busy teams are, but not whether delivery is actually improving. When decisions are based on these signals, teams may look productive while speed, stability, and cost efficiency fail to improve. Measuring activity creates motion, not meaningful progress.

Metrics That Distort Decision-Making

The following metrics are still widely used, but provide limited insight into delivery effectiveness or financial impact:

- Number of commits or pull requests
High commit or PR volume reflects coding activity, not how quickly changes reach production or how stable they are once deployed.

- Tickets closed or story points completed
These metrics track workload throughput within a team, but stop at the planning boundary. They don’t show whether work reaches production, increases risk, or leads to faster feedback and value.

- Build counts or pipeline runs
Frequent builds show pipeline activity, not delivery performance. Build volume alone does not reflect lead time, failure rate, or recovery speed.

- Total cloud spend (without context)
It does not show whether higher spend reflects growth, better performance, or wasted capacity, and can hide rising unit costs.

These metrics can improve in isolation while delivery outcomes, reliability, and margins quietly deteriorate.

Why Activity Metrics Fail Business

Activity metrics are easy to collect and report, but they say little about whether delivery is actually improving. They show how busy teams are, not the results of their work.

Because of this, they fail to answer the questions leadership needs to understand:

Are we delivering value faster, or just doing more work?
Is reliability improving, or are we building hidden risk?
Do costs grow in line with the business, or faster?

Without cost and outcome context, activity metrics push teams to optimize individual tasks or tools instead of improving the delivery system as a whole.

What to Measure Instead

Outcome-focused metrics we talked about earlier align delivery performance with business results:

Deployment frequency and lead time show how quickly value reaches production
Change failure rate and MTTR reveal delivery risk and recovery cost
Availability reflects long-term service reliability
Unit economics show whether systems scale profitably
Resource usage exposes waste built into infrastructure

Conclusion

In 2026, DevOps maturity is about results, not activity. What matters is whether delivery improves speed, reliability, and cost efficiency at the same time.

Metrics that focus on activity can make teams look productive, but they don’t show whether systems are becoming faster, more stable, or cheaper to run. The metrics that matter connect delivery work to financial outcomes. They help teams see trade-offs, understand whether systems scale efficiently or deteriorate as they grow.