Maria Artamonova for Red5

Posted on Jan 16 • Originally published at red5.net

GPU vs CPU. How to Cut Live Streaming & AI Processing Costs?

#livestreaming #software #learning #beginners

The question of how to cut live streaming and AI processing costs using GPU vs CPU is becoming increasingly common as businesses look to optimize operating costs and improve ROI. In this blog, you will learn practical ways to reduce costs across real-time streaming workflows as well as AI processing in live streaming environments. If you want to skip ahead, you can jump directly to the second part using the table of contents below.

Watch a short video on this topic by our co-founder and CEO, Chris Allen.

Introduction

Minimizing the costs of operations in real-time video streaming has never been more important or more challenging than it is now with the use of AI models beginning to permeate the marketplace.

Not only do users have to determine whether a given real-time streaming platform and their choices of AI models meet expectations. They must be sure they’re spending as little as possible on all the processing end to end that goes into achieving their goals.

Keeping live stream processing costs as low as possible requires looking under the operational hood to see how a given platform provider’s processing requirements are divided between reliance on low-cost Central Processing Units (CPUs) and much costlier hardware acceleration via Graphics Processing Units (GPUs) or other custom microprocessors used by cloud providers such as Google’s Tensor Processing Units (TCUs) and Amazon’s Graviton. This touches on how money is spent on cloud resources at every juncture, from basic ingestion, transcoding, mixing and routing functionalities to executing the intelligence determining how all the elements are packaged for each end user.

Of course, in the case of the hyperdynamic AI processing marketplace, there’s only so much to convey in the way of general cost guidance beyond exploring how to take advantage of the fact that processing AI doesn’t always have to involve GPUs or other high-cost processors. However, as shall be seen, there are other ways beyond the GPU vs. CPU assessment that streaming platforms can be used to keep AI usage costs in check.

The good news is, costs don’t need to be the barrier that some observers claim when it comes to executing highly scalable real-time streaming solutions or any type of live streaming with AI. Our goal here is to shed some light on why that’s the case.

Part 1: Cost Containment In Real-Time Streaming

CPU and GPU Per-Instance Costs

Category	CPU	GPU
Typical Cloud Instance Cost	Under $1 per instance	$3–$10+ per instance
Processing Style	Serial execution of complex tasks	Massive parallel execution for large data sets
Price Stability	Predictable year-to-year	Volatile, varies by vendor and demand
Impact on Real-Time Streaming	Works for logic + workflow tasks if tuned	Great for heavy AI/encoding but expensive at scale

Judging the cost implications of CPU vs. GPU usage is complicated but begins with the basic contrast in costs per cloud instances. Looking at prices charged by various cloud providers, we find cloud instances involving use of CPUs running in the sub-dollar range and per-instance single GPU usage costs ranging from around $3 to over $10.

These differences can be monumental to the bottom-line consequences of choices impacting digital commerce at the semiconductor level. However, looking at pricing in that light beyond simply acknowledging there’s a gap in basic unit usage costs is not all that helpful, given how the respective chip designs are shaped by what they’re meant to accomplish.

CPU and GPU cores alike handle millions of calculations per second and rely on internal memory to facilitate performance. But CPUs, as the brains of a computer, are optimized for serially processing a complex set of related tasks whereas GPUs are designed to run calculations in parallel to accommodate processing of immense amounts of data related to a given task. Integrated circuit boards comprising a CPU can have anywhere from one to dozens of cores, each far more powerful than a GPU core, while the latest generation GPUs come with core counts exceeding 15,000.

Complicating matters, the costs of GPU instances and clusters vary greatly among hyperscalers and smaller providers of cloud computing services. To cite just one example of the pricing disparities among hyperscalers, the per-instance hourly cost in 2025 as measured in Verda’s study for use of an eight-GPU cluster of Nvidia’s H100 supporting 80GB of processing capacity was $88.49 on the Google Cloud Platform, $55.04 on AWS and $98.32 on Microsoft Azure. As a further data point, Oracle Cloud Infrastructure offers an 8× H100 bare-metal node under the instance shape BM.GPU.H100.8 — priced at US $80.00/hour (i.e. US $10.00 per GPU-hour).

Cloud Provider	Hourly Price (8× H100)	Price per GPU-Hour	Source
Google Cloud Platform (GCP)	$88.49/hr	$11.06	Verda GPU Pricing Comparison (2025)
Amazon Web Services (AWS)	$55.04/hr	$6.88	Verda GPU Pricing Comparison (2025)
Microsoft Azure	$98.32/hr	$12.29	Verda GPU Pricing Comparison (2025)
Oracle Cloud Infrastructure (OCI)	$80.00/hr	$10.00	Oracle official GPU pricing (updated 2025)

Moreover, GPU or related commodity hardware accelerator costs are in a highly volatile state as options multiply with AMD, Intel and ever more entrants, including newcomers like OpenAI’s chip-making venture, Apple’s anticipated foray into the field and many others looking to dent Nvidia’s dominance. What that means to any narrowing of the price gap between traditional CPUs and GPUs over time is hard to say.

But, for now, the analysis regarding the semiconductor processing cost impact on real-time streaming platforms is fairly straight forward. Here the complexities relevant to cost containment revolve around the fact that anything done with CPUs to minimize processing costs must also meet quality performance goals without compromising on latency performance.

Cost Containment at the Speed of Thought

In other words, the use of CPUs should not result in pushing end-to-end live streaming latencies above the real-time benchmark set by Red5 at 250ms, which ensures there’s no interval lengthy enough to be perceived as lag time between source and reception. Moreover, the more processing that’s required to support the kind of enhanced use-case versatility intrinsic to Red5’s Experience Delivery Network Architecture, the more challenging cost containment through reliance on CPUs becomes.

It’s a tall order, but Red5’s has met the challenge with architectural and software engineering that enables reliance on CPUs to support a vast majority of real-time interactive streaming use cases. This is cost containment with end user experiences delivered, as our tagline says, “at the speed of thought.” (See Addendum for an overview of XDN Architecture, distinctions between the Red5 Cloud software-as-a-service (SaaS) and the Red5 Pro self-hosted approach to XDN implementation, and the use-case versatility enabled by the Red5 TrueTime toolsets, SDKs and other innovations.)

We begin with exploration of the ways in which Red5’s support for all real-time streaming use cases serves as a cost-saving foundation for any instances involving use of AI. The breadth of applications, with support as needed for social or collaborative interactivity, encompasses sports, esports and other live content distribution, multiplayer gaming, sports betting, virtual casino gambling, online auctions, e-commerce, distributed live production, large-scale public safety surveillance, and much else, including the use of extended reality (XR) technology wherever applicable in these scenarios.

Going even farther, Red5 has secured additional cloud resources to enable the lowest pricing yet seen for real-time streaming in U.S. markets. As described at length in this blog, the recently introduced Red5 Cloud Pay-as-Grow plan allows entry-level customers, without even registering credit cards, to stream up to 50 gigabytes consuming up to 6,000 instance hours per month at no cost on a continuing basis.

In any month exceeding 50 GB usage, these customers, upon registering payment methods, pay just $0.08 per GB and $0.69 per instance hour, reverting to free usage in the succeeding month at or below 50 GB. Any time customers reach the point where they want to scale to much higher monthly volumes with real-time streaming to any number of users numbering into the millions at continental and transcontinental distances, they can seamlessly transition to use of Red5 Cloud resources running on the OCI platform.

Part 2: AI Cost Containment In Live Streaming

The abundance of generative AI solutions available for applications directly impacting payloads delivered over A/V streaming infrastructures of all kinds has added a complex dimension to cost assessment on the part of decision makers everywhere. Here, as mentioned earlier, the bedrock cost-containment principle is that decision makers should avoid acting on the assumption that GPUs are always needed to handle tasks related to the various permutations of AI, including Large Language Model (LLM) and Vision Language Model (VLM) versions of generative AI (genAI) and the decision-making capabilities of Agentic AI.

It’s also important to note that, as shown by innovations introduced by Red5, a real-time streaming platform can have a major role to play in facilitating efficient, cost-saving uses of generative, agentic and the new AI formulations that will be exploding into economic reality in the years ahead. We’ll explore these developments in the second part of this section.

Choosing Semiconductors for AI Deep Learning and Inference Processing

As in the case of assessing CPU vs. GPU possibilities in streaming-related processing, decision makers should take into account the full scope of the outcomes of using either type of processor for AI beyond just the basic per-instance cost differential. If analysis shows that a GPU instance incurs four or five times the cost of a CPU instance but delivers ten times the execution results, the GPU path turns out to be least costly.

Scenario	CPU	GPU
Training generative AI models	Used only for serially sequenced machine-learning algorithms that do not require parallel computing	Usually the right choice due to immense processing power
Inference execution in commercial AI-assisted solutions	Often used to handle real-time execution tasks	Required only when inference must process massive workloads in real time, such as speech-to-text, translation, visual object analysis, or any task requiring very high QPS
Cost efficiency considerations	Lower per-instance cost but slower; may be more costly overall if throughput is insufficient	Higher per-instance cost but can deliver 10x the execution results, making it the lower-cost option when throughput matters
Decision criteria	Balanced perspective required; may be preferred for inference unless workload is massive	Preferred when parallel processing or extremely high throughput is required

That said, when it comes to the deep learning that goes into training generative AI models for a particular application, GPUs with their immense processing power are usually the right choice, but there are some training scenarios where CPUs are a better fit, such as serially sequenced machine-learning algorithms that don’t require parallel computing.

On the other hand, CPUs can often be used to handle processing required for execution at the inference stage of commercially marketed AI-assisted solutions, even when the task must be completed in real time. With AI inference execution, nuances determining which type of processor might be the best choice dictate a more balanced perspective than in the case of the GPU-dominated deep learning processes.

For example, it takes GPUs to train a recommendations system to know how to analyze user behavior and tie it to available viewing options, but CPUs can take care of executing a recommendation based on a particular user’s interactions with content. But if the inference stage involves real-time execution of massive processing workloads, as happens in speech-to-text and language translations, visual object analysis and anything else that requires very high numbers of queries-per-second (QPS), GPU-enabled parallel processing is the way to go.

Taking all these factors into consideration, it’s clear that keeping the processing costs associated with AI in check requires an open-minded, thorough and highly informed approach to choosing processors and their cloud hosts. As IT solutions supplier HorizonIQ puts it in a recent blog, “By strategically combining the computational capabilities of CPUs and GPUs, businesses can maximize performance while minimizing costs.”

The Streaming Platform’s Role in AI Cost Containment

Apart from users’ choices of AI processing environments, there are steps as illustrated by Red5 that providers of streaming platforms can take to play a direct role in AI-related cost containment. This starts with facilitating ready access to a broad selection of AI solutions.

In addition, the platform’s architecture should support the flexibility and functionality essential to getting the most out of AI applications, including those that are meant to support specific use cases running on the platform as well as those that can be applied to enhance performance of the platform itself. On all counts, decision makers will find that nothing rivals Red5’s XDN Architecture as the foundation for using AI applications with real-time multidirectional streaming.

As described in this blog, Red5 has pre-integrated AI solutions offered by a wide range of partners for use in live production, streamed in-venue sports viewing experiences, multilingual closed captioning, public safety surveillance, interactive dispersed user engagement in live sports and esports, online betting, video games, ecommerce and telehealth, and much else. Leveraging its portfolio of open APIs, Red5 is continually adding new AI-solution integrations.

At the same time, Red5 has vastly expanded the usefulness of AI in live streaming by introducing for the first time anywhere a means by which A/V frames live streamed at any commonly used frames-per-second (fps) rate can be instantly extracted on the fly not only from Red5 Cloud XDN streams but from conventional HTTP-based streams as well. This is a game changer in the use of applications aided by LLMs or VLMs, which, until now, have not been executed to their full potential owing to the time consumed by pulling frames for AI processing.

Diagram illustrating real-time frame extraction and encoding for AI in live streaming using SRT.

Whether HTTP- or XDN-based streaming is involved, the Red5 Cloud frame-extraction service employs Red5’s unique cloud-hosted real-time transcoding process to deliver extracted frames in whatever resolutions and bitrate profiles users choose on the service portal. Within the Red5 Cloud domain, Red5 has taken an AI-agnostic approach to making the broadest range of applications available for real-time frame extraction. As a result, just about any relevant AI solution can be programmatically applied on the Red5 Cloud platform to execute tasks at warp speed.

The possibilities are limitless. Some AI-related applications now enabled by real-time A/V frame extraction relate to surveillance or other scenarios where there’s a need to detect specific objects or faces, variations in street and highway traffic patterns, fires and other emergency-related developments, criminal activity, and other types of developments in the live stream flow, including unwanted audio or video elements such as swear words or indecent exposure.

Other applications have to do with pulling and formatting high-quality still images to capture key moments in sports competition, generate promotional thumbnails, or pinpoint defective parts in factory production, to name just some types of use cases that can benefit in ways far beyond what once could only be done by extracting screenshots from stored files.

Conclusion

There will be much more to come as we expand our partnerships and devise new ways to put AI to use with XDN Architecture. Now and into the foreseeable future Red5 will be at the forefront providing the means by which AI can be applied to maximum effect at the lowest possible costs.

And with the savings resulting from low-latency use of software running on CPUs, Red5 will remain the go-to source for real-time streaming that minimizes the need for hardware-accelerated processing at all points end to end.

Addendum: XDN Overview

Whether real-time streaming is implemented though the Red5 Cloud service or via Red5 Pro,

XDN infrastructure runs on commodity servers in locations that can be dynamically configured by the XDN Stream Manager to seamlessly operate in cloud clusters as Origin, Relay and Edge Nodes. One or more Origin Nodes in a cluster serve to ingest and stream encoded content out to Relay Nodes, each of which serves an array of Edge Nodes that deliver live unicast streams to end points in their assigned service areas.

Origin Node placements can be optimized to accommodate ingestion of massive volumes of streams at minimum latency in interactive scenarios serving all end users through co-locations with XDN Edge Nodes. By leveraging both containerized and virtual machine-based iterations of datacenter virtualization, the XDN platform enables the flexibility and speed of resource utilization that is essential to unlimited scalability and fail-safe redundancy.

The multidirectional flexibility of this architecture can be applied so that whenever anyone at any moment of a streamed session chooses to generate a video, that user’s stream will be conveyed along with any marketing enhancements to everyone else. Whatever the use case might be, it doesn’t matter whether just a few, thousands or even millions of users are engaged or where they are.

Regardless of usage scale or transmission distance, persistent performance across all our customers’ XDN applications confirms the latency incurred from live production output through ingest at Origin Nodes and transport through Relay and Edge Nodes to rendering on end-user devices is no greater than 250ms – hence our use of the term “real-time streaming,” which applies to any instance where, as in digital voice communications, the lag time is imperceptible to users.

Latencies can drop as low as 50ms when end-to-end distances are limited to in-region footprints. Lower latencies in the case of 5G network streaming are also attained when XDN access and egress distances are reduced with connectivity to cell sites that have been equipped by mobile carriers affiliated with the Amazon Web Services (AWS) Wavelength program to support direct on-ramps to the AWS cloud. Red5 is the only real-time streaming supplier authorized for pre-integration with AWS Wavelength Zones, which allows customers’ 5G stream flows to bypass multiple internet hops in their connections to XDN Nodes hosted by AWS.

Wherever XDN infrastructure is deployed, cluster-wide redundancy essential to fail-safe operations is enabled by the Stream Manager’s autoscaling mechanism through platform controllers designed to work with cloud providers’ APIs. With comprehensive performance monitoring, the Stream Manager executes the load balancing essential to persistent high performance across the entire infrastructure without manual intervention. And in the event of a malfunctioning node component, it can instantly shift processing to another appliance in that node.

Currently, WebRTC is the most used XDN streaming mode owing to the client-side support provided by all the major browsers, including Chrome, Edge, Firefox, Safari and Opera, which eliminates the need for plug-ins or purpose-built hardware. Alternatively, if a mobile device with built-in client support for RTSP is receiving the stream, the platform transmits via RTSP. The client-optimized flexibility of XDN architecture also extends to packaging ingested RTMP, MPEG-TS and SRT encapsulations for transport over RTP when devices compatible with these protocols can’t be reached via WebRTC or RTSP.

It’s also important to note that, over time, there’s a good chance the emerging IETF MOQ standard will become the dominant real-time streaming protocol. As discussed in this blog, Red5 is taking steps toward incorporating MOQ into the XDN Architecture for customers’ use once the standard is finalized, which is expected to occur sometime in 2026.

The basic difference between working in the Red5 Cloud and Red5 Pro environments has to do with how the XDN infrastructure is implemented and managed over time. The full end-to-end real-time multidirectional streaming supported by Red5 Cloud is automatically implemented in accord with user requirements on the global Oracle Cloud Infrastructure (OCI), which spans 50 geographic regions on six continents.

In response to customers input setting geographical reach, targeted user counts and other basic parameters on their service portals, the Red5 Cloud service instantly activates resources hosted by the global Oracle Cloud Infrastructure for implementations precisely tuned to their needs. The service includes sustained managed support for maintenance, changes in original parameters and other needs through the entire engagement life cycle.

Each Red5 Cloud instantiation of a customer’s XDN infrastructure and its subsequent modifications remain dedicated exclusively to that customer’s use in perfect alignment with the use case requirements. This is a major departure from the shared usage platforms operated by other suppliers of WebRTC cloud services, where pre-formatted use-case applications are offered on a take-it-or-leave-it basis.

Customers choosing to pursue the Red5 Pro DevOps approach can mount XDN infrastructure in public or private clouds utilizing a comprehensive portfolio of Red5 Pro SDKs and open APIs with recourse to assistance from Red5 personnel. Public cloud XDN infrastructures built with Red5 Pro can operate seamlessly with no loss of latency in cross-cloud scenarios involving the leading cloud providers, including AWS, Google Cloud, Microsoft Azure, OCI and others that have been pre-integrated with the platform, as well as many more that can be integrated for XDN use with the aid of the Terraform open-source multi-cloud toolset.

Whatever approach customers take to deploying their XDN infrastructures, XDN Architecture accords them unparalleled freedom to respond with speed and precision to new opportunities. The architecture’s reliance on open-source technology and APIs together with the availability of application-specific TrueTime tools bundled with native iOS, macOS, Windows, Android, Linux, and HTML5 SDKs provides the flexibility they need to tailor applications as they see fit.

Red5 uses its open-source APIs to create an ever-expanding ecosystem of partners whose solutions are pre-integrated into XDN Architecture to deliver best-of-breed solutions individually or in whatever combinations work to reduce time to market as well as the time it takes to introduce ongoing service enhancements. The applications range across support for cloud computing, backend transport, asset management, storage, transcoding, packaging, content protection, conventional CDN tie-ins, and a host of value-addedfeatures developed by Red5 and its partners to capitalize on these capabilities.

DEV Community