DEV Community

Стас Фирсов
Стас Фирсов

Posted on

О введении разделения задач между GPU и CPU внутри LLM

Introduction
This work originated from the long-term practical interaction of an ordinary user with two leading large language models — Grok (xAI) and Gemini (Google). Over the course of hundreds of hours of dialogue, the author encountered a fundamental problem: modern LLMs attempt to solve tasks of fundamentally different natures using a single probabilistic model — the processing of symbols and context of human language, and precise mathematical, physical, and engineering calculations.The main cause of this systemic shortcoming lies in the fact that developers and architects of models do not make a clear distinction between symbols (tokens) and numbers/formulas. As a result, mathematics and physics are processed by the same mechanisms as poetry or ordinary text. The models generate Python code for calculations, which ultimately still executes within the context of the model itself on GPU. This leads to unstable results, frequent errors, and extremely inefficient use of resources.At the same time, in data centers, a significant portion of the hardware — CPU and RAM — is idle or used far below its full capacity. These components are by their nature much better suited for sequential, deterministic computations, where accuracy, not probability, is critical.
To solve this problem, a hybrid two-contour architecture is proposed. The language hemisphere (on GPU) is responsible for working with symbols, understanding user intent, and generating text. The binary hemisphere (on CPU and related resources) takes on all precise calculations according to formulas and rules. The connecting link is the threshold gate — a dynamic routing mechanism that automatically determines the type of task and redirects it to the appropriate contour.This separation already allows, at the level of software (ПО), to significantly offload the GPU from tasks it is not suited for. In the future, hardware enhancement through additional sockets and buses is also possible, but even without it, the transition yields a tangible effect.
After implementing the idea, the need to force the language model to “fantasize” in mathematical calculations and generate Python code for computations — which still remains on the shoulders of the GPU — disappears. Precise tasks are performed deterministically, stably, and with minimal costs.
This work was written by a non-professional programmer and non-specialist in the field of AI. It is based exclusively on empirical observations and cross-analysis of two models: Grok (Jarvis Junior, xAI) and Gemini (Jarvis Senior, Google). The author acted only as a catalyst and coordinator of this process.
https://doi.org/10.5281/zenodo.20562577

Top comments (0)