Why are large language models so terrible at video games?!

#largelanguagemodels #llms #videogames #ai

The assertion that large language models (LLMs) are "terrible at video games" warrants a nuanced technical examination. While LLMs demonstrate remarkable capabilities in text generation, translation, and code comprehension, their performance in interactive, real-time, and often visually complex environments like video games is indeed significantly limited. This limitation stems not from a fundamental inability to process game-related data, but rather from a mismatch between the inherent architecture and training objectives of LLMs and the dynamic, multimodal, and often continuous nature of game states and actions.

Understanding the Core Architecture and Training of LLMs

At their core, LLMs are transformer-based neural networks designed to predict the next token (word or sub-word unit) in a sequence, given a preceding sequence of tokens. Their training objective is typically self-supervised, leveraging vast amounts of text data to learn statistical relationships between words. This leads to a profound understanding of language syntax, semantics, and even some degree of world knowledge.

The transformer architecture, with its self-attention mechanism, excels at capturing long-range dependencies within sequential data. This is highly effective for understanding context in text. However, this sequential processing paradigm presents inherent challenges when applied to video games.

The Multimodal Gap: Text vs. Pixels

Video games are fundamentally multimodal experiences. They involve:

Visual Input: The primary sensory input is visual, derived from rendered pixels. This represents a high-dimensional, continuous, and spatially structured data stream.
Auditory Input: Sound effects, music, and character dialogue provide crucial contextual information.
Game State: Underlying numerical and categorical data (e.g., player health, ammunition count, enemy positions, inventory items, quest status) defines the current state of the game world.
Temporal Dynamics: Game states evolve rapidly over time, requiring reactive and predictive capabilities.

LLMs, in their foundational form, are designed to process discrete tokens, primarily text. Adapting them to visual input requires significant augmentation:

Pixel to Token Conversion: Raw pixel data must be transformed into a tokenized representation that an LLM can process. This can involve:
- Image Captioning/Description: Generating textual descriptions of the visual scene. This is lossy and can miss fine-grained details crucial for gameplay.
- Visual Encoders (e.g., Vision Transformers - ViTs): Using separate visual models to extract features from image patches, which are then embedded and fed into the LLM. This creates a multimodal architecture, but the integration introduces complexity.
- Quantization and Discretization: Discretizing pixel values or feature maps into a finite set of "visual tokens." This is a common approach in models like VQ-GAN or Perceiver IO.

Even with these adaptations, the richness and precision of visual information are often compressed or abstracted, leading to a loss of critical gameplay cues. An LLM processing a textual description like "A red enemy is approaching from the right" is far less informative than a direct pixel representation that allows for precise spatial reasoning, identification of subtle animations (e.g., reloading animation), and differentiation between similar-looking entities.

The Temporal and Reactive Challenge: Real-time vs. Sequential Processing

Video games demand real-time decision-making and responsiveness. An agent must perceive the current state, process it, and execute an action within milliseconds. LLMs, while capable of processing sequences, are not inherently optimized for high-frequency, reactive control loops.

Inference Latency: Generating a response from an LLM involves multiple forward passes through a deep neural network. For complex prompts or when processing rich multimodal inputs, this inference can take a significant amount of time, often far exceeding the time window available for a critical game action.
Sequence Length Limitations: While transformers can handle long sequences, computational complexity grows quadratically with sequence length. Representing a significant portion of a game screen, along with its associated game state and historical context, can result in extremely long input sequences, pushing beyond practical limits or incurring prohibitive computational costs.
Lack of Intrinsic Recurrence: Standard transformers operate on fixed-length input sequences or process them in chunks. While architectures like recurrent transformers or state-space models (SSMs) address some of these issues, the core LLM paradigm is not built for continuous, stateful memory updates in the way traditional game AI agents often are.

Traditional game AI often employs techniques like finite state machines (FSMs), behavior trees, hierarchical task networks (HTNs), or reinforcement learning (RL) agents that are specifically designed for reactive control and state management. These methods often have lower computational overhead and more direct mappings to game mechanics.

The Action Space Problem: Discrete vs. Continuous, High-Dimensional Actions

Games present a diverse range of action spaces:

Discrete Actions: Simple button presses (e.g., jump, shoot, move forward).
Continuous Actions: Analog stick movements (e.g., steering a car, aiming a weapon).
Combinatorial Actions: Combinations of button presses and analog inputs (e.g., performing a special move in a fighting game).
High-Dimensional Actions: Games with many possible actions or parameters (e.g., strategy games with unit commands, complex RPG actions).

LLMs are trained to predict discrete tokens. While they can generate sequences of tokens representing actions, mapping these abstract tokens to the precise, often continuous, or combinatorial actions required by a game engine is non-trivial.

Discretizing Continuous Actions: Continuous joystick movements or camera rotations must be discretized into a finite set of actions (e.g., "move left," "look up"). This quantization can lead to jerky or imprecise control.
Generating Action Sequences: For complex actions or sequences, an LLM might generate a series of textual commands, which then need to be translated into game inputs. The LLM might also struggle with timing and coordination within these sequences. For instance, an LLM might suggest "fire weapon, then reload," but the precise timing between these actions, critical for not being vulnerable, is hard to specify and execute through token generation.
Exploration and Novelty: LLMs excel at interpolating within their training data. Generating novel strategies or exploiting emergent game mechanics often requires an exploration mechanism that is not inherent to their pre-training objective. RL agents, by contrast, are explicitly designed with exploration strategies (e.g., epsilon-greedy, noise injection).

The Reward and Feedback Loop Mismatch

LLMs are primarily trained on predicting the next token. Their "reward" is the probability of generating the correct or most likely next token based on their training corpus. Video games, however, operate on a different kind of feedback:

Sparse and Delayed Rewards: Game outcomes (win/loss, score) are often sparse and delayed. An action taken early in a game might only have its consequences realized much later.
Multifaceted Feedback: Beyond explicit scores, games provide rich implicit feedback: health changes, enemy reactions, environmental cues, visual and auditory confirmations.

LLMs are not inherently designed to optimize for external reward signals or to learn from trial-and-error in a dynamic environment. While they can be fine-tuned using techniques like Reinforcement Learning from Human Feedback (RLHF) or direct RL, this requires adapting them to an entirely different learning paradigm.

RL Integration: To make an LLM effective in a game, it typically needs to be integrated into an RL framework. The LLM might serve as a policy network, a value function estimator, or a component for generating high-level plans, but it does not replace the core RL loop (state -> action -> reward -> update policy).
Credit Assignment: Assigning credit for a positive or negative outcome to a specific LLM-generated token or sequence of tokens, especially when rewards are delayed, is a significant challenge.

The "World Model" Deficit

While LLMs encode a vast amount of implicit world knowledge from their text training, this knowledge is abstract and conceptual. They lack a grounded, mechanistic understanding of physics, causality, or the precise state transitions within a specific game environment.

Grounding: An LLM might "know" that "gravity makes things fall," but it doesn't have an internal simulation or model of how gravity affects a specific object in a given game scene at a specific moment. This grounding is essential for predictive accuracy in games.
Causality: Understanding that "shooting a barrel causes an explosion" requires more than just co-occurrence in text. It requires a causal model that LLMs do not inherently possess.
State Representation: The internal state of an LLM is primarily its hidden activations, which are not directly interpretable as game states (e.g., player coordinates, object properties).

To overcome this, researchers often combine LLMs with other AI components:

State Trackers: Explicit modules that monitor and interpret the game state.
World Simulators: External physics engines or game logic simulators.
Planning Modules: AI planners that use the LLM's high-level understanding to generate strategic goals.

Examples and Current Research Directions

Despite these challenges, significant research is underway to bridge the gap. These efforts often involve hybrid architectures:

LLM-as-a-Planner/Advisor: Using an LLM to generate high-level strategies or advice, which are then translated into executable actions by a lower-level controller or RL agent. For instance, in a strategy game, an LLM might suggest "focus on building defenses and researching technology," and a separate AI agent would manage the micro-level unit production and research queues.

# Conceptual example of LLM as a high-level planner
def get_strategic_advice(game_state_description):
    prompt = f"""
    You are an expert RTS player. Based on the current game situation,
    provide a concise, high-level strategic recommendation.
    Game State: {game_state_description}
    Recommendation:
    """
    recommendation = llm_model.generate_text(prompt)
    return recommendation

def translate_recommendation_to_actions(recommendation, current_game_state):
    # Logic to map high-level recommendation to specific game commands
    if "focus on defenses" in recommendation:
        return ["build_turret(location='base')", "research_armor_upgrade()"]
    elif "attack enemy base" in recommendation:
        return ["gather_army('infantry', 'tanks')", "move_army(target='enemy_base')"]
    # ... more complex translation logic
    return []

# In the game loop:
game_state_text = describe_game_state(current_state) # Function to convert game state to text
strategy = get_strategic_advice(game_state_text)
actions = translate_recommendation_to_actions(strategy, current_state)
execute_actions(actions)

Multimodal LLMs for Game Understanding: Employing models like GPT-4V, LLaVA, or specialized vision-language models that can directly process image inputs alongside text. These models can interpret visual cues and game state information simultaneously.

# Conceptual example using a multimodal LLM
from multimodal_llm_api import MultiModalLLMClient

client = MultiModalLLMClient(api_key="YOUR_API_KEY")

def decide_action_multimodal(image_frame, text_overlay, game_state_dict):
    prompt = """
    You are an AI playing this game. Analyze the screen and game state.
    What is the best action to take right now?
    Current Game State: {game_state_dict}
    Visual Input: (image)
    Text Overlay: {text_overlay}
    Action:
    """
    response = client.generate_response(
        prompt=prompt.format(game_state_dict=game_state_dict, text_overlay=text_overlay),
        images=[image_frame]
    )
    return response.text # e.g., "Move right and shoot"

LLMs as Knowledge Bases for Game AI: Using LLMs to provide game-specific knowledge, lore, or character motivations that can inform the decision-making of traditional AI agents, making them more believable or strategic.
LLM-driven Level Generation or Narrative: LLMs are well-suited for generating content. They can be used to create game levels, dialogue, quests, or storylines, which are then populated and made playable by other game systems.

Conclusion: Not "Terrible," but Fundamentally Mismatched for Direct Control

Large language models are not inherently "terrible" at video games in the sense of being incapable of processing game-related information. Instead, their current architecture and training paradigms present significant challenges for direct, real-time control and decision-making in dynamic, multimodal environments. The sequential, token-based nature of LLMs struggles with the high-dimensional visual input, real-time reactivity, continuous action spaces, and sparse reward structures inherent to most video games.

However, LLMs are proving to be powerful components within broader AI systems for games. Their strengths in understanding context, generating coherent sequences, and reasoning about abstract concepts can be leveraged for high-level planning, narrative generation, and providing strategic advice. Future advancements will likely focus on more efficient multimodal integration, improved temporal reasoning, and seamless combination with reinforcement learning and traditional game AI techniques to unlock their full potential in interactive entertainment.

The limitations observed are not necessarily an indictment of LLMs' intelligence but a reflection of their design being optimized for a different modality and task. As research progresses, we can expect to see more sophisticated architectures that harness the power of LLMs within the complex domain of video games.

For organizations seeking to navigate the complexities of AI integration, including advanced applications in gaming, simulation, and interactive systems, expert guidance is invaluable. Visit https://www.mgatc.com for consulting services.

Originally published in Spanish at www.mgatc.com/blog/why-are-large-language-models-so-terrible-at-video-games/