DEV Community

Cover image for Hybrid Async-Native Engine for Python: A Design Concept for Review
Muhammed Shafin P
Muhammed Shafin P

Posted on

Hybrid Async-Native Engine for Python: A Design Concept for Review

Introduction

I've been working on a design concept for a hybrid execution engine that aims to bridge Python's asyncio with native C-level threading. I'm sharing this here to get community feedback, identify potential issues, and learn from more experienced developers. This is an experimental concept, and I'm certain there are drawbacks and edge cases I haven't considered.

Please note: This is a theoretical design I've developed. I have not benchmarked it against production systems, and the performance characteristics I describe are based on my understanding of how the components should behave, not empirical measurements.

Design Goals

My primary goal was to explore ways to reduce task orchestration overhead in async Python applications. Standard asyncio works well for I/O-bound tasks, but I wanted to experiment with moving the task queue into native C memory to see if this could reduce scheduling overhead.

Core Architecture Concept

1. The "Bus" - A Native Task Queue

The central idea is a C-implemented task queue (I'm calling it the "Bus") that uses atomic operations for task management:

  • Push Operation: Adds tasks to the queue using atomic pointer updates
  • Pop Operation: Workers retrieve tasks using mutex locks
  • Wait State: Idle workers use condition variables to minimize CPU usage

Visual Architecture Overview

                          ┌─────────────────────────────────┐
                          │   Python Application Layer      │
                          │  (Submits Tasks via setup())    │
                          └────────────┬────────────────────┘
                                       │
                                       ▼
                          ┌─────────────────────────────────┐
                          │    Native C "Bus" (Queue)       │
                          │   ┌─────────────────────────┐   │
                          │   │  Task Priority Queue    │   │
                          │   │  (LIFO/FIFO Policy)     │   │
                          │   │  Atomic PUSH/POP Ops    │   │
                          │   └─────────────────────────┘   │
                          └─────┬───────────────────┬───────┘
                                │                   │
                ┌───────────────┴─────┐   ┌────────┴──────────────┐
                │                     │   │                       │
                ▼                     ▼   ▼                       ▼
    ┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────────┐
    │  GIL Thread 1       │  │  GIL Thread 2       │  │  C Thread 1         │
    │  ┌───────────────┐  │  │  ┌───────────────┐  │  │  ┌───────────────┐  │
    │  │ Event Loop    │  │  │  │ Event Loop    │  │  │  │ C Function    │  │
    │  │ 1000+ async   │  │  │  │ 1000+ async   │  │  │  │ Execution     │  │
    │  │ I/O tasks     │  │  │  │ I/O tasks     │  │  │  │ (Parallel)    │  │
    │  └───────────────┘  │  │  └───────────────┘  │  │  └───────────────┘  │
    │  Requires GIL       │  │  Requires GIL       │  │  No GIL Needed      │
    └─────────────────────┘  └─────────────────────┘  └──────────┬──────────┘
             │                         │                          │
             │                         │              ┌───────────▼───────────┐
             │                         │              │  Shared Memory Bus    │
             │                         │              │  (Non-GIL Only)       │
             │                         │              │  ┌─────────────────┐  │
             │                         │              │  │ Atomic Read/    │  │
             │                         │              │  │ Write Operations│  │
             │                         │              │  └─────────────────┘  │
             │                         │              │  Zero-Copy Data      │
             │                         │              │  Sharing Between C   │
             │                         │              └──────────┬───────────┘
             │                         │                         │
             │                         │              ┌──────────▼──────────┐
             │                         │              │  C Thread 2         │
             │                         │              │  ┌───────────────┐  │
             │                         │              │  │ C Function    │  │
             │                         │              │  │ Execution     │  │
             │                         │              │  │ (Parallel)    │  │
             │                         │              │  └───────────────┘  │
             │                         │              │  No GIL Needed      │
             │                         │              └─────────────────────┘
             │                         │                         │
             └─────────────┬───────────┴─────────────────────────┘
                           │
                           ▼
                  ┌─────────────────┐
                  │  Results /      │
                  │  Completion     │
                  └─────────────────┘
Enter fullscreen mode Exit fullscreen mode

Thread Pool Composition

Configuration GIL Threads C Threads Total Workers
Example 1 2 2 4
Example 2 3 1 4
Example 3 1 7 8

Constraint: gil_threads ≤ workers and c_threads = workers - gil_threads

Task Type Routing Table

Task Type Can Execute On Execution Model Memory Access Typical Use Case
Python Coroutine (async/await) GIL Threads Only Concurrent I/O (1000s per thread) Python heap only Web requests, file I/O, database queries
C Function Pointer C Threads (Preferred) True Parallel Execution Shared Memory Bus + Local CPU-intensive math, data processing, encoding
C Function Pointer (Alternative) GIL Threads (If C threads busy)* Sequential Execution Local memory only Fallback when C threads saturated

*Design question: Should this fallback be allowed?

Memory Architecture Comparison

Memory Region Accessible By Thread-Safe Use Case Size Limit
Main Task Bus All Threads Yes (Mutex/Atomic) Task queue management Defined by shared_amount
Shared Memory Bus C Threads Only Yes (Atomic Ops) Zero-copy data sharing between C tasks Defined by shared_memory parameter
Python Heap GIL Threads Only Yes (GIL Protected) Python objects and coroutine state System memory limit
Thread-Local Memory Each Thread N/A (No sharing) Thread-specific temporary data Stack/heap limits

Capacity Examples

Workers GIL Threads C Threads Concurrent I/O Capacity Parallel CPU Tasks
4 2 2 ~2,000 (1000 × 2) 2 simultaneous
8 3 5 ~3,000 (1000 × 3) 5 simultaneous
8 6 2 ~6,000 (1000 × 6) 2 simultaneous
16 4 12 ~4,000 (1000 × 4) 12 simultaneous

2. Dual Execution Paths

The design attempts to support two types of workloads:

Path A: Python Coroutines (GIL-dependent)

  • Standard async/await functions
  • Requires GIL acquisition during execution
  • Falls back to single-threaded concurrency model

Path B: C Function Pointers (Non-GIL)

  • Direct C function execution
  • Bypasses Python interpreter
  • Theoretical true parallelism across cores

Known concern: I'm not sure how to safely marshal complex Python objects into the Non-GIL path without creating memory management issues.

3. Configuration Parameters

setup(
    workers=4,              # Total thread pool size (capped at CPU count)
    gil_threads=2,          # Number of GIL-enabled threads (must be ≤ workers)
    tasks_per_thread=100,   # Virtual queue depth per worker
    shared_amount=1GB,      # Memory ceiling for backpressure
    shared_memory=512MB,    # Shared Memory Bus size (Non-GIL threads only)
    policy='LIFO'           # Task ordering strategy
)
Enter fullscreen mode Exit fullscreen mode

Important constraint: gil_threads must always be less than or equal to workers. This parameter determines how many threads in the pool are designated for Python coroutines (which require the GIL).

Example: If workers=4 and gil_threads=2:

  • 2 threads are designated for GIL-based Python coroutines
  • 2 threads are designated for Non-GIL C function execution
  • The 2 C threads share access to a 512MB Shared Memory Bus

Shared Memory Bus (Non-GIL Only)

In addition to the main Task Bus, the design includes a separate Shared Memory Bus that provides zero-copy data sharing between Non-GIL C threads:

Key Features:

  • Exclusive to C Threads: Only Non-GIL threads can access this memory region
  • Lock-Free Access: Uses atomic operations for thread-safe reads and writes
  • Zero-Copy Sharing: C threads can share large data structures without serialization
  • Direct Memory Access: C functions can read/write shared memory pointers directly
  • Isolated from Python: GIL threads cannot access this memory, maintaining strict separation

Use Cases:

  • Sharing intermediate computation results between parallel C tasks
  • Building shared lookup tables or caches for C functions
  • Passing large arrays or buffers between C threads without copying
  • Coordinating state across parallel C computations

Design Question: Should there be a mechanism to synchronize data between the Shared Memory Bus and Python objects, or should this remain completely isolated? The isolation is safer but limits interoperability.

API Methods and Functionality

The engine exposes several methods for task submission, memory management, and introspection:

Task Submission Methods

spawn(coroutine)

Submits a Python coroutine to the Task Bus for execution on GIL threads.

async def fetch_data(url):
    # async I/O operation
    return data

engine.spawn(fetch_data("https://example.com"))
Enter fullscreen mode Exit fullscreen mode

Behavior:

  • Task is pushed to the main Task Bus
  • Routed only to GIL-enabled threads
  • Executes within an asyncio event loop
  • Returns immediately (non-blocking submission)

c_spawn(function_pointer, args)

Submits a C function pointer to the Task Bus for execution on Non-GIL threads.

# Assuming you have a C function registered
c_function_ptr = get_c_function("process_array")
engine.c_spawn(c_function_ptr, (array_data, size))
Enter fullscreen mode Exit fullscreen mode

Behavior:

  • Task is pushed to the main Task Bus with high priority
  • Routed preferentially to Non-GIL C threads
  • Executes directly in C without Python interpreter
  • Can access Shared Memory Bus for data sharing

Design Question: How should arguments be marshalled? Should only C-native types be allowed, or should there be automatic conversion for simple Python types (int, float, bytes)?

Shared Memory Bus Methods

memory_bus_push(key, data, size)

Writes data to the Shared Memory Bus accessible by all Non-GIL threads.

# Push a large array to shared memory
array_ptr = get_array_pointer(my_data)
engine.memory_bus_push("computation_input", array_ptr, size_bytes)
Enter fullscreen mode Exit fullscreen mode

Behavior:

  • Allocates space in the Shared Memory Bus
  • Uses atomic operations to update memory pointers
  • Returns a key/handle for retrieval
  • Only accessible from C threads
  • Error if: Called from GIL thread context or memory limit exceeded

memory_bus_pop(key)

Retrieves data from the Shared Memory Bus.

# Retrieve shared data in a C function
data_ptr = engine.memory_bus_pop("computation_input")
Enter fullscreen mode Exit fullscreen mode

Behavior:

  • Returns pointer to shared memory region
  • Does not copy data (zero-copy access)
  • Multiple threads can read simultaneously
  • Design Question: Should pop remove the data or just retrieve it? Should there be separate memory_bus_get() for non-destructive reads?

Monitoring and Introspection Methods

get_stats()

Returns current engine statistics and performance metrics.

stats = engine.get_stats()
print(stats)
# Output:
# {
#   'total_tasks_submitted': 15420,
#   'tasks_completed': 15102,
#   'tasks_in_queue': 318,
#   'gil_threads_active': 2,
#   'gil_threads_idle': 0,
#   'c_threads_active': 3,
#   'c_threads_idle': 1,
#   'shared_memory_used': 245760000,  # bytes
#   'shared_memory_available': 291240000,
#   'avg_task_latency_ms': 2.4,
#   'bus_contention_count': 47
# }
Enter fullscreen mode Exit fullscreen mode

Returned Metrics:

  • Task counters (submitted, completed, queued)
  • Thread utilization per type
  • Memory usage statistics
  • Performance metrics (latency, throughput)
  • Contention/blocking events

thread_info(thread_id=None)

Returns detailed information about specific threads or all threads.

# Get info for all threads
all_threads = engine.thread_info()

# Get info for specific thread
thread_5 = engine.thread_info(thread_id=5)
# Output:
# {
#   'thread_id': 5,
#   'type': 'C_THREAD',
#   'state': 'RUNNING',
#   'current_task': 'process_matrix_42',
#   'tasks_completed': 1847,
#   'cpu_time_ms': 45230,
#   'idle_time_ms': 892,
#   'last_active': '2026-01-13T10:23:45'
# }
Enter fullscreen mode Exit fullscreen mode

Information Provided:

  • Thread type (GIL vs C)
  • Current state (RUNNING, IDLE, WAITING, BLOCKED)
  • Task execution history
  • CPU time and idle time
  • Current task identifier

get_call_stack(task_id=None)

Returns a call stack representation for tasks, similar to frame info but as a custom Python object.

# Get current task's call stack
stack = engine.get_call_stack()

# Get specific task's call stack
stack = engine.get_call_stack(task_id="task_12345")

# Stack object structure (custom Python object)
for frame in stack.frames:
    print(f"Function: {frame.function_name}")
    print(f"Location: {frame.file}:{frame.line}")
    print(f"Type: {frame.execution_type}")  # 'PYTHON' or 'C_NATIVE'
    print(f"Thread: {frame.thread_id}")
    print(f"Timestamp: {frame.timestamp}")
Enter fullscreen mode Exit fullscreen mode

Custom Stack Object Properties:

  • frames: List of frame objects (not standard Python frame objects)
  • Each frame contains:
    • function_name: Name of the function/coroutine
    • file: Source file (or "C_NATIVE" for C functions)
    • line: Line number (or 0 for C functions)
    • execution_type: 'PYTHON' or 'C_NATIVE'
    • thread_id: Which thread is executing this frame
    • timestamp: When this frame was entered
    • memory_refs: References to Shared Memory Bus if applicable

Design Question: Should the stack object be immutable (snapshot at call time) or live-updating? Should it include memory access history for C threads?

Additional Utility Methods

pause() / resume()

Temporarily pause task execution without shutdown.

engine.pause()  # Stop accepting new tasks, finish current ones
engine.resume()  # Resume task acceptance
Enter fullscreen mode Exit fullscreen mode

clear_memory_bus()

Clears all data from the Shared Memory Bus.

engine.clear_memory_bus()  # Free all shared memory allocations
Enter fullscreen mode Exit fullscreen mode

set_priority(task_id, priority)

Adjusts task priority in the queue.

engine.set_priority("task_12345", priority=10)  # Higher priority
Enter fullscreen mode Exit fullscreen mode

API Summary Table

Method Purpose Accessible From Returns Blocking
spawn() Submit Python coroutine Python None No
c_spawn() Submit C function Python None No
memory_bus_push() Write to shared memory Python/C Key handle No
memory_bus_pop() Read from shared memory C threads Data pointer No
get_stats() Engine statistics Python Dict No
thread_info() Thread details Python Dict/List No
get_call_stack() Task call stack Python Custom object No
pause() Pause execution Python None No
resume() Resume execution Python None No
wait_all() Graceful completion Python None Yes
shutdown() Immediate termination Python None No

Thread Allocation and Task Routing

The engine divides the worker pool into two groups:

GIL-Enabled Threads (Python coroutine handlers):

  • These threads can acquire the Python GIL
  • They execute Python async functions and coroutines
  • Limited to the number specified in gil_threads parameter
  • Handle Python object manipulation safely

Non-GIL C Threads (Native function handlers):

  • Calculated as: c_threads = workers - gil_threads
  • These threads never acquire the GIL
  • Execute pure C function pointers directly
  • Provide true parallel execution across CPU cores
  • Cannot safely interact with Python objects

Task Execution Logic

When the Bus receives a task, the routing works as follows:

  1. Python Coroutine Task Arrives:

    • Checks if any GIL-enabled threads are available
    • If yes: Assigns to a free GIL thread immediately
    • If no: Task waits in queue until a GIL thread becomes free
    • C threads cannot execute these tasks (they lack GIL access)
  2. C Function Pointer Task Arrives:

    • First checks if any Non-GIL C threads are available
    • If yes: Assigns to a free C thread immediately
    • If no: Can optionally use a GIL thread (after releasing GIL for the C execution)
    • Prefers C threads for optimal performance
  3. All GIL Threads Occupied:

    • Python coroutine tasks must wait in the Bus queue
    • C function tasks continue to execute on C threads without blocking
    • This prevents Python workload from blocking native execution

Design Question I'm Uncertain About: Should C function tasks be allowed to "borrow" GIL threads when C threads are full but GIL threads are idle? Or should they strictly wait for a C thread to become available? The first approach maximizes utilization, but the second maintains cleaner separation.

4. Lifecycle Management

The design includes two shutdown approaches:

Graceful Shutdown (wait_all()):

  • Blocks until all tasks complete
  • Ensures all results are committed
  • Standard cleanup process

Immediate Shutdown (shutdown()):

  • Broadcasts termination signal
  • Stops workers mid-execution
  • Frees resources immediately

Major concern: The immediate shutdown could leave C-level state corrupted or cause memory leaks if tasks are holding native resources. I haven't figured out how to make this safe in all scenarios.

Theoretical Performance Characteristics

Based on my understanding of the architecture, I expect:

  • Task creation: Should be faster than Python asyncio due to reduced abstraction layers
  • Scheduling: Native C loop should have lower overhead than Python event loop
  • Parallelism: Non-GIL path should enable true multi-core execution for CPU-bound tasks
  • I/O Concurrency: Each GIL thread can manage thousands of async I/O operations simultaneously, multiplied across all GIL threads (e.g., 3 GIL threads × 1,000 concurrent I/O ops = 3,000 total concurrent I/O operations)
  • Hybrid Workloads: CPU-intensive C tasks run in parallel on C threads while I/O-intensive Python coroutines handle thousands of concurrent connections across GIL threads

Important caveat: These are expectations, not proven results. Actual performance would depend on implementation quality, workload characteristics, and numerous factors I may not have considered. The overhead of managing multiple event loops and coordinating between thread types might offset these theoretical gains.

Known Issues and Questions

I'm aware of several potential problems:

  1. Memory Safety: How do I safely handle Python object lifetimes in the Non-GIL execution path?

  2. Shared Memory Bus Synchronization: How do C threads coordinate access to the Shared Memory Bus? Should I use lock-free algorithms, atomic operations, or some form of lightweight locking?

  3. Memory Isolation: Is complete isolation between the Shared Memory Bus (C threads) and Python heap (GIL threads) the right approach? Or should there be a safe bridge mechanism?

  4. Data Corruption Risk: If multiple C threads write to the same Shared Memory Bus location simultaneously, how do I prevent race conditions and data corruption?

  5. GIL Thread Allocation: Is the fixed separation between GIL and Non-GIL threads the right approach? Should there be dynamic reallocation based on workload patterns?

  6. Thread Starvation: If all GIL threads are busy, Python tasks queue up even when C threads are idle. Is there a better way to handle this imbalance?

  7. GIL Interaction: The interleaving between GIL and Non-GIL tasks might create complex synchronization issues I haven't anticipated.

  8. Error Handling: How should exceptions in C-level tasks be propagated back to Python?

  9. Resource Cleanup: The immediate shutdown approach seems risky. What's the proper way to ensure clean termination, especially for data in the Shared Memory Bus?

  10. Lock Contention: Under high load, the mutex in the Bus_Pop operation might become a bottleneck.

  11. Thread Borrowing: If C threads are fully occupied but GIL threads are idle, should C tasks be allowed to execute on GIL threads (after releasing the GIL for C execution)?

  12. Shared Memory Leaks: How do I track and free memory allocations in the Shared Memory Bus when C tasks complete or fail?

Seeking Community Input

I'd greatly appreciate feedback on:

  • Architecture Review: Are there fundamental flaws in this approach that would prevent it from working?

  • Thread Allocation Strategy: Is the fixed gil_threads parameter the right approach, or should the engine dynamically adjust thread roles based on workload?

  • Task Routing Logic: How should the engine handle scenarios where one thread type is fully occupied while the other is idle?

  • Lock-Free Alternatives: Would a lock-free circular buffer be more appropriate than my mutex-based Bus?

  • Data Marshaling: What's the safest way to pass Python objects to C functions without GIL protection?

  • Shutdown Safety: How can I make the immediate shutdown path safe for all types of tasks?

  • Similar Work: Has something like this been attempted before? What were the results?

  • Python/C Boundary: Am I underestimating the overhead of crossing the Python/C boundary frequently?

  • GIL Management: Are there better ways to manage multiple threads competing for the GIL besides my fixed allocation approach?

  • Shared Memory Bus Design: Is a separate shared memory region for Non-GIL threads a good idea? What are the best practices for lock-free data structures in this context?

  • Memory Synchronization: Should there be any mechanism to transfer data between the Shared Memory Bus and Python objects, or should they remain completely isolated?

  • Atomic Operations: What atomic primitives should I use for the Shared Memory Bus? Compare-and-swap (CAS)? Memory barriers? Relaxed ordering?

  • Memory Allocation Strategy: Should the Shared Memory Bus use a pre-allocated pool, dynamic allocation, or a hybrid approach?

  • Feasibility: Will this design actually work in practice, or am I missing something fundamental about how Python's GIL and C extensions interact?

  • Implementation in Python: Can this concept be built using Python's C API and existing libraries, or would it require patches to CPython itself? Are there existing Python tools or frameworks that could help implement this?

  • Performance Boost: If implemented correctly, would this actually provide meaningful performance improvements over standard asyncio, or would the overhead of thread management and the Bus negate any gains?

Conclusion

This is an exploratory design concept, and I'm sharing it to learn from the community. I'm certain there are aspects I haven't thought through properly, and I welcome criticism and suggestions. If this approach has fundamental issues that make it impractical, I'd rather know now.

Has anyone worked on similar hybrid async/native systems? What challenges did you encounter?

Thank you for taking the time to review this concept.


Author: @hejhdiss (Muhammed Shafin P)

This is a theoretical design document. Performance claims are based on architectural expectations, not empirical testing. I am seeking feedback to identify issues.

Top comments (0)