Muhammed Shafin P

Posted on Jan 13

Hybrid Async-Native Engine for Python: A Design Concept for Review

#hejhdiss #python3 #hybridasyncengine

Introduction

I've been working on a design concept for a hybrid execution engine that aims to bridge Python's asyncio with native C-level threading. I'm sharing this here to get community feedback, identify potential issues, and learn from more experienced developers. This is an experimental concept, and I'm certain there are drawbacks and edge cases I haven't considered.

Please note: This is a theoretical design I've developed. I have not benchmarked it against production systems, and the performance characteristics I describe are based on my understanding of how the components should behave, not empirical measurements.

Design Goals

My primary goal was to explore ways to reduce task orchestration overhead in async Python applications. Standard asyncio works well for I/O-bound tasks, but I wanted to experiment with moving the task queue into native C memory to see if this could reduce scheduling overhead.

Core Architecture Concept

1. The "Bus" - A Native Task Queue

The central idea is a C-implemented task queue (I'm calling it the "Bus") that uses atomic operations for task management:

Push Operation: Adds tasks to the queue using atomic pointer updates
Pop Operation: Workers retrieve tasks using mutex locks
Wait State: Idle workers use condition variables to minimize CPU usage

Visual Architecture Overview

                          ┌─────────────────────────────────┐
                          │   Python Application Layer      │
                          │  (Submits Tasks via setup())    │
                          └────────────┬────────────────────┘
                                       │
                                       ▼
                          ┌─────────────────────────────────┐
                          │    Native C "Bus" (Queue)       │
                          │   ┌─────────────────────────┐   │
                          │   │  Task Priority Queue    │   │
                          │   │  (LIFO/FIFO Policy)     │   │
                          │   │  Atomic PUSH/POP Ops    │   │
                          │   └─────────────────────────┘   │
                          └─────┬───────────────────┬───────┘
                                │                   │
                ┌───────────────┴─────┐   ┌────────┴──────────────┐
                │                     │   │                       │
                ▼                     ▼   ▼                       ▼
    ┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────────┐
    │  GIL Thread 1       │  │  GIL Thread 2       │  │  C Thread 1         │
    │  ┌───────────────┐  │  │  ┌───────────────┐  │  │  ┌───────────────┐  │
    │  │ Event Loop    │  │  │  │ Event Loop    │  │  │  │ C Function    │  │
    │  │ 1000+ async   │  │  │  │ 1000+ async   │  │  │  │ Execution     │  │
    │  │ I/O tasks     │  │  │  │ I/O tasks     │  │  │  │ (Parallel)    │  │
    │  └───────────────┘  │  │  └───────────────┘  │  │  └───────────────┘  │
    │  Requires GIL       │  │  Requires GIL       │  │  No GIL Needed      │
    └─────────────────────┘  └─────────────────────┘  └──────────┬──────────┘
             │                         │                          │
             │                         │              ┌───────────▼───────────┐
             │                         │              │  Shared Memory Bus    │
             │                         │              │  (Non-GIL Only)       │
             │                         │              │  ┌─────────────────┐  │
             │                         │              │  │ Atomic Read/    │  │
             │                         │              │  │ Write Operations│  │
             │                         │              │  └─────────────────┘  │
             │                         │              │  Zero-Copy Data      │
             │                         │              │  Sharing Between C   │
             │                         │              └──────────┬───────────┘
             │                         │                         │
             │                         │              ┌──────────▼──────────┐
             │                         │              │  C Thread 2         │
             │                         │              │  ┌───────────────┐  │
             │                         │              │  │ C Function    │  │
             │                         │              │  │ Execution     │  │
             │                         │              │  │ (Parallel)    │  │
             │                         │              │  └───────────────┘  │
             │                         │              │  No GIL Needed      │
             │                         │              └─────────────────────┘
             │                         │                         │
             └─────────────┬───────────┴─────────────────────────┘
                           │
                           ▼
                  ┌─────────────────┐
                  │  Results /      │
                  │  Completion     │
                  └─────────────────┘

Thread Pool Composition

Configuration	GIL Threads	C Threads	Total Workers
Example 1	2	2	4
Example 2	3	1	4
Example 3	1	7	8

Constraint: gil_threads ≤ workers and c_threads = workers - gil_threads

Task Type Routing Table

Task Type	Can Execute On	Execution Model	Memory Access	Typical Use Case
Python Coroutine (async/await)	GIL Threads Only	Concurrent I/O (1000s per thread)	Python heap only	Web requests, file I/O, database queries
C Function Pointer	C Threads (Preferred)	True Parallel Execution	Shared Memory Bus + Local	CPU-intensive math, data processing, encoding
C Function Pointer (Alternative)	GIL Threads (If C threads busy)*	Sequential Execution	Local memory only	Fallback when C threads saturated

*Design question: Should this fallback be allowed?

Memory Architecture Comparison

Memory Region	Accessible By	Thread-Safe	Use Case	Size Limit
Main Task Bus	All Threads	Yes (Mutex/Atomic)	Task queue management	Defined by `shared_amount`
Shared Memory Bus	C Threads Only	Yes (Atomic Ops)	Zero-copy data sharing between C tasks	Defined by `shared_memory` parameter
Python Heap	GIL Threads Only	Yes (GIL Protected)	Python objects and coroutine state	System memory limit
Thread-Local Memory	Each Thread	N/A (No sharing)	Thread-specific temporary data	Stack/heap limits

Capacity Examples

Workers	GIL Threads	C Threads	Concurrent I/O Capacity	Parallel CPU Tasks
4	2	2	~2,000 (1000 × 2)	2 simultaneous
8	3	5	~3,000 (1000 × 3)	5 simultaneous
8	6	2	~6,000 (1000 × 6)	2 simultaneous
16	4	12	~4,000 (1000 × 4)	12 simultaneous

2. Dual Execution Paths

The design attempts to support two types of workloads:

Path A: Python Coroutines (GIL-dependent)

Standard async/await functions
Requires GIL acquisition during execution
Falls back to single-threaded concurrency model

Path B: C Function Pointers (Non-GIL)

Direct C function execution
Bypasses Python interpreter
Theoretical true parallelism across cores

Known concern: I'm not sure how to safely marshal complex Python objects into the Non-GIL path without creating memory management issues.

3. Configuration Parameters

setup(
    workers=4,              # Total thread pool size (capped at CPU count)
    gil_threads=2,          # Number of GIL-enabled threads (must be ≤ workers)
    tasks_per_thread=100,   # Virtual queue depth per worker
    shared_amount=1GB,      # Memory ceiling for backpressure
    shared_memory=512MB,    # Shared Memory Bus size (Non-GIL threads only)
    policy='LIFO'           # Task ordering strategy
)

Important constraint: gil_threads must always be less than or equal to workers. This parameter determines how many threads in the pool are designated for Python coroutines (which require the GIL).

Example: If workers=4 and gil_threads=2:

2 threads are designated for GIL-based Python coroutines
2 threads are designated for Non-GIL C function execution
The 2 C threads share access to a 512MB Shared Memory Bus

Shared Memory Bus (Non-GIL Only)

In addition to the main Task Bus, the design includes a separate Shared Memory Bus that provides zero-copy data sharing between Non-GIL C threads:

Key Features:

Exclusive to C Threads: Only Non-GIL threads can access this memory region
Lock-Free Access: Uses atomic operations for thread-safe reads and writes
Zero-Copy Sharing: C threads can share large data structures without serialization
Direct Memory Access: C functions can read/write shared memory pointers directly
Isolated from Python: GIL threads cannot access this memory, maintaining strict separation

Use Cases:

Sharing intermediate computation results between parallel C tasks
Building shared lookup tables or caches for C functions
Passing large arrays or buffers between C threads without copying
Coordinating state across parallel C computations

Design Question: Should there be a mechanism to synchronize data between the Shared Memory Bus and Python objects, or should this remain completely isolated? The isolation is safer but limits interoperability.

API Methods and Functionality

The engine exposes several methods for task submission, memory management, and introspection:

Task Submission Methods

`spawn(coroutine)`

Submits a Python coroutine to the Task Bus for execution on GIL threads.

async def fetch_data(url):
    # async I/O operation
    return data

engine.spawn(fetch_data("https://example.com"))

Behavior:

Task is pushed to the main Task Bus
Routed only to GIL-enabled threads
Executes within an asyncio event loop
Returns immediately (non-blocking submission)

`c_spawn(function_pointer, args)`

Submits a C function pointer to the Task Bus for execution on Non-GIL threads.

# Assuming you have a C function registered
c_function_ptr = get_c_function("process_array")
engine.c_spawn(c_function_ptr, (array_data, size))

Behavior:

Task is pushed to the main Task Bus with high priority
Routed preferentially to Non-GIL C threads
Executes directly in C without Python interpreter
Can access Shared Memory Bus for data sharing

Design Question: How should arguments be marshalled? Should only C-native types be allowed, or should there be automatic conversion for simple Python types (int, float, bytes)?

Shared Memory Bus Methods

`memory_bus_push(key, data, size)`

Writes data to the Shared Memory Bus accessible by all Non-GIL threads.

# Push a large array to shared memory
array_ptr = get_array_pointer(my_data)
engine.memory_bus_push("computation_input", array_ptr, size_bytes)

Behavior:

Allocates space in the Shared Memory Bus
Uses atomic operations to update memory pointers
Returns a key/handle for retrieval
Only accessible from C threads
Error if: Called from GIL thread context or memory limit exceeded

`memory_bus_pop(key)`

Retrieves data from the Shared Memory Bus.

# Retrieve shared data in a C function
data_ptr = engine.memory_bus_pop("computation_input")

Behavior:

Returns pointer to shared memory region
Does not copy data (zero-copy access)
Multiple threads can read simultaneously
Design Question: Should pop remove the data or just retrieve it? Should there be separate memory_bus_get() for non-destructive reads?

Monitoring and Introspection Methods

`get_stats()`

Returns current engine statistics and performance metrics.

stats = engine.get_stats()
print(stats)
# Output:
# {
#   'total_tasks_submitted': 15420,
#   'tasks_completed': 15102,
#   'tasks_in_queue': 318,
#   'gil_threads_active': 2,
#   'gil_threads_idle': 0,
#   'c_threads_active': 3,
#   'c_threads_idle': 1,
#   'shared_memory_used': 245760000,  # bytes
#   'shared_memory_available': 291240000,
#   'avg_task_latency_ms': 2.4,
#   'bus_contention_count': 47
# }

Returned Metrics:

Task counters (submitted, completed, queued)
Thread utilization per type
Memory usage statistics
Performance metrics (latency, throughput)
Contention/blocking events

`thread_info(thread_id=None)`

Returns detailed information about specific threads or all threads.

# Get info for all threads
all_threads = engine.thread_info()

# Get info for specific thread
thread_5 = engine.thread_info(thread_id=5)
# Output:
# {
#   'thread_id': 5,
#   'type': 'C_THREAD',
#   'state': 'RUNNING',
#   'current_task': 'process_matrix_42',
#   'tasks_completed': 1847,
#   'cpu_time_ms': 45230,
#   'idle_time_ms': 892,
#   'last_active': '2026-01-13T10:23:45'
# }

Information Provided:

Thread type (GIL vs C)
Current state (RUNNING, IDLE, WAITING, BLOCKED)
Task execution history
CPU time and idle time
Current task identifier

`get_call_stack(task_id=None)`

Returns a call stack representation for tasks, similar to frame info but as a custom Python object.

# Get current task's call stack
stack = engine.get_call_stack()

# Get specific task's call stack
stack = engine.get_call_stack(task_id="task_12345")

# Stack object structure (custom Python object)
for frame in stack.frames:
    print(f"Function: {frame.function_name}")
    print(f"Location: {frame.file}:{frame.line}")
    print(f"Type: {frame.execution_type}")  # 'PYTHON' or 'C_NATIVE'
    print(f"Thread: {frame.thread_id}")
    print(f"Timestamp: {frame.timestamp}")

Custom Stack Object Properties:

frames: List of frame objects (not standard Python frame objects)
Each frame contains:
- function_name: Name of the function/coroutine
- file: Source file (or "C_NATIVE" for C functions)
- line: Line number (or 0 for C functions)
- execution_type: 'PYTHON' or 'C_NATIVE'
- thread_id: Which thread is executing this frame
- timestamp: When this frame was entered
- memory_refs: References to Shared Memory Bus if applicable

Design Question: Should the stack object be immutable (snapshot at call time) or live-updating? Should it include memory access history for C threads?

Additional Utility Methods

`pause()` / `resume()`

Temporarily pause task execution without shutdown.

engine.pause()  # Stop accepting new tasks, finish current ones
engine.resume()  # Resume task acceptance

`clear_memory_bus()`

Clears all data from the Shared Memory Bus.

engine.clear_memory_bus()  # Free all shared memory allocations

`set_priority(task_id, priority)`

Adjusts task priority in the queue.

engine.set_priority("task_12345", priority=10)  # Higher priority

API Summary Table

Method	Purpose	Accessible From	Returns	Blocking
`spawn()`	Submit Python coroutine	Python	None	No
`c_spawn()`	Submit C function	Python	None	No
`memory_bus_push()`	Write to shared memory	Python/C	Key handle	No
`memory_bus_pop()`	Read from shared memory	C threads	Data pointer	No
`get_stats()`	Engine statistics	Python	Dict	No
`thread_info()`	Thread details	Python	Dict/List	No
`get_call_stack()`	Task call stack	Python	Custom object	No
`pause()`	Pause execution	Python	None	No
`resume()`	Resume execution	Python	None	No
`wait_all()`	Graceful completion	Python	None	Yes
`shutdown()`	Immediate termination	Python	None	No

Thread Allocation and Task Routing

The engine divides the worker pool into two groups:

GIL-Enabled Threads (Python coroutine handlers):

These threads can acquire the Python GIL
They execute Python async functions and coroutines
Limited to the number specified in gil_threads parameter
Handle Python object manipulation safely

Non-GIL C Threads (Native function handlers):

Calculated as: c_threads = workers - gil_threads
These threads never acquire the GIL
Execute pure C function pointers directly
Provide true parallel execution across CPU cores
Cannot safely interact with Python objects

Task Execution Logic

When the Bus receives a task, the routing works as follows:

Python Coroutine Task Arrives:
- Checks if any GIL-enabled threads are available
- If yes: Assigns to a free GIL thread immediately
- If no: Task waits in queue until a GIL thread becomes free
- C threads cannot execute these tasks (they lack GIL access)
C Function Pointer Task Arrives:
- First checks if any Non-GIL C threads are available
- If yes: Assigns to a free C thread immediately
- If no: Can optionally use a GIL thread (after releasing GIL for the C execution)
- Prefers C threads for optimal performance
All GIL Threads Occupied:
- Python coroutine tasks must wait in the Bus queue
- C function tasks continue to execute on C threads without blocking
- This prevents Python workload from blocking native execution

Design Question I'm Uncertain About: Should C function tasks be allowed to "borrow" GIL threads when C threads are full but GIL threads are idle? Or should they strictly wait for a C thread to become available? The first approach maximizes utilization, but the second maintains cleaner separation.

4. Lifecycle Management

The design includes two shutdown approaches:

Graceful Shutdown (wait_all()):

Blocks until all tasks complete
Ensures all results are committed
Standard cleanup process

Immediate Shutdown (shutdown()):

Broadcasts termination signal
Stops workers mid-execution
Frees resources immediately

Major concern: The immediate shutdown could leave C-level state corrupted or cause memory leaks if tasks are holding native resources. I haven't figured out how to make this safe in all scenarios.

Theoretical Performance Characteristics

Based on my understanding of the architecture, I expect:

Task creation: Should be faster than Python asyncio due to reduced abstraction layers
Scheduling: Native C loop should have lower overhead than Python event loop
Parallelism: Non-GIL path should enable true multi-core execution for CPU-bound tasks
I/O Concurrency: Each GIL thread can manage thousands of async I/O operations simultaneously, multiplied across all GIL threads (e.g., 3 GIL threads × 1,000 concurrent I/O ops = 3,000 total concurrent I/O operations)
Hybrid Workloads: CPU-intensive C tasks run in parallel on C threads while I/O-intensive Python coroutines handle thousands of concurrent connections across GIL threads

Important caveat: These are expectations, not proven results. Actual performance would depend on implementation quality, workload characteristics, and numerous factors I may not have considered. The overhead of managing multiple event loops and coordinating between thread types might offset these theoretical gains.

Known Issues and Questions

I'm aware of several potential problems:

Memory Safety: How do I safely handle Python object lifetimes in the Non-GIL execution path?
Shared Memory Bus Synchronization: How do C threads coordinate access to the Shared Memory Bus? Should I use lock-free algorithms, atomic operations, or some form of lightweight locking?
Memory Isolation: Is complete isolation between the Shared Memory Bus (C threads) and Python heap (GIL threads) the right approach? Or should there be a safe bridge mechanism?
Data Corruption Risk: If multiple C threads write to the same Shared Memory Bus location simultaneously, how do I prevent race conditions and data corruption?
GIL Thread Allocation: Is the fixed separation between GIL and Non-GIL threads the right approach? Should there be dynamic reallocation based on workload patterns?
Thread Starvation: If all GIL threads are busy, Python tasks queue up even when C threads are idle. Is there a better way to handle this imbalance?
GIL Interaction: The interleaving between GIL and Non-GIL tasks might create complex synchronization issues I haven't anticipated.
Error Handling: How should exceptions in C-level tasks be propagated back to Python?
Resource Cleanup: The immediate shutdown approach seems risky. What's the proper way to ensure clean termination, especially for data in the Shared Memory Bus?
Lock Contention: Under high load, the mutex in the Bus_Pop operation might become a bottleneck.
Thread Borrowing: If C threads are fully occupied but GIL threads are idle, should C tasks be allowed to execute on GIL threads (after releasing the GIL for C execution)?
Shared Memory Leaks: How do I track and free memory allocations in the Shared Memory Bus when C tasks complete or fail?

Seeking Community Input

I'd greatly appreciate feedback on:

Architecture Review: Are there fundamental flaws in this approach that would prevent it from working?
Thread Allocation Strategy: Is the fixed gil_threads parameter the right approach, or should the engine dynamically adjust thread roles based on workload?
Task Routing Logic: How should the engine handle scenarios where one thread type is fully occupied while the other is idle?
Lock-Free Alternatives: Would a lock-free circular buffer be more appropriate than my mutex-based Bus?
Data Marshaling: What's the safest way to pass Python objects to C functions without GIL protection?
Shutdown Safety: How can I make the immediate shutdown path safe for all types of tasks?
Similar Work: Has something like this been attempted before? What were the results?
Python/C Boundary: Am I underestimating the overhead of crossing the Python/C boundary frequently?
GIL Management: Are there better ways to manage multiple threads competing for the GIL besides my fixed allocation approach?
Shared Memory Bus Design: Is a separate shared memory region for Non-GIL threads a good idea? What are the best practices for lock-free data structures in this context?
Memory Synchronization: Should there be any mechanism to transfer data between the Shared Memory Bus and Python objects, or should they remain completely isolated?
Atomic Operations: What atomic primitives should I use for the Shared Memory Bus? Compare-and-swap (CAS)? Memory barriers? Relaxed ordering?
Memory Allocation Strategy: Should the Shared Memory Bus use a pre-allocated pool, dynamic allocation, or a hybrid approach?
Feasibility: Will this design actually work in practice, or am I missing something fundamental about how Python's GIL and C extensions interact?
Implementation in Python: Can this concept be built using Python's C API and existing libraries, or would it require patches to CPython itself? Are there existing Python tools or frameworks that could help implement this?
Performance Boost: If implemented correctly, would this actually provide meaningful performance improvements over standard asyncio, or would the overhead of thread management and the Bus negate any gains?

Conclusion

This is an exploratory design concept, and I'm sharing it to learn from the community. I'm certain there are aspects I haven't thought through properly, and I welcome criticism and suggestions. If this approach has fundamental issues that make it impractical, I'd rather know now.

Has anyone worked on similar hybrid async/native systems? What challenges did you encounter?

Thank you for taking the time to review this concept.

Author: @hejhdiss (Muhammed Shafin P)

This is a theoretical design document. Performance claims are based on architectural expectations, not empirical testing. I am seeking feedback to identify issues.

DEV Community

Hybrid Async-Native Engine for Python: A Design Concept for Review

Introduction

Design Goals

Core Architecture Concept

1. The "Bus" - A Native Task Queue

Visual Architecture Overview

Thread Pool Composition

Task Type Routing Table

Memory Architecture Comparison

Capacity Examples

2. Dual Execution Paths

3. Configuration Parameters

Shared Memory Bus (Non-GIL Only)

API Methods and Functionality

Task Submission Methods

`spawn(coroutine)`

`c_spawn(function_pointer, args)`

Shared Memory Bus Methods

`memory_bus_push(key, data, size)`

`memory_bus_pop(key)`

Monitoring and Introspection Methods

`get_stats()`

`thread_info(thread_id=None)`

`get_call_stack(task_id=None)`

Additional Utility Methods

`pause()` / `resume()`

`clear_memory_bus()`

`set_priority(task_id, priority)`

API Summary Table

Thread Allocation and Task Routing

Task Execution Logic

4. Lifecycle Management

Theoretical Performance Characteristics

Known Issues and Questions

Seeking Community Input

Conclusion

Top comments (0)

Introduction

Design Goals

Core Architecture Concept

1. The "Bus" - A Native Task Queue

Visual Architecture Overview

Thread Pool Composition

Task Type Routing Table

Memory Architecture Comparison

Capacity Examples

2. Dual Execution Paths

3. Configuration Parameters

Shared Memory Bus (Non-GIL Only)

API Methods and Functionality

Task Submission Methods

spawn(coroutine)

c_spawn(function_pointer, args)

Shared Memory Bus Methods

memory_bus_push(key, data, size)

memory_bus_pop(key)

Monitoring and Introspection Methods

get_stats()

thread_info(thread_id=None)

get_call_stack(task_id=None)

Additional Utility Methods

pause() / resume()

clear_memory_bus()

set_priority(task_id, priority)

API Summary Table

Thread Allocation and Task Routing

Task Execution Logic

4. Lifecycle Management

Theoretical Performance Characteristics

Known Issues and Questions

Seeking Community Input

Conclusion

`spawn(coroutine)`

`c_spawn(function_pointer, args)`

`memory_bus_push(key, data, size)`

`memory_bus_pop(key)`

`get_stats()`

`thread_info(thread_id=None)`

`get_call_stack(task_id=None)`

`pause()` / `resume()`

`clear_memory_bus()`

`set_priority(task_id, priority)`