Introduction
I've been working on a design concept for a hybrid execution engine that aims to bridge Python's asyncio with native C-level threading. I'm sharing this here to get community feedback, identify potential issues, and learn from more experienced developers. This is an experimental concept, and I'm certain there are drawbacks and edge cases I haven't considered.
Please note: This is a theoretical design I've developed. I have not benchmarked it against production systems, and the performance characteristics I describe are based on my understanding of how the components should behave, not empirical measurements.
Design Goals
My primary goal was to explore ways to reduce task orchestration overhead in async Python applications. Standard asyncio works well for I/O-bound tasks, but I wanted to experiment with moving the task queue into native C memory to see if this could reduce scheduling overhead.
Core Architecture Concept
1. The "Bus" - A Native Task Queue
The central idea is a C-implemented task queue (I'm calling it the "Bus") that uses atomic operations for task management:
- Push Operation: Adds tasks to the queue using atomic pointer updates
- Pop Operation: Workers retrieve tasks using mutex locks
- Wait State: Idle workers use condition variables to minimize CPU usage
Visual Architecture Overview
┌─────────────────────────────────┐
│ Python Application Layer │
│ (Submits Tasks via setup()) │
└────────────┬────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Native C "Bus" (Queue) │
│ ┌─────────────────────────┐ │
│ │ Task Priority Queue │ │
│ │ (LIFO/FIFO Policy) │ │
│ │ Atomic PUSH/POP Ops │ │
│ └─────────────────────────┘ │
└─────┬───────────────────┬───────┘
│ │
┌───────────────┴─────┐ ┌────────┴──────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ GIL Thread 1 │ │ GIL Thread 2 │ │ C Thread 1 │
│ ┌───────────────┐ │ │ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ Event Loop │ │ │ │ Event Loop │ │ │ │ C Function │ │
│ │ 1000+ async │ │ │ │ 1000+ async │ │ │ │ Execution │ │
│ │ I/O tasks │ │ │ │ I/O tasks │ │ │ │ (Parallel) │ │
│ └───────────────┘ │ │ └───────────────┘ │ │ └───────────────┘ │
│ Requires GIL │ │ Requires GIL │ │ No GIL Needed │
└─────────────────────┘ └─────────────────────┘ └──────────┬──────────┘
│ │ │
│ │ ┌───────────▼───────────┐
│ │ │ Shared Memory Bus │
│ │ │ (Non-GIL Only) │
│ │ │ ┌─────────────────┐ │
│ │ │ │ Atomic Read/ │ │
│ │ │ │ Write Operations│ │
│ │ │ └─────────────────┘ │
│ │ │ Zero-Copy Data │
│ │ │ Sharing Between C │
│ │ └──────────┬───────────┘
│ │ │
│ │ ┌──────────▼──────────┐
│ │ │ C Thread 2 │
│ │ │ ┌───────────────┐ │
│ │ │ │ C Function │ │
│ │ │ │ Execution │ │
│ │ │ │ (Parallel) │ │
│ │ │ └───────────────┘ │
│ │ │ No GIL Needed │
│ │ └─────────────────────┘
│ │ │
└─────────────┬───────────┴─────────────────────────┘
│
▼
┌─────────────────┐
│ Results / │
│ Completion │
└─────────────────┘
Thread Pool Composition
| Configuration | GIL Threads | C Threads | Total Workers |
|---|---|---|---|
| Example 1 | 2 | 2 | 4 |
| Example 2 | 3 | 1 | 4 |
| Example 3 | 1 | 7 | 8 |
Constraint: gil_threads ≤ workers and c_threads = workers - gil_threads
Task Type Routing Table
| Task Type | Can Execute On | Execution Model | Memory Access | Typical Use Case |
|---|---|---|---|---|
| Python Coroutine (async/await) | GIL Threads Only | Concurrent I/O (1000s per thread) | Python heap only | Web requests, file I/O, database queries |
| C Function Pointer | C Threads (Preferred) | True Parallel Execution | Shared Memory Bus + Local | CPU-intensive math, data processing, encoding |
| C Function Pointer (Alternative) | GIL Threads (If C threads busy)* | Sequential Execution | Local memory only | Fallback when C threads saturated |
*Design question: Should this fallback be allowed?
Memory Architecture Comparison
| Memory Region | Accessible By | Thread-Safe | Use Case | Size Limit |
|---|---|---|---|---|
| Main Task Bus | All Threads | Yes (Mutex/Atomic) | Task queue management | Defined by shared_amount
|
| Shared Memory Bus | C Threads Only | Yes (Atomic Ops) | Zero-copy data sharing between C tasks | Defined by shared_memory parameter |
| Python Heap | GIL Threads Only | Yes (GIL Protected) | Python objects and coroutine state | System memory limit |
| Thread-Local Memory | Each Thread | N/A (No sharing) | Thread-specific temporary data | Stack/heap limits |
Capacity Examples
| Workers | GIL Threads | C Threads | Concurrent I/O Capacity | Parallel CPU Tasks |
|---|---|---|---|---|
| 4 | 2 | 2 | ~2,000 (1000 × 2) | 2 simultaneous |
| 8 | 3 | 5 | ~3,000 (1000 × 3) | 5 simultaneous |
| 8 | 6 | 2 | ~6,000 (1000 × 6) | 2 simultaneous |
| 16 | 4 | 12 | ~4,000 (1000 × 4) | 12 simultaneous |
2. Dual Execution Paths
The design attempts to support two types of workloads:
Path A: Python Coroutines (GIL-dependent)
- Standard async/await functions
- Requires GIL acquisition during execution
- Falls back to single-threaded concurrency model
Path B: C Function Pointers (Non-GIL)
- Direct C function execution
- Bypasses Python interpreter
- Theoretical true parallelism across cores
Known concern: I'm not sure how to safely marshal complex Python objects into the Non-GIL path without creating memory management issues.
3. Configuration Parameters
setup(
workers=4, # Total thread pool size (capped at CPU count)
gil_threads=2, # Number of GIL-enabled threads (must be ≤ workers)
tasks_per_thread=100, # Virtual queue depth per worker
shared_amount=1GB, # Memory ceiling for backpressure
shared_memory=512MB, # Shared Memory Bus size (Non-GIL threads only)
policy='LIFO' # Task ordering strategy
)
Important constraint: gil_threads must always be less than or equal to workers. This parameter determines how many threads in the pool are designated for Python coroutines (which require the GIL).
Example: If workers=4 and gil_threads=2:
- 2 threads are designated for GIL-based Python coroutines
- 2 threads are designated for Non-GIL C function execution
- The 2 C threads share access to a 512MB Shared Memory Bus
Shared Memory Bus (Non-GIL Only)
In addition to the main Task Bus, the design includes a separate Shared Memory Bus that provides zero-copy data sharing between Non-GIL C threads:
Key Features:
- Exclusive to C Threads: Only Non-GIL threads can access this memory region
- Lock-Free Access: Uses atomic operations for thread-safe reads and writes
- Zero-Copy Sharing: C threads can share large data structures without serialization
- Direct Memory Access: C functions can read/write shared memory pointers directly
- Isolated from Python: GIL threads cannot access this memory, maintaining strict separation
Use Cases:
- Sharing intermediate computation results between parallel C tasks
- Building shared lookup tables or caches for C functions
- Passing large arrays or buffers between C threads without copying
- Coordinating state across parallel C computations
Design Question: Should there be a mechanism to synchronize data between the Shared Memory Bus and Python objects, or should this remain completely isolated? The isolation is safer but limits interoperability.
API Methods and Functionality
The engine exposes several methods for task submission, memory management, and introspection:
Task Submission Methods
spawn(coroutine)
Submits a Python coroutine to the Task Bus for execution on GIL threads.
async def fetch_data(url):
# async I/O operation
return data
engine.spawn(fetch_data("https://example.com"))
Behavior:
- Task is pushed to the main Task Bus
- Routed only to GIL-enabled threads
- Executes within an asyncio event loop
- Returns immediately (non-blocking submission)
c_spawn(function_pointer, args)
Submits a C function pointer to the Task Bus for execution on Non-GIL threads.
# Assuming you have a C function registered
c_function_ptr = get_c_function("process_array")
engine.c_spawn(c_function_ptr, (array_data, size))
Behavior:
- Task is pushed to the main Task Bus with high priority
- Routed preferentially to Non-GIL C threads
- Executes directly in C without Python interpreter
- Can access Shared Memory Bus for data sharing
Design Question: How should arguments be marshalled? Should only C-native types be allowed, or should there be automatic conversion for simple Python types (int, float, bytes)?
Shared Memory Bus Methods
memory_bus_push(key, data, size)
Writes data to the Shared Memory Bus accessible by all Non-GIL threads.
# Push a large array to shared memory
array_ptr = get_array_pointer(my_data)
engine.memory_bus_push("computation_input", array_ptr, size_bytes)
Behavior:
- Allocates space in the Shared Memory Bus
- Uses atomic operations to update memory pointers
- Returns a key/handle for retrieval
- Only accessible from C threads
- Error if: Called from GIL thread context or memory limit exceeded
memory_bus_pop(key)
Retrieves data from the Shared Memory Bus.
# Retrieve shared data in a C function
data_ptr = engine.memory_bus_pop("computation_input")
Behavior:
- Returns pointer to shared memory region
- Does not copy data (zero-copy access)
- Multiple threads can read simultaneously
-
Design Question: Should pop remove the data or just retrieve it? Should there be separate
memory_bus_get()for non-destructive reads?
Monitoring and Introspection Methods
get_stats()
Returns current engine statistics and performance metrics.
stats = engine.get_stats()
print(stats)
# Output:
# {
# 'total_tasks_submitted': 15420,
# 'tasks_completed': 15102,
# 'tasks_in_queue': 318,
# 'gil_threads_active': 2,
# 'gil_threads_idle': 0,
# 'c_threads_active': 3,
# 'c_threads_idle': 1,
# 'shared_memory_used': 245760000, # bytes
# 'shared_memory_available': 291240000,
# 'avg_task_latency_ms': 2.4,
# 'bus_contention_count': 47
# }
Returned Metrics:
- Task counters (submitted, completed, queued)
- Thread utilization per type
- Memory usage statistics
- Performance metrics (latency, throughput)
- Contention/blocking events
thread_info(thread_id=None)
Returns detailed information about specific threads or all threads.
# Get info for all threads
all_threads = engine.thread_info()
# Get info for specific thread
thread_5 = engine.thread_info(thread_id=5)
# Output:
# {
# 'thread_id': 5,
# 'type': 'C_THREAD',
# 'state': 'RUNNING',
# 'current_task': 'process_matrix_42',
# 'tasks_completed': 1847,
# 'cpu_time_ms': 45230,
# 'idle_time_ms': 892,
# 'last_active': '2026-01-13T10:23:45'
# }
Information Provided:
- Thread type (GIL vs C)
- Current state (RUNNING, IDLE, WAITING, BLOCKED)
- Task execution history
- CPU time and idle time
- Current task identifier
get_call_stack(task_id=None)
Returns a call stack representation for tasks, similar to frame info but as a custom Python object.
# Get current task's call stack
stack = engine.get_call_stack()
# Get specific task's call stack
stack = engine.get_call_stack(task_id="task_12345")
# Stack object structure (custom Python object)
for frame in stack.frames:
print(f"Function: {frame.function_name}")
print(f"Location: {frame.file}:{frame.line}")
print(f"Type: {frame.execution_type}") # 'PYTHON' or 'C_NATIVE'
print(f"Thread: {frame.thread_id}")
print(f"Timestamp: {frame.timestamp}")
Custom Stack Object Properties:
-
frames: List of frame objects (not standard Python frame objects) - Each frame contains:
-
function_name: Name of the function/coroutine -
file: Source file (or "C_NATIVE" for C functions) -
line: Line number (or 0 for C functions) -
execution_type: 'PYTHON' or 'C_NATIVE' -
thread_id: Which thread is executing this frame -
timestamp: When this frame was entered -
memory_refs: References to Shared Memory Bus if applicable
-
Design Question: Should the stack object be immutable (snapshot at call time) or live-updating? Should it include memory access history for C threads?
Additional Utility Methods
pause() / resume()
Temporarily pause task execution without shutdown.
engine.pause() # Stop accepting new tasks, finish current ones
engine.resume() # Resume task acceptance
clear_memory_bus()
Clears all data from the Shared Memory Bus.
engine.clear_memory_bus() # Free all shared memory allocations
set_priority(task_id, priority)
Adjusts task priority in the queue.
engine.set_priority("task_12345", priority=10) # Higher priority
API Summary Table
| Method | Purpose | Accessible From | Returns | Blocking |
|---|---|---|---|---|
spawn() |
Submit Python coroutine | Python | None | No |
c_spawn() |
Submit C function | Python | None | No |
memory_bus_push() |
Write to shared memory | Python/C | Key handle | No |
memory_bus_pop() |
Read from shared memory | C threads | Data pointer | No |
get_stats() |
Engine statistics | Python | Dict | No |
thread_info() |
Thread details | Python | Dict/List | No |
get_call_stack() |
Task call stack | Python | Custom object | No |
pause() |
Pause execution | Python | None | No |
resume() |
Resume execution | Python | None | No |
wait_all() |
Graceful completion | Python | None | Yes |
shutdown() |
Immediate termination | Python | None | No |
Thread Allocation and Task Routing
The engine divides the worker pool into two groups:
GIL-Enabled Threads (Python coroutine handlers):
- These threads can acquire the Python GIL
- They execute Python async functions and coroutines
- Limited to the number specified in
gil_threadsparameter - Handle Python object manipulation safely
Non-GIL C Threads (Native function handlers):
- Calculated as:
c_threads = workers - gil_threads - These threads never acquire the GIL
- Execute pure C function pointers directly
- Provide true parallel execution across CPU cores
- Cannot safely interact with Python objects
Task Execution Logic
When the Bus receives a task, the routing works as follows:
-
Python Coroutine Task Arrives:
- Checks if any GIL-enabled threads are available
- If yes: Assigns to a free GIL thread immediately
- If no: Task waits in queue until a GIL thread becomes free
- C threads cannot execute these tasks (they lack GIL access)
-
C Function Pointer Task Arrives:
- First checks if any Non-GIL C threads are available
- If yes: Assigns to a free C thread immediately
- If no: Can optionally use a GIL thread (after releasing GIL for the C execution)
- Prefers C threads for optimal performance
-
All GIL Threads Occupied:
- Python coroutine tasks must wait in the Bus queue
- C function tasks continue to execute on C threads without blocking
- This prevents Python workload from blocking native execution
Design Question I'm Uncertain About: Should C function tasks be allowed to "borrow" GIL threads when C threads are full but GIL threads are idle? Or should they strictly wait for a C thread to become available? The first approach maximizes utilization, but the second maintains cleaner separation.
4. Lifecycle Management
The design includes two shutdown approaches:
Graceful Shutdown (wait_all()):
- Blocks until all tasks complete
- Ensures all results are committed
- Standard cleanup process
Immediate Shutdown (shutdown()):
- Broadcasts termination signal
- Stops workers mid-execution
- Frees resources immediately
Major concern: The immediate shutdown could leave C-level state corrupted or cause memory leaks if tasks are holding native resources. I haven't figured out how to make this safe in all scenarios.
Theoretical Performance Characteristics
Based on my understanding of the architecture, I expect:
- Task creation: Should be faster than Python asyncio due to reduced abstraction layers
- Scheduling: Native C loop should have lower overhead than Python event loop
- Parallelism: Non-GIL path should enable true multi-core execution for CPU-bound tasks
- I/O Concurrency: Each GIL thread can manage thousands of async I/O operations simultaneously, multiplied across all GIL threads (e.g., 3 GIL threads × 1,000 concurrent I/O ops = 3,000 total concurrent I/O operations)
- Hybrid Workloads: CPU-intensive C tasks run in parallel on C threads while I/O-intensive Python coroutines handle thousands of concurrent connections across GIL threads
Important caveat: These are expectations, not proven results. Actual performance would depend on implementation quality, workload characteristics, and numerous factors I may not have considered. The overhead of managing multiple event loops and coordinating between thread types might offset these theoretical gains.
Known Issues and Questions
I'm aware of several potential problems:
Memory Safety: How do I safely handle Python object lifetimes in the Non-GIL execution path?
Shared Memory Bus Synchronization: How do C threads coordinate access to the Shared Memory Bus? Should I use lock-free algorithms, atomic operations, or some form of lightweight locking?
Memory Isolation: Is complete isolation between the Shared Memory Bus (C threads) and Python heap (GIL threads) the right approach? Or should there be a safe bridge mechanism?
Data Corruption Risk: If multiple C threads write to the same Shared Memory Bus location simultaneously, how do I prevent race conditions and data corruption?
GIL Thread Allocation: Is the fixed separation between GIL and Non-GIL threads the right approach? Should there be dynamic reallocation based on workload patterns?
Thread Starvation: If all GIL threads are busy, Python tasks queue up even when C threads are idle. Is there a better way to handle this imbalance?
GIL Interaction: The interleaving between GIL and Non-GIL tasks might create complex synchronization issues I haven't anticipated.
Error Handling: How should exceptions in C-level tasks be propagated back to Python?
Resource Cleanup: The immediate shutdown approach seems risky. What's the proper way to ensure clean termination, especially for data in the Shared Memory Bus?
Lock Contention: Under high load, the mutex in the Bus_Pop operation might become a bottleneck.
Thread Borrowing: If C threads are fully occupied but GIL threads are idle, should C tasks be allowed to execute on GIL threads (after releasing the GIL for C execution)?
Shared Memory Leaks: How do I track and free memory allocations in the Shared Memory Bus when C tasks complete or fail?
Seeking Community Input
I'd greatly appreciate feedback on:
Architecture Review: Are there fundamental flaws in this approach that would prevent it from working?
Thread Allocation Strategy: Is the fixed
gil_threadsparameter the right approach, or should the engine dynamically adjust thread roles based on workload?Task Routing Logic: How should the engine handle scenarios where one thread type is fully occupied while the other is idle?
Lock-Free Alternatives: Would a lock-free circular buffer be more appropriate than my mutex-based Bus?
Data Marshaling: What's the safest way to pass Python objects to C functions without GIL protection?
Shutdown Safety: How can I make the immediate shutdown path safe for all types of tasks?
Similar Work: Has something like this been attempted before? What were the results?
Python/C Boundary: Am I underestimating the overhead of crossing the Python/C boundary frequently?
GIL Management: Are there better ways to manage multiple threads competing for the GIL besides my fixed allocation approach?
Shared Memory Bus Design: Is a separate shared memory region for Non-GIL threads a good idea? What are the best practices for lock-free data structures in this context?
Memory Synchronization: Should there be any mechanism to transfer data between the Shared Memory Bus and Python objects, or should they remain completely isolated?
Atomic Operations: What atomic primitives should I use for the Shared Memory Bus? Compare-and-swap (CAS)? Memory barriers? Relaxed ordering?
Memory Allocation Strategy: Should the Shared Memory Bus use a pre-allocated pool, dynamic allocation, or a hybrid approach?
Feasibility: Will this design actually work in practice, or am I missing something fundamental about how Python's GIL and C extensions interact?
Implementation in Python: Can this concept be built using Python's C API and existing libraries, or would it require patches to CPython itself? Are there existing Python tools or frameworks that could help implement this?
Performance Boost: If implemented correctly, would this actually provide meaningful performance improvements over standard asyncio, or would the overhead of thread management and the Bus negate any gains?
Conclusion
This is an exploratory design concept, and I'm sharing it to learn from the community. I'm certain there are aspects I haven't thought through properly, and I welcome criticism and suggestions. If this approach has fundamental issues that make it impractical, I'd rather know now.
Has anyone worked on similar hybrid async/native systems? What challenges did you encounter?
Thank you for taking the time to review this concept.
Author: @hejhdiss (Muhammed Shafin P)
This is a theoretical design document. Performance claims are based on architectural expectations, not empirical testing. I am seeking feedback to identify issues.
Top comments (0)