DEV Community

Cover image for 🧠 How We Built Our Own ZIP Handler from Scratch: Complete Technical Journey (Pagonic Project)
SetraTheX
SetraTheX

Posted on

🧠 How We Built Our Own ZIP Handler from Scratch: Complete Technical Journey (Pagonic Project)

How We Built Our Own ZIP Handler from Scratch: Complete Technical Journey (Pagonic Project)

A journey of building a production-ready ZIP engine from scratch with AI support, achieving 253.7 MB/s performance.


πŸ“‹ Table of Contents


🧠 Introduction

In my previous articles, I shared how I built a modern ZIP engine using AI tools and achieved spectacular performance improvements. But the real story goes deeper - it's about building our own ZIP handler from scratch instead of relying on Python's built-in zipfile module.

This article tells the complete technical journey of creating zip_handler.py - a 4220-line production-ready ZIP engine with AI-assisted optimizations, achieving 602.6 MB/s extraction speed. ZIP64 support is in development and test results for 4GB+ files will be shared when completed.


🎯 Challenge: Why We Built Our Own ZIP Handler

πŸ’‘ What you'll learn in this section: Limitations of standard libraries, our vision, and why we decided to develop a custom solution.

Problems with Standard Libraries

  • Python's zipfile: General-purpose, limited optimization potential
  • Performance bottleneck: 2.8 MB/s baseline performance was unacceptable
  • No ZIP64 support: 4GB+ files couldn't be processed
  • Limited customization: AI-assisted optimizations couldn't be applied

Our Vision

  • Custom ZIP parser: Full control over format parsing
  • AI-assisted optimizations: Pattern recognition and adaptive strategies
  • Hardware acceleration: SIMD CRC32 and memory operations
  • Production performance: 600+ MB/s target (achieved!)

πŸ—οΈ Architecture: Building the Foundation

πŸ—οΈ What you'll learn in this section: System architecture, component structure, and fundamental design decisions.

Core Components

zip_handler.py (4220 lines)
β”œβ”€β”€ ZIP Format Parser (zip_structs.py)
β”œβ”€β”€ SIMD Optimizations (simd_crc32.py)
β”œβ”€β”€ Hybrid Decompressor (hybrid_decompressor.py)
β”œβ”€β”€ Buffer Pool System (buffer_pool.py)
β”œβ”€β”€ AI Optimization Engine (ai_optimizer.py)
└── Parallel Processing (zip_parallel_orchestrator.py)
Enter fullscreen mode Exit fullscreen mode

Key Design Decisions

  • Modular architecture: Each component <400 lines for Copilot compatibility
  • Hybrid strategy: Fast path for small files, optimized path for large files
  • Thread-safe design: Proper synchronization for parallel processing
  • Backward compatibility: Works with existing ZIP files

πŸ”§ Technical Implementation: Deep Dive

πŸ”§ What you'll learn in this section: Technical implementation of each component, challenges faced, and solutions.

1. ZIP Format Parser (zip_structs.py)

Challenge: Understanding and implementing ZIP file format from scratch

Solution:

  • Created dataclass structures for all ZIP headers
  • Implemented offset-based binary parsing
  • Added ZIP64 support for large files
  • Built robust error handling

πŸ”§ Key Code:

@dataclass
class CentralDirectoryEntry:
    signature: int                   # 0x02014b50
    version_made_by: int            # System that created the file
    compression_method: int         # 0=store, 8=deflate
    crc32: int                     # CRC-32 checksum
    compressed_size: int           # Compressed size
    uncompressed_size: int         # Original size
    filename: str = ""             # File name
    local_header_offset: int = 0   # Offset to local header
Enter fullscreen mode Exit fullscreen mode

2. SIMD CRC32 Optimization (simd_crc32.py)

Challenge: CRC32 validation was a major bottleneck

Solution:

  • Hardware-accelerated CRC32 with crc32c library
  • Fallback to zlib.crc32 for compatibility
  • Achieved 8-9x speed improvement

⚑ Key Code:

def fast_crc32(data: bytes, initial: int = 0) -> int:
    try:
        import crc32c
        return crc32c.crc32c(data, initial)  # Hardware acceleration
    except ImportError:
        return zlib.crc32(data, initial) & 0xffffffff  # Fallback
Enter fullscreen mode Exit fullscreen mode

3. Hybrid Fast Path Strategy (hybrid_decompressor.py)

Challenge: Different file sizes require different optimization strategies

Solution:

  • Small files (<10MB): Direct zlib decompression
  • Large files (β‰₯10MB): Buffer pools and optimized streams
  • Automatic strategy selection based on file size

πŸš€ Key Code:

def decompress_data(self, compressed_data: bytes, filename: str = "unknown") -> bytes:
    if decision_size < self.threshold_bytes:
        return self._fast_path_decompress(compressed_data, filename)  # Direct zlib
    else:
        return self._optimized_path_decompress(compressed_data, filename)  # Buffer pools
Enter fullscreen mode Exit fullscreen mode

4. Buffer Pool System (buffer_pool.py)

Challenge: Memory fragmentation and repeated allocations

Solution:

  • Pre-allocated buffer pools (64KB to 8MB)
  • Thread-safe buffer reuse
  • Memory pressure management
  • Achieved 100% hit rate

πŸ’Ύ Key Code:

class BufferPool:
    def __init__(self, max_buffers_per_size: int = 10):
        self.standard_sizes = [
            64 * 1024,    # 64KB - small files
            256 * 1024,   # 256KB - medium files  
            1024 * 1024,  # 1MB - large files
            4 * 1024 * 1024,  # 4MB - very large files
            8 * 1024 * 1024   # 8MB - huge files
        ]
Enter fullscreen mode Exit fullscreen mode

5. AI-Assisted Optimization (ai_optimizer.py)

Challenge: How to automatically select optimal parameters for each file

Solution:

  • Pattern recognition for 5 file types
  • Adaptive compression levels (1-9)
  • Dynamic chunk sizing (64KB-4MB)
  • Performance prediction

πŸ€– Key Code:

def get_intelligent_strategy(self, file_path: str, file_size: int) -> Dict[str, Any]:
    file_profile = self._analyze_file_characteristics(file_path, file_size)
    strategy = self._ai_decision_engine(file_profile, memory_pressure, recent_perf)
    return strategy
Enter fullscreen mode Exit fullscreen mode

πŸ“Š Performance Results: From 2.8 to 602.6 MB/s

Current Benchmark Results

Baseline (Python zipfile):    2.8 MB/s
Our ZIP Handler:              602.6 MB/s (extraction)
Improvement:                   +21,421%
Compression Speed:            333.0 MB/s (peak)
Extraction Speed:             602.6 MB/s (peak)
Enter fullscreen mode Exit fullscreen mode

πŸ“ˆ Performance Comparison Chart

Speed (MB/s)    Baseline    Our Handler
    700 ─                                    ╭─ 602.6
    600 ─                                ╭───╯
    500 ─                            ╭───╯
    400 ─                        ╭───╯
    300 ─                    ╭───╯ 333.0
    200 ─                ╭───╯
    100 ─            ╭───╯
      0 ┼────────────╯
         Extraction   Compression
Enter fullscreen mode Exit fullscreen mode

πŸ† Success Metrics

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric          β”‚ Baseline    β”‚ Ours        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Extraction Speedβ”‚ 2.8 MB/s    β”‚ 602.6 MB/s  β”‚
β”‚ Compression     β”‚ 1.5 MB/s    β”‚ 333.0 MB/s  β”‚
β”‚ Memory Usage    β”‚ 500 MB      β”‚ 24.5 MB     β”‚
β”‚ Test Success    β”‚ 85%         β”‚ 100%        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Strategy Performance

  • Parallel Extraction: 459.6 MB/s (average) - 602.6 MB/s (peak)
  • Modular Compression: 217.1 MB/s (average) - 333.0 MB/s (peak)
  • AI Pattern Detection: 64 successful detections
  • Memory Efficiency: Average 24.5 MB usage

Test Coverage

  • 112 tests: 100% pass rate
  • 1MB-1GB file range: Full support
  • Cross-platform: Windows/Linux compatibility
  • Production ready: Thread-safe and robust

System Information

  • CPU: 12 cores (ideal for high parallel performance)
  • RAM: 15.93 GB total, 6.16 GB available
  • Disk: 464.98 GB total, 181.54 GB free
  • Platform: Windows 10 (powerful system)

Note: The reason for such high parallel extraction speeds is the 12-core powerful processor and sufficient RAM. These results were achieved on high-performance systems.


πŸ€– AI Integration: Beyond Traditional Optimization

Pattern Recognition System

file_type_patterns = {
    'text': {'compression_level': 9, 'method': 'deflate', 'chunk_size': 1024*1024},
    'binary': {'compression_level': 6, 'method': 'deflate', 'chunk_size': 2*1024*1024},
    'image': {'compression_level': 3, 'method': 'store', 'chunk_size': 4*1024*1024},
    'archive': {'compression_level': 1, 'method': 'store', 'chunk_size': 8*1024*1024},
    'executable': {'compression_level': 7, 'method': 'deflate', 'chunk_size': 512*1024}
}
Enter fullscreen mode Exit fullscreen mode

Adaptive Strategy Selection

  • File size analysis: Automatic categorization
  • Content type detection: Entropy-based analysis
  • System resource monitoring: Memory and CPU pressure
  • Performance history: Learning from previous operations

πŸš€ Advanced Features: Parallel Processing and Future Plans

Current Features

  • Parallel Extraction: 602.6 MB/s peak performance (12-core system)
  • Thread-safe extraction: Multiple files simultaneously
  • Buffer pool integration: Thread-safe memory management
  • AI Pattern Recognition: 64 successful detections
  • Memory Pool Optimization: Average 24.5 MB usage
  • Multi-core Optimization: Maximum performance on 12-core systems

Future Plans

  • ZIP64 Support: In development (for 4GB+ files)
  • Stress Tests: Extreme large files (5GB-10GB) tests planned
  • Cloud Integration: Remote file processing support
  • Enterprise Features: Advanced security and compliance

πŸ› οΈ Development Challenges and Solutions

Challenge 1: Copilot File Size Limits and AI Crashes

Problem: 4220-line zip_handler.py exceeded Copilot's scanning limits and AI started crashing continuously

Personal Experience: "I was fed up with Copilot. Lines kept increasing and AI kept crashing. After my long planning was done, I said 'this will work' and switched to Cursor. Problem solved."

Solution:

  • Modular architecture with <400 line components
  • Extracted optimizations to separate modules
  • Improved tool compatibility while maintaining functionality
  • Cursor transition: Started using Cursor when Copilot limits were exceeded

Challenge 2: Thread Safety

Problem: Parallel processing caused race conditions

Solution:

  • Global locks for folder creation
  • Thread-safe buffer pools
  • Thread-isolated file handles
  • Proper exception handling

Challenge 3: Memory Management

Problem: Large files caused memory overflow

Solution:

  • Buffer pooling system
  • Streaming decompression
  • Memory-mapped file support
  • Adaptive chunk sizing

πŸ“ˆ Lessons Learned: The Reality of AI-Assisted Development

What Works Well

  • AI for architecture: ChatGPT helped design modular structure
  • Pattern recognition: AI was excellent at defining optimization patterns
  • Code generation: Copilot was great for repetitive boilerplate
  • Testing: AI helped create comprehensive test suites

Code Example - AI Pattern Recognition:

# AI excelled at this type of pattern definition
file_type_patterns = {
    'text': {'compression_level': 9, 'chunk_size': 1024*1024},
    'binary': {'compression_level': 6, 'chunk_size': 2*1024*1024},
    'image': {'compression_level': 3, 'chunk_size': 4*1024*1024}
}
Enter fullscreen mode Exit fullscreen mode

What's Difficult

  • Large file processing: AI struggled with complex memory management
  • Performance optimization: Required manual fine-tuning
  • Thread safety: Required careful manual review
  • Integration complexity: AI couldn't handle complete system integration

Code Example - Manual Thread Safety:

# AI couldn't handle this complex thread safety
class ThreadSafeExtractor:
    def __init__(self):
        self._folder_locks = {}
        self._global_lock = threading.Lock()

    def extract_file(self, zip_path: str, output_dir: str):
        folder_path = os.path.dirname(output_dir)
        with self._global_lock:
            if folder_path not in self._folder_locks:
                self._folder_locks[folder_path] = threading.Lock()
        with self._folder_locks[folder_path]:
            os.makedirs(folder_path, exist_ok=True)
Enter fullscreen mode Exit fullscreen mode

Key Insights

  • AI is a tool, not a replacement: Manual intervention was often necessary
  • Modular design is critical: Keeps files manageable for AI tools
  • Testing is essential: Comprehensive validation of AI-generated code required
  • Performance requires iteration: Multiple optimization cycles necessary

Code Example - Modular Design:

# Before: 4220 lines - AI crashed
class ZIPHandler:
    def __init__(self):
        # 4000+ lines of code
        pass

# After: Modular - AI works perfectly
# zip_handler.py (200 lines)
# zip_structs.py (150 lines) 
# simd_crc32.py (100 lines)
# hybrid_decompressor.py (300 lines)
Enter fullscreen mode Exit fullscreen mode

🎯 Future Roadmap: What's Next

Short Term (1-2 weeks)

  • ZIP64 support: Full support for 4GB+ files (in development)
  • Stress tests: Benchmark for 5GB-10GB extreme large files
  • GUI integration: User-friendly interface

Medium Term (1 month)

  • Additional formats: 7z, RAR support
  • Cloud integration: Remote file processing
  • Enterprise features: Advanced security and compliance

Long Term (3 months)

  • Community release: Make project open source
  • Plugin system: Extensible architecture
  • Performance optimization: 700+ MB/s target (above current 602.6 MB/s)

πŸ’‘ Insights: Building Production Software with AI

Technical Insights

  1. Custom implementations, when optimized for specific use cases, can exceed standard libraries
  2. Modular architecture is essential for AI-assisted development
  3. Performance optimization requires multiple iterations and careful measurement
  4. Thread safety and error handling are critical for production systems

AI Development Insights

  1. AI excels at pattern recognition and code generation but struggles with complex system integration
  2. Manual intervention is often necessary for performance-critical code
  3. Testing is more important than ever when using AI-generated code
  4. Documentation and clear architecture help AI tools work more effectively

Business Insights

  1. Custom solutions can provide competitive advantage in performance-critical applications
  2. AI-assisted development can accelerate development but requires expert supervision
  3. Performance optimization can be a significant differentiator in software products
  4. Modular, maintainable code is essential for long-term success

🎯 Conclusion: Journey Summary

πŸ† Achievements: Building our own ZIP handler from scratch was a challenging but rewarding journey.

πŸ“Š Results We Achieved

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric                  β”‚ Target          β”‚ Achieved        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Extraction Performance  β”‚ 150+ MB/s       β”‚ 602.6 MB/s      β”‚
β”‚ Compression Performance β”‚ 100+ MB/s       β”‚ 333.0 MB/s      β”‚
β”‚ Test Success            β”‚ 85%+            β”‚ 100%            β”‚
β”‚ AI Pattern Detection    β”‚ 50+             β”‚ 64              β”‚
β”‚ Memory Efficiency       β”‚ <100 MB         β”‚ 24.5 MB         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

πŸ”‘ Key Lessons

  1. AI-assisted development can create powerful custom solutions that exceed standard libraries
  2. Careful architecture, comprehensive testing, and expert supervision are essential
  3. Modular design is critical for AI tools
  4. Performance optimization requires multiple iterations

πŸš€ Future Vision

This project shows that with the right approach, AI tools can help developers build sophisticated, high-performance software that would be difficult to create manually.

πŸ“ Note: ZIP64 support is in development and test results for 4GB+ files will be shared when completed. Additionally, stress tests for 5GB-10GB extreme large files are planned.

πŸ’» System Requirements: These performance results were achieved on a 12-core powerful system. Parallel extraction speeds are specifically optimized for multi-core systems.


πŸ“¦ Project: Pagonic ZIP Engine

πŸ‘€ Developer: SetraTheXX

πŸš€ Performance: 602.6 MB/s extraction speed (peak, 12-core system)

πŸ€– AI Integration: Pattern recognition and adaptive optimization

πŸ’» Test System: 12 cores, 16GB RAM, Windows 10


πŸ’¬ Personal Experiences: Questions and Answers

The most valuable lessons and personal experiences I learned throughout this journey:

🎯 Biggest Challenge: AI Tool Limitations

Question: "What was the biggest challenge you faced in this project?"

Answer: The biggest challenge was AI tools starting to crash as file size increased. When zip_handler.py reached 4000+ lines, Copilot completely crashed. Every change would freeze the IDE and AI would just give up.

Code Example - The Problem:

# This file grew to 4220 lines - Copilot couldn't handle it
class ZIPHandler:
    def __init__(self):
        # 4000+ lines of code
        # Copilot: "I give up, this is too complex"
        pass

# Solution: Split into modules <400 lines each
# zip_handler.py (200 lines)
# zip_structs.py (150 lines) 
# simd_crc32.py (100 lines)
# hybrid_decompressor.py (300 lines)
Enter fullscreen mode Exit fullscreen mode

Personal Experience: "I was fed up with Copilot. Lines kept increasing and AI kept crashing. After my long planning was done, I said 'this will work' and switched to Cursor. Problem solved."

This experience taught me the practical limits of AI tools and showed the importance of modular architecture.

🧠 Technical Learning: From Naive to Systematic Development

Question: "What was your biggest technical learning from this project?"

Answer: The biggest learning was how to develop software systematically even with AI assistance. I started with a naive approach - just asking AI to build features - but quickly learned that real progress requires a structured methodology.

Development Evolution:

  1. Phase 1: Template-First Development - Learned to create standardized module templates (50% speedup)
  2. Phase 2: Copy-Paste Engineering - Learned to systematically identify and reuse proven code blocks
  3. Phase 3: Manual-AI Hybrid Approach - Learned to manually implement code with AI guidance when tools hit limits
  4. Phase 4: Modular Architecture - Realized keeping files under 300 lines is critical for AI tool compatibility

This approach became so systematic that I documented it in detailed planning files like 02_SIKISTIRMA_MOTORU.md.

πŸ€– AI Integration: Surprises and Realities

Question: "What surprised you most about AI in the development process?"

Answer: What surprised me was AI being excellent at pattern recognition and code generation but struggling with complex system integration. AI was great at defining optimization patterns but required manual intervention for complex memory management and thread safety.

What Works Well:

  • AI for architectural design
  • Pattern recognition and optimization strategies
  • Boilerplate code generation
  • Test suite creation

What's Difficult:

  • Complex memory management
  • Performance-critical optimizations
  • Thread safety
  • Complete system integration

πŸ“Š Performance Insights: Biggest Surprise

Question: "Which performance optimization surprised you most and why?"

Answer: The impact of the buffer pooling system surprised me most. It started as a simple memory management optimization but achieved dramatic performance improvement with 100% hit rate.

Key Insight: Sometimes the simplest optimizations create the biggest impact. Buffer pooling improved performance through smart memory management rather than complex algorithms.

πŸš€ Future Plans: Next Big Challenge

Question: "What big challenge are you planning to tackle next?"

Answer: ZIP64 support and stress tests for extreme large files (5GB-10GB). ZIP64 is currently in development and I'll share test results for 4GB+ files when completed.

Future Goals:

  • Complete ZIP64 support (4GB+ files)
  • 5GB-10GB extreme large file stress tests
  • 700+ MB/s performance target (above current 602.6 MB/s)
  • Cloud integration and enterprise features

πŸ˜… Funny/Frustrating Moments: Educational Experiences

Question: "Did you have any funny or frustrating moments during development?"

Answer: Yes! Copilot constantly crashing and me saying "this time it will definitely work" and trying again was funny. Every change would freeze the IDE in 4000+ line files but I still hoped "maybe this time."

Educational Moment: When I finally decided to switch to Cursor, I restructured the entire project into modular components and the problem was solved. This taught me the lesson "accept tool limitations and adapt."

Personal Lesson: Sometimes the best solution isn't fighting with the current tool, but finding the right tool or changing approach.


πŸš€ Next Steps

πŸ’‘ You Try Too!

If you're inspired by this project, you can start your own AI-assisted development journey:

  1. Start with a small project - Simple optimizations instead of complex systems
  2. Use modular design - Manageable file sizes for AI tools
  3. Write comprehensive tests - Validate AI-generated code correctness
  4. Measure performance - Track progress with concrete metrics

πŸ“š Resources

πŸ’¬ Interaction

Would you like to share your AI-assisted development experiences too?

  • What challenges did you face?
  • How did you solve them?
  • Which AI tools did you use?

I'd love to compare notes! πŸš€


πŸ‘¨β€πŸ’» Developer Information

Developer: SetraTheXX

Project: Pagonic ZIP Engine

GitHub: SetraTheXX (coming soon)

Contact: Available through GitHub

Specialization: AI-assisted development, performance optimization, custom ZIP implementations

πŸ› οΈ Technical Stack

  • Language: Python 3.x
  • AI Tools: GitHub Copilot, Cursor, ChatGPT
  • Performance: 602.6 MB/s extraction speed (peak)
  • Architecture: Modular, thread-safe, production-ready
  • Testing: 112 tests, 100% pass rate

🎯 Current Focus

  • ZIP64 support development
  • Extreme large file testing (5GB-10GB)
  • Performance optimization to 700+ MB/s
  • Open source release preparation

πŸ“ˆ Achievements

  • Built custom ZIP handler from scratch
  • Achieved 21,421% performance improvement over baseline
  • Implemented AI-assisted pattern recognition
  • Created modular, maintainable architecture

This project demonstrates the power of AI-assisted development when combined with systematic methodology and expert supervision.

Top comments (1)

Collapse
 
setrathexx profile image
SetraTheX

I am really curious about your views, I would appreciate it if you take the time to read them.