Ahirton Lopes

Posted on May 1

🏈 TensorCraft Playbook: From Classroom CNNs to Cloud TPUs with Keras

#ai

📖 Chapter 1: The Offensive Formation (CNN Architecture)

Every winning strategy requires solid fundamentals. In Deep Learning, this means designing a well-balanced Convolutional Neural Network (CNN) architecture. Our starting point is a classic structure, designed for hierarchical extraction of spatial patterns from 32x32 pixel images.

Tactical Architecture Analysis

The baseline model follows a sequence of operations that mirrors the hierarchy of the visual cortex:

Convolution (Feature Extraction): Conv2D layers apply filters to detect both low-level patterns (edges) and high-level features (objects such as cars or birds).

Downsampling (Spatial Reduction): MaxPooling2D layers reduce spatial dimensions while preserving the most relevant features, making the model invariant to small translations.

Regularization: The use of Dropout is critical to prevent overfitting, forcing the network to avoid reliance on specific neurons during training.

Decision Zone (Classification): Fully connected layers combine all extracted features to deliver the “touchdown”: the final classification through a softmax activation function across 10 classes.

# Arquitetura Estratégica - Referência: Demo_CNN_CIFAR10.ipynb
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

def build_attack_formation():
    model = Sequential()
    # Primeiro bloco: 32 filtros para capturar texturas básicas
    model.add(Conv2D(filters=32, kernel_size=2, padding='same', activation='relu', input_shape=(32, 32, 3)))
    model.add(MaxPooling2D(pool_size=2))

    # Segundo bloco: aumentando a profundidade para 64 filtros
    model.add(Conv2D(filters=64, kernel_size=2, padding='same', activation='relu'))
    model.add(MaxPooling2D(pool_size=2))

    # Terceiro bloco: 128 filtros para abstrações complexas
    model.add(Conv2D(filters=128, kernel_size=2, padding='same', activation='relu'))
    model.add(MaxPooling2D(pool_size=2))

    # Camada de transição e regularização
    model.add(Dropout(0.3))
    model.add(Flatten())

    # Fully Connected: A "Red Zone" antes do touchdown
    model.add(Dense(100, activation='relu'))
    model.add(Dropout(0.4))
    model.add(Dense(10, activation='softmax')) # O Touchdown Final

    return model

📖 Chapter 2: Knowing the Opponent (CPU vs. GPU vs. TPU)

Before stepping onto the field, a strategist must understand the strengths and weaknesses of each processing unit. During CIFAR-10 training, switching across devices reveals the classic compute-bound vs. memory-bound trade-off.

1.CPU: The Master Builder (Low Latency)

The Central Processing Unit (CPU) is designed for complex logic and sequential execution. It has fewer cores, but they are highly optimized, fast, and versatile.

In the Playbook: The CPU is ideal for data preprocessing and loading (ETL). However, when it comes to large-scale matrix multiplications, it becomes a bottleneck due to its inherently sequential nature.

2. GPU: The Workforce Army (Massive Parallelism)

The Graphics Processing Unit (GPU) has been adapted for AI workloads due to its thousands of cores operating in parallel.

In the Playbook: GPUs excel at CNN workloads because each filter can be processed as an independent parallel task. The main challenge here is the Memory Wall: moving data between VRAM and compute cores can become slower than the computation itself.

3. TPU: The Linear Algebra Specialist (Systolic Array)

The Tensor Processing Unit (TPU) is a purpose-built AI accelerator (ASIC). Unlike GPUs, it leverages a systolic array architecture.

In the Playbook: Think of data flowing like waves through a grid of multipliers, without needing to return to main memory at every step. This eliminates latency bottlenecks and allows throughput to scale almost linearly with increasing batch sizes.

📖 Chapter 3: The Scaling Strategy (TPUStrategy & XLA)

If Chapter 1 was about building the team and Chapter 2 about understanding the field, Chapter 3 is about tactical communication. Integrating TPUs into a Keras workflow requires only minimal code changes, but it triggers deep transformations in how execution happens under the hood.

1.TPUStrategy: The Team Captain

In a Cloud TPU setup (such as a v3-8), you are not working with a single accelerator, but with 8 processing cores operating in unison. The tf.distribute.TPUStrategy API enables synchronous distributed training through model replication.

How it works: The model is replicated across all 8 cores. Each core receives a different shard of the input data (the global batch size), computes gradients independently, and then all cores synchronize these gradients before updating the model weights. This orchestration ensures that all “players” remain perfectly aligned during training.

2.XLA: Runtime Optimization

XLA (Accelerated Linear Algebra) is the domain-specific compiler responsible for optimizing TensorFlow operations.

Operation Fusion: Instead of executing a Conv2D operation followed by a ReLU activation as separate steps—each requiring memory access—XLA fuses them into a single hardware-level kernel. This drastically reduces memory bandwidth usage and significantly improves raw computational performance.

3.Code Evolution: From Classroom to Cluster

Here is how a simple educational model evolves into a high-performance distributed training setup:

import tensorflow as tf

# 1. Localizar o recurso de computação (TPU)
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)

# 2. Definir o modelo dentro do escopo da estratégia
with strategy.scope():
    # Aqui entra sua arquitetura autoral do CIFAR-10
    model = build_attack_formation() 

    # O otimizador e a perda também são distribuídos automaticamente
    model.compile(
        loss='categorical_crossentropy', 
        optimizer='rmsprop', 
        metrics=['accuracy']
    )

📖 Chapter 4: Field Logistics (Data Pipeline & I/O)

In football, having the best quarterback means nothing if the ball does not reach their hands at the right time. In high-performance computing, the main adversary of TPUs is data starvation. If your input pipeline is slow, the TPU will sit idle waiting for the CPU to read and process data, wasting expensive computational resources.

Continuous Flow Strategy

To ensure the “engine” never stalls, the TensorCraft Playbook implements three key logistics tactics:

TFRecords and Protocol Buffers:

Instead of reading thousands of individual JPG files, the CIFAR-10 dataset is serialized into TFRecord binary files. This enables high-throughput sequential reads and significantly reduces filesystem overhead.

Parallel ETL with tf.data:

We leverage map operations with num_parallel_calls=tf.data.AUTOTUNE, delegating image preprocessing tasks such as resizing and normalization across multiple CPU cores simultaneously.

Software Pipelining (Prefetch):
The key component is .prefetch(). While the TPU is processing the current batch (Step N), the CPU is already preparing and asynchronously sending the next batch (Step N+1) to TPU memory.

# Logística de Alta Performance
def prepare_pipeline(dataset, batch_size):
    ds = dataset.cache() # Cache em memória para datasets pequenos como CIFAR-10
    ds = ds.shuffle(buffer_size=1000)
    ds = ds.batch(batch_size, drop_remainder=True)
    # O segredo da logística: desacoplar a produção do consumo
    ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE)
    return ds

📖 Chapter 5: The Real Scoreboard (Performance, Constraints, and Field Lessons)

The true validation of any playbook lies in the scoreboard. However, rather than presenting an idealized benchmark, this chapter reflects the real-world behavior of the system under practical execution conditions.

Experimental Methodology

All experiments were conducted using Google Colab Pro, ensuring access to higher-performance infrastructure compared to the standard Colab environment.

To maintain experimental consistency, the following variables were strictly controlled:

Same notebook and execution pipeline
Same CNN architecture
Same CIFAR-10 dataset
Same preprocessing and tf.data pipeline
Same batch size and number of epochs

The only variable changed across experiments was the execution environment:

CPU runtime
GPU runtime
TPU runtime

This setup isolates infrastructure as the independent variable, enabling a direct comparison of performance across hardware configurations.

Scenario	Effective Hardware	Time (s)	Throughput (img/s)	Validation Accuracy
CPU Runtime	CPU	167.84	1,489	72.42%
GPU Runtime	GPU	34.66	7,213	73.36%
TPU Runtime	CPU (TPU host)	30.42	8,218	72.27%

Key Observations

1.GPU Acceleration Works as Expected

The GPU execution delivered a significant performance improvement:

~4.8× faster than CPU
Highest validation accuracy
Efficient parallelization of convolutional operations

This confirms the expected advantage of GPUs in CNN workloads.

2.TPU Runtime Did Not Execute on TPU

Despite selecting a TPU runtime, the system reported:

effective_hardware = CPU
verified_tpu = false
failure in TPUClusterResolver initialization

This indicates that the workload did not run on TPU hardware, but instead on the host CPU of the TPU environment.

3.Unexpected Result: TPU Runtime Outperformed GPU

Even without TPU acceleration, the TPU runtime delivered:

Highest throughput (8,218 img/s)
Fastest execution time (30.42s)

This performance exceeded the GPU run.

Technical Interpretation

This result highlights a non-obvious but critical aspect of modern ML systems:

Execution environment matters as much as hardware type.

When selecting a TPU runtime in Colab Pro, the execution environment changes in two ways:

Access to a TPU accelerator (not successfully initialized in this case)
Allocation of a different host machine, typically a more powerful CPU infrastructure

As a result, the model ran on:

High-performance TPU host CPU

rather than:

Standard Colab CPU

This explains the ~5.5× improvement over the baseline CPU and the slight advantage over GPU.

Why TPU Failed to Initialize

The inability to use TPU was not related to the model, but to environment constraints:

Missing or incompatible TensorFlow runtime
Dependency conflicts (ml_dtypes, JAX, TensorFlow)
Runtime inconsistencies in Colab TPU environments
Failure to resolve TPU via TPUClusterResolver

This reinforces the operational complexity of TPU-based workflows.

Engineering Lessons

This experiment surfaces several important insights:

Infrastructure is part of the model;
Performance depends on the execution environment, not only on architecture;
Hardware labels can be misleading;
Selecting TPU does not guarantee TPU execution;
TPUs require mature environments;
They are not yet fully plug-and-play in notebook-based workflows;
Reproducibility is fragile in managed environments
Even controlled experiments can yield unexpected behaviors;
Data pipeline efficiency remains critical;
High-performance hardware still depends on efficient data delivery;

Conclusion

The scoreboard is no longer incomplete, it is revealing.

This experiment demonstrates that performance in deep learning is not solely determined by model design or nominal hardware selection. Instead, it emerges from the interaction between model, infrastructure, and execution environment.

Perhaps the most important takeaway is this:

In real-world machine learning systems, infrastructure decisions can outweigh hardware assumptions. In this case, a TPU runtime without TPU acceleration still outperformed a GPU execution, not because of the TPU itself, but because of the underlying system it provided.

Top comments (2)

Camilla Martins (punkdodevops) • May 21

amazing!

Fernanda Wanderley • May 21

Learned a lot!!