Special Thanks: I would like to express my gratitude to APMIC and the Twinkle AI community for their assistance, which made the completion of this article possible.
Original Chinese Post: https://medium.com/@simon3458/twinkleai-gemma-3-t1-4b-adk-agent-d3309665f448
As Large Language Model (LLM) technology enters a stage of maturity, the focus of developers has shifted from simple "conversation generation" to "AI Agents" capable of autonomous planning and execution. However, creating an Agent that understands Taiwan's local culture and can accurately execute complex tool calls presents two major challenges: first, general-purpose models often lack understanding of local regulations and context; second, the high cost of GPU computing power limits the widespread adoption of applications.
This article will guide you through exploring: building AI Agent application services by combining the matrix computation advantages of Google TPU, Twinkle AI's gemma-3–4B-T1-it open-source model (optimized specifically for the Taiwanese context), and the Google ADK (Agent Development Kit).
We will start with the underlying architecture of TPUs to explain why they are accelerators for AI inference. Next, we will introduce how Twinkle AI solves "alignment drift" and strengthens Function Calling capabilities. Finally, we will conduct a hands-on walkthrough using Google Colab, stacking and integrating these technologies to build an AI Agent from scratch that is responsive, understands Taiwanese linguistic habits, and can actually query stock information.
I. Introduction to Google TPU
1. What is a TPU (Tensor Processing Unit)?
Google TPU is a "Domain-Specific Architecture" (DSA) integrated circuit tailored specifically for machine learning workloads. Unlike traditional processors that need to handle various general-purpose tasks, the core design of the TPU adopts a "Systolic Array" architecture. This design mimics the way a heart beats, allowing data to flow rhythmically between thousands of arithmetic units within the chip. This architecture enables the TPU to significantly reduce frequent memory access when performing matrix multiplication—the core operation of neural networks—thereby breaking through the "von Neumann bottleneck" of traditional computer architectures and achieving extremely high computational density and efficiency.
2. The Difference Between TPU and GPU
The fundamental difference between the two lies in the philosophical opposition of "Specialization" vs. "Generalization."
GPU (Graphics Processing Unit): essentially a general-purpose parallel processor designed for graphics rendering. It retains a massive amount of control logic and cache to handle complex instruction streams, giving it high flexibility and a powerful CUDA software ecosystem suitable for highly variable research and applications.
TPU: sacrifices generality (it cannot efficiently handle non-matrix operations) and removes hardware units irrelevant to AI, dedicating all released resources to matrix operation units. This gives TPUs higher computational efficiency when processing large-scale static computation graphs in specific formats (such as bfloat16). However, the development barrier is higher than GPUs when dealing with dynamic control flows or custom operators, usually relying on the XLA compiler and JAX framework for optimization.
II. Introduction to the Twinkle AI gemma-3–4B-T1-it Model
gemma-3-4B-T1-it is a 4B parameter model launched by Twinkle AI based on the Google Gemma 3 architecture. It aims to solve the "alignment drift" problem caused by uneven data in mainstream foundation models and to practice the concept of "Sovereign AI."
The model has been deeply optimized for the Taiwanese context, correcting:
- Vocabulary Misuse/Appropriation: (e.g., distinguishing between terms for "quality" and "mass").
- Legal and Institutional Hallucinations: (Citing current laws of the Republic of China rather than laws of the PRC).
- Cultural Meme Disconnects: (Understanding internet slang from communities like PTT and Dcard).
Through Gemma 3's "Local-Global Hybrid Attention Mechanism" and a 128K token context window, T1–4B-it achieves deep cultural alignment at a lightweight scale, positioning itself as a language model focused on Agent workflows and local needs.
regarding dataset selection and ecosystem collaboration, T1–4B-it adopts a rigorous data strategy. Training data includes lianghsun/tw-reasoning-instruct (reasoning instructions) designed for the Taiwan context, nvidia/Nemotron (instruction following), lianghsun/tw-contract-review-chat (contract review), and Chain of Thought (CoT) data prepared by Kerg (such as tw_mm_R1). We thank APMIC for providing critical computing support for the infrastructure to make this possible.
For architecture designed to strengthen Function Calling, T1–4B-it specifically introduces the Hermes Tool-call Parser format for training, equipping it with powerful Agent capabilities. The model can handle four levels of complex calling scenarios: single function, multiple functions, parallel functions, and parallel multiple functions. In the BFCL evaluation, the model achieved an overall accuracy of 84.5%, with performance on multiple Abstract Syntax Trees (AST) reaching as high as 89%. This demonstrates that at the 4B parameter magnitude, it possesses tool usage and automated execution capabilities surpassing many 7B or 13B models.
Example: Using Google ADK to handle parallel function processing simultaneously.
For detailed information, please visit HuggingFace:
twinkle-ai/gemma-3-4B-T1-it
III. Hands-on: Launching an AI Agent Service on Google Colab via VLLM and Google ADK
If you are not familiar with Google ADK tools, you can read this article first: https://medium.com/@simon3458/google-adk-tools-intro-202504-3181fd6ab567

(Thanks to Twinkle AI community friend Thomas for assisting with the graphics!)
The main goal of this project is to deploy a Twinkle AI Gemma 3 T1 4B model in a Google Colab TPU v5e-1 environment and transform it into an AI Agent capable of executing specific tasks.
1. Environment Preparation and Dependency Installation
Hardware Setup: Confirm the current execution environment is Google TPU v5e-1, hardware designed specifically to accelerate machine learning workloads.
Core Packages: Install the vLLM inference engine that supports TPU acceleration; this is key to making the model run fast. Simultaneously install OpenAI SDK and LiteLLM for subsequent API connection and forwarding.
2. Launching vLLM Inference Service
Load Model: Start the vLLM server via terminal commands and load the Twinkle AI Gemma-3–4B-T1-it model.
Enable Advanced Features: Configure parameters at startup to enable the model's "Auto Tool Choice" and "Hermes Tool Parser," giving the model the ability to understand and call external tools.
Verify Service Status:
- Check if the model API is successfully online.
- Perform simple conversation tests to confirm the model responds normally.
- Critical Test: Test if the model can correctly parse "Function Calling" (e.g., asking about database structures to confirm the model returns the correct tool execution request).
3. Setting up the LiteLLM API Bridge
Configure Forwarding Rules: Create a configuration file to forward standard API requests to the backend vLLM service. This step is to standardize the model interface for compatibility with Google ADK.
Start Proxy Service: Launch the LiteLLM proxy server in the background and monitor it until the service is fully ready (model list appears).
4. Integrating Google ADK (Agent Development Kit)
Get Agent Example: Download a pre-written Stock Query Agent example project from GitHub.
Install Agent Dependencies: Install the Python packages required for the Agent project.
Set Environment Variables: Configure the keys and API addresses needed for the Agent connection, pointing them to the LiteLLM service we just set up.
Run and Test Agent: Launch the Google ADK command-line interface and actually converse with the Agent (e.g., asking for TSMC's stock price). At this point, the Agent will automatically determine the need, call the stock query tool, retrieve data, and finally use the Gemma model to generate a natural language response.
5. (Optional) Establishing a Remote Development Tunnel
Setup ngrok: Use the ngrok tool to expose the API service inside Colab to the public internet.
Local Connection: This allows developers to develop the ADK frontend or logic on their local machine while leaving the heavy model inference computations to run on the TPU in Colab, achieving an efficient "Local Development, Cloud Inference" model.
This process demonstrates the complete integration from underlying model deployment and mid-layer API forwarding to upper-layer Agent application logic, utilizing Google Colab's TPU computing power to build intelligent AI applications.
IV. Conclusion
This hands-on exercise is not only a display of a technology stack but also verifies the huge potential of combining "Specialized Hardware" with "Localized Small Models."
Through the specialized acceleration of Google TPU v5e, we proved that even a lightweight model at the 4B parameter level, when paired with high-quality localized instruction fine-tuning (such as the efforts of Twinkle AI Gemma-3-T1-it) and an appropriate inference framework (vLLM + Google ADK), can demonstrate logical reasoning and tool usage capabilities that transcend its size class.
This solution offers three important insights for developers:
Compute is no longer a high wall: TPUs provide an efficient alternative to GPUs. Through platforms like Colab, developers can access powerful matrix computing resources with a lower barrier to entry.
Localization is crucial: The performance of the Twinkle AI model proves that models which solve "cultural disconnects" and "regulatory hallucinations" are better suited for actual business and life scenarios—an advantage general-purpose models struggle to replace.
Standardization of Agent Development: The introduction of Google ADK and standardized APIs (LiteLLM) evolves Agent development from "hand-crafting Prompts" to modular engineering practices, significantly improving development efficiency and stability.
With the open-sourcing of the Google Gemma 3 architecture and the ubiquity of TPU cloud resources, we are on the eve of a blossoming of AI applications. I hope this tutorial helps developers in various fields quickly build intelligent assistants that understand local languages and solve real problems, truly realizing the democratization and innovation of AI technology.
I am Simon
Hello everyone, I am Simon Liu (Liu Yu-wei), an AI Solutions Expert and currently a Google Developer Expert (AI Role). I look forward to helping enterprises implement Artificial Intelligence technologies to solve problems.
If this article was helpful to you, please give it a clap on Medium and follow my personal account so you can read my future articles at any time. You are welcome to leave comments on my LinkedIn to provide feedback and discuss AI-related topics with me. I look forward to being of help to everyone!
My Personal Website:
https://simonliuyuwei.my.canva.site/link-in-bio







Top comments (0)