Thanawat Wongchai

Posted on Jun 23 • Originally published at apidog.com

วิธีรัน Google Gemma 3 270M แบบโลคอล: AI ส่วนตัวที่รวดเร็วสำหรับนักพัฒนา

กำลังมองหาโมเดลภาษา AI ขนาดเล็กที่รันบนเครื่องได้จริงโดยไม่ต้องพึ่งคลาวด์อยู่ใช่ไหม? Google Gemma 3 270M เป็นโมเดลที่เล็กที่สุดในซีรีส์ Gemma มีพารามิเตอร์ 270 ล้านตัว เหมาะสำหรับนักพัฒนาที่ต้องการเพิ่มฟีเจอร์ AI แบบความหน่วงต่ำและรักษาข้อมูลไว้ในเครื่อง เช่น การสร้างข้อความ, Q&A, การสรุป, การจัดหมวดหมู่ข้อความ และเวิร์กโฟลว์ API ภายในองค์กร

ลองใช้ Apidog วันนี้

เคล็ดลับ: หากคุณต้องการต่อ Gemma 3 270M เข้ากับแอปผ่าน API ให้ใช้ Apidog เพื่อออกแบบ ทดสอบ จำลอง และจัดทำเอกสาร API สำหรับ endpoint ที่เชื่อมกับโมเดล AI ในเครื่อง ช่วยให้คุณตรวจสอบ request/response และทำให้ฟีเจอร์ AI พร้อมใช้งานได้เร็วขึ้น

ทำไมต้องใช้ Gemma 3 270M สำหรับงาน AI ในเครื่อง?

Gemma 3 270M เหมาะกับงานที่ต้องการ:

ความเป็นส่วนตัวบนอุปกรณ์: ข้อมูลไม่จำเป็นต้องออกจากเครื่อง
ความหน่วงต่ำ: เหมาะกับฟีเจอร์แบบโต้ตอบ เช่น chatbot หรือ assistant ภายใน
ใช้ทรัพยากรน้อย: รันได้บนแล็ปท็อป เดสก์ท็อป หรืออุปกรณ์ที่ทรัพยากรจำกัด
ต้นทุนคงที่: ลดการพึ่งพา API AI บนคลาวด์สำหรับงานทั่วไป

โมเดลรองรับหน้าต่างบริบทสูงสุด 32,000 โทเค็น และมีตัวเลือก Quantization เช่น Q4_0 QAT เพื่อช่วยลดการใช้หน่วยความจำ โดยโหมด INT4 สามารถใช้หน่วยความจำน้อยกว่า 200MB ซึ่งมีประโยชน์มากสำหรับ Edge, mobile และระบบที่ต้องการ deploy แบบ lightweight

สถาปัตยกรรม Gemma 3 270M: จุดที่ทำให้โมเดลเล็กแต่ใช้งานได้จริง

Gemma 3 270M สร้างบนสถาปัตยกรรมแบบ Transformer โดยมีองค์ประกอบหลักดังนี้:

พารามิเตอร์ 170 ล้านตัวสำหรับ embedding รองรับคลังคำศัพท์ 256,000 โทเค็น
พารามิเตอร์ 100 ล้านตัวสำหรับ Transformer blocks
รองรับหลายภาษา และสามารถปรับเข้ากับงานเฉพาะทางได้
INT4 quantization, Rotary Position Embeddings และ Group Query Attention เพื่อเพิ่มประสิทธิภาพด้านความเร็วและหน่วยความจำ

งานที่เหมาะกับโมเดลขนาดนี้ ได้แก่:

การทำตามคำสั่งสั้น ๆ
การดึงข้อมูลจากข้อความ
การสรุป
การตรวจสอบข้อความตาม policy หรือ compliance
การจัดหมวดหมู่ข้อความ
การสร้างคำตอบสำหรับระบบ Q&A ภายใน

ผลการทดสอบแสดงว่า Gemma 3 270M ทำคะแนน F1 ได้ดีใน IFEval จึงเหมาะกับงานที่ต้องการ instruction following แต่ยังต้องควบคุมการใช้หน่วยความจำและพลังงาน

ประโยชน์หลักของการรัน Gemma 3 270M ในเครื่อง

ข้อมูลอยู่ในเครื่อง: ลดความเสี่ยงจากการส่งข้อมูลไปยังบริการภายนอก
ตอบสนองเร็ว: เหมาะกับแอปที่ต้องการ interactive UX
ไม่มีค่าใช้จ่าย API ต่อ request: เหมาะกับงานที่มี request จำนวนมากหรือทำงานซ้ำ ๆ
ประหยัดพลังงาน: ใช้พลังงานเพียง 0.75% ของแบตเตอรี่ Pixel 9 Pro สำหรับการสนทนาแบบ INT4 จำนวน 25 ครั้ง
ปรับแต่งได้: ใช้ LoRA เพื่อ fine-tune กับชุดข้อมูลเฉพาะโดเมน
เหมาะกับทีมเล็ก: ทดลอง prototype และ iterate ได้โดยไม่ต้องตั้ง infrastructure ขนาดใหญ่

ข้อกำหนดของระบบ: ต้องใช้ฮาร์ดแวร์แบบไหน?

Gemma 3 270M เข้าถึงได้ง่ายสำหรับนักพัฒนาส่วนใหญ่:

รูปแบบการใช้งาน	ข้อกำหนดที่แนะนำ
CPU inference	RAM 4GB และ CPU สมัยใหม่ เช่น Intel Core i5
GPU acceleration	NVIDIA GPU ที่มี VRAM 2GB สำหรับโมเดล quantized
Apple Silicon	ใช้ MLX-LM เพื่อประสิทธิภาพสูง
Fine-tuning	RAM 8GB และ GPU VRAM 4GB สำหรับชุดข้อมูลขนาดเล็ก
OS	Windows, macOS หรือ Linux
Software	Python 3.10+
Storage	ประมาณ 1GB สำหรับไฟล์โมเดล

เลือกเครื่องมือสำหรับรัน Gemma 3 270M ในเครื่อง

คุณสามารถเลือกเครื่องมือตาม workflow ที่ต้องการ:

Hugging Face Transformers: เหมาะกับนักพัฒนาที่ต้องการเขียน Python และ integrate เข้ากับแอป
LM Studio: เหมาะกับการทดลองผ่าน GUI โดยไม่ต้องเขียนโค้ดมาก
llama.cpp: เหมาะกับงานที่ต้องการ performance และควบคุม runtime ระดับต่ำ
MLX: เหมาะกับเครื่อง Apple M-series

คำแนะนำแบบเร็ว:

ถ้าต้องการทดลองเร็ว: LM Studio
ถ้าต้องการฝังใน backend/API: Transformers
ถ้าต้องการ runtime เบาและเร็ว: llama.cpp
ถ้าใช้ Apple Silicon: MLX

วิธีรัน Gemma 3 270M ด้วย Hugging Face Transformers

1. ติดตั้งไลบรารี

pip install transformers torch

ถ้าต้องการใช้ Hugging Face Hub CLI:

pip install huggingface_hub

2. โหลด tokenizer และโมเดล

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "google/gemma-3-270m"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto"
)

3. สร้าง response จาก prompt

input_text = "Explain quantum computing in simple terms."

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

4. เปิดใช้ 4-bit Quantization เพื่อลดหน่วยความจำ

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quant_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="auto"
)

5. Login Hugging Face หากโมเดลต้องใช้สิทธิ์เข้าถึง

from huggingface_hub import login

login(token="your_hf_token")

รับ token ได้จากบัญชี Hugging Face ของคุณ แล้วตั้งค่าเป็น environment variable แทนการ hardcode ใน production:

export HF_TOKEN="your_hf_token"

สร้าง Local API สำหรับ Gemma 3 270M ด้วย FastAPI

หากต้องการเรียกโมเดลจาก frontend หรือ service อื่น ให้ห่อโมเดลเป็น HTTP API

1. ติดตั้ง FastAPI

pip install fastapi uvicorn transformers torch

2. สร้างไฟล์ `main.py`

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForCausalLM

app = FastAPI()

model_name = "google/gemma-3-270m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

class GenerateRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 200

@app.post("/generate")
def generate_text(payload: GenerateRequest):
    inputs = tokenizer(payload.prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=payload.max_new_tokens
    )

    text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return {
        "prompt": payload.prompt,
        "response": text
    }

3. รัน API

uvicorn main:app --host 0.0.0.0 --port 8000

4. ทดสอบด้วย cURL

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Summarize why local AI models are useful.",
    "max_new_tokens": 120
  }'

จากนั้นคุณสามารถนำ endpoint นี้ไปออกแบบ ทดสอบ และทำเอกสาร API ใน Apidog ได้ โดยกำหนด request body, response schema และ mock response สำหรับทีม frontend

วิธีรัน Gemma 3 270M ด้วย LM Studio

LM Studio เหมาะกับการทดลองโมเดลผ่าน GUI โดยไม่ต้อง setup โค้ดมาก

1. ดาวน์โหลดและติดตั้ง LM Studio

ดาวน์โหลดจาก lmstudio.ai

2. ค้นหาโมเดล

ค้นหา "gemma-3-270m" ใน Model Hub

3. ดาวน์โหลดเวอร์ชัน Quantized

เลือกเวอร์ชันที่เหมาะกับเครื่อง เช่น Q4_0

4. โหลดโมเดลและตั้งค่าการสร้างข้อความ

ค่าที่ใช้เริ่มต้นได้:

Context: 32k
Temperature: 1.0
Top-p: 0.95
Top-k: 64

5. เปิดใช้ GPU Offloading หากมี

ถ้าเครื่องมี GPU ให้เปิด GPU offloading เพื่อเพิ่มความเร็วในการ generate token

LM Studio เหมาะสำหรับ prototype, demo ภายในทีม หรือการทดลอง prompt ก่อนนำไปเขียนเป็น API จริง

วิธีรัน Gemma 3 270M ด้วย llama.cpp

ถ้าต้องการ runtime ที่เบาและควบคุมได้มากขึ้น โดยเฉพาะบนระบบที่ทรัพยากรจำกัด ให้ใช้ llama.cpp

1. Clone และ build llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

2. ดาวน์โหลดไฟล์ GGUF จาก Hugging Face

huggingface-cli download unsloth/gemma-3-270m-it-GGUF --include "*.gguf"

3. รันโมเดล

./llama-cli -m gemma-3-270m-it-Q4_K_M.gguf \
  -p "Build a simple AI app."

4. Build พร้อม CUDA สำหรับ NVIDIA GPU

make GGML_CUDA=1

จากนั้นตั้งค่า GPU layers เช่น:

./llama-cli -m gemma-3-270m-it-Q4_K_M.gguf \
  -p "Build a simple AI app." \
  --n-gpu-layers 999

ตัวอย่างการใช้งาน Gemma 3 270M ในเวิร์กโฟลว์ API

1. วิเคราะห์ความรู้สึก

prompt = "Classify the sentiment as Positive, Neutral, or Negative: This product is amazing!"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

ตัวอย่างผลลัพธ์:

Positive

2. สรุปข้อความ

text = """
Long article here...
"""

prompt = f"Summarize the following text in 3 bullet points:\n{text}"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=150)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)

3. ตอบคำถามจาก knowledge base

context = """
Climate change is primarily driven by greenhouse gas emissions from burning fossil fuels,
deforestation, and industrial processes.
"""

question = "อะไรคือสาเหตุของการเปลี่ยนแปลงสภาพภูมิอากาศ?"

prompt = f"""
Answer the question using only the context.

Context:
{context}

Question:
{question}
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=120)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

4. ดึงข้อมูลเอนทิตีจากข้อความ

ตัวอย่าง prompt สำหรับดึงข้อมูลแบบ structured:

prompt = """
Extract entities from the text as JSON.

Text:
Patient reports headache and fever for 2 days. Prescribed paracetamol.

Return JSON with:
- symptoms
- duration
- medication
"""

เหมาะกับงานประมวลผลข้อมูลในเครื่อง เช่น ข้อความภายในองค์กร หรือข้อมูลที่ไม่ต้องการส่งออกไปยัง cloud service

เคล็ดลับ: ใช้ Apidog เพื่อออกแบบ endpoint เช่น /generate, /summarize, /classify และ /extract-entities พร้อมกำหนด schema ของ request/response ให้ทีม frontend และ backend ทำงานร่วมกันได้ง่ายขึ้น

การปรับแต่ง Gemma 3 270M สำหรับงานเฉพาะทางด้วย LoRA

สำหรับงานเฉพาะโดเมน คุณสามารถใช้ LoRA เพื่อลดจำนวนพารามิเตอร์ที่ต้อง train

1. ติดตั้ง PEFT

pip install peft

2. เพิ่ม LoRA adapter

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"]
)

model = get_peft_model(model, lora_config)

3. Train ด้วย Transformers Trainer

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_steps=10,
    save_steps=100
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

trainer.train()

แนวทางที่ควรทำ:

ใช้ชุดข้อมูลขนาดเล็กที่มีคุณภาพก่อนเพิ่มขนาด dataset
แยก train/validation set เพื่อวัด overfitting
บันทึก adapter แยกจาก base model เพื่อสลับงานได้เร็ว
ติดตาม loss และตัวอย่าง output ระหว่าง train

เคล็ดลับการเพิ่มประสิทธิภาพ

ใช้ 4-bit หรือ 8-bit quantization เพื่อลด RAM/VRAM
Batch inference เมื่อมี request หลายรายการ
ปรับ generation parameters เช่น temperature=1.0, top_k=64, top_p=0.95
เปิด mixed precision บน GPU ที่รองรับ
ตรวจสอบ VRAM ด้วย nvidia-smi
จำกัด max_new_tokens ตาม use case เพื่อลด latency
อัปเดตไลบรารี เช่น transformers, torch, bitsandbytes เพื่อรับ performance improvement ล่าสุด

ข้อควรระวัง:

อย่าใส่ BOS token ซ้ำใน prompt
ตรวจสอบ context window เพื่อป้องกัน prompt ถูกตัดทอน
อย่า hardcode token หรือ secret ใน source code
แยก prompt template ออกจาก business logic เพื่อปรับแต่งง่าย

สรุป: ใช้ Gemma 3 270M เพื่อสร้าง AI local-first

Gemma 3 270M เหมาะสำหรับนักพัฒนาที่ต้องการเพิ่มความสามารถด้านภาษาในแอปโดยยังควบคุม latency, privacy และต้นทุนได้ดี คุณสามารถเริ่มจาก LM Studio เพื่อทดลอง prompt จากนั้นย้ายไป Hugging Face Transformers หรือ llama.cpp เพื่อสร้าง API และ deploy ในสภาพแวดล้อมจริง

หากต้องต่อโมเดลเข้ากับ backend, frontend หรือ integration อื่น ๆ ให้ใช้ Apidog เพื่อออกแบบ ทดสอบ mock และทำเอกสาร API สำหรับ workflow ที่ใช้ Gemma 3 270M ได้เป็นระบบมากขึ้น