Kalau kamu mau stop drama “si A nyontek si B” di kelas / tugas / kantor, ini tutorial praktis buat bikin plagiarism checker yang nyadar bahasa Indonesia, bisa cek whole text, per paragraf, per kalimat, dan menyimpan semua tugas ke dalam text bank (Postgres + Qdrant) supaya ketika kamu upload 40 tugas—mereka akan saling dicek otomatis.
Ringkasan arsitektur
-
Frontend & API gateway: Nuxt (server script) — menerima request
/check
, meneruskan ke embedding/check service. -
Embedding & Search service: Python + FastAPI +
sentence-transformers/paraphrase-multilingual-mpnet-base-v2
+qdrant-client
— buat embedding, search, dan upsert vectors. - Vector DB: Qdrant — menyimpan embedding (vector size = 768, distance = Cosine).
- Relational DB: PostgreSQL — menyimpan metadata dokumen (doc_id unik, title, created_at, raw_text).
Prasyarat (tools & versi)
- Docker + docker-compose
- Python 3.10+
- Node.js + Nuxt 3
- (Opsional) GPU jika mau cepat inference
1) Docker — jalankan Postgres + Qdrant cepat
Buat docker-compose.yml
sederhana:
version: '3.8'
services:
postgres:
image: postgres:15
environment:
POSTGRES_USER: pguser
POSTGRES_PASSWORD: pgpass
POSTGRES_DB: plagiarism
ports:
- "5432:5432"
volumes:
- ./pgdata:/var/lib/postgresql/data
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
volumes:
- ./qdrant_storage:/qdrant/storage
Jalankan:
docker-compose up -d
2) Backend embedding & search (Python + FastAPI)
Buat virtualenv dan install dependency utama:
python -m venv venv
source venv/bin/activate
pip install -U pip
pip install fastapi uvicorn sentence-transformers qdrant-client sqlalchemy psycopg2-binary python-multipart numpy
Buat file service/app.py
— berikut versi ringkas tapi lengkap:
# service/app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
import numpy as np
from qdrant_client import QdrantClient
from qdrant_client.http.models import VectorParams, Distance, PointStruct
from sqlalchemy import create_engine, Column, String, Text, DateTime
from sqlalchemy.orm import declarative_base, sessionmaker
from datetime import datetime
import re
import uuid
# CONFIG
QDRANT_COLLECTION = "plagiarism_vectors"
VECTOR_SIZE = 768
THRESHOLD = 0.8
QDRANT_URL = "http://localhost:6333"
DATABASE_URL = "postgresql+psycopg2://pguser:pgpass@localhost:5432/plagiarism"
# Init
app = FastAPI(title="Plagiarism Checker Service")
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2') # loads on startup
qdrant = QdrantClient(url=QDRANT_URL)
engine = create_engine(DATABASE_URL)
Base = declarative_base()
SessionLocal = sessionmaker(bind=engine)
# DB model
class Document(Base):
__tablename__ = "documents"
doc_id = Column(String, primary_key=True, index=True)
title = Column(String)
text = Column(Text)
created_at = Column(DateTime, default=datetime.utcnow)
Base.metadata.create_all(bind=engine)
# Ensure Qdrant collection exists
try:
qdrant.recreate_collection(
collection_name=QDRANT_COLLECTION,
vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE)
)
except Exception:
# if exists, create may fail — try get or ignore
qdrant.create_collection(collection_name=QDRANT_COLLECTION, vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE))
# Helpers
def split_paragraphs(text: str):
paras = [p.strip() for p in re.split(r'\n{2,}', text) if p.strip()]
return paras if paras else [text.strip()]
def split_sentences(paragraph: str):
# simple rule-based sentence split (works reasonably for Indo). Improve with spaCy if needed.
sents = re.split(r'(?<=[\.\?\!])\s+', paragraph.strip())
return [s.strip() for s in sents if s.strip()]
def cosine_sim(a: np.ndarray, b: np.ndarray):
den = (np.linalg.norm(a) * np.linalg.norm(b))
if den == 0: return 0.0
return float(np.dot(a, b) / den)
def chunk_text_all(text: str):
"""Return list of chunks with type info: full, paragraphs, sentences"""
chunks = []
# full
chunks.append({"type":"full","index":0,"text":text.strip()})
paras = split_paragraphs(text)
for i,p in enumerate(paras):
chunks.append({"type":"paragraph","index":i,"text":p})
sents = split_sentences(p)
for j,s in enumerate(sents):
chunks.append({"type":"sentence","index":f"{i}-{j}","text":s})
return chunks
# Request model
class CheckRequest(BaseModel):
doc_id: str
title: str
text: str
@app.post("/check")
def check_and_add(req: CheckRequest):
doc_id = req.doc_id
title = req.title
text = req.text
# 1) chunk
chunks = chunk_text_all(text)
texts = [c["text"] for c in chunks]
# 2) embed
embeddings = model.encode(texts, show_progress_bar=False)
embeddings = np.array(embeddings) # shape (n, 768)
# 3) search each chunk in Qdrant (exclude same doc_id results to avoid self-match)
results = {"full": [], "paragraphs": [], "sentences": []}
for i, c in enumerate(chunks):
vec = embeddings[i].tolist()
# search top 5
hits = qdrant.search(collection_name=QDRANT_COLLECTION, query_vector=vec, limit=5, with_payload=True, with_vectors=True)
matches = []
for h in hits:
payload = h.payload or {}
source_doc = payload.get("doc_id")
# skip matches from same doc (because this doc may already be in DB)
if source_doc == doc_id:
continue
# compute exact cosine using vectors returned
if hasattr(h, "vector") and h.vector is not None:
sim = cosine_sim(np.array(vec), np.array(h.vector))
else:
# fallback to score if vector not returned
sim = float(getattr(h, "score", 0.0))
matches.append({
"score": sim,
"source_doc_id": source_doc,
"source_text": payload.get("text"),
"source_type": payload.get("chunk_type"),
"source_index": payload.get("chunk_index")
})
# sort matches desc
matches = sorted(matches, key=lambda x: x["score"], reverse=True)
entry = {
"chunk_type": c["type"],
"chunk_index": c["index"],
"text": c["text"],
"top_matches": matches[:5]
}
if c["type"] == "full":
results["full"].append(entry)
elif c["type"] == "paragraph":
results["paragraphs"].append(entry)
else:
results["sentences"].append(entry)
# 4) store doc metadata in Postgres (prevent duplicate by doc_id)
db = SessionLocal()
existing = db.query(Document).filter(Document.doc_id==doc_id).first()
if not existing:
newdoc = Document(doc_id=doc_id, title=title, text=text)
db.add(newdoc)
db.commit()
db.close()
# 5) upsert all chunks to Qdrant (id uses doc_id to prevent duplicates)
points = []
for i, c in enumerate(chunks):
pid = f"{doc_id}__{c['type']}__{c['index']}"
payload = {
"doc_id": doc_id,
"title": title,
"chunk_type": c["type"],
"chunk_index": c["index"],
"text": c["text"]
}
points.append(PointStruct(id=pid, vector=embeddings[i].tolist(), payload=payload))
qdrant.upsert(collection_name=QDRANT_COLLECTION, points=points)
# 6) build report: find any sentences/paras > THRESHOLD
flagged = {"sentences": [], "paragraphs": []}
for s in results["sentences"]:
if s["top_matches"] and s["top_matches"][0]["score"] >= THRESHOLD:
flagged["sentences"].append({
"text": s["text"],
"best_match": s["top_matches"][0]
})
for p in results["paragraphs"]:
if p["top_matches"] and p["top_matches"][0]["score"] >= THRESHOLD:
flagged["paragraphs"].append({
"text": p["text"],
"best_match": p["top_matches"][0]
})
return {
"status": "ok",
"doc_id": doc_id,
"scores": results,
"flagged": flagged
}
Catatan singkat:
- Endpoint
/check
menerima{doc_id, title, text}
. - Melakukan check (full/paragraph/sentence), lalu insert metadata ke Postgres dan upsert semua chunk ke Qdrant.
- Self-match dikecualikan saat mencari (agar dokumen baru tidak match terhadap dirinya sendiri).
- Threshold default 0.8 (atur di
THRESHOLD
).
3) Nuxt (frontend + server route)
Kamu bisa bikin server API route di Nuxt yang meneruskan request ke FastAPI (atau langsung panggil FastAPI dari frontend, tetapi lebih aman via server route).
Contoh server/api/check.post.ts
(Nuxt 3 / Nitro):
// server/api/check.post.ts
import { defineEventHandler, readBody } from 'h3'
export default defineEventHandler(async (event) => {
const body = await readBody(event)
// ganti URL jika service beda host/port
const FASTAPI_URL = process.env.FASTAPI_URL || "http://localhost:8000/check"
const res = await $fetch(FASTAPI_URL, {
method: "POST",
body,
headers: { "Content-Type": "application/json" }
})
return res
})
Frontend form (simplified):
<template>
<form @submit.prevent="submit">
<input v-model="docId" placeholder="doc id (unik)"/>
<input v-model="title" placeholder="judul"/>
<textarea v-model="text" placeholder="tempelkan teks tugas"></textarea>
<button>Check</button>
</form>
<div v-if="report">
<h3>Flagged:</h3>
<div v-for="s in report.flagged.sentences" :key="s.text">
<b>Kalimat:</b> {{ s.text }} — <i>match</i>: {{ s.best_match.source_doc_id }} ({{ s.best_match.score.toFixed(3) }})
<div>source text: {{ s.best_match.source_text }}</div>
</div>
<div v-for="p in report.flagged.paragraphs" :key="p.text">
<b>Paragraf:</b> {{ p.text }} — match: {{ p.best_match.source_doc_id }} ({{ p.best_match.score.toFixed(3) }})
</div>
</div>
</template>
<script setup>
import { ref } from 'vue'
const docId = ref(`task-${Date.now()}`)
const title = ref('')
const text = ref('')
const report = ref(null)
async function submit(){
report.value = null
const res = await $fetch('/api/check', {
method: 'POST',
body: { doc_id: docId.value, title: title.value, text: text.value }
})
report.value = res
}
</script>
4) Testing cepat (local)
- Jalankan FastAPI:
uvicorn service.app:app --reload --port 8000
- Jalankan Nuxt, buka form, kirim 2 dokumen mirip → lihat
flagged
pada response. - Coba upload 40 tugas: berikan tiap tugas
doc_id
unik (misal:tugas-2025-09-25-001
) — sistem akan menyimpan semua ke bank dan saling memeriksa. - Jika kamu punya textbook atau paper yang menjadi referensi anak-anak, kamu bisa memasukkannya ke dalam bank.
5) Tips produksi / tuning
- Threshold: 0.8 untuk indikasi kuat (sesuaikan). Paraphrase kuat mungkin 0.7–0.75.
- Top_k: 5–10 cukup untuk deteksi.
- Batching: embed batch (mis. 128 per batch) untuk performance.
-
GPU: pindahkan model ke GPU (
device='cuda'
) kalau banyak dokumen. -
Dedup: gunakan
doc_id
sebagai primary key; idempotensi pada/check
. -
Eksklusi self-match: sudah diterapkan, pastikan saat search selalu filter
payload.doc_id != doc_id
. - Tokenisasi noglitches: split sentence di atas bersifat heuristik — untuk ketepatan tinggi, gunakan tokenizer bahasa Indo (spaCy model/transformer-based segmentation).
- Privacy: simpan teks hanya jika kebijakan/privacy policy mengizinkan (kuliah / bisnis sensitif).
- Scaling: Qdrant bisa dijalankan cluster; gunakan sharding & replicas untuk ukuran besar.
6) Checklist deploy
- [ ] Docker compose running (Postgres + Qdrant)
- [ ] FastAPI deployed (container / VM) dan reachable dari Nuxt
- [ ] Nuxt server deployed (environment variable FASTAPI_URL)
- [ ] Backup Postgres, snapshot Qdrant
- [ ] Monitoring: latency embed time, Qdrant search time
Penutup (TL;DR)
- Kamu punya flow: Nuxt → FastAPI(embedding+Qdrant) → Postgres
- Checker melakukan: whole text, per paragraph, per sentence.
- Output: array skor & flagged items (kalimat/paragraf > 0.8 menampilkan teks dan referensinya).
- Saat
/check
juga menambahkan dokumen ke bank (unik dengandoc_id
) sehingga tugas saling ter-cek.
Disclaimer
Tutorial ini dibuat oleh AI. Jika Anda menemukan error, jadikanlah sebagai progres pembelajaran. Syukur-syukur kalau bisa buat tutorial yang lebih bener.
Top comments (1)
ada github linknya bang?