soy

Posted on Mar 8 • Edited on Mar 24 • Originally published at media.patentllm.org

Individual Developer's Portfolio Strategy: Running 13 Projects on a Single RTX 5090

#ai #gpu #performance

13 Project List

The portfolio consists of the following categories:

Legal Tech

Contract Auto-Generation Tool (Clause suggestion with Streamlit + Gemini API)
Case Law Search System (Fast search of case law documents with SQLite FTS5)
Legal Compliance Chatbot (Article interpretation support with Gemini)

Chemical Simulation

Molecular Structure Prediction Model (FP8 Quantized ResNet)
Reaction Rate Calculation Engine (CUDA kernel optimized)

Shogi AI

Fuka40B (FP8 Quantized ResNet40x384, 80 layers)
Fuka2025Q2-20b (FP8 Policy Evaluation Model)
Floodgate Strategy Engine
ttzl-ex (TensorRT Inference Optimization)
Shogi Data Analysis Pipeline

Others

Minecraft AI Assistant (vLLM Resident)
Stock Data Visualization Dashboard
Research Note Management System

Standardizing the Technology Stack

Search Infrastructure: SQLite FTS5

To standardize search functionality across all projects, SQLite FTS5 is adopted. For patent documents and case law data, fast and highly relevant searches are achieved through BM25 ranking.

Common UI: Streamlit

Streamlit is used for the frontend of all applications, standardizing the display of responses when integrating with the Gemini API.

import streamlit as st
from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="特許文書から条項を抽出"
)
st.markdown(f"**提案条項**:\n{response.text}")

GPU Sharing Strategy

vLLM Resident Architecture

To maximize the utilization of the RTX 5090's 32GB VRAM, vLLM is launched as a resident process. The inference engine is switched according to the model size for each project.

TensorRT Switching Logic

In Shogi AI, models are optimized with TensorRT.

trtexec \
  --onnx=models/eval/model_fp8.onnx \
  --fp8 \
  --minShapes=input1:1x62x9x9,input2:1x57x9x9 \
  --optShapes=input1:256x62x9x9,input2:256x57x9x9 \
  --maxShapes=input1:256x62x9x9,input2:256x57x9x9 \
  --saveEngine=model_fp8_trt

GPU Usage Monitoring

while true; do
  usage=$(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits | tr -d ' ')
  if [ "$usage" -gt 80 ]; then
    systemctl --user stop vllm.service
  fi
  sleep 60
done

Cloudflare + Caddy Publishing Infrastructure

All web projects are published using Cloudflare Tunnel + Caddy. Caddy functions as a reverse proxy, handling HTTPS termination and routing.

Horizontal Security Deployment

A common security policy is applied across all projects.

API keys are managed via environment variables and are not hardcoded in the code.
Branch protection is configured to require PRs.
Automated execution of periodic log auditing scripts.

Operational Tips

Standardized on CUDA 12.8 to resolve version conflicts between projects.
Managed per-project library paths using environment variables.
Automatically stops services when GPU utilization exceeds a threshold.

Summary

To maximize the utilization of the RTX 5090's 32GB VRAM, the following three points were prioritized:

Building a Common Infrastructure: Standardized search and UI with SQLite FTS5 and Streamlit.
Dynamic Resource Management: Optimized based on model load with vLLM + TensorRT switching.
Horizontal Security Deployment: Standardized authentication processes.

Particularly in the Shogi AI project, the combination of FP8 quantization and TensorRT achieved significant inference speed improvements compared to FP16. In personal development, balancing "freedom in technology selection" and "the importance of a common infrastructure" is key to success.

This article was generated by Nemotron-Nano-9B-v2-Japanese and formatted and validated by Gemini 2.5 Flash.

DEV Community