<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: cosmos0709</title>
    <description>The latest articles on DEV Community by cosmos0709 (@cosmos0709).</description>
    <link>https://dev.to/cosmos0709</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4007148%2F960ff88d-d00d-44a6-9cae-44fce98cf8b9.png</url>
      <title>DEV Community: cosmos0709</title>
      <link>https://dev.to/cosmos0709</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cosmos0709"/>
    <language>en</language>
    <item>
      <title>Show Dev: Self-reinforcing K-pop data pipeline using Spring Boot and pgvector (Built on OCI Free Tier)</title>
      <dc:creator>cosmos0709</dc:creator>
      <pubDate>Mon, 29 Jun 2026 02:21:06 +0000</pubDate>
      <link>https://dev.to/cosmos0709/show-dev-self-reinforcing-k-pop-data-pipeline-using-spring-boot-and-pgvector-built-on-oci-free-5g7</link>
      <guid>https://dev.to/cosmos0709/show-dev-self-reinforcing-k-pop-data-pipeline-using-spring-boot-and-pgvector-built-on-oci-free-5g7</guid>
      <description>&lt;p&gt;Hi everyone,&lt;/p&gt;

&lt;p&gt;I'm a backend developer based in Seoul. I built k-cosmos, an interactive web-based 3D music space that maps K-pop tracks based on 768-dimensional vector embeddings.&lt;/p&gt;

&lt;p&gt;The main reason I had to build this from scratch is that there's no clean, structured K-pop metadata or emotional tag dataset available anywhere.&lt;/p&gt;

&lt;p&gt;How the pipeline grows itself&lt;br&gt;
It runs on an autonomous background sync cycle. First, the system ingests tracks and uses an LLM to analyze the mood and aesthetic. Then, the AI reverse-engineers low-latency search keywords based on that analysis. These keywords are absorbed back into the database to fuel the next day's ingestion scheduler, allowing the system to expand its data footprint without human intervention.&lt;/p&gt;

&lt;p&gt;Architectural decisions under hard constraints&lt;br&gt;
Since I am running everything on the OCI free tier with around 4,000 tracks, I had to resolve several performance bottlenecks at the database and thread layer.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Preventing HikariCP starvation
Waiting for slow external network I/O (Gemini API, Wikipedia) inside a DB transaction is a severe anti-pattern that leads to connection pool exhaustion. I decoupled the transaction boundaries using TransactionTemplate into three tight phases:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Phase 1 (Short TX): Claim the target track via FOR UPDATE SKIP LOCKED and immediately flip the status to PROCESSING to isolate rows for worker concurrency. Commit and release the connection.&lt;/p&gt;

&lt;p&gt;Phase 2 (Zero TX): Perform the heavy external network I/O and embedding generation while holding zero active DB connections.&lt;/p&gt;

&lt;p&gt;Phase 3 (Short TX): Open a short transaction to persist the final structured entity data.&lt;br&gt;
The entire flow runs over Java 21 Virtual Threads to minimize scheduling overhead during I/O wait states.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Diversity-enforced vector search
Raw pgvector cosine similarity search caused an echo chamber where a single popular artist's massive discography monopolized the recommendation coordinate space. To guarantee diverse digging exploration, I moved a 2-stage window function directly into the database layer rather than post-processing in application memory:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(:&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cosmos_tracks&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'COMPLETED'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;cluster_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;clusterId&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(:&lt;/span&gt;&lt;span class="n"&gt;excludeIds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(:&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;poolSize&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;diversified&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;artist&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;artist_rank&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;diversified&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;artist_rank&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;maxPerArtist&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="k"&gt;limit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This preserves index efficiency while strictly scattering artist density with a single roundtrip.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Defensive budgeting
The self-reinforcing loop naturally causes exponential data growth, which easily threatens free-tier LLM quotas. To control this, the engine flattens and randomizes the entire task grid (Collections.shuffle()) every midnight, distributing budget queries evenly across all latent moods before hitting the daily hard ceiling.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I deliberately chose a Thymeleaf SSR hybrid architecture to keep the deployment unit single and maintain high operational visibility (P6Spy, Actuator) instead of splitting into a separate SPA.&lt;/p&gt;

&lt;p&gt;Live Project: &lt;a href="https://cosmos.codeghost.cloud/" rel="noopener noreferrer"&gt;https://cosmos.codeghost.cloud/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'm very happy to discuss any architectural or design decisions. Let me know your thoughts or hit me with any questions!&lt;/p&gt;

</description>
      <category>showdev</category>
      <category>java</category>
      <category>webdev</category>
      <category>springboot</category>
    </item>
  </channel>
</rss>
