<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gaurav Thorat</title>
    <description>The latest articles on DEV Community by Gaurav Thorat (@gaurav_thorat_669a72b30ba).</description>
    <link>https://dev.to/gaurav_thorat_669a72b30ba</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3958574%2F6eccd7f7-5725-492d-b1e0-5b5164fa661e.jpg</url>
      <title>DEV Community: Gaurav Thorat</title>
      <link>https://dev.to/gaurav_thorat_669a72b30ba</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gaurav_thorat_669a72b30ba"/>
    <language>en</language>
    <item>
      <title>Why Your React Frontend Crashes When an LLM Streams Malformed JSON</title>
      <dc:creator>Gaurav Thorat</dc:creator>
      <pubDate>Tue, 09 Jun 2026 09:50:02 +0000</pubDate>
      <link>https://dev.to/gaurav_thorat_669a72b30ba/why-your-react-frontend-crashes-when-an-llm-streams-malformed-json-1k69</link>
      <guid>https://dev.to/gaurav_thorat_669a72b30ba/why-your-react-frontend-crashes-when-an-llm-streams-malformed-json-1k69</guid>
      <description>&lt;p&gt;&lt;strong&gt;A production-minded walkthrough with a live Next.js demo — JSON.parse() vs partial-json + Zod for real-time AI dashboards.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;canonical: &lt;a href="https://gauravthorat-portfolio.vercel.app/blog/react-llm-stream-json-parser" rel="noopener noreferrer"&gt;https://gauravthorat-portfolio.vercel.app/blog/react-llm-stream-json-parser&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>react</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Why 99% of RAG Apps Crash in Production (Naive vs Scaled Node.js)</title>
      <dc:creator>Gaurav Thorat</dc:creator>
      <pubDate>Fri, 29 May 2026 14:01:10 +0000</pubDate>
      <link>https://dev.to/gaurav_thorat_669a72b30ba/why-99-of-rag-apps-crash-in-production-naive-vs-scaled-nodejs-1jp7</link>
      <guid>https://dev.to/gaurav_thorat_669a72b30ba/why-99-of-rag-apps-crash-in-production-naive-vs-scaled-nodejs-1jp7</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; I am a frontend developer transitioning into AI engineering, sharing real experiments and learnings from building production-style RAG systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Your RAG pipeline works perfectly on Friday. Then Monday hits. &lt;strong&gt;1,000 users query at once.&lt;/strong&gt; Suddenly everything breaks: 502 errors, ECONNRESET, OpenAI 429 rate limits, Pinecone timeouts. The demo wasn't wrong—it just wasn't built for production concurrency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Video:&lt;/strong&gt; &lt;a href="https://youtu.be/-2aS3Yl5-5M" rel="noopener noreferrer"&gt;https://youtu.be/-2aS3Yl5-5M&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/gauravthorath/rag-scale-demo" rel="noopener noreferrer"&gt;https://github.com/gauravthorath/rag-scale-demo&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Full article:&lt;/strong&gt; &lt;a href="https://gauravthorat-portfolio.vercel.app/blog/rag-production-architecture" rel="noopener noreferrer"&gt;https://gauravthorat-portfolio.vercel.app/blog/rag-production-architecture&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The Monday morning problem
&lt;/h2&gt;

&lt;p&gt;Locally: chunk docs → embed → upsert to Pinecone → query → LLM. Simple.&lt;/p&gt;

&lt;p&gt;Under load: socket exhaustion, connection pool saturation, API 429s, token costs exploding.&lt;/p&gt;
&lt;h2&gt;
  
  
  Naive RAG (what most people build first)
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;SAMPLE_CHUNKS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;embedOne&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;embedModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;SAMPLE_CHUNKS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
  &lt;span class="nx"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`demo-naive-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pinecone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Pinecone&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pineconeKey&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;DEMO_NAMESPACE&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Why it breaks at scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One embedding call per chunk&lt;/li&gt;
&lt;li&gt;One upsert per vector&lt;/li&gt;
&lt;li&gt;No batching, no connection reuse, no retries&lt;/li&gt;
&lt;li&gt;New client instances repeatedly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3 chunks × 1,000 users × retries = thousands of outbound API calls. Sockets and rate limits run out fast.&lt;/p&gt;
&lt;h2&gt;
  
  
  Production pattern
&lt;/h2&gt;

&lt;p&gt;Same RAG logic. Better infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Singleton Pinecone client:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Pinecone&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;indexCache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Index&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;getPineconeIndex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;indexName&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;Index&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;indexName&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="nf"&gt;getEnv&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;PINECONE_INDEX_NAME&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;indexCache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getPineconeClient&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;indexCache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Embedding batching:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;64 texts → 1 API call instead of 64. Big win on latency, cost, and rate limits.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In-process batching only. For multiple servers, add Redis caching and a task queue.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Naive vs production
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Naive&lt;/th&gt;
&lt;th&gt;Production&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;New Pinecone client per call&lt;/td&gt;
&lt;td&gt;Singleton client&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One embedding per chunk&lt;/td&gt;
&lt;td&gt;Batched embeddings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One upsert per vector&lt;/td&gt;
&lt;td&gt;Bulk upsert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raw env vars&lt;/td&gt;
&lt;td&gt;Zod validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No retries&lt;/td&gt;
&lt;td&gt;Backoff + retry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No metrics&lt;/td&gt;
&lt;td&gt;Tracing + metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Before real scale
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Exponential backoff + jitter on OpenAI and Pinecone&lt;/li&gt;
&lt;li&gt;Top-K + reranking (don't dump every chunk into the prompt)&lt;/li&gt;
&lt;li&gt;Distributed rate limiting across instances&lt;/li&gt;
&lt;li&gt;Metrics: embed latency, retrieval quality, token usage&lt;/li&gt;
&lt;li&gt;Stable vector IDs for safe retries&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/gauravthorath/rag-scale-demo
&lt;span class="nb"&gt;cd &lt;/span&gt;rag-scale-demo
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
npm &lt;span class="nb"&gt;install
&lt;/span&gt;npm run naive
npm run production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use separate Pinecone namespaces so runs don't overwrite each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;Most RAG tutorials stop at "it answers my PDF." Production is about surviving concurrency, retries, rate limits, and cost pressure.&lt;/p&gt;

&lt;p&gt;Questions or repo fixes? Drop a comment. I reply here and on YouTube.&lt;/p&gt;

&lt;p&gt;Originally published on my portfolio: &lt;a href="https://gauravthorat-portfolio.vercel.app/blog/rag-production-architecture" rel="noopener noreferrer"&gt;https://gauravthorat-portfolio.vercel.app/blog/rag-production-architecture&lt;/a&gt;&lt;/p&gt;

</description>
      <category>node</category>
      <category>typescript</category>
      <category>webdev</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
