<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Akshit Sharma</title>
    <description>The latest articles on DEV Community by Akshit Sharma (@akshit_sharma_321b0b789a4).</description>
    <link>https://dev.to/akshit_sharma_321b0b789a4</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2926598%2F03055afd-a7bc-455e-90af-8551ccae3132.jpg</url>
      <title>DEV Community: Akshit Sharma</title>
      <link>https://dev.to/akshit_sharma_321b0b789a4</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/akshit_sharma_321b0b789a4"/>
    <language>en</language>
    <item>
      <title>Everyone is using LLMs wrong........</title>
      <dc:creator>Akshit Sharma</dc:creator>
      <pubDate>Sat, 09 May 2026 19:19:18 +0000</pubDate>
      <link>https://dev.to/akshit_sharma_321b0b789a4/everyone-is-using-llms-wrong-18hp</link>
      <guid>https://dev.to/akshit_sharma_321b0b789a4/everyone-is-using-llms-wrong-18hp</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1bs5s6j22ukstdltyubq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1bs5s6j22ukstdltyubq.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;---&lt;/p&gt;

&lt;h1&gt;
  
  
  Can an LLM "See" a Room Just by Listening?
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Everyone is using LLMs wrong.&lt;/strong&gt; We are obsessed with Speech-to-Text. We take audio, flatten it into text, and feed it to a chatbot.&lt;/p&gt;

&lt;p&gt;But what happens if you bypass the text entirely?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What if you feed the &lt;em&gt;raw acoustic tensor data&lt;/em&gt; directly into a native multimodal LLM and ask it to connect the physical dots?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I am running an experiment to see if an AI can blindly "see" a room just by listening to it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Experimental Setup
&lt;/h3&gt;

&lt;p&gt;I am blasting a 3kHz–6kHz broadband frequency chirp (a mathematical "bell ping") inside a highly reverberant, non-rectangular tile box.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Bounds:&lt;/strong&gt; Tight. &lt;code&gt;X=104"&lt;/code&gt;, &lt;code&gt;Y=101"&lt;/code&gt;, &lt;code&gt;Z=97"&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Geometry:&lt;/strong&gt; It is a geometric nightmare. There’s a 21.5-inch inward offset on one wall acting as a diffuser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Obstacles:&lt;/strong&gt; A urinal sitting at coordinate &lt;code&gt;(0, 45, 32)&lt;/code&gt; and a toilet at &lt;code&gt;(0, 76, 0)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Target:&lt;/strong&gt; My primary test target is located precisely at &lt;code&gt;(0, 23, 15)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Traditional Approach
&lt;/h3&gt;

&lt;p&gt;Normally, calculating this requires hardcore Synthetic Aperture Sonar (SAS) math. You sweep a directional emitter, capture the sound from multiple microphone edges (&lt;code&gt;M1&lt;/code&gt;, &lt;code&gt;M2&lt;/code&gt;, etc.), and calculate the Time Difference of Arrival (TDOA).&lt;/p&gt;

&lt;p&gt;For my target object, the acoustic shadow hits at exactly &lt;strong&gt;~1.92 ms&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Crazy" Idea
&lt;/h3&gt;

&lt;p&gt;Instead of writing a manual C++ or Julia script to subtract the empty room baseline ($\Delta R(t)$) and plot intersecting hyperbolas, what if we just feed the raw, overlapping multipath reflection audio directly into an LLM as native tokens?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No text. No pre-processing. Just raw environmental interaction data.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the model can understand the hidden sequential context of a billion text parameters, can it implicitly learn the physics of sound propagation? Can it identify that the localized energy redistribution at 1.92 ms is an 8-inch object, while the dense tail after 10 ms is just the tile ceiling?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can an LLM act as a biological auditory cortex and map a 3D coordinate system entirely in the dark?&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>showdev</category>
      <category>discuss</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
