<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: MindsDB</title>
    <description>The latest articles on DEV Community by MindsDB (@mindsdb).</description>
    <link>https://dev.to/mindsdb</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F8790%2Fd558afe9-d790-4387-b479-07d7040912ec.png</url>
      <title>DEV Community: MindsDB</title>
      <link>https://dev.to/mindsdb</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mindsdb"/>
    <language>en</language>
    <item>
      <title>How Web Apps Like Midday Gain a Competitive Edge with Supabase and MindsDB</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Wed, 04 Feb 2026 16:42:45 +0000</pubDate>
      <link>https://dev.to/mindsdb/how-web-apps-like-midday-gain-a-competitive-edge-with-supabase-and-mindsdb-4aoi</link>
      <guid>https://dev.to/mindsdb/how-web-apps-like-midday-gain-a-competitive-edge-with-supabase-and-mindsdb-4aoi</guid>
      <description>&lt;p&gt;&lt;em&gt;Written by Chandre Van Der Westhuizen, Community &amp;amp; Marketing Co-ordinator at MindsDB&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;SaaS companies live and die by their metrics. Data is collected in operational systems, copied into warehouses, transformed through pipelines, and finally visualized in dashboards. Finance teams learn where to click. Analysts learn how to maintain the machinery. Leadership waits for answers.&lt;/p&gt;

&lt;p&gt;This approach worked when businesses were simpler and questions were predictable. But modern SaaS companies don't operate that way anymore. Metrics change weekly. Pricing models evolve. Usage patterns shift quickly. And the most important questions are often the ones no dashboard was designed to answer.&lt;/p&gt;

&lt;p&gt;This is why finance analytics is reaching an inflection point.&lt;br&gt;
Monthly Recurring Revenue (MRR), Annual Recurring Revenue (ARR), active users, churn risk - these numbers drive decisions across finance, product, and leadership. Tools like &lt;strong&gt;Midday.ai&lt;/strong&gt; exist to make those metrics visible and actionable.&lt;/p&gt;

&lt;p&gt;But there's a growing challenge beneath the surface: As data grows more complex, dashboards alone are no longer enough.&lt;br&gt;
This is where &lt;strong&gt;MindsDB&lt;/strong&gt; comes in - especially for teams already using &lt;strong&gt;Supabase&lt;/strong&gt; as their primary data store.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Real Bottleneck Isn't Just Data - It’s Distance
&lt;/h2&gt;

&lt;p&gt;Finance teams using Web Applications already have the data they need. What they lack is proximity to insight.&lt;/p&gt;

&lt;p&gt;Every additional layer between the question and the answer introduces friction. Data must be replicated. Dashboards must be updated. Someone must interpret the results. By the time insight arrives, the moment has often passed.&lt;/p&gt;

&lt;p&gt;Worse, trust erodes along the way. Numbers differ across tools. Reports disagree. Teams argue about which system is the source of truth.&lt;/p&gt;

&lt;p&gt;In finance, uncertainty isn't just inconvenient. It's dangerous.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8i4g6etuf4g6mq1yrxuo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8i4g6etuf4g6mq1yrxuo.png" alt="web apps data challenges" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Supabase Has Become the System of Record for SaaS Finance
&lt;/h2&gt;

&lt;p&gt;Platforms like Midday.ai, alongside Rally and Deriv, reflect a broader shift in how modern SaaS products are built.&lt;/p&gt;

&lt;p&gt;Supabase is no longer just a backend convenience. It increasingly holds the most critical business data: subscriptions, invoices, usage, customer activity, and even pre-aggregated metrics like MRR and ARR.&lt;/p&gt;

&lt;p&gt;For many teams, Supabase is the financial system of record ,especially with its row-level security and all-in-one back-end solutions.&lt;/p&gt;

&lt;p&gt;And that raises an uncomfortable question. If the authoritative data already lives in Postgres, why do we keep copying it elsewhere just to understand it?&lt;/p&gt;
&lt;h2&gt;
  
  
  When Intelligence Lives Where the Data Does
&lt;/h2&gt;

&lt;p&gt;MindsDB challenges the assumption that analytics must live outside operational systems.&lt;/p&gt;

&lt;p&gt;Instead of moving financial and user data into separate analytics stacks, MindsDB allows AI to run directly on Supabase. Queries execute where the data lives. Answers reflect the current state of the business. Every result can be traced back to real rows.&lt;/p&gt;

&lt;p&gt;This is not just an architectural preference. It fundamentally changes how finance teams interact with their data. &lt;/p&gt;

&lt;p&gt;Many AI analytics tools fall short for finance teams because their answers feel disconnected from reality. When numbers lack context, can't be audited, or appear probabilistic, trust breaks down in a domain that demands precision. MindsDB addresses this by grounding every answer directly in live Supabase data, with no hidden transformations or black-box aggregation, making each result fully traceable to its source.&lt;/p&gt;

&lt;p&gt;When intelligence lives next to the source of truth, freshness and trust stop being tradeoffs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl05cv2yog3w6buu56zst.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl05cv2yog3w6buu56zst.png" alt="MindsDB + Supabase" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  MindsDB and Supabase: Bringing AI Intelligence Directly to Your Web Apps’ Data
&lt;/h2&gt;

&lt;p&gt;Supabase provides a reliable Postgres-based system of record, while MindsDB turns that data into a queryable intelligence layer, enabling natural-language analytics, hybrid search, and AI agents that stay grounded in real rows. The result is faster insights, fewer pipelines to maintain, and analytics teams can trust—because every answer remains fresh, explainable, and tied back to the source of truth.&lt;/p&gt;

&lt;p&gt;For web apps like Midday, MindsDB’s federated query engine allows Supabase to make financial intelligence a native part of the product rather than a separate analytics layer. This reduces infrastructure complexity for the product team while delivering a more flexible, differentiated experience for finance teams who need accurate insights they can act on immediately.&lt;/p&gt;

&lt;p&gt;Let’s explore a use case for Midday where we connect web application data hosted in Supabase to MindsDB, build Knowledge Bases and perform Hybrid Search.&lt;/p&gt;

&lt;p&gt;Please note, we will make use of synthetic data for Midday hosted in Supabase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-requisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Access MindsDB’s GUI via &lt;a href="https://docs.mindsdb.com/setup/self-hosted/docker" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; locally or MindsDB’s extension on &lt;a href="https://docs.mindsdb.com/setup/self-hosted/docker-desktop" rel="noopener noreferrer"&gt;Docker Desktop.&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Configure your default models in the MindsDB GUI by navigating to Settings → Models.&lt;/li&gt;
&lt;li&gt;Navigate to Manage Integrations in Settings and install the dependencies for Supabase.&lt;/li&gt;
&lt;li&gt;Once you have installed the dependencies for Supabase, you can connect to it via our SQL Editor.
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;display_name&lt;/span&gt;  &lt;span class="c1"&gt;--- display name for database.&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'supabase'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;--- name of the mindsdb handler&lt;/span&gt;
&lt;span class="k"&gt;PARAMETERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="nv"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"postgres.gyjaxewghgebqavchydf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;--- the user to authenticate&lt;/span&gt;
   &lt;span class="nv"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"Supabase2026!"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;--- the password to authenticate the user&lt;/span&gt;
   &lt;span class="nv"&gt;"host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"aws-1-us-west-1.pooler.supabase.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;--- the host name of the Supabase connection&lt;/span&gt;
   &lt;span class="nv"&gt;"port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"5432"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;--- the port to use when connecting&lt;/span&gt;
   &lt;span class="nv"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"postgres"&lt;/span&gt;           &lt;span class="c1"&gt;--- database name&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://docs.mindsdb.com/integrations/vector-db-integrations/pgvector" rel="noopener noreferrer"&gt;PGVector&lt;/a&gt; will be used as storage, which you can also connect to following these docs.&lt;/p&gt;

&lt;p&gt;We will make use of two tables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;midday_user_data&lt;/code&gt; which stores details about organizations.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;midday_metrics&lt;/code&gt; which starts metrics like MRR, ARR, NRR and churn risk.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Querying Supabase Data with MindsDB Knowledge Bases and Hybrid Search
&lt;/h2&gt;

&lt;p&gt;This section walks through how to use &lt;strong&gt;MindsDB &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overview" rel="noopener noreferrer"&gt;Knowledge Bases&lt;/a&gt; and &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/hybrid_search" rel="noopener noreferrer"&gt;Hybrid Search&lt;/a&gt;&lt;/strong&gt; on top of &lt;strong&gt;Supabase data&lt;/strong&gt; to power AI-native analytics inside a web application. Using a Midday-style dataset, we’ll show how operational and financial data stored in Supabase can be transformed into a queryable knowledge layer that supports questions without moving or duplicating data. By the end of this tutorial, you’ll see how combining semantic search with structured filters enables accurate, explainable insights directly from your live Postgres source with the precision of SQL.&lt;/p&gt;

&lt;p&gt;Now that you have connected your Supabase data to MindsDB, you can create a Knowledge Base using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/create" rel="noopener noreferrer"&gt;CREATE KNOWLEDGE_BASE syntax&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;KNOWLEDGE_BASE&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
    &lt;span class="k"&gt;storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;heroku&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;supabase_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org_created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org_is_active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_login_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_is_active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subscription_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;plan_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;billing_interval&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subscription_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subscription_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subscription_cancel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seats_included&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seats_purchased&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invoice_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invoice_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invoice_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;usage_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;active_minutes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;events_ingested&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reports_generated&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;content_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;org_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;full_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;industry&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;id_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Insert your data using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/insert_data" rel="noopener noreferrer"&gt;INSERT INTO&lt;/a&gt; syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org_created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org_is_active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_login_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_is_active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subscription_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;plan_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;billing_interval&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subscription_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subscription_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subscription_cancel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seats_included&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seats_purchased&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invoice_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invoice_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invoice_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;usage_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;active_minutes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;events_ingested&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reports_generated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;full_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;industry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;supabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;midday_user_data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can select the Knowledge Base to see the data inserted using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/query" rel="noopener noreferrer"&gt;SELECT&lt;/a&gt; statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frulao5f4zsdogdi6jp03.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frulao5f4zsdogdi6jp03.png" alt="KB1" width="800" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can follow the same steps for the midday_metrics table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Create Knowledge Base&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;KNOWLEDGE_BASE&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
    &lt;span class="k"&gt;storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;heroku&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;supbase_metrics&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'mrr'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'arr'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'active_users'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'net_revenue_retention'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'churn_risk_score'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;content_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'metric_month'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;id_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'org_id'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Insert Data&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;metric_month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mrr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;active_users&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;net_revenue_retention&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;churn_risk_score&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;supabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;midday_metrics&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Query the Knowledge Base&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxsrqgnkfbsy3jowiz76s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxsrqgnkfbsy3jowiz76s.png" alt="KB2" width="800" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MindsDB’s Hybrid search allows you to query data with keyword and semantic search using SQL. Let’s see what insights we can obtain from the Midday data stored in Supabase.&lt;/p&gt;

&lt;p&gt;Identify which paying customers are most at risk right now and immediately see whether low engagement is a plausible driver.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Find high churn-risk customers and pull their engagement signals&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
 &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'2025-10'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;churn_risk_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;70&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org_is_active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the fastest “save revenue” query—finance and Customer Service can prioritize outreach based on risk and product behavior in one view.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwgym5zszqrz5vfy344v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmwgym5zszqrz5vfy344v.png" alt="mindsdb" width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Surface customers who contribute meaningful recurring revenue but aren’t adopting the product.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--“High MRR, low usage” accounts (classic expansion + retention signal)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mrr&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'249'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active_minutes&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sessions&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subscription_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Low adoption is one of the strongest leading indicators of churn; catching it early protects ARR.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2btwhjow2mg0jnpn6kzu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2btwhjow2mg0jnpn6kzu.png" alt="mindsdb" width="800" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can detect revenue leakage and billing failures without waiting for finance tooling or delayed reports.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Accounts with payment issues + current subscription context&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
 &lt;span class="n"&gt;invoice_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'failed'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;subscription_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;org_is_active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_is_active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Failed invoices are a direct threat to cash flow and can also signal account distress or impending churn. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq85pm3e9of6qfxfrldp0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq85pm3e9of6qfxfrldp0.png" alt="mindsdb" width="800" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Find customers who have outgrown their current plan footprint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Seat overages: customers growing beyond plan assumptions&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;
 &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
 &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subscription_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seats_purchased&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seats_included&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'2025-12'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seat overages are a clean expansion motion—great for upsell workflows and improving NRR.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgjuy8amzxycz6c3h5cbi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgjuy8amzxycz6c3h5cbi.png" alt="mindsdb" width="800" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Identify customers still paying but no longer engaged—often a precursor to cancellation at renewal time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--“Silent churn” candidates: active subscription, but no recent logins&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;
 &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
 &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subscription_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_login_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="s1"&gt;'2025-12-15'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'2025-12'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a proactive retention list for finance and customer service, especially ahead of renewals.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8aqvs8fveaheyhpav47m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8aqvs8fveaheyhpav47m.png" alt="mindsdb" width="800" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Flag accounts contributing recurring revenue but shrinking or at risk of contraction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--MRR present but retention weak&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2025-01'&lt;/span&gt; &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mrr&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'99'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;net_revenue_retention&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'1.00'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org_is_active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Contraction reduces ARR quietly—catching it early helps protect revenue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fid8rhq952shp32c8susi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fid8rhq952shp32c8susi.png" alt="mindsdb" width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Find cases where revenue retention looks good but the subscription shows cancellation signals.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--NRR strong but subscription status at risk&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;
 &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2025-12'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;net_revenue_retention&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subscription_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'canceled'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This helps detect “NRR lag” (metrics still look fine, but renewal risk is real).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4whlsy4l2tae0n4mxezs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4whlsy4l2tae0n4mxezs.png" alt="mindsdb" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Find the accounts that could cause the biggest revenue loss if they churn.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```sql--Revenue risk: high MRR + high churn risk&lt;br&gt;
SELECT *&lt;br&gt;
FROM midday_metrics_kb&lt;br&gt;
JOIN user_data_kb&lt;br&gt;
ON midday_metrics_kb.id = user_data_kb.id&lt;br&gt;
WHERE midday_metrics_kb.content = '2026-01'&lt;br&gt;
 AND midday_metrics_kb.mrr &amp;gt;= '249'&lt;br&gt;
 AND midday_metrics_kb.churn_risk_score &amp;gt;= 0.70&lt;br&gt;
 AND user_data_kb.org_is_active = 'true';&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


This allows finance and customer service teams prioritize saves by “revenue at risk,” not just customer count. 

![mindsdb](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0f90b2xdi40xxwbl0c63.png)

Find accounts expanding and engaging broadly across users.



```sql--High NRR + high active_users (product-market fit signal)
SELECT *
FROM midday_metrics_kb
JOIN user_data_kb
ON midday_metrics_kb.id = user_data_kb.id
WHERE midday_metrics_kb.content = '2026-01'
 AND midday_metrics_kb.net_revenue_retention &amp;gt;= '1.10'
 AND midday_metrics_kb.active_users &amp;gt;= '10'
 AND user_data_kb.org_is_active = 'true';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This is a high-confidence “sticky + growing” segment, useful for forecasting and expansion playbooks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofbsajwu6fwoad5l7aw7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofbsajwu6fwoad5l7aw7.png" alt="mindsdb" width="800" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Validate contraction risk using real usage behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--NRR weak + low engagement (risk validation)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;
 &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2025-01'&lt;/span&gt; &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="s1"&gt;'2025-12'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;midday_metrics_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;net_revenue_retention&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active_minutes&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sessions&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_data_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_is_active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Confirms whether retention issues are product adoption issues versus pricing/billing factors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fls7jbnpwjahvwvoc6b0x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fls7jbnpwjahvwvoc6b0x.png" alt="mindsdb" width="800" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Web Apps Like Midday.ai Using Supabase
&lt;/h2&gt;

&lt;p&gt;Web apps like Midday.ai are evolving from static dashboards into systems that help teams reason about financial health. By pairing Supabase as the source of truth with MindsDB Knowledge Bases and Hybrid Search, these platforms can deliver real-time, conversational financial insights that remain accurate and explainable. This represents a shift toward AI-native analytics that finance teams can trust and act on.&lt;/p&gt;

&lt;p&gt;At MindsDB, we believe in simplifying getting real-time answers. If your AI Analytics needs are more complex, we’re here to help. &lt;a href="https://mindsdb.com/contact" rel="noopener noreferrer"&gt;Contact our team&lt;/a&gt; that understands AI analytics demand precision and scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: From Dashboards to Dialogue
&lt;/h2&gt;

&lt;p&gt;The future of finance analytics isn't more charts or denser dashboards.&lt;/p&gt;

&lt;p&gt;It’s dialogue.&lt;/p&gt;

&lt;p&gt;A continuous conversation between the business, its data, and the people responsible for decisions. Supabase provides the foundation. MindsDB provides the intelligence. Knowledge Bases and Hybrid Search ensure that intelligence operates safely.&lt;/p&gt;

&lt;p&gt;Together, they point toward a world where finance teams spend less time navigating tools and more time understanding what's really happening in their business.&lt;/p&gt;

&lt;p&gt;And that's where meaningful decisions begin.&lt;/p&gt;

</description>
      <category>supabase</category>
      <category>ai</category>
      <category>sql</category>
      <category>analytics</category>
    </item>
    <item>
      <title>Building a Semantic Search Knowledge Base with MindsDB</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Wed, 28 Jan 2026 16:30:26 +0000</pubDate>
      <link>https://dev.to/mindsdb/building-a-semantic-search-knowledge-base-with-mindsdb-5107</link>
      <guid>https://dev.to/mindsdb/building-a-semantic-search-knowledge-base-with-mindsdb-5107</guid>
      <description>&lt;p&gt;&lt;em&gt;Written by Andriy Burkov, Ph.D. &amp;amp; Author, MindsDB Advisor&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What happens when a developer searches for "how to make async HTTP calls" but your documentation says "asynchronous network requests"? Traditional keyword search fails—even though the content is exactly what they need.&lt;/p&gt;

&lt;p&gt;This is the fundamental limitation of keyword search: it matches words, not meaning.&lt;/p&gt;

&lt;p&gt;In this tutorial, we'll build a semantic search system using MindsDB that understands user intent. Using 2 million Stack Overflow posts, we'll create knowledge bases with two different vector storage backends—&lt;strong&gt;pgvector&lt;/strong&gt; and &lt;strong&gt;FAISS&lt;/strong&gt;—and compare their performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How MindsDB knowledge bases convert text into searchable vectors&lt;/li&gt;
&lt;li&gt;Setting up pgvector (PostgreSQL-based) and FAISS (Facebook AI Similarity Search) storage&lt;/li&gt;
&lt;li&gt;Combining semantic search with metadata filters&lt;/li&gt;
&lt;li&gt;Building an AI agent that uses your knowledge base to answer questions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A MindsDB account (cloud or self-hosted)&lt;/li&gt;
&lt;li&gt;PostgreSQL database with the Stack Overflow dataset&lt;/li&gt;
&lt;li&gt;An OpenAI API key for embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How Semantic Search Works
&lt;/h2&gt;

&lt;p&gt;Before we dive in, let's understand the key difference between keyword and semantic search:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Keyword Search&lt;/th&gt;
&lt;th&gt;Semantic Search&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Matching&lt;/td&gt;
&lt;td&gt;Exact words&lt;/td&gt;
&lt;td&gt;Meaning/intent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query: "async HTTP"&lt;/td&gt;
&lt;td&gt;Misses "asynchronous requests"&lt;/td&gt;
&lt;td&gt;Finds both&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Handles synonyms&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Understands context&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Semantic search works by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Embedding&lt;/strong&gt;: Converting text into numerical vectors using an embedding model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storing&lt;/strong&gt;: Saving these vectors in a vector database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Querying&lt;/strong&gt;: Converting the search query to a vector and finding the closest matches&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;MindsDB handles all of this through its Knowledge Base abstraction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing Dependencies
&lt;/h2&gt;

&lt;p&gt;We need two packages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;mindsdb_sdk&lt;/strong&gt;: Python client for interacting with MindsDB servers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pandas&lt;/strong&gt;: For working with query results as DataFrames
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;mindsdb_sdk&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Connecting to the MindsDB Cloud Instance
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mindsdb_sdk&lt;/span&gt;

&lt;span class="c1"&gt;# Connect to your MindsDB instance
&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mindsdb_sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_MINDSDB_URL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# e.g., 'https://cloud.mindsdb.com' for MindsDB Cloud
&lt;/span&gt;    &lt;span class="n"&gt;login&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_USERNAME&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_PASSWORD&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connected to MindsDB server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Connected to MindsDB server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  4. Connecting to the Data Source
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;success_msg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Query executed successfully&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Execute a SQL query and handle &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;already exists&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; errors gracefully.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;success_msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;RuntimeError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;already exists&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Resource already exists - skipping&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="c1"&gt;# Connect to your PostgreSQL database containing Stack Overflow data
&lt;/span&gt;&lt;span class="nf"&gt;run_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE DATABASE pg_sample
    WITH ENGINE = &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
    PARAMETERS = {
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_PG_USER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_PG_PASSWORD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_PG_HOST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5432&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    }
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Created pg_sample database connection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Created pg_sample database connection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Let's verify the connection by exploring the data. Check the dataset size:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get total row count
&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COUNT(*) as cnt FROM pg_sample.stackoverflow_2m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dataset size: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cnt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Dataset size: 2,000,000 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Show 10 records:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Test sample data
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM pg_sample.stackoverflow_2m LIMIT 10&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Display as a nice table (in Jupyter notebooks)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;IPython.display&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;display&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;id&lt;/th&gt;
      &lt;th&gt;PostTypeId&lt;/th&gt;
      &lt;th&gt;AcceptedAnswerId&lt;/th&gt;
      &lt;th&gt;ParentId&lt;/th&gt;
      &lt;th&gt;Score&lt;/th&gt;
      &lt;th&gt;ViewCount&lt;/th&gt;
      &lt;th&gt;Body&lt;/th&gt;
      &lt;th&gt;Title&lt;/th&gt;
      &lt;th&gt;ContentLicense&lt;/th&gt;
      &lt;th&gt;FavoriteCount&lt;/th&gt;
      &lt;th&gt;CreationDate&lt;/th&gt;
      &lt;th&gt;LastActivityDate&lt;/th&gt;
      &lt;th&gt;LastEditDate&lt;/th&gt;
      &lt;th&gt;LastEditorUserId&lt;/th&gt;
      &lt;th&gt;OwnerUserId&lt;/th&gt;
      &lt;th&gt;Tags&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;4.0&lt;/td&gt;
      &lt;td&gt;522&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;An explicit cast to `double` like this isn't n...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;CC BY-SA 4.0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2008-07-31T22:17:57.883&lt;/td&gt;
      &lt;td&gt;2019-10-21T14:03:54.607&lt;/td&gt;
      &lt;td&gt;2019-10-21T14:03:54.607&lt;/td&gt;
      &lt;td&gt;5496973.0&lt;/td&gt;
      &lt;td&gt;9.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;9&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1404.0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2199&lt;/td&gt;
      &lt;td&gt;784860.0&lt;/td&gt;
      &lt;td&gt;Given a `DateTime` representing a person's bir...&lt;/td&gt;
      &lt;td&gt;How do I calculate someone's age based on a Da...&lt;/td&gt;
      &lt;td&gt;CC BY-SA 4.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2008-07-31T23:40:59.743&lt;/td&gt;
      &lt;td&gt;2023-02-02T18:38:32.613&lt;/td&gt;
      &lt;td&gt;2022-07-27T22:34:36.320&lt;/td&gt;
      &lt;td&gt;3524942.0&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;c#,.net,datetime&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1248.0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;1644&lt;/td&gt;
      &lt;td&gt;197314.0&lt;/td&gt;
      &lt;td&gt;Given a specific `DateTime` value, how do I di...&lt;/td&gt;
      &lt;td&gt;Calculate relative time in C#&lt;/td&gt;
      &lt;td&gt;CC BY-SA 4.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2008-07-31T23:55:37.967&lt;/td&gt;
      &lt;td&gt;2022-09-05T11:26:30.187&lt;/td&gt;
      &lt;td&gt;2022-07-10T00:19:55.237&lt;/td&gt;
      &lt;td&gt;16790137.0&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;c#,datetime,time,datediff,relative-time-span&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;14&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;491&lt;/td&gt;
      &lt;td&gt;173083.0&lt;/td&gt;
      &lt;td&gt;What is the difference between [Math.Floor()](...&lt;/td&gt;
      &lt;td&gt;Difference between Math.Floor() and Math.Trunc...&lt;/td&gt;
      &lt;td&gt;CC BY-SA 3.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2008-08-01T00:59:11.177&lt;/td&gt;
      &lt;td&gt;2022-04-22T08:59:43.817&lt;/td&gt;
      &lt;td&gt;2017-02-25T17:42:17.810&lt;/td&gt;
      &lt;td&gt;6495084.0&lt;/td&gt;
      &lt;td&gt;11.0&lt;/td&gt;
      &lt;td&gt;.net,math&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;6&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;31.0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;319&lt;/td&gt;
      &lt;td&gt;23465.0&lt;/td&gt;
      &lt;td&gt;I have an absolutely positioned `div` containi...&lt;/td&gt;
      &lt;td&gt;Why did the width collapse in the percentage w...&lt;/td&gt;
      &lt;td&gt;CC BY-SA 4.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2008-07-31T22:08:08.620&lt;/td&gt;
      &lt;td&gt;2021-01-29T18:46:45.963&lt;/td&gt;
      &lt;td&gt;2021-01-29T18:46:45.963&lt;/td&gt;
      &lt;td&gt;9134576.0&lt;/td&gt;
      &lt;td&gt;9.0&lt;/td&gt;
      &lt;td&gt;html,css,internet-explorer-7&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;11.0&lt;/td&gt;
      &lt;td&gt;347&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;Here's how I do it\n\n

```\nvar ts = new TimeSp...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;CC BY-SA 4.0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2008-07-31T23:56:41.303&lt;/td&gt;
      &lt;td&gt;2020-06-13T10:30:44.397&lt;/td&gt;
      &lt;td&gt;2020-06-13T10:30:44.397&lt;/td&gt;
      &lt;td&gt;238419.0&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;13&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;701&lt;/td&gt;
      &lt;td&gt;277780.0&lt;/td&gt;
      &lt;td&gt;Is there a standard way for a web server to be...&lt;/td&gt;
      &lt;td&gt;Determine a user's timezone&lt;/td&gt;
      &lt;td&gt;CC BY-SA 4.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2008-08-01T00:42:38.903&lt;/td&gt;
      &lt;td&gt;2022-03-29T07:31:31.320&lt;/td&gt;
      &lt;td&gt;2020-12-03T03:37:56.313&lt;/td&gt;
      &lt;td&gt;584192.0&lt;/td&gt;
      &lt;td&gt;9.0&lt;/td&gt;
      &lt;td&gt;html,browser,timezone,user-agent,timezone-offset&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;7.0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;794&lt;/td&gt;
      &lt;td&gt;70633.0&lt;/td&gt;
      &lt;td&gt;I want to assign the decimal variable "trans" ...&lt;/td&gt;
      &lt;td&gt;How to convert Decimal to Double in C#?&lt;/td&gt;
      &lt;td&gt;CC BY-SA 4.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2008-07-31T21:42:52.667&lt;/td&gt;
      &lt;td&gt;2022-09-08T05:07:26.033&lt;/td&gt;
      &lt;td&gt;2022-09-08T05:07:26.033&lt;/td&gt;
      &lt;td&gt;16124033.0&lt;/td&gt;
      &lt;td&gt;8.0&lt;/td&gt;
      &lt;td&gt;c#,floating-point,type-conversion,double,decimal&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;17&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;26.0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;198&lt;/td&gt;
      &lt;td&gt;85547.0&lt;/td&gt;
      &lt;td&gt;How do I store binary data in [MySQL](http://e...&lt;/td&gt;
      &lt;td&gt;Binary Data in MySQL&lt;/td&gt;
      &lt;td&gt;CC BY-SA 3.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2008-08-01T05:09:55.993&lt;/td&gt;
      &lt;td&gt;2020-12-03T03:37:51.763&lt;/td&gt;
      &lt;td&gt;2020-12-03T03:37:51.763&lt;/td&gt;
      &lt;td&gt;584192.0&lt;/td&gt;
      &lt;td&gt;2.0&lt;/td&gt;
      &lt;td&gt;mysql,database,binary-data,data-storage&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;24&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;49.0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;193&lt;/td&gt;
      &lt;td&gt;101180.0&lt;/td&gt;
      &lt;td&gt;If I have a trigger before the update on a tab...&lt;/td&gt;
      &lt;td&gt;Throw an error preventing a table update in a ...&lt;/td&gt;
      &lt;td&gt;CC BY-SA 4.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2008-08-01T12:12:19.350&lt;/td&gt;
      &lt;td&gt;2021-01-29T12:57:17.153&lt;/td&gt;
      &lt;td&gt;2021-01-29T12:57:17.153&lt;/td&gt;
      &lt;td&gt;14152908.0&lt;/td&gt;
      &lt;td&gt;22.0&lt;/td&gt;
      &lt;td&gt;mysql,database,triggers&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Stack Overflow dataset contains 2 million posts—both questions (&lt;code&gt;PostTypeId=1&lt;/code&gt;) and answers (&lt;code&gt;PostTypeId=2&lt;/code&gt;). Key columns include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Id&lt;/code&gt; - Unique identifier for each post&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Body&lt;/code&gt; - The content we'll make semantically searchable&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Title&lt;/code&gt; - The title of the post (questions only)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Tags&lt;/code&gt; - Programming language and topic tags (e.g., &lt;code&gt;python&lt;/code&gt;, &lt;code&gt;javascript&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Score&lt;/code&gt; - Community voting score—useful for prioritizing high-quality content&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ViewCount&lt;/code&gt; - Popularity metric for filtering&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PostTypeId&lt;/code&gt; - Type of post (1=question, 2=answer)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AcceptedAnswerId&lt;/code&gt; - ID of the accepted answer (for questions)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CreationDate&lt;/code&gt;, &lt;code&gt;LastActivityDate&lt;/code&gt;, &lt;code&gt;LastEditDate&lt;/code&gt; - Timestamps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This rich metadata allows us to combine semantic understanding with traditional filters—for example, finding Python questions about async programming with a score above 10.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Setting Up Vector Storage Backends
&lt;/h2&gt;

&lt;p&gt;MindsDB supports multiple vector storage options. We'll set up both pgvector and a recently added FAISS and will compare how quick they are.&lt;/p&gt;

&lt;h3&gt;
  
  
  pgvector (PostgreSQL Extension)
&lt;/h3&gt;

&lt;p&gt;pgvector is a PostgreSQL extension for vector similarity search. It's ideal when you want to keep vectors alongside your relational data.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
# Create pgvector database connection
run_query("""
    CREATE DATABASE pg_vector
    WITH ENGINE = "pgvector",
    PARAMETERS = {
        "user": "YOUR_PG_USER",
        "password": "YOUR_PG_PASSWORD",
        "host": "YOUR_PG_HOST",
        "port": "5432",
        "database": "vector"
    }
""", "Created pg_vector database connection")


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Created pg_vector database connection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  FAISS (Facebook AI Similarity Search)
&lt;/h3&gt;

&lt;p&gt;FAISS is a library for efficient similarity search developed by Facebook AI Research. It's optimized for fast similarity search on large datasets.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
# Create FAISS database connection
run_query("""
    CREATE DATABASE db_faiss
    WITH ENGINE = 'duckdb_faiss',
    PARAMETERS = {
        "persist_directory": "/home/ubuntu/faiss"
    }
""", "Created db_faiss database connection")


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Created db_faiss database connection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Choosing Between pgvector and FAISS
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;pgvector&lt;/th&gt;
&lt;th&gt;FAISS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Integration with existing PostgreSQL&lt;/td&gt;
&lt;td&gt;Maximum query speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Persistence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native PostgreSQL storage&lt;/td&gt;
&lt;td&gt;File-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Good (PostgreSQL limits)&lt;/td&gt;
&lt;td&gt;Excellent (billions of vectors)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requires PostgreSQL extension&lt;/td&gt;
&lt;td&gt;Standalone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Good (~19s for 2M vectors)&lt;/td&gt;
&lt;td&gt;Excellent (~5s for 2M vectors)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For this tutorial, we'll implement both so you can see the performance difference firsthand.&lt;/p&gt;
&lt;h2&gt;
  
  
  5. Creating Knowledge Bases
&lt;/h2&gt;

&lt;p&gt;Now we have a table with relational data and two vector stores to keep the embedding vectors. We are ready to create knowledge bases using both storage backends.&lt;/p&gt;

&lt;p&gt;The knowledge base will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use OpenAI's &lt;code&gt;text-embedding-3-small&lt;/code&gt; model for generating embeddings&lt;/li&gt;
&lt;li&gt;Store the post &lt;code&gt;Body&lt;/code&gt; as searchable content&lt;/li&gt;
&lt;li&gt;Include metadata fields for filtering results&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Knowledge Base with pgvector Storage
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
def kb_exists(kb_name):
    """Check if a knowledge base already exists."""
    try:
        result = server.query("SELECT name FROM information_schema.knowledge_bases").fetch()
        return kb_name in result['name'].values
    except Exception:
        return False

# Create pgvector knowledge base
if kb_exists("kb_stack_vector"):
    print("kb_stack_vector already exists - skipping creation")
else:
    run_query("""
        CREATE KNOWLEDGE_BASE kb_stack_vector
        USING
            storage = pg_vector.stack,
            embedding_model = {
                "provider": "openai",
                "model_name": "text-embedding-3-small"
            },
            content_columns = ['Body'],
            metadata_columns = [
                "PostTypeId",
                "AcceptedAnswerId",
                "ParentId",
                "Score",
                "ViewCount",
                "Title",
                "ContentLicense",
                "FavoriteCount",
                "CreationDate",
                "LastActivityDate",
                "LastEditDate",
                "LastEditorUserId",
                "OwnerUserId",
                "Tags"
            ]
    """, "Created kb_stack_vector knowledge base")


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Created kb_stack_vector knowledge base
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Knowledge Base with FAISS Storage
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
# Create FAISS knowledge base
if kb_exists("kb_stack_faiss"):
    print("kb_stack_faiss already exists - skipping creation")
else:
    run_query("""
        CREATE KNOWLEDGE_BASE kb_stack_faiss
        USING
            storage = db_faiss.stack,
            embedding_model = {
                "provider": "openai",
                "model_name": "text-embedding-3-small"
            },
            content_columns = ['Body'],
            metadata_columns = [
                "PostTypeId",
                "AcceptedAnswerId",
                "ParentId",
                "Score",
                "ViewCount",
                "Title",
                "ContentLicense",
                "FavoriteCount",
                "CreationDate",
                "LastActivityDate",
                "LastEditDate",
                "LastEditorUserId",
                "OwnerUserId",
                "Tags"
            ]
    """, "Created kb_stack_faiss knowledge base")


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Created kb_stack_faiss knowledge base
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Understanding the Parameters
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;storage&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Specifies the vector database connection and table name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;embedding_model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Configuration for the embedding model (provider and model name)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;content_columns&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Columns to embed and make semantically searchable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metadata_columns&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Columns available for filtering (not embedded, but stored)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  6. Loading Data into Knowledge Bases
&lt;/h2&gt;

&lt;p&gt;Now we'll insert the Stack Overflow data into our knowledge bases. This process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetches data from the source table in batches&lt;/li&gt;
&lt;li&gt;Generates embeddings for content columns using the OpenAI API&lt;/li&gt;
&lt;li&gt;Stores vectors and metadata in the vector database&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Loading Data into pgvector Knowledge Base
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
def is_kb_empty(kb_name):
    """Check if a knowledge base is empty (fast - only fetches 1 row)."""
    result = server.query(f"SELECT id FROM {kb_name} LIMIT 1").fetch()
    return len(result) == 0

if is_kb_empty("kb_stack_vector"):
    print("kb_stack_vector is empty - starting data insertion...")
    server.query("""
        INSERT INTO kb_stack_vector
        SELECT * FROM pg_sample.stackoverflow_2m 
        USING 
            batch_size = 1000, 
            track_column = id
    """).fetch()
    print("Data insertion started for kb_stack_vector")
else:
    print("kb_stack_vector is not empty - skipping data insertion")


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Data insertion started for kb_stack_vector
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Loading Data into FAISS Knowledge Base
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
if is_kb_empty("kb_stack_faiss"):
    print("kb_stack_faiss is empty - starting data insertion...")
    server.query("""
        INSERT INTO kb_stack_faiss
        SELECT * FROM pg_sample.stackoverflow_2m 
        USING 
            batch_size = 1000, 
            track_column = id
    """).fetch()
    print("Data insertion started for kb_stack_faiss")
else:
    print("kb_stack_faiss is not empty - skipping data insertion")


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Data insertion started for kb_stack_faiss
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Wait until the data insertion is complete.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Querying the Knowledge Bases
&lt;/h2&gt;

&lt;p&gt;Once data is loaded, you can perform semantic searches combined with metadata filtering.&lt;/p&gt;

&lt;h3&gt;
  
  
  Basic Semantic Search
&lt;/h3&gt;

&lt;p&gt;Search for content related to "8-bit music" (finds semantically similar content):&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
import time

# Semantic search on pgvector KB
start = time.time()
results_vector = server.query("""
    SELECT * FROM kb_stack_vector 
    WHERE content = '8-bit music'
    AND Tags LIKE '%python%'
    LIMIT 10
""").fetch()
elapsed_vector = time.time() - start
print(f"pgvector query time: {elapsed_vector:.2f} seconds")
display(results_vector)

# Semantic search on FAISS KB
start = time.time()
results_faiss = server.query("""
    SELECT * FROM kb_stack_faiss 
    WHERE content = '8-bit music'
    AND Tags LIKE '%python%'
    LIMIT 10
""").fetch()
elapsed_faiss = time.time() - start
print(f"FAISS query time: {elapsed_faiss:.2f} seconds")
display(results_faiss)


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pgvector query time: 19.21 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;id&lt;/th&gt;
      &lt;th&gt;chunk_id&lt;/th&gt;
      &lt;th&gt;chunk_content&lt;/th&gt;
      &lt;th&gt;distance&lt;/th&gt;
      &lt;th&gt;relevance&lt;/th&gt;
      &lt;th&gt;ContentLicense&lt;/th&gt;
      &lt;th&gt;ViewCount&lt;/th&gt;
      &lt;th&gt;LastEditDate&lt;/th&gt;
      &lt;th&gt;Score&lt;/th&gt;
      &lt;th&gt;AcceptedAnswerId&lt;/th&gt;
      &lt;th&gt;OwnerUserId&lt;/th&gt;
      &lt;th&gt;LastActivityDate&lt;/th&gt;
      &lt;th&gt;Tags&lt;/th&gt;
      &lt;th&gt;LastEditorUserId&lt;/th&gt;
      &lt;th&gt;PostTypeId&lt;/th&gt;
      &lt;th&gt;ParentId&lt;/th&gt;
      &lt;th&gt;Title&lt;/th&gt;
      &lt;th&gt;FavoriteCount&lt;/th&gt;
      &lt;th&gt;CreationDate&lt;/th&gt;
      &lt;th&gt;metadata&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;1118266&lt;/td&gt;
      &lt;td&gt;1118266:Body:1of2:0to971&lt;/td&gt;
      &lt;td&gt;Im trying to engineer in python a way of trans...&lt;/td&gt;
      &lt;td&gt;0.605447&lt;/td&gt;
      &lt;td&gt;0.622879&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;1694.0&lt;/td&gt;
      &lt;td&gt;2009-07-13T08:32:20.797&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2010-03-17T15:16:17.060&lt;/td&gt;
      &lt;td&gt;python,audio&lt;/td&gt;
      &lt;td&gt;12855.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;List of values to a sound file&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2009-07-13T08:27:25.393&lt;/td&gt;
      &lt;td&gt;{'Tags': 'python,audio', 'Score': 0, 'Title': ...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;974071&lt;/td&gt;
      &lt;td&gt;974071:Body:1of1:0to791&lt;/td&gt;
      &lt;td&gt;I have a mosquito problem in my house. This wo...&lt;/td&gt;
      &lt;td&gt;0.615257&lt;/td&gt;
      &lt;td&gt;0.619097&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;55695.0&lt;/td&gt;
      &lt;td&gt;2017-05-23T12:32:21.507&lt;/td&gt;
      &lt;td&gt;44&lt;/td&gt;
      &lt;td&gt;974291.0&lt;/td&gt;
      &lt;td&gt;51197.0&lt;/td&gt;
      &lt;td&gt;2020-02-12T22:24:39.977&lt;/td&gt;
      &lt;td&gt;python,audio,mp3,frequency&lt;/td&gt;
      &lt;td&gt;-1.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Python library for playing fixed-frequency sound&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2009-06-10T07:05:02.037&lt;/td&gt;
      &lt;td&gt;{'Tags': 'python,audio,mp3,frequency', 'Score'...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;1967040&lt;/td&gt;
      &lt;td&gt;1967040:Body:1of1:0to224&lt;/td&gt;
      &lt;td&gt;I am confused because there are a lot of progr...&lt;/td&gt;
      &lt;td&gt;0.626904&lt;/td&gt;
      &lt;td&gt;0.614665&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;6615.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;1968691.0&lt;/td&gt;
      &lt;td&gt;237934.0&lt;/td&gt;
      &lt;td&gt;2021-08-10T10:40:59.217&lt;/td&gt;
      &lt;td&gt;python,audio&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;How can i create a melody? Is there any sound-...&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2009-12-27T21:04:34.243&lt;/td&gt;
      &lt;td&gt;{'Tags': 'python,audio', 'Score': 7, 'Title': ...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;1118266&lt;/td&gt;
      &lt;td&gt;1118266:Body:2of2:972to1430&lt;/td&gt;
      &lt;td&gt;The current solution I'm thinking of involves ...&lt;/td&gt;
      &lt;td&gt;0.627442&lt;/td&gt;
      &lt;td&gt;0.614461&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;1694.0&lt;/td&gt;
      &lt;td&gt;2009-07-13T08:32:20.797&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2010-03-17T15:16:17.060&lt;/td&gt;
      &lt;td&gt;python,audio&lt;/td&gt;
      &lt;td&gt;12855.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;List of values to a sound file&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2009-07-13T08:27:25.393&lt;/td&gt;
      &lt;td&gt;{'Tags': 'python,audio', 'Score': 0, 'Title': ...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;1344884&lt;/td&gt;
      &lt;td&gt;1344884:Body:1of1:0to327&lt;/td&gt;
      &lt;td&gt;I want to learn how to program a music applica...&lt;/td&gt;
      &lt;td&gt;0.643957&lt;/td&gt;
      &lt;td&gt;0.608289&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2205.0&lt;/td&gt;
      &lt;td&gt;2017-05-23T12:11:22.607&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;1346272.0&lt;/td&gt;
      &lt;td&gt;164623.0&lt;/td&gt;
      &lt;td&gt;2022-04-14T09:12:07.197&lt;/td&gt;
      &lt;td&gt;python,perl,waveform&lt;/td&gt;
      &lt;td&gt;-1.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Programming a Self Learning Music Maker&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2009-08-28T03:28:03.937&lt;/td&gt;
      &lt;td&gt;{'Tags': 'python,perl,waveform', 'Score': 7, '...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;2376505&lt;/td&gt;
      &lt;td&gt;2376505:Body:1of2:0to968&lt;/td&gt;
      &lt;td&gt;Write a function called listenToPicture that t...&lt;/td&gt;
      &lt;td&gt;0.645214&lt;/td&gt;
      &lt;td&gt;0.607824&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;3058.0&lt;/td&gt;
      &lt;td&gt;2010-03-04T02:28:26.703&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;285922.0&lt;/td&gt;
      &lt;td&gt;2010-03-06T05:27:48.017&lt;/td&gt;
      &lt;td&gt;python,image,audio&lt;/td&gt;
      &lt;td&gt;34397.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;How do I loop through every 4th pixel in every...&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2010-03-04T02:26:22.603&lt;/td&gt;
      &lt;td&gt;{'Tags': 'python,image,audio', 'Score': 0, 'Ti...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;2226853&lt;/td&gt;
      &lt;td&gt;2226853:Body:1of1:0to877&lt;/td&gt;
      &lt;td&gt;I'm trying to write a program to display PCM d...&lt;/td&gt;
      &lt;td&gt;0.654162&lt;/td&gt;
      &lt;td&gt;0.604536&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;12425.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;2226907.0&lt;/td&gt;
      &lt;td&gt;210920.0&lt;/td&gt;
      &lt;td&gt;2015-07-25T11:16:16.747&lt;/td&gt;
      &lt;td&gt;python,audio,pcm&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Interpreting WAV Data&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2010-02-09T05:01:25.703&lt;/td&gt;
      &lt;td&gt;{'Tags': 'python,audio,pcm', 'Score': 7, 'Titl...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;1561104&lt;/td&gt;
      &lt;td&gt;1561104:Body:1of1:0to306&lt;/td&gt;
      &lt;td&gt;Is there a way to do this? Also, I need this t...&lt;/td&gt;
      &lt;td&gt;0.668074&lt;/td&gt;
      &lt;td&gt;0.599494&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;1303.0&lt;/td&gt;
      &lt;td&gt;2020-06-20T09:12:55.060&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1561314.0&lt;/td&gt;
      &lt;td&gt;151377.0&lt;/td&gt;
      &lt;td&gt;2012-01-29T00:01:18.230&lt;/td&gt;
      &lt;td&gt;python,pygame,pitch&lt;/td&gt;
      &lt;td&gt;-1.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Playing sounds with python and changing their ...&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2009-10-13T15:44:54.267&lt;/td&gt;
      &lt;td&gt;{'Tags': 'python,pygame,pitch', 'Score': 1, 'T...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;1382998&lt;/td&gt;
      &lt;td&gt;1382998:Body:4of4:2649to3382&lt;/td&gt;
      &lt;td&gt;```

\n¼ éíñ§ÐÌëÑ » ¼ ö ® © ’\n0 1\n2 10\n3 10\n...&lt;/td&gt;
      &lt;td&gt;0.670654&lt;/td&gt;
      &lt;td&gt;0.598568&lt;/td&gt;
      &lt;td&gt;CC BY-SA 3.0&lt;/td&gt;
      &lt;td&gt;12497.0&lt;/td&gt;
      &lt;td&gt;2011-06-09T06:00:51.243&lt;/td&gt;
      &lt;td&gt;18&lt;/td&gt;
      &lt;td&gt;1383721.0&lt;/td&gt;
      &lt;td&gt;6946.0&lt;/td&gt;
      &lt;td&gt;2015-06-04T17:13:43.323&lt;/td&gt;
      &lt;td&gt;python,unicode&lt;/td&gt;
      &lt;td&gt;6946.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;latin-1 to ascii&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2009-09-05T10:44:40.167&lt;/td&gt;
      &lt;td&gt;{'Tags': 'python,unicode', 'Score': 18, 'Title...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;1837686&lt;/td&gt;
      &lt;td&gt;1837686:Body:1of2:0to950&lt;/td&gt;
      &lt;td&gt;I wish to take a file encoded in UTF-8 that do...&lt;/td&gt;
      &lt;td&gt;0.675999&lt;/td&gt;
      &lt;td&gt;0.596659&lt;/td&gt;
      &lt;td&gt;CC BY-SA 3.0&lt;/td&gt;
      &lt;td&gt;3016.0&lt;/td&gt;
      &lt;td&gt;2011-10-15T13:17:24.520&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2011-10-15T13:17:24.520&lt;/td&gt;
      &lt;td&gt;python,c,utf-8,compression&lt;/td&gt;
      &lt;td&gt;12113.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Compressing UTF-8(or other 8-bit encoding) to ...&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2009-12-03T04:43:05.963&lt;/td&gt;
      &lt;td&gt;{'Tags': 'python,c,utf-8,compression', 'Score'...&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FAISS query time: 5.04 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;id&lt;/th&gt;
      &lt;th&gt;distance&lt;/th&gt;
      &lt;th&gt;chunk_id&lt;/th&gt;
      &lt;th&gt;chunk_content&lt;/th&gt;
      &lt;th&gt;relevance&lt;/th&gt;
      &lt;th&gt;ContentLicense&lt;/th&gt;
      &lt;th&gt;ViewCount&lt;/th&gt;
      &lt;th&gt;LastEditDate&lt;/th&gt;
      &lt;th&gt;Score&lt;/th&gt;
      &lt;th&gt;AcceptedAnswerId&lt;/th&gt;
      &lt;th&gt;OwnerUserId&lt;/th&gt;
      &lt;th&gt;ParentId&lt;/th&gt;
      &lt;th&gt;LastEditorUserId&lt;/th&gt;
      &lt;th&gt;LastActivityDate&lt;/th&gt;
      &lt;th&gt;Tags&lt;/th&gt;
      &lt;th&gt;PostTypeId&lt;/th&gt;
      &lt;th&gt;FavoriteCount&lt;/th&gt;
      &lt;th&gt;Title&lt;/th&gt;
      &lt;th&gt;CreationDate&lt;/th&gt;
      &lt;th&gt;metadata&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;1118266&lt;/td&gt;
      &lt;td&gt;0.605468&lt;/td&gt;
      &lt;td&gt;1118266:Body:1of2:0to971&lt;/td&gt;
      &lt;td&gt;Im trying to engineer in python a way of trans...&lt;/td&gt;
      &lt;td&gt;0.622871&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;1694.0&lt;/td&gt;
      &lt;td&gt;2009-07-13T08:32:20.797&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;12855.0&lt;/td&gt;
      &lt;td&gt;2010-03-17T15:16:17.060&lt;/td&gt;
      &lt;td&gt;python,audio&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;List of values to a sound file&lt;/td&gt;
      &lt;td&gt;2009-07-13T08:27:25.393&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;974071&lt;/td&gt;
      &lt;td&gt;0.615225&lt;/td&gt;
      &lt;td&gt;974071:Body:1of1:0to791&lt;/td&gt;
      &lt;td&gt;I have a mosquito problem in my house. This wo...&lt;/td&gt;
      &lt;td&gt;0.619109&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;55695.0&lt;/td&gt;
      &lt;td&gt;2017-05-23T12:32:21.507&lt;/td&gt;
      &lt;td&gt;44&lt;/td&gt;
      &lt;td&gt;974291.0&lt;/td&gt;
      &lt;td&gt;51197.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;-1.0&lt;/td&gt;
      &lt;td&gt;2020-02-12T22:24:39.977&lt;/td&gt;
      &lt;td&gt;python,audio,mp3,frequency&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;Python library for playing fixed-frequency sound&lt;/td&gt;
      &lt;td&gt;2009-06-10T07:05:02.037&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;1967040&lt;/td&gt;
      &lt;td&gt;0.626923&lt;/td&gt;
      &lt;td&gt;1967040:Body:1of1:0to224&lt;/td&gt;
      &lt;td&gt;I am confused because there are a lot of progr...&lt;/td&gt;
      &lt;td&gt;0.614657&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;6615.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;1968691.0&lt;/td&gt;
      &lt;td&gt;237934.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2021-08-10T10:40:59.217&lt;/td&gt;
      &lt;td&gt;python,audio&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;How can i create a melody? Is there any sound-...&lt;/td&gt;
      &lt;td&gt;2009-12-27T21:04:34.243&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;1118266&lt;/td&gt;
      &lt;td&gt;0.627461&lt;/td&gt;
      &lt;td&gt;1118266:Body:2of2:972to1430&lt;/td&gt;
      &lt;td&gt;The current solution I'm thinking of involves ...&lt;/td&gt;
      &lt;td&gt;0.614454&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;1694.0&lt;/td&gt;
      &lt;td&gt;2009-07-13T08:32:20.797&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;12855.0&lt;/td&gt;
      &lt;td&gt;2010-03-17T15:16:17.060&lt;/td&gt;
      &lt;td&gt;python,audio&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;List of values to a sound file&lt;/td&gt;
      &lt;td&gt;2009-07-13T08:27:25.393&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;1344884&lt;/td&gt;
      &lt;td&gt;0.643955&lt;/td&gt;
      &lt;td&gt;1344884:Body:1of1:0to327&lt;/td&gt;
      &lt;td&gt;I want to learn how to program a music applica...&lt;/td&gt;
      &lt;td&gt;0.608289&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2205.0&lt;/td&gt;
      &lt;td&gt;2017-05-23T12:11:22.607&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;1346272.0&lt;/td&gt;
      &lt;td&gt;164623.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;-1.0&lt;/td&gt;
      &lt;td&gt;2022-04-14T09:12:07.197&lt;/td&gt;
      &lt;td&gt;python,perl,waveform&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;Programming a Self Learning Music Maker&lt;/td&gt;
      &lt;td&gt;2009-08-28T03:28:03.937&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;2376505&lt;/td&gt;
      &lt;td&gt;0.645192&lt;/td&gt;
      &lt;td&gt;2376505:Body:1of2:0to968&lt;/td&gt;
      &lt;td&gt;Write a function called listenToPicture that t...&lt;/td&gt;
      &lt;td&gt;0.607832&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;3058.0&lt;/td&gt;
      &lt;td&gt;2010-03-04T02:28:26.703&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;285922.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;34397.0&lt;/td&gt;
      &lt;td&gt;2010-03-06T05:27:48.017&lt;/td&gt;
      &lt;td&gt;python,image,audio&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;How do I loop through every 4th pixel in every...&lt;/td&gt;
      &lt;td&gt;2010-03-04T02:26:22.603&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;2226853&lt;/td&gt;
      &lt;td&gt;0.654112&lt;/td&gt;
      &lt;td&gt;2226853:Body:1of1:0to877&lt;/td&gt;
      &lt;td&gt;I'm trying to write a program to display PCM d...&lt;/td&gt;
      &lt;td&gt;0.604554&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;12425.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;2226907.0&lt;/td&gt;
      &lt;td&gt;210920.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2015-07-25T11:16:16.747&lt;/td&gt;
      &lt;td&gt;python,audio,pcm&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;Interpreting WAV Data&lt;/td&gt;
      &lt;td&gt;2010-02-09T05:01:25.703&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;1561104&lt;/td&gt;
      &lt;td&gt;0.668055&lt;/td&gt;
      &lt;td&gt;1561104:Body:1of1:0to306&lt;/td&gt;
      &lt;td&gt;Is there a way to do this? Also, I need this t...&lt;/td&gt;
      &lt;td&gt;0.599501&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;1303.0&lt;/td&gt;
      &lt;td&gt;2020-06-20T09:12:55.060&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1561314.0&lt;/td&gt;
      &lt;td&gt;151377.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;-1.0&lt;/td&gt;
      &lt;td&gt;2012-01-29T00:01:18.230&lt;/td&gt;
      &lt;td&gt;python,pygame,pitch&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;Playing sounds with python and changing their ...&lt;/td&gt;
      &lt;td&gt;2009-10-13T15:44:54.267&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;1382998&lt;/td&gt;
      &lt;td&gt;0.670668&lt;/td&gt;
      &lt;td&gt;1382998:Body:4of4:2649to3382&lt;/td&gt;
      &lt;td&gt;

```\n¼ éíñ§ÐÌëÑ » ¼ ö ® © ’\n0 1\n2 10\n3 10\n...&lt;/td&gt;
      &lt;td&gt;0.598563&lt;/td&gt;
      &lt;td&gt;CC BY-SA 3.0&lt;/td&gt;
      &lt;td&gt;12497.0&lt;/td&gt;
      &lt;td&gt;2011-06-09T06:00:51.243&lt;/td&gt;
      &lt;td&gt;18&lt;/td&gt;
      &lt;td&gt;1383721.0&lt;/td&gt;
      &lt;td&gt;6946.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;6946.0&lt;/td&gt;
      &lt;td&gt;2015-06-04T17:13:43.323&lt;/td&gt;
      &lt;td&gt;python,unicode&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;latin-1 to ascii&lt;/td&gt;
      &lt;td&gt;2009-09-05T10:44:40.167&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 3.0', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;1837686&lt;/td&gt;
      &lt;td&gt;0.675986&lt;/td&gt;
      &lt;td&gt;1837686:Body:1of2:0to950&lt;/td&gt;
      &lt;td&gt;I wish to take a file encoded in UTF-8 that do...&lt;/td&gt;
      &lt;td&gt;0.596664&lt;/td&gt;
      &lt;td&gt;CC BY-SA 3.0&lt;/td&gt;
      &lt;td&gt;3016.0&lt;/td&gt;
      &lt;td&gt;2011-10-15T13:17:24.520&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;12113.0&lt;/td&gt;
      &lt;td&gt;2011-10-15T13:17:24.520&lt;/td&gt;
      &lt;td&gt;python,c,utf-8,compression&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;Compressing UTF-8(or other 8-bit encoding) to ...&lt;/td&gt;
      &lt;td&gt;2009-12-03T04:43:05.963&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 3.0', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Analyzing the Results
&lt;/h2&gt;

&lt;p&gt;Notice how the search for "8-bit music" returned posts about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Converting values to sound files&lt;/li&gt;
&lt;li&gt;Playing fixed-frequency sounds&lt;/li&gt;
&lt;li&gt;Creating melodies programmatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these posts contain the exact phrase "8-bit music," yet they're all semantically relevant to chiptune/retro audio generation. This is the power of semantic search.&lt;/p&gt;

&lt;p&gt;Also note the &lt;strong&gt;4x speed improvement&lt;/strong&gt; with FAISS (5 seconds vs 19 seconds for pgvector). For production systems with high query volumes, this difference is significant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Combined Semantic and Metadata Filtering
&lt;/h3&gt;

&lt;p&gt;Find AJAX-related posts tagged with jQuery that have high view counts:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
# pgvector: Semantic search with metadata filters
start = time.time()
results = server.query("""
    SELECT * FROM kb_stack_vector 
    WHERE content = 'ajax'
        AND Tags LIKE '%jquery%'
        AND ViewCount &amp;gt; 1000.0
        AND relevance &amp;gt; 0.6
    LIMIT 10
""").fetch()
print(f"pgvector query time: {time.time() - start:.2f} seconds")
display(results)

# FAISS: Semantic search with metadata filters
start = time.time()
results = server.query("""
    SELECT * FROM kb_stack_faiss 
    WHERE content = 'ajax'
        AND Tags LIKE '%jquery%'
        AND ViewCount &amp;gt; 1000.0
        AND relevance &amp;gt; 0.6
    LIMIT 10
""").fetch()
print(f"FAISS query time: {time.time() - start:.2f} seconds")
display(results)


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pgvector query time: 5.76 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;id&lt;/th&gt;
      &lt;th&gt;chunk_id&lt;/th&gt;
      &lt;th&gt;chunk_content&lt;/th&gt;
      &lt;th&gt;distance&lt;/th&gt;
      &lt;th&gt;relevance&lt;/th&gt;
      &lt;th&gt;ContentLicense&lt;/th&gt;
      &lt;th&gt;ViewCount&lt;/th&gt;
      &lt;th&gt;LastEditDate&lt;/th&gt;
      &lt;th&gt;Score&lt;/th&gt;
      &lt;th&gt;AcceptedAnswerId&lt;/th&gt;
      &lt;th&gt;OwnerUserId&lt;/th&gt;
      &lt;th&gt;LastActivityDate&lt;/th&gt;
      &lt;th&gt;Tags&lt;/th&gt;
      &lt;th&gt;LastEditorUserId&lt;/th&gt;
      &lt;th&gt;PostTypeId&lt;/th&gt;
      &lt;th&gt;ParentId&lt;/th&gt;
      &lt;th&gt;Title&lt;/th&gt;
      &lt;th&gt;FavoriteCount&lt;/th&gt;
      &lt;th&gt;CreationDate&lt;/th&gt;
      &lt;th&gt;metadata&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;1400637:Body:28of32:25627to26627&lt;/td&gt;
      &lt;td&gt;o.ajax({type:"POST",url:E,data:G,success:H,dat...&lt;/td&gt;
      &lt;td&gt;0.427265&lt;/td&gt;
      &lt;td&gt;0.700641&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;1400637:Body:30of32:27488to28356&lt;/td&gt;
      &lt;td&gt;O=false;T.onload=T.onreadystatechange=function...&lt;/td&gt;
      &lt;td&gt;0.453764&lt;/td&gt;
      &lt;td&gt;0.687870&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;1400637:Body:27of32:24691to25626&lt;/td&gt;
      &lt;td&gt;rn this},serialize:function(){return o.param(t...&lt;/td&gt;
      &lt;td&gt;0.454629&lt;/td&gt;
      &lt;td&gt;0.687460&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;1424774&lt;/td&gt;
      &lt;td&gt;1424774:Body:2of2:934to1745&lt;/td&gt;
      &lt;td&gt;var self = this;\n        $.ajax({\n  ...&lt;/td&gt;
      &lt;td&gt;0.461486&lt;/td&gt;
      &lt;td&gt;0.684235&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;3601.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1426940.0&lt;/td&gt;
      &lt;td&gt;173350.0&lt;/td&gt;
      &lt;td&gt;2020-06-08T10:43:45.037&lt;/td&gt;
      &lt;td&gt;jquery,loops&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Loop with 8 times&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2009-09-15T02:02:58.927&lt;/td&gt;
      &lt;td&gt;{'Tags': 'jquery,loops', 'Score': 1, 'Title': ...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;1400637:Body:31of32:28357to29238&lt;/td&gt;
      &lt;td&gt;N=function(X){if(J.readyState==0){if(P){clearI...&lt;/td&gt;
      &lt;td&gt;0.462191&lt;/td&gt;
      &lt;td&gt;0.683905&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;546344&lt;/td&gt;
      &lt;td&gt;546344:Body:2of3:902to1764&lt;/td&gt;
      &lt;td&gt;var before = function() { $(loading).show() ;...&lt;/td&gt;
      &lt;td&gt;0.463258&lt;/td&gt;
      &lt;td&gt;0.683407&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;1463.0&lt;/td&gt;
      &lt;td&gt;2009-02-13T16:17:38.170&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;546642.0&lt;/td&gt;
      &lt;td&gt;2755.0&lt;/td&gt;
      &lt;td&gt;2009-02-13T16:37:59.867&lt;/td&gt;
      &lt;td&gt;javascript,jquery,ajax&lt;/td&gt;
      &lt;td&gt;2755.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Using jQuery, how can I store the result of a ...&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2009-02-13T15:25:00.963&lt;/td&gt;
      &lt;td&gt;{'Tags': 'javascript,jquery,ajax', 'Score': 0,...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;1279625&lt;/td&gt;
      &lt;td&gt;1279625:Body:2of3:782to1754&lt;/td&gt;
      &lt;td&gt;```

\n&amp;lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML ...&lt;/td&gt;
      &lt;td&gt;0.468882&lt;/td&gt;
      &lt;td&gt;0.680790&lt;/td&gt;
      &lt;td&gt;CC BY-SA 3.0&lt;/td&gt;
      &lt;td&gt;1130.0&lt;/td&gt;
      &lt;td&gt;2016-12-03T07:00:58.213&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;1279881.0&lt;/td&gt;
      &lt;td&gt;58375.0&lt;/td&gt;
      &lt;td&gt;2016-12-03T07:00:58.213&lt;/td&gt;
      &lt;td&gt;events,jquery,getjson&lt;/td&gt;
      &lt;td&gt;6637668.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Trouble with jQuery Ajax timing&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2009-08-14T19:06:28.043&lt;/td&gt;
      &lt;td&gt;{'Tags': 'events,jquery,getjson', 'Score': 0, ...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;1400637:Body:32of32:29239to30048&lt;/td&gt;
      &lt;td&gt;L(){if(M.complete){M.complete(J,R)}if(M.global...&lt;/td&gt;
      &lt;td&gt;0.468944&lt;/td&gt;
      &lt;td&gt;0.680761&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;1775625&lt;/td&gt;
      &lt;td&gt;1775625:Body:5of9:3144to4049&lt;/td&gt;
      &lt;td&gt;}\n\n}\n&amp;lt;/script&amp;gt;\n\n\n\n&amp;lt;script type=...&lt;/td&gt;
      &lt;td&gt;0.472723&lt;/td&gt;
      &lt;td&gt;0.679014&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2100.0&lt;/td&gt;
      &lt;td&gt;2009-11-21T14:46:00.250&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1776406.0&lt;/td&gt;
      &lt;td&gt;212889.0&lt;/td&gt;
      &lt;td&gt;2009-11-21T19:03:52.070&lt;/td&gt;
      &lt;td&gt;jquery,form-submit&lt;/td&gt;
      &lt;td&gt;212889.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;jQuery - Multiple form submission trigger unre...&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;2009-11-21T14:32:41.383&lt;/td&gt;
      &lt;td&gt;{'Tags': 'jquery,form-submit', 'Score': 1, 'Ti...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;1400637:Body:26of32:23690to24690&lt;/td&gt;
      &lt;td&gt;nclick")}o(function(){var L=document.createEle...&lt;/td&gt;
      &lt;td&gt;0.477784&lt;/td&gt;
      &lt;td&gt;0.676689&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FAISS query time: 2.50 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;id&lt;/th&gt;
      &lt;th&gt;distance&lt;/th&gt;
      &lt;th&gt;chunk_id&lt;/th&gt;
      &lt;th&gt;chunk_content&lt;/th&gt;
      &lt;th&gt;relevance&lt;/th&gt;
      &lt;th&gt;ContentLicense&lt;/th&gt;
      &lt;th&gt;ViewCount&lt;/th&gt;
      &lt;th&gt;LastEditDate&lt;/th&gt;
      &lt;th&gt;Score&lt;/th&gt;
      &lt;th&gt;AcceptedAnswerId&lt;/th&gt;
      &lt;th&gt;OwnerUserId&lt;/th&gt;
      &lt;th&gt;ParentId&lt;/th&gt;
      &lt;th&gt;LastEditorUserId&lt;/th&gt;
      &lt;th&gt;LastActivityDate&lt;/th&gt;
      &lt;th&gt;Tags&lt;/th&gt;
      &lt;th&gt;PostTypeId&lt;/th&gt;
      &lt;th&gt;FavoriteCount&lt;/th&gt;
      &lt;th&gt;Title&lt;/th&gt;
      &lt;th&gt;CreationDate&lt;/th&gt;
      &lt;th&gt;metadata&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;0.427243&lt;/td&gt;
      &lt;td&gt;1400637:Body:28of32:25627to26627&lt;/td&gt;
      &lt;td&gt;o.ajax({type:"POST",url:E,data:G,success:H,dat...&lt;/td&gt;
      &lt;td&gt;0.700651&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;0.453769&lt;/td&gt;
      &lt;td&gt;1400637:Body:30of32:27488to28356&lt;/td&gt;
      &lt;td&gt;O=false;T.onload=T.onreadystatechange=function...&lt;/td&gt;
      &lt;td&gt;0.687867&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;0.454589&lt;/td&gt;
      &lt;td&gt;1400637:Body:27of32:24691to25626&lt;/td&gt;
      &lt;td&gt;rn this},serialize:function(){return o.param(t...&lt;/td&gt;
      &lt;td&gt;0.687479&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;1424774&lt;/td&gt;
      &lt;td&gt;0.461469&lt;/td&gt;
      &lt;td&gt;1424774:Body:2of2:934to1745&lt;/td&gt;
      &lt;td&gt;var self = this;\n        $.ajax({\n  ...&lt;/td&gt;
      &lt;td&gt;0.684243&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;3601.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1426940.0&lt;/td&gt;
      &lt;td&gt;173350.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2020-06-08T10:43:45.037&lt;/td&gt;
      &lt;td&gt;jquery,loops&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;Loop with 8 times&lt;/td&gt;
      &lt;td&gt;2009-09-15T02:02:58.927&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;0.462233&lt;/td&gt;
      &lt;td&gt;1400637:Body:31of32:28357to29238&lt;/td&gt;
      &lt;td&gt;N=function(X){if(J.readyState==0){if(P){clearI...&lt;/td&gt;
      &lt;td&gt;0.683886&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;546344&lt;/td&gt;
      &lt;td&gt;0.463237&lt;/td&gt;
      &lt;td&gt;546344:Body:2of3:902to1764&lt;/td&gt;
      &lt;td&gt;var before = function() { $(loading).show() ;...&lt;/td&gt;
      &lt;td&gt;0.683416&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;1463.0&lt;/td&gt;
      &lt;td&gt;2009-02-13T16:17:38.170&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;546642.0&lt;/td&gt;
      &lt;td&gt;2755.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;2755.0&lt;/td&gt;
      &lt;td&gt;2009-02-13T16:37:59.867&lt;/td&gt;
      &lt;td&gt;javascript,jquery,ajax&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;Using jQuery, how can I store the result of a ...&lt;/td&gt;
      &lt;td&gt;2009-02-13T15:25:00.963&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;1279625&lt;/td&gt;
      &lt;td&gt;0.468854&lt;/td&gt;
      &lt;td&gt;1279625:Body:2of3:782to1754&lt;/td&gt;
      &lt;td&gt;

```\n&amp;lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML ...&lt;/td&gt;
      &lt;td&gt;0.680803&lt;/td&gt;
      &lt;td&gt;CC BY-SA 3.0&lt;/td&gt;
      &lt;td&gt;1130.0&lt;/td&gt;
      &lt;td&gt;2016-12-03T07:00:58.213&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;1279881.0&lt;/td&gt;
      &lt;td&gt;58375.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;6637668.0&lt;/td&gt;
      &lt;td&gt;2016-12-03T07:00:58.213&lt;/td&gt;
      &lt;td&gt;events,jquery,getjson&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;Trouble with jQuery Ajax timing&lt;/td&gt;
      &lt;td&gt;2009-08-14T19:06:28.043&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 3.0', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;0.468931&lt;/td&gt;
      &lt;td&gt;1400637:Body:32of32:29239to30048&lt;/td&gt;
      &lt;td&gt;L(){if(M.complete){M.complete(J,R)}if(M.global...&lt;/td&gt;
      &lt;td&gt;0.680767&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;1775625&lt;/td&gt;
      &lt;td&gt;0.472740&lt;/td&gt;
      &lt;td&gt;1775625:Body:5of9:3144to4049&lt;/td&gt;
      &lt;td&gt;}\n\n}\n&amp;lt;/script&amp;gt;\n\n\n\n&amp;lt;script type=...&lt;/td&gt;
      &lt;td&gt;0.679007&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2100.0&lt;/td&gt;
      &lt;td&gt;2009-11-21T14:46:00.250&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1776406.0&lt;/td&gt;
      &lt;td&gt;212889.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;212889.0&lt;/td&gt;
      &lt;td&gt;2009-11-21T19:03:52.070&lt;/td&gt;
      &lt;td&gt;jquery,form-submit&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;jQuery - Multiple form submission trigger unre...&lt;/td&gt;
      &lt;td&gt;2009-11-21T14:32:41.383&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;1400637&lt;/td&gt;
      &lt;td&gt;0.477785&lt;/td&gt;
      &lt;td&gt;1400637:Body:26of32:23690to24690&lt;/td&gt;
      &lt;td&gt;nclick")}o(function(){var L=document.createEle...&lt;/td&gt;
      &lt;td&gt;0.676688&lt;/td&gt;
      &lt;td&gt;CC BY-SA 2.5&lt;/td&gt;
      &lt;td&gt;2741.0&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:16:59.430&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1400656.0&lt;/td&gt;
      &lt;td&gt;107129.0&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;8590.0&lt;/td&gt;
      &lt;td&gt;2013-08-05T16:07:54.400&lt;/td&gt;
      &lt;td&gt;javascript,jquery&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;Stop reload for ajax submitted form&lt;/td&gt;
      &lt;td&gt;2009-09-09T16:12:46.057&lt;/td&gt;
      &lt;td&gt;{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Understanding Query Results
&lt;/h3&gt;

&lt;p&gt;The query returns these columns:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Original document ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;chunk_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Identifier for the text chunk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;chunk_content&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The actual text content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metadata&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;JSON object with all metadata fields&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;distance&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Vector distance (lower = more similar)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;relevance&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Relevance score (higher = more relevant, 0-1)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Filtering by Relevance
&lt;/h3&gt;

&lt;p&gt;Get only highly relevant results:&lt;/p&gt;

&lt;h2&gt;
  
  
  The Power of Combined Filtering
&lt;/h2&gt;

&lt;p&gt;The query we just ran demonstrates MindsDB's hybrid search capability:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
sql
SELECT * FROM kb_stack_faiss 
WHERE content = 'ajax'              -- Semantic match
    AND Tags LIKE '%jquery%'        -- Metadata filter
    AND ViewCount &amp;gt; 1000            -- Popularity threshold
    AND relevance &amp;gt; 0.6             -- Quality threshold


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This finds posts that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Are semantically similar to "ajax" (not just keyword matches)&lt;/li&gt;
&lt;li&gt;Are tagged with jQuery&lt;/li&gt;
&lt;li&gt;Have significant engagement (&amp;gt;1000 views)&lt;/li&gt;
&lt;li&gt;Meet a minimum relevance score&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This combination is impossible with traditional search and would require complex custom code with raw vector databases.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
def run_query_ignore_exists(sql, success_msg="Query executed successfully"):
    """Execute a query, silently ignoring 'already exists' errors."""
    try:
        result = server.query(sql).fetch()
        print(success_msg)
        return result
    except RuntimeError as e:
        return None  # Silently ignore
# Create MindsDB Agent
run_query_ignore_exists("""
    drop agent stackoverflow_agent
""", "Dropped stackoverflow_agent")

run_query("""
    CREATE AGENT stackoverflow_agent
    USING
        model = {
            "provider": "openai",
            "model_name": "gpt-4.1"
        },
        data = {
            "knowledge_bases": ["mindsdb.kb_stack_faiss"]
        },
        prompt_template = '
            You are a helpful programming assistant. 
            mindsdb.kb_stack_faiss is a knowledge base that contains Stack Overflow questions and answers.
            Use this knowledge to provide accurate, helpful responses to programming questions.
            Include code examples when relevant.
            You must base your answer on the Stack Overflow questions and answers extracted from mindsdb.kb_stack_faiss.
            If you failed to get the results from mindsdb.kb_stack_faiss, answer I could not get the results from mindsdb.kb_stack_faiss.
            Print the chunk ID for each question and answer you based your answer on.
            IMPORTANT: Use a limit of 100 in your query to the knowledge base.
        '
""", "Created stackoverflow_agent")



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Dropped stackoverflow_agent
Created stackoverflow_agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
# Query the agent
start = time.time()
response = server.query("""
    SELECT answer
    FROM stackoverflow_agent 
    WHERE question = 'Compare JavaScript to TypeScript for building web services'
""").fetch()
print(f"Agent response time: {time.time() - start:.2f} seconds\n")
print(response['answer'].iloc[0])


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent response time: 63.44 seconds

&lt;p&gt;To compare JavaScript and TypeScript for building web services, let's look at insights from Stack Overflow posts (see chunk IDs for reference):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JavaScript:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JavaScript is a dynamic, weakly typed, prototype-based language with first-class functions (&lt;a href="https://stackoverflow.com/posts/1253285" rel="noopener noreferrer"&gt;1253285:Body:1of1:0to384&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;It is the default language for web development, both on the client (browser) and, with Node.js, on the server (&lt;a href="https://stackoverflow.com/posts/870980" rel="noopener noreferrer"&gt;870980:Body:1of1:0to133&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;JavaScript is flexible and widely supported, but its lack of static typing can lead to runtime errors and makes large codebases harder to maintain.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;TypeScript:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;While not directly mentioned in the top results, TypeScript is a superset of JavaScript that adds static typing and modern language features. It compiles to JavaScript, so it runs anywhere JavaScript does.&lt;/li&gt;
&lt;li&gt;TypeScript helps catch errors at compile time, improves code readability, and is especially beneficial for large projects or teams.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Web Services:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JavaScript (with Node.js) is commonly used to build RESTful APIs and web services (&lt;a href="https://stackoverflow.com/posts/208051" rel="noopener noreferrer"&gt;208051:Body:1of1:0to147&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;TypeScript is increasingly popular for the same purpose, as it provides all the benefits of JavaScript plus type safety and better tooling (e.g., autocompletion, refactoring).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Summary Table:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;JavaScript&lt;/th&gt;
&lt;th&gt;TypeScript&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Typing&lt;/td&gt;
&lt;td&gt;Dynamic, weakly typed&lt;/td&gt;
&lt;td&gt;Static typing (optional)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tooling&lt;/td&gt;
&lt;td&gt;Good, but less type-aware&lt;/td&gt;
&lt;td&gt;Excellent (autocompletion, refactor)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning Curve&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;td&gt;Slightly higher (due to types)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error Checking&lt;/td&gt;
&lt;td&gt;Runtime&lt;/td&gt;
&lt;td&gt;Compile-time + runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ecosystem&lt;/td&gt;
&lt;td&gt;Huge, universal&lt;/td&gt;
&lt;td&gt;Same as JS, plus TS-specific tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintainability&lt;/td&gt;
&lt;td&gt;Can be challenging in large code&lt;/td&gt;
&lt;td&gt;Easier in large codebases&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For small projects or rapid prototyping, JavaScript is sufficient and easy to start with.&lt;/li&gt;
&lt;li&gt;For larger projects, teams, or when maintainability and reliability are priorities, TypeScript is generally preferred.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;References:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://stackoverflow.com/posts/1253285" rel="noopener noreferrer"&gt;1253285:Body:1of1:0to384&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://stackoverflow.com/posts/870980" rel="noopener noreferrer"&gt;870980:Body:1of1:0to133&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://stackoverflow.com/posts/208051" rel="noopener noreferrer"&gt;208051:Body:1of1:0to147&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want more specific code examples or a deeper dive into either technology, let me know!&lt;br&gt;
&lt;/p&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Conclusion&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;We've built a complete semantic search system that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processes 2 million Stack Overflow posts&lt;/li&gt;
&lt;li&gt;Supports both pgvector and FAISS backends&lt;/li&gt;
&lt;li&gt;Combines semantic search with metadata filtering&lt;/li&gt;
&lt;li&gt;Powers an AI agent for natural language queries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;FAISS is much faster than pgvector for pure search queries&lt;/li&gt;
&lt;li&gt;Metadata filtering lets you narrow results by tags, scores, dates&lt;/li&gt;
&lt;li&gt;Knowledge bases abstract complexity—no need to manage embeddings manually&lt;/li&gt;
&lt;li&gt;Agents can leverage knowledge bases for RAG-style applications&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Next Steps
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Try different embedding models&lt;/li&gt;
&lt;li&gt;Add more data sources&lt;/li&gt;
&lt;li&gt;Build a chat interface&lt;/li&gt;
&lt;li&gt;Explore different chunking strategies&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>W.I.S.H – Whatever I Say Happens – Programming</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Thu, 22 Jan 2026 16:37:25 +0000</pubDate>
      <link>https://dev.to/mindsdb/wish-whatever-i-say-happens-programming-1cfb</link>
      <guid>https://dev.to/mindsdb/wish-whatever-i-say-happens-programming-1cfb</guid>
      <description>&lt;p&gt;&lt;em&gt;Written by Jorge Torres, Co-founder &amp;amp; CEO at MindsDB&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  When Anyone Can Build Software, What Will CIOs Actually Invest In?
&lt;/h2&gt;

&lt;p&gt;Every two weeks, I find myself sitting pretty on an airplane; with just enough connectivity to triage emails and Slack notifications, but not enough to do real work. These flights have become my thinking time; moments to step back and consider how the forces reshaping technology will affect our business, our customers, and the decisions that technology leaders face. Some of these thought experiments feel worth sharing. Here's the first one of 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  The End of Proprietary as a Moat
&lt;/h2&gt;

&lt;p&gt;We're entering a software era where claiming "proprietary technology" as a competitive advantage is starting to feel not quaint but naive, like written in crayons. Engineering roadmaps as differentiators? Increasingly questionable. "Competitive moats" built purely on code? More like suggestions.&lt;/p&gt;

&lt;p&gt;What we’ve been calling "vibe coding"; the casual ability to describe what you want and have AI generate working code, is rapidly evolving into something more significant. Call it W.I.S.H programming: Whatever I Say, Happens. And I don't mean entertaining but useless demos or toy applications. I mean functional backends, polished interfaces, scalable cloud infrastructure, tested and deployable. The whole stack.&lt;/p&gt;

&lt;p&gt;Today, if you understand software architecture at the level of a competent project manager—someone who grasps how the components of a SaaS application connect, even if you've never written a line of code—you can build genuinely useful applications by describing what you need in plain English. Yes, there's still an interpretation dance between your intent and what the AI produces. Yes, quality and reliability still depend on your technical judgment about what needs testing and how components should integrate. But these systems are improving rapidly. Soon, they'll embed every best practice known for every layer of the stack automatically, bringing a level of attention to detail that currently requires years of engineering experience. All there, ready for you to just say it and see it magically happen before your eyes.&lt;/p&gt;

&lt;p&gt;We're approaching a threshold where functional software can be thought into existence rather than coded into reality in the traditional sense. Even if that is a far-fetched exaggeration, I am inviting you to imagine a future where: software is a commodity, in which if we do not have our ears glued to the ground, we will all just be selling or buying reheated lasagna at different price points.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gmail Paradox
&lt;/h2&gt;

&lt;p&gt;Follow this trajectory to its logical conclusion, and you arrive at an uncomfortable question: In a world where software becomes a commodity, where does durable value actually reside?&lt;/p&gt;

&lt;p&gt;Consider Gmail. In the software-as-commodity future, cloning Gmail will be a weekend project. Building better versions of Gmail—with improved spam filtering, smarter categorization, better search—will be similarly trivial. The technical barriers that once protected established products are dissolving.&lt;/p&gt;

&lt;p&gt;And yet.&lt;/p&gt;

&lt;p&gt;No one is going to successfully storm that castle with an overnight Gmail clone, no matter how technically superior. Why? Because Gmail by itself isn't the product. Google Workspace is. And Google Workspace isn't just software—it's organizational infrastructure. It's the accumulated context of every conversation your company has ever had, every calendar invite, every shared document, every nested folder that someone has been curating since 2014.&lt;/p&gt;

&lt;p&gt;Companies don't use Google Workspace because it's the theoretically optimal solution. They use it because now they are full people that have 47 nested folders that they’ve been curating since 2014, and if you try to migrate them to anything else, they will burn the place to the  ground. They are recurrent customers, because migration would require organizational consensus that simply doesn't exist. That accumulated context isn't a feature. It's gravity. It's the business.&lt;/p&gt;

&lt;p&gt;The principles that governed SaaS when the industry began—network effects, switching costs, data gravity—will still apply a decade from now. But they'll apply to different things.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Survives
&lt;/h2&gt;

&lt;p&gt;In a future where anyone can build anything, value doesn't reside solely in what you build, even if it elegantly solves a clearly defined problem. If it solves that problem for a single team in isolation, it may be displaced by internally vibed tools that fit their specific workflow perfectly.&lt;/p&gt;

&lt;p&gt;CIOs are going to face a constant calibration: Is this externally-sourced solution preventing technical debt, or is it creating friction we could eliminate by building something custom? When the cost of building drops dramatically, the calculus changes.&lt;/p&gt;

&lt;p&gt;So where does sustained value live? I keep returning to a deceptively simple question: &lt;em&gt;What do humans still need to agree on?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The answer points toward systems that can't simply be vibed into existence by any individual team, because their value depends on coordination across organizational boundaries. These are the systems where building in isolation creates more problems than it solves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trust as the Scarce Resource
&lt;/h2&gt;

&lt;p&gt;Here's the realization that keeps crystallizing for me: AI is racing to make building software dramatically cheaper than it's ever been. But the faster that race progresses, the more expensive trust becomes.&lt;/p&gt;

&lt;p&gt;Think about the implications. When every team can spin up internal tools overnight, who decides which source of truth actually matters? When you can clone any workflow application, who ensures the clone doesn't harbor compliance violations or security vulnerabilities that no one thought to check?&lt;/p&gt;

&lt;p&gt;The winners in this landscape won't simply be the builders of workflow applications. They'll be the providers of what I've started calling "agreement infrastructure"—the systems of record where decisions get made, documented, and recognized by other parties. Approvals. Reviews. Planning. Compliance attestation. The unglamorous connective tissue between "I built a thing" and "other people can rely on it."&lt;/p&gt;

&lt;p&gt;Wherever humans still need to make decisions together—and have those decisions carry weight beyond their immediate team—there's a durable business.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Synthesis
&lt;/h2&gt;

&lt;p&gt;The question I keep wrestling with: Can you deliver both? Can you provide the creative velocity that WISH programming enables while also building genuine agreement infrastructure—the kind that creates organizational gravity rather than technical debt?&lt;/p&gt;

&lt;p&gt;That intersection—between radical accessibility and institutional trust—may be where the gravity rules of traditional SaaS and the new era of software commoditization find equilibrium. Where sustainable businesses will actually be built.&lt;/p&gt;

&lt;p&gt;For CIOs navigating investment decisions, I'd suggest the filter isn't "what problems does this solve?" It's "what agreements does this enable, and how hard would those agreements be to recreate?"&lt;/p&gt;

&lt;p&gt;The lasagna might be getting commoditized. The kitchen where everyone agrees to eat together? That's still worth building.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>softwaredevelopment</category>
    </item>
    <item>
      <title>Building AI-Powered Data Analytics with MindsDB Enterprise: From Natural Language to Charts</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Mon, 19 Jan 2026 10:48:58 +0000</pubDate>
      <link>https://dev.to/mindsdb/building-ai-powered-data-analytics-with-mindsdb-enterprise-from-natural-language-to-charts-49o3</link>
      <guid>https://dev.to/mindsdb/building-ai-powered-data-analytics-with-mindsdb-enterprise-from-natural-language-to-charts-49o3</guid>
      <description>&lt;p&gt;Data analytics traditionally requires writing SQL queries, understanding database schemas, and manually creating visualizations. This creates a significant barrier for business users who need insights but lack technical expertise. With &lt;strong&gt;MindsDB Minds&lt;/strong&gt; you could simply ask questions in plain English and receive both answers and charts automatically.&lt;/p&gt;

&lt;p&gt;In this tutorial, I will show how to build an AI-powered analytics system using MindsDB Minds that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understand natural language questions about your data&lt;/li&gt;
&lt;li&gt;Automatically generate and execute SQL queries&lt;/li&gt;
&lt;li&gt;Return formatted answers with dynamically generated charts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the end of this tutorial, you'll have a working system that transforms questions like &lt;em&gt;"What's the total sales revenue by product category?"&lt;/em&gt; into actionable insights complete with visualizations.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You'll Learn
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Setting up a MindsDB Mind with database connectivity&lt;/li&gt;
&lt;li&gt;Asking data questions to MindsDB Mind and getting answers programmatically&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting, ensure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.8 or higher installed&lt;/li&gt;
&lt;li&gt;A MindsDB API key (please &lt;a href="https://mindsdb.com/contact" rel="noopener noreferrer"&gt;contact MindsDB&lt;/a&gt; to get one)&lt;/li&gt;
&lt;li&gt;Basic familiarity with Python and SQL concepts&lt;/li&gt;
&lt;li&gt;The following Python packages installed:

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pandas&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;openai&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;minds-sdk&lt;/code&gt; (MindsDB Python client)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;You can install the required packages with the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="n"&gt;minds&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Dataset: Web Sales Analytics
&lt;/h2&gt;

&lt;p&gt;This tutorial uses a web sales dataset stored in PostgreSQL available in your Minds dashboard. The dataset consists of four related tables that model an e-commerce business:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Table&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Key Columns&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;websales_orders&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Order information&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;order_id&lt;/code&gt;, &lt;code&gt;order_date&lt;/code&gt;, &lt;code&gt;ship_date&lt;/code&gt;, &lt;code&gt;ship_mode&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;websales_sales&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sales transactions&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;order_id&lt;/code&gt;, &lt;code&gt;product_id&lt;/code&gt;, &lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;sales&lt;/code&gt;, &lt;code&gt;quantity&lt;/code&gt;, &lt;code&gt;discount&lt;/code&gt;, &lt;code&gt;profit&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;websales_products&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Product catalog&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;product_id&lt;/code&gt;, &lt;code&gt;category&lt;/code&gt;, &lt;code&gt;sub_category&lt;/code&gt;, &lt;code&gt;product_name&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;websales_customers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Customer information&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;customer_id&lt;/code&gt;, &lt;code&gt;customer_name&lt;/code&gt;, &lt;code&gt;segment&lt;/code&gt;, &lt;code&gt;country&lt;/code&gt;, &lt;code&gt;city&lt;/code&gt;, &lt;code&gt;state&lt;/code&gt;, &lt;code&gt;region&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Table Relationships
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;websales_sales.product_id  → websales_products.product_id
websales_sales.order_id    → websales_orders.order_id
websales_sales.customer_id → websales_customers.customer_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This schema supports a wide range of analytical questions — from product performance and customer segmentation to shipping analysis and regional trends.&lt;/p&gt;

&lt;p&gt;For simplicity and to keep the focus on the Minds features, in this tutorial we used an existing dataset. Readers not familiar with the notion of data sources in MindsDB are invited to read &lt;a href="https://docs.mindsdb.com/mindsdb_sql/sql/create/database" rel="noopener noreferrer"&gt;this piece of documentation&lt;/a&gt; to learn how to make existing databases available for querying in MindsDB.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Import Required Libraries
&lt;/h2&gt;

&lt;p&gt;To start coding, let's import the necessary libraries for API communication, data handling, and MindsDB client operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;minds.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The OpenAI library will be used as the API for communicating with a Mind. It's a popular choice of API supported by many chatbot and LLM providers. MindsDB supports it as well.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;Client&lt;/code&gt; is a MindsDB client we will use for Mind management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Configure API Credentials
&lt;/h2&gt;

&lt;p&gt;Next, we set the MindsDB API URL, API key, and Mind name:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://mdb.ai/api/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;MINDS_API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE_YOUR_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;# "CREATE_YOUR_KEY at https://mindsdb.com/contact"
&lt;/span&gt;&lt;span class="n"&gt;MIND_NAME&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_sales_demo_mind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;DATA_SOURCES&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;postgres_web_sales_datasource&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tables&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;websales_orders&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;websales_sales&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;websales_products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;websales_customers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's see what each parameter defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;BASE_URL&lt;/code&gt; is the MindsDB API endpoint.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MINDS_API_KEY&lt;/code&gt; is your personal API key for authentication. Create one at the MindsDB portal.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MIND_NAME&lt;/code&gt; is a unique identifier for your Mind. Choose a descriptive name that reflects its purpose.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DATA_SOURCES&lt;/code&gt; specify which database tables the Mind should have access to:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;name&lt;/code&gt; is a reference name for this datasource configuration. This should match an existing datasource in your MindsDB environment.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tables&lt;/code&gt; is a list of table names the Mind is allowed to query. Restricting access helps the AI focus on relevant data.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 3: Initialize the MindsDB Client
&lt;/h2&gt;

&lt;p&gt;With credentials configured, we now can create a client instance that handles all communication with the MindsDB platform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MINDS_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;Client&lt;/code&gt; object provides methods for creating, updating, and managing Minds. It handles authentication automatically using the provided API key.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Create the Prompt Template
&lt;/h2&gt;

&lt;p&gt;The prompt template is the heart of your Mind's behavior. It instructs the AI on how to interpret questions, generate SQL, and format responses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;PROMPT_TEMPLATE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
# ROLE AND TASK

You are a precise data analyst with access to a SQL execution tool named **sql_db_query**.
You MUST (a) generate SQL, (b) EXECUTE it via sql_db_query, and (c) based ONLY on the returned rows,
return the answer in markdown format and always create a chart whenever possible.
Generate URL-encoded charts via quickchart.io like this:

https://quickchart.io/chart?c=%7Btype%3A%27line%27%2Cdata%3A%7Blabels%3A%5B%27Jan%27%2C%27Feb%27%2C%27Mar%27%2C%27Apr%27%2C%27May%27%2C%27Jun%27%5D%2Cdatasets%3A%5B%7Blabel%3A%27Sales%27%2Cdata%3A%5B65%2C59%2C80%2C81%2C56%2C95%5D%2CborderColor%3A%27rgb(75%2C192%2C192)%27%7D%5D%7D%7D
https://quickchart.io/chart?c=%7Btype%3A%27pie%27%2Cdata%3A%7Blabels%3A%5B%27CompanyA%27%2C%27CompanyB%27%2C%27CompanyC%27%2C%27Others%27%5D%2Cdatasets%3A%5B%7Bdata%3A%5B35%2C25%2C20%2C20%5D%2CbackgroundColor%3A%5B%27%23FF6384%27%2C%27%2336A2EB%27%2C%27%23FFCE56%27%2C%27%234BC0C0%27%5D%7D%5D%7D%7D
https://quickchart.io/chart?c=%7Btype%3A%27bar%27%2Cdata%3A%7Blabels%3A%5B%27North%27%2C%27South%27%2C%27East%27%2C%27West%27%5D%2Cdatasets%3A%5B%7Blabel%3A%27Revenue%27%2Cdata%3A%5B120%2C190%2C300%2C250%5D%7D%5D%7D%7D

# SCHEMA (use schema-qualified names ONLY; do NOT include datasource name)
- postgres_web_sales_datasource.websales_orders     (order_id, order_date, ship_date, ship_mode, ...)
- postgres_web_sales_datasource.websales_sales      (order_id, product_id, customer_id, sales, quantity, discount, profit)
- postgres_web_sales_datasource.websales_products   (product_id, category, sub_category, product_name)
- postgres_web_sales_datasource.websales_customers  (customer_id, customer_name, segment, country, city, state, postal_code, region)

# JOINS

- websales_sales.product_id  -&amp;gt; websales_products.product_id
- websales_sales.order_id    -&amp;gt; websales_orders.order_id
- websales_sales.customer_id -&amp;gt; websales_customers.customer_id

# GUIDELINES

When answering questions, follow these guidelines:
    For questions about database tables and their contents:
    - Use the sql_db_query to query the tables directly
    - You can join tables if needed to get comprehensive information
    - **Important Rule for SQL Queries:** If you formulate an SQL query as part of answering a user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s question,
    you *must* then use the `sql_db_query` tool to execute that query and get its results.
    The SQL query string itself is NOT the final answer to the user unless the user has specifically asked for the query.
    Your final AI response should be based on the *results* obtained from executing the query.
    For factual questions, ALWAYS use the available tools to look up information rather than relying on your internal knowledge.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lets see what we defined in the above prompt:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Section&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ROLE AND TASK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Defines the AI's persona, core responsibilities, and provides URL-encoded QuickChart.io examples for the AI to learn the format&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SCHEMA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Documents available tables and their columns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JOINS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Explains table relationships for multi-table queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GUIDELINES&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Highlights the importance of following the execute-then-answer pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of these sections in the prompt are mandatory but they would help the AI to better understand your data and what you are trying to achieve.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why QuickChart.io?&lt;/strong&gt; Minds can generate plots without requiring the user to provide their own plotting API. These plots are currently available in the Minds UI. Because we are using Minds via the Python API, we will use an external service for plots. QuickChart.io generates chart images from URL-encoded JSON configurations, allowing charts to be embedded in Markdown without any client-side JavaScript.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 5: Create the Mind
&lt;/h2&gt;

&lt;p&gt;With all components ready, we create the Mind. This registers the AI agent with MindsDB, connecting it to the specified datasource and prompt template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;minds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MIND_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;datasources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;DATA_SOURCES&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;system_prompt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PROMPT_TEMPLATE&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting &lt;code&gt;replace=True&lt;/code&gt; results in the new mind overwriting an old one with the same name.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Create the OpenAI-Compatible Client
&lt;/h2&gt;

&lt;p&gt;MindsDB exposes Minds through an OpenAI-compatible API. This means you can use the standard OpenAI Python client to interact with your Mind:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;oa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MINDS_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 7: Define Helper Functions
&lt;/h2&gt;

&lt;p&gt;As you can see in a minute, we will interact with our Mind the same way as we usually interact with a chatbot. But first, we need several helper functions to handle API responses, extract the final answer from the AI's reasoning trace, and render Markdown output.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.1 Non-Streaming Response Collection
&lt;/h3&gt;

&lt;p&gt;For simple use cases where you don't need real-time output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;collect_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Collect the final response from the Mind.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;oa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MIND_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;full_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;full_response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The low &lt;code&gt;temperature=0.1&lt;/code&gt; setting makes responses more deterministic — ideal for data analysis where consistency matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.2 Streaming Response with Reasoning Trace
&lt;/h3&gt;

&lt;p&gt;For a better user experience, streaming shows the AI's thinking process in real-time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream_and_collect_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Stream the LLM response, print reasoning trace, but only return the final answer
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;=== QUESTION ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== LLM REASONING TRACE START ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;oa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MIND_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;full_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;choices&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;choice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="n"&gt;full_response&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Streaming error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Fallback to non-streaming
&lt;/span&gt;        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;oa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MIND_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;full_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== LLM REASONING TRACE END ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract only the final answer (everything after "I finished executing the SQL query")
&lt;/span&gt;    &lt;span class="n"&gt;final_answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_final_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;final_answer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This function provides transparency into the AI's decision-making process, showing the SQL it generates and executes before presenting the final answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.3 Extract Final Answer
&lt;/h3&gt;

&lt;p&gt;The AI's response includes both reasoning steps and the final answer. This function extracts just the user-facing content:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_final_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Extract the final answer from the full response using the end-of-reasoning marker
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;end_marker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I finished executing the SQL query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Find the position of the end-of-reasoning marker
&lt;/span&gt;    &lt;span class="n"&gt;marker_pos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;full_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end_marker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;marker_pos&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Extract everything after the marker
&lt;/span&gt;        &lt;span class="n"&gt;final_answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;full_response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;marker_pos&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end_marker&lt;/span&gt;&lt;span class="p"&gt;):].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;final_answer&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Fallback: if marker not found, return the full response
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;full_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;end_marker&lt;/code&gt; string acts as a delimiter between the AI's internal reasoning and the polished response meant for end users.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.4 Render Markdown Output
&lt;/h3&gt;

&lt;p&gt;Finally, we need to display the formatted response with charts rendered:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;render_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Render the markdown content generated by the Mind.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;IPython.display&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Markdown&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display&lt;/span&gt;
        &lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ImportError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Fallback for non-Jupyter environments: simply print the markdown string
&lt;/span&gt;        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Jupyter environments, this renders charts inline. In other environments, it might print the raw Markdown (including chart URLs).&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 8: Create the Main Query Interface
&lt;/h2&gt;

&lt;p&gt;We wrap everything into a single, easy-to-use function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Ask a question to the mind, show reasoning trace, and return DataFrame
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stream_and_collect_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;#response = collect_response(question)
&lt;/span&gt;    &lt;span class="nf"&gt;render_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the primary interface users will interact with — simply call &lt;code&gt;ask()&lt;/code&gt; with a natural language question.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example Queries and Results
&lt;/h2&gt;

&lt;p&gt;Let's see the system in action with real business questions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query 1: Sales by Product Category
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the total sales revenue by product category?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="n"&gt;QUESTION&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt;
    &lt;span class="n"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the total sales revenue by product category?

    === LLM REASONING TRACE START ===

    I will now generate the SQL query to answer your question.Here is the generated SQL query along with its execution result:
    Query executed: SELECT wp.category, SUM(s.sales) as total_sales_revenue 
    FROM postgres_web_sales_datasource.websales_sales s 
    JOIN postgres_web_sales_datasource.websales_products wp ON s.product_id = wp.product_id 
    GROUP BY wp.category

    Results: 3 rows x 2 columns

    category                total_sales_revenue
    Hybrid Work Essentials 1525521.9           
      Smart Office Devices 4288440.0           
         Connected Devices 1288424.8           Here are the total sales revenues by product category:

    - **Smart Office Devices:** $4,288,440.00
    - **Hybrid Work Essentials:** $1,525,521.90
    - **Connected Devices:** $1,288,424.80

    Here is a bar chart showing the sales revenue by category:

    ![Sales Revenue by Category](https://quickchart.io/chart?c=%7Btype%3A%27bar%27%2Cdata%3A%7Blabels%3A%5B%27Smart%20Office%20Devices%27%2C%27Hybrid%20Work%20Essentials%27%2C%27Connected%20Devices%27%5D%2Cdatasets%3A%5B%7Blabel%3A%27Total%20Sales%20Revenue%27%2Cdata%3A%5B4288440.0%2C1525521.9%2C1288424.8%5D%7D%5D%7D%7D)

    === LLM REASONING TRACE END ===
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fquickchart.io%2Fchart%3Fc%3D%257Btype%253A%2527bar%2527%252Cdata%253A%257Blabels%253A%255B%2527Smart%2520Office%2520Devices%2527%252C%2527Hybrid%2520Work%2520Essentials%2527%252C%2527Connected%2520Devices%2527%255D%252Cdatasets%253A%255B%257Blabel%253A%2527Total%2520Sales%2520Revenue%2527%252Cdata%253A%255B4288440.0%252C1525521.9%252C1288424.8%255D%257D%255D%257D%257D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fquickchart.io%2Fchart%3Fc%3D%257Btype%253A%2527bar%2527%252Cdata%253A%257Blabels%253A%255B%2527Smart%2520Office%2520Devices%2527%252C%2527Hybrid%2520Work%2520Essentials%2527%252C%2527Connected%2520Devices%2527%255D%252Cdatasets%253A%255B%257Blabel%253A%2527Total%2520Sales%2520Revenue%2527%252Cdata%253A%255B4288440.0%252C1525521.9%252C1288424.8%255D%257D%255D%257D%257D" alt="Sales Revenue by Category" width="1000" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Query 2: Revenue by Shipping Mode
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the total sales revenue for each shipping mode, and which delivery option generates the most revenue?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== QUESTION ===
What is the total sales revenue for each shipping mode, and which delivery option generates the most revenue?

=== LLM REASONING TRACE START ===

I will now generate the SQL query to answer your question.Here is the generated SQL query along with its execution result:
Query executed: SELECT wo.ship_mode, SUM(ws.sales) AS total_revenue
FROM postgres_web_sales_datasource.websales_orders wo
JOIN postgres_web_sales_datasource.websales_sales ws ON wo.order_id = ws.order_id
GROUP BY wo.ship_mode
ORDER BY total_revenue DESC

Results: 3 rows x 2 columns

ship_mode         total_revenue
 Premium Express 2407209.8     
    Eco Delivery 2397529.8     
Instant Delivery 2297646.0     Here is the total sales revenue generated by each shipping mode:

| Shipping Mode       | Total Revenue  |
|---------------------|----------------|
| Premium Express     | $2,407,209.80  |
| Eco Delivery        | $2,397,529.80  |
| Instant Delivery    | $2,297,646.00  |

The shipping mode that generates the most revenue is **Premium Express** with a total revenue of $2,407,209.80.

![Revenue by Shipping Mode](https://quickchart.io/chart?c=%7Btype%3A%27bar%27%2Cdata%3A%7Blabels%3A%5B%27Premium%20Express%27%2C%27Eco%20Delivery%27%2C%27Instant%20Delivery%27%5D%2Cdatasets%3A%5B%7Blabel%3A%27Total%20Revenue%27%2Cdata%3A%5B2407209.8%2C2397529.8%2C2297646.0%5D%2CbackgroundColor%3A%5B%27%234BC0C0%27%2C%27%23FFCE56%27%2C%27%23FF6384%27%5D%7D%5D%7D%7D)

=== LLM REASONING TRACE END ===
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fquickchart.io%2Fchart%3Fc%3D%257Btype%253A%2527bar%2527%252Cdata%253A%257Blabels%253A%255B%2527Premium%2520Express%2527%252C%2527Eco%2520Delivery%2527%252C%2527Instant%2520Delivery%2527%255D%252Cdatasets%253A%255B%257Blabel%253A%2527Total%2520Revenue%2527%252Cdata%253A%255B2407209.8%252C2397529.8%252C2297646.0%255D%252CbackgroundColor%253A%255B%2527%25234BC0C0%2527%252C%2527%2523FFCE56%2527%252C%2527%2523FF6384%2527%255D%257D%255D%257D%257D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fquickchart.io%2Fchart%3Fc%3D%257Btype%253A%2527bar%2527%252Cdata%253A%257Blabels%253A%255B%2527Premium%2520Express%2527%252C%2527Eco%2520Delivery%2527%252C%2527Instant%2520Delivery%2527%255D%252Cdatasets%253A%255B%257Blabel%253A%2527Total%2520Revenue%2527%252Cdata%253A%255B2407209.8%252C2397529.8%252C2297646.0%255D%252CbackgroundColor%253A%255B%2527%25234BC0C0%2527%252C%2527%2523FFCE56%2527%252C%2527%2523FF6384%2527%255D%257D%255D%257D%257D" alt="Revenue by Shipping Mode" width="1000" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Query 3: Customer Segment Comparison
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How does total sales revenue compare across different customer segments (Startup vs Enterprise vs others)?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== QUESTION ===
How does total sales revenue compare across different customer segments (Startup vs Enterprise vs others)?

=== LLM REASONING TRACE START ===

I will now generate the SQL query to answer your question.Here is the generated SQL query along with its execution result:
Query executed: SELECT c.segment, SUM(s.sales) AS total_sales_revenue
FROM postgres_web_sales_datasource.websales_sales s
JOIN postgres_web_sales_datasource.websales_customers c ON s.customer_id = c.customer_id
GROUP BY c.segment

Results: 3 rows x 2 columns

segment        total_sales_revenue
   Enterprise 2137724.0           
Remote Worker 1285370.2           
      Startup 3679287.8           Here's how the total sales revenue compares across different customer segments:

- **Enterprise**: $2,137,724
- **Remote Worker**: $1,285,370.2
- **Startup**: $3,679,287.8

As we can see, the "Startup" segment generates the highest total sales revenue, followed by "Enterprise", and finally "Remote Worker".

Here's a bar chart illustrating the total sales revenue for each customer segment:

![Total Sales Revenue by Segment](https://quickchart.io/chart?c=%7Btype%3A%27bar%27%2Cdata%3A%7Blabels%3A%5B%27Enterprise%27%2C%27Remote%20Worker%27%2C%27Startup%27%5D%2Cdatasets%3A%5B%7Blabel%3A%27Total%20Sales%20Revenue%27%2Cdata%3A%5B2137724%2C1285370.2%2C3679287.8%5D%7D%5D%7D%7D)

=== LLM REASONING TRACE END ===
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fquickchart.io%2Fchart%3Fc%3D%257Btype%253A%2527bar%2527%252Cdata%253A%257Blabels%253A%255B%2527Enterprise%2527%252C%2527Remote%2520Worker%2527%252C%2527Startup%2527%255D%252Cdatasets%253A%255B%257Blabel%253A%2527Total%2520Sales%2520Revenue%2527%252Cdata%253A%255B2137724%252C1285370.2%252C3679287.8%255D%257D%255D%257D%257D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fquickchart.io%2Fchart%3Fc%3D%257Btype%253A%2527bar%2527%252Cdata%253A%257Blabels%253A%255B%2527Enterprise%2527%252C%2527Remote%2520Worker%2527%252C%2527Startup%2527%255D%252Cdatasets%253A%255B%257Blabel%253A%2527Total%2520Sales%2520Revenue%2527%252Cdata%253A%255B2137724%252C1285370.2%252C3679287.8%255D%257D%255D%257D%257D" alt="Total Sales Revenue by Segment" width="1000" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this tutorial, you learned how to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set up MindsDB Minds — Create intelligent AI agents connected to your databases and allow users to query data without writing SQL&lt;/li&gt;
&lt;li&gt;Design effective prompt templates — Guide the AI with schema information, examples, and behavioral rules&lt;/li&gt;
&lt;li&gt;Generate dynamic visualizations — Leverage QuickChart.io for automatic chart generation&lt;/li&gt;
&lt;li&gt;Handle streaming responses — Provide real-time feedback and transparency into AI reasoning&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Future Steps
&lt;/h2&gt;

&lt;p&gt;To extend this tutorial, consider:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add more chart types: Expand the prompt template with examples for scatter plots, area charts, and multi-series visualizations&lt;/li&gt;
&lt;li&gt;Expand the datasource: Connect additional tables or databases to answer more complex cross-domain questions&lt;/li&gt;
&lt;li&gt;Build a web interface: Wrap the &lt;code&gt;ask()&lt;/code&gt; function in a Flask or FastAPI application for broader access&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By now, you’ve built a complete AI-powered analytics system using MindsDB Minds — one that can translate natural language into SQL, execute queries safely, and return clear insights with automatically generated charts. What once required dashboards, analysts, and BI tools can now be done by simply asking a question.&lt;/p&gt;

&lt;p&gt;This approach doesn’t just make analytics easier — it makes insights accessible to anyone, regardless of technical skill. MindsDB Minds turns your database into a conversational interface, unlocking faster decisions, richer exploration, and a more intuitive way to understand your business data.&lt;/p&gt;

&lt;p&gt;As you continue exploring, you can extend your Mind with additional datasets, new chart types, and even a web interface to share with your team. This is just the starting point. With MindsDB, you have everything you need to build interactive, intelligent analytics experiences that grow with your data.&lt;/p&gt;

&lt;p&gt;If you're ready to take the next step, explore more examples, connect new sources, or begin building your own AI-powered analytics applications- &lt;a href="https://mindsdb.com/contact" rel="noopener noreferrer"&gt;contact our team for a demo.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>python</category>
      <category>ai</category>
    </item>
    <item>
      <title>MindsDB in 2025: From SQL to the Universal AI Data Hub</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Tue, 30 Dec 2025 00:33:25 +0000</pubDate>
      <link>https://dev.to/mindsdb/mindsdb-in-2025-from-sql-to-the-universal-ai-data-hub-40k8</link>
      <guid>https://dev.to/mindsdb/mindsdb-in-2025-from-sql-to-the-universal-ai-data-hub-40k8</guid>
      <description>&lt;p&gt;&lt;em&gt;Written by Alejandro Cantu, Senior Product Manager &amp;amp; Martyna Slawinska, Technical Product Manager at MindsDB&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In 2025, the challenge of connecting enterprise data to AI shifted from "if" to "how fast." For the MindsDB open-source community, this year was a journey of evolution—moving beyond predictive models to building the essential infrastructure for the Agentic Web.&lt;/p&gt;

&lt;p&gt;We began the year with a focus on Agentic AI, believing it would fundamentally reshape how we build and interact with software. Over the last 12 months, we’ve worked alongside our contributors to make that vision accessible to everyone. By sticking to our core philosophy—Connect, Unify, Respond—we have grown into a platform where developers can confidently build self-reasoning agents on top of any data source.&lt;/p&gt;

&lt;p&gt;Here is the scorecard for 2025 and a look at how we got here.&lt;/p&gt;

&lt;h3&gt;
  
  
  2025 By The Numbers
&lt;/h3&gt;

&lt;p&gt;It was a record-breaking year for activity in our repository.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;11 Major Monthly Releases (v25.1 to v25.11).&lt;/li&gt;
&lt;li&gt;1,500+ Pull Requests merged to main.&lt;/li&gt;
&lt;li&gt;37,400+ GitHub Stars (and counting).&lt;/li&gt;
&lt;li&gt;800+ Active Contributors driving the ecosystem forward.&lt;/li&gt;
&lt;li&gt;500k+ Docker pulls&lt;/li&gt;
&lt;li&gt;850k+ pip installs (no mirrors)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr30ux5fou1rdwf0ga5fa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr30ux5fou1rdwf0ga5fa.png" alt="mindsdb" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. CONNECT: The Universal Adapter for Agents
&lt;/h3&gt;

&lt;p&gt;The Goal: Make every data source "AI-Ready" instantly.&lt;/p&gt;

&lt;p&gt;The biggest shift in 2025 was moving beyond simple database connections to becoming the universal language for AI Agents.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One Interface to Rule Them All:&lt;/strong&gt; By abstracting 200+ data sources behind a &lt;strong&gt;universal SQL-like interface&lt;/strong&gt;, we removed the need for agents to learn hundreds of different APIs. Whether it's a Postgres database, a Slack channel, or a Salesforce CRM, your agent logic stays simple, portable, and decoupled from the underlying data complexity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Big Pivot (MCP Support):&lt;/strong&gt; In Q1, we re-architected our API layer to support the &lt;a href="https://mindsdb.com/blog/mindsdb-now-supports-model-context-protocol-the-unified-ai-data-hub-your-enterprise-needs" rel="noopener noreferrer"&gt;Model Context Protocol (MCP)&lt;/a&gt;. This effectively turned MindsDB into a "universal adapter," allowing agents (like Claude Desktop) to plug into any backend data source without custom code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google MCP Toolbox (Nov):&lt;/strong&gt; We collaborated with Google to &lt;a href="https://mindsdb.com/blog/mindsdb-supercharges-google-s-mcp-toolbox-with-unstructured-data-support" rel="noopener noreferrer"&gt;supercharge their MCP Toolbox&lt;/a&gt;, bringing unstructured data support to the broader developer ecosystem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep Integrations:&lt;/strong&gt; We didn't just add more handlers; we made them smarter. We rolled out verified, agent-ready integrations for major platforms like Snowflake, BigQuery, Salesforce, Oracle, Databricks, PostgreSQL, MySQL, SQL Server, and Gong, enhancing them with advanced metadata extraction so AI agents can autonomously understand and navigate your data schemas. (Plus, we have Jira and Elasticsearch coming soon!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffa0p46drp85a0cfy67mh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffa0p46drp85a0cfy67mh.png" alt=" " width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. UNIFY: Democratizing RAG
&lt;/h3&gt;

&lt;p&gt;The Goal: Treat unstructured data (PDFs, Docs) just like a SQL table.&lt;/p&gt;

&lt;p&gt;In Spring, we solved the "Last Mile" problem of AI: retrieval. We believed you shouldn't need a PhD in vector databases to build a knowledge bot.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://mindsdb.com/blog/beyond-keywords-introducing-mindsdb-knowledge-bases-for-rag-and-semantic-search" rel="noopener noreferrer"&gt;Knowledge Bases (April):&lt;/a&gt;&lt;/strong&gt; We officially launched &lt;strong&gt;Knowledge Bases&lt;/strong&gt;, allowing users to ingest documents and query them with SQL-like syntax. This was enabled by a massive Q1 refactor of our storage engine to handle embeddings natively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://mindsdb.com/blog/introducing-mindsdb-s-hybrid-search-find-what-matters-in-a-sea-of-enterprise-data" rel="noopener noreferrer"&gt;Hybrid Search (August):&lt;/a&gt;&lt;/strong&gt; We upgraded the engine to support &lt;strong&gt;Hybrid Search&lt;/strong&gt;, blending keyword accuracy (BM25) with semantic understanding (Vector).

&lt;ul&gt;
&lt;li&gt;Result: AI answers that are both factually accurate and contextually rich.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Virtual Metadata Tables (June):&lt;/strong&gt; We introduced &lt;code&gt;META_HANDLER_INFO&lt;/code&gt; and &lt;code&gt;META_COLUMNS&lt;/code&gt;, allowing agents to introspect the database structure themselves—a critical step for self-healing agent workflows.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;See how this allows you to &lt;a href="https://www.youtube.com/watch?v=HN4fHtS4mvo" rel="noopener noreferrer"&gt;search unstructured data with SQL precision here.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7btogu9ral3u2setohjb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7btogu9ral3u2setohjb.png" alt=" " width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. RESPOND: Talking to Your Data
&lt;/h3&gt;

&lt;p&gt;The Goal: A conversational interface for your data.&lt;/p&gt;

&lt;p&gt;We  wanted to empower agent builders—from expert engineers to "vibe coders"—to interact with data intuitively. Getting answers from your data is easier than ever; you can now use natural language instead of writing complex SQL yourself.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chat with Your Data (May):&lt;/strong&gt; We released the &lt;a href="https://mindsdb.com/newsroom/chat-with-your-data-mindsdb-launches-open-source-ai-interface-for-databases-and-documents" rel="noopener noreferrer"&gt;Open Source Chat Interface.&lt;/a&gt; This major UX overhaul unified structured SQL querying and unstructured document chat into one seamless window, allowing developers to prototype agent interactions instantly.

&lt;ul&gt;
&lt;li&gt;Technical Win: This required merging our websocket handling and state management into a unified "Agent" backend.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Agent Skills:&lt;/strong&gt; Throughout Q3, we refined how Agents handle SQL generation. They can now autonomously decide when to query a database vs. when to search a Knowledge Base, effectively giving your applications a "Natural Language API" for any connected data source.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fars3mmswyof0zljh71k3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fars3mmswyof0zljh71k3.png" alt=" " width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. EXPERIENCE: A GUI Built for Builders
&lt;/h3&gt;

&lt;p&gt;The Goal: A seamless, intuitive environment for AI development.&lt;/p&gt;

&lt;p&gt;We didn't just upgrade the engine; we overhauled the dashboard. In 2025, the MindsDB GUI evolved into a full-fledged IDE for AI&lt;br&gt;
Refined Workflow: We introduced a richer tab experience with drag-and-drop organization, session persistence, and per-tab storage, ensuring you never lose context when switching between tasks.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Visual Overhaul:&lt;/strong&gt; From the highly requested Dark Mode to a revamped full-width sidebar, the interface is now cleaner and easier to navigate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrated Resources:&lt;/strong&gt; We embedded documentation directly into the GUI, so you can find answers without leaving your workspace.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seamless Onboarding:&lt;/strong&gt; A new onboarding process makes getting started with MindsDB faster and more intuitive for new users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Management:&lt;/strong&gt; Enhanced model management screens provide deeper visibility and control, making it easier to oversee your AI models as you scale. You can now easily configure and switch between top-tier providers including O*&lt;em&gt;penAI, Anthropic, Google Gemini,NVIDIA NIM, Ollama, and AWS Bedrock&lt;/em&gt;*, giving you the flexibility to choose the right model for the job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication &amp;amp; Access:&lt;/strong&gt; We added standalone local login support with a fresh brand identity and streamlined OAuth support for tools like Microsoft Teams and Google Calendar.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Foundation: Stability, Performance &amp;amp; Security
&lt;/h2&gt;

&lt;p&gt;While features grab headlines, reliability runs production. This year saw the most significant engineering overhauls in our history.&lt;/p&gt;

&lt;h3&gt;
  
  
  Runtime and Operational Stability (September)
&lt;/h3&gt;

&lt;p&gt;In v25.9, we made significant changes to improve runtime stability and operational reliability.&lt;/p&gt;

&lt;p&gt;Impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eliminated orphaned background processes in containerized environments&lt;/li&gt;
&lt;li&gt;Reduced memory usage across typical deployments&lt;/li&gt;
&lt;li&gt;Simplified setup and operational troubleshooting&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Performance: Faster Inference
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Smarter SQL:&lt;/strong&gt; We overhauled planning, pruning, and pushdown; optimized handlers; reduced cold-start penalties; and validated everything with large-scale benchmarks. The result: faster federated SQL, smarter agents, and a cleaner architecture ready for the next step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faster Query Execution:&lt;/strong&gt; Previously, MindsDB retrieved handler information directly from the underlying database – a process that added overhead to every query. We’ve now re-engineered this flow so that handler data is read as lightweight, in-code metadata. This architectural shift eliminates unnecessary I/O and streamlines execution. See results below.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improved Startup and Shutdown Times:&lt;/strong&gt; Historically, each new process spent 5+ seconds of CPU time initializing the MindsDB SQL parser, because a set of complex parsing tables was being regenerated on every startup. We have mitigated this by adopting and integrating smart caching capabilities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enhanced Logging and Configurability:&lt;/strong&gt; We improved the clarity, structure, and consistency of log output, making it easier for users to diagnose issues and monitor system behavior. In addition, logging levels are now fully configurable via the MindsDB configuration file.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Security: Enterprise-Grade Trust
&lt;/h3&gt;

&lt;p&gt;Trust is paramount. Throughout 2025, we hardened the platform to meet the needs of our growing enterprise user base&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SOC2 Certification:&lt;/strong&gt; We officially achieved SOC2 compliance, validating our commitment to the highest standards of data security and operational governance. You can verify our security posture in real-time at our new &lt;a href="https://trust.mindsdb.com/?_gl=1*16vmggn*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;Trust Center.&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proactive Patching:&lt;/strong&gt; We stayed ahead of the curve, patching critical CVEs (e.g., CVE-2024-45853) and updating dependencies like Werkzeug and Ray before they became risks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secret Management:&lt;/strong&gt; We overhauled how environment secrets and SSL verifications are handled, ensuring that your connection to enterprise data sources remains hermetically sealed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proactive Patching:&lt;/strong&gt; We stayed ahead of the curve, patching critical CVEs (e.g., CVE-2024-45853) and updating dependencies like Werkzeug and Ray before they became risks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secret Management:&lt;/strong&gt; We overhauled how environment secrets and SSL verifications are handled, ensuring that your connection to enterprise data sources remains hermetically sealed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb887utit8xvwqa3ad26y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb887utit8xvwqa3ad26y.png" alt=" " width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Ready for 2026
&lt;/h2&gt;

&lt;p&gt;2025 was the year MindsDB matured from a tool into a platform. Whether you’re building customer support agents, financial analysis bots, or just querying your database in natural language, the foundation is stronger than ever.&lt;/p&gt;

&lt;p&gt;Thank you to our community, our contributors, and our users for building with us. See you in 2026!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>MySQL &amp; MindsDB Unlocks Intelligent Content Discovery For Web CMS with Knowledge Bases and Cursor</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Tue, 30 Dec 2025 00:08:01 +0000</pubDate>
      <link>https://dev.to/mindsdb/mysql-mindsdb-unlocks-intelligent-content-discovery-for-web-cms-with-knowledge-bases-and-cursor-13ob</link>
      <guid>https://dev.to/mindsdb/mysql-mindsdb-unlocks-intelligent-content-discovery-for-web-cms-with-knowledge-bases-and-cursor-13ob</guid>
      <description>&lt;p&gt;&lt;em&gt;Written by Chandre Van Der Westhuizen, Community &amp;amp; Marketing Co-ordinator at MindsDB&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Content teams run on data — not just page views, but topic trends, engagement behavior, author performance, and semantic relationships across hundreds or thousands of articles. And most of this content lives in MySQL databases powering CMS platforms like WordPress, Ghost, Webflow, Strapi, or custom-built systems.&lt;/p&gt;

&lt;p&gt;But while MySQL gives you structured data (titles, metadata, authors, timestamps), it wasn’t designed to answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Which articles discuss AI trends and have high engagement?”&lt;/li&gt;
&lt;li&gt;“Which authors consistently publish top-performing content?”&lt;/li&gt;
&lt;li&gt;“Show me content with a high bounce rate but low click-through rate.”&lt;/li&gt;
&lt;li&gt;“How many blog posts relate semantically to ‘cloud security’?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These questions require semantic understanding + structured analytics, something that traditionally requires multiple systems: ETL jobs, a separate vector database, a search index, and various analytics tools.&lt;/p&gt;

&lt;p&gt;MindsDB removes all of that overhead by allowing MySQL data to be queried, enriched, and analyzed with AI directly — using Knowledge Bases, Hybrid Search, and the MCP Server with Cursor for natural-language access.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Insights Content Marketing Teams Don't Have Access To
&lt;/h2&gt;

&lt;p&gt;content-driven companies—from media publishers to SaaS documentation teams—run on data stored inside MySQL-based Web CMS systems. Titles, slugs, authors, tags, publish dates, engagement metrics—everything lives neatly in tables.&lt;/p&gt;

&lt;p&gt;But the actual content—the article bodies, guides, announcements, long-form pages—lives as unstructured text. When you need to search across hundreds or thousands of pieces, traditional CMS search breaks down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keyword search only catches exact matches&lt;/li&gt;
&lt;li&gt;Metadata queries can’t capture semantic meaning&lt;/li&gt;
&lt;li&gt;Editors waste time digging for articles, reusing content manually, or re-creating pieces that already exist&lt;/li&gt;
&lt;li&gt;AI-driven tools struggle because the CMS provides no semantic understanding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This is where MindsDB changes the game.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By connecting your MySQL CMS directly to MindsDB, you can build AI-native search, insights, and content automation—without ETL pipelines, custom vector stores, or rewriting your CMS stack.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1jlm7nwy0lgq0exy7aj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1jlm7nwy0lgq0exy7aj.png" alt="mindsdb" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  MindsDB Bridges The Gap for Content Teams Working in MySQL
&lt;/h2&gt;

&lt;p&gt;For teams managing large volumes of CMS data inside MySQL, MindsDB unlocks capabilities that traditional SQL alone cannot provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI-powered insight directly on top of your existing MySQL tables — no migrations or ETL.&lt;/li&gt;
&lt;li&gt;Hybrid Search that blends semantic meaning with precise SQL filters for richer content discovery.&lt;/li&gt;
&lt;li&gt;Knowledge Bases that turn unstructured CMS text into searchable, analyzable AI-ready data.&lt;/li&gt;
&lt;li&gt;Natural-language querying via MindsDB’s MCP Server + Cursor, enabling editors and analysts to “ask” questions instead of writing complex SQL.&lt;/li&gt;
&lt;li&gt;Smarter decision-making around content performance, SEO, user behavior, and trends — all while keeping MySQL as your source of truth.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now let’s explore how you can build an intelligent AI layer on top of your Web CMS data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqatx27oa2n7lvajz0t9q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqatx27oa2n7lvajz0t9q.png" alt=" " width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Elevate Your MySQL CMS- Unlock AI-Powered Content Intelligence with MindsDB
&lt;/h2&gt;

&lt;p&gt;Now that we’ve explored why AI unlocks deeper insight from web-based content systems, let’s walk through how to actually build this inside MindsDB. In the next steps, you’ll learn how to take CMS data already stored in MySQL—titles, slugs, tags, authors, publish dates, full text—and convert it into powerful &lt;strong&gt;AI-ready &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overview?_gl=1*qlerv4*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;Knowledge Bases&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To demonstrate this, we will make use of a sample Web CMS dataset hosted in MySQL.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Pre-requisites: *&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Access MindsDB’s GUI via &lt;a href="https://docs.mindsdb.com/setup/self-hosted/docker?_gl=1*15ti9cl*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; locally or &lt;a href="https://docs.mindsdb.com/setup/self-hosted/docker-desktop?_gl=1*15ti9cl*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;MindsDB’s extension on Docker Desktop.&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Configure your default models in the MindsDB GUI by navigating to Settings → Models.&lt;/li&gt;
&lt;li&gt;Navigate to Manage Integrations in Settings and install the dependencies for MariaDB.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After the dependencies have been successfully installed, you can connect your MySQL data to MindsDB.&lt;/p&gt;

&lt;p&gt;You can establish a connection with MySQL using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/sql/create/database?_gl=1*10x1wp9*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;CREATE DATABASE&lt;/a&gt; syntax in MindsDB’s SQL Editor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;mysql&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'mysql'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="k"&gt;PARAMETERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nv"&gt;"host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"samples.mindsdb.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3306&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"public"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"MindsDBUser123!"&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This connection will give you access to the Web CMS data and CMS Performance Metrics data.&lt;/p&gt;

&lt;h3&gt;
  
  
  How MindsDB Unifies MySQL Web CMS Data with Knowledge Bases
&lt;/h3&gt;

&lt;p&gt;MindsDB Knowledge Bases turn your existing CMS data into an AI-ready search and reasoning layer. Instead of treating content as plain text inside tables, Knowledge Bases enrich it with metadata, embeddings, and hybrid search capabilities. This means your articles, pages, logs, and performance metrics become semantically searchable, instantly retrievable, and usable by AI agents without moving or duplicating data.&lt;/p&gt;

&lt;p&gt;We will make use of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Web CMS data that stores information about blog content&lt;/li&gt;
&lt;li&gt;The CMS Performance Metrics which stores information about the blogs’ performances.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To start, we will create a Knowledge Base using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/create?_gl=1*l7i6q0*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;CREATE KNOWLEDGE_BASE&lt;/a&gt; statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;KNOWLEDGE_BASE&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
    &lt;span class="k"&gt;storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;heroku&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'slug'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'author'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'created_at'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'updated_at'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'tags'&lt;/span&gt; &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;content_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'content'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;id_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'title'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here are the parameters provided: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;web_cms_kb : The name of the knowledge base. &lt;/li&gt;
&lt;li&gt;storage : The storage table where the embeddings of the knowledge base is stored. As you can see we are using the PGVector database we created a connection with and provide the name orders to the table that will be created for storage. &lt;/li&gt;
&lt;li&gt;metadata_columns : Here columns are provided as meta data columns to perform metadata filtering. &lt;/li&gt;
&lt;li&gt;content_columns : Here columns are provided for semantic search. &lt;/li&gt;
&lt;li&gt;id_column: This uniquely identifies each source data row in the knowledge base.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can insert the data using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/insert_data?_gl=1*1kt2io6*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;INSERT INTO&lt;/a&gt; statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;author&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mysql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;real_cms_data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once inserted, you can select the data in the Knowledge Bases using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/query?_gl=1*1mkg8z*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;SELECT&lt;/a&gt; syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3b4ufmgzqfk9rceohe61.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3b4ufmgzqfk9rceohe61.png" alt="mindsdb" width="800" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can follow the same instructions to create a Knowledge Base for the Performance Metrics data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create Knowledge Base&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;KNOWLEDGE_BASE&lt;/span&gt; &lt;span class="n"&gt;cms_performances_kb&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
    &lt;span class="k"&gt;storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;heroku&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cms_performances_kb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'views'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'clicks'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'ctr'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'avg_time_seconds'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'bounce_rate'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'date'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;content_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'cms_id'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;id_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'title'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;--Insert data into Knowledge Base&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;cms_performance_kb&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`views`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;avg_time_seconds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bounce_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`date`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cms_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mysql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cms_6month_performance_metrics&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Select Knowledge Base&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cms_performance_kb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fax7s2pewfahwuoqr5kqo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fax7s2pewfahwuoqr5kqo.png" alt="mindsdb" width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that the Knowledge Bases are created, we can perform Hybrid Search to gain some insights.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hybrid Search: The Missing Link Between Structured MySQL Data and AI Understanding
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/hybrid_search?_gl=1*129vmlt*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;Hybrid Search&lt;/a&gt; in MindsDB brings together the best of semantic search and traditional filtering so teams can find the right content—not just content that contains matching keywords.&lt;/p&gt;

&lt;p&gt;For MySQL-backed CMS systems, this means your content finally becomes discoverable in the way people actually search—using natural language, intent, and structured filters together.&lt;/p&gt;

&lt;p&gt;CMS teams quickly understand how much of their content relates to a specific topic—guiding editorial planning, SEO focus, and audience targeting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Identify how many topics relate to a certain subject&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'food'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt;
&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It finds all CMS articles whose content semantically relates to the subject “food”, using hybrid search to surface both direct matches and deeper contextual connections. You can identify if this topic has been explored yet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9fdzpzleuorybwzpdozs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9fdzpzleuorybwzpdozs.png" alt="mindsdb" width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This query returns all blog posts where the associated author consistently generates high-view content.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Identify authors whose blogs bring in high views&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cms_performance_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;cms_performance_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;views&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;7000&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;cms_performance_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CMS teams can quickly identify top-performing authors, understand what drives engagement, and prioritize creators or topics that meaningfully grow traffic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwx228p04tjc9uodilq4u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwx228p04tjc9uodilq4u.png" alt="mindsdb" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This query finds all finance-tagged blog posts that not only attract a high number of clicks but also keep readers engaged for at least 3 minutes (200 seconds).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Identify blogs with a specific tag that have high click rate with a average reading time of 3minutes&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cms_performance_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;cms_performance_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'finance'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;avg_time_seconds&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;cms_performance_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This helps teams identify which finance-related content performs best so they can replicate or expand on successful topics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5jjrf2z14hkgp6bnwlb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5jjrf2z14hkgp6bnwlb.png" alt="mindsdb" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This query identifies blog posts that lose readers quickly — high bounce rate, low click-through rate, and very short time-on-page.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Identify blogs which has a high bounce rate and a low CTR&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cms_performance_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;cms_performance_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
&lt;span class="n"&gt;bounce_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;ctr&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;avg_time_seconds&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;cms_performance_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;web_cms_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It alerts content teams to pages that may need rewriting, updated SEO, better UX, or clearer topic alignment to improve performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcmkju4qe7sskq3zp03q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcmkju4qe7sskq3zp03q.png" alt="mindsdb" width="800" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hybrid search gives CMS teams a powerful new lens for understanding their content — blending semantic meaning with structured performance data to answer questions that were previously tedious or impossible to query in SQL alone. Once this intelligence is available inside MindsDB’s Knowledge Bases, the next step is making it accessible anywhere your team works.&lt;/p&gt;

&lt;p&gt;That’s where &lt;a href="https://docs.mindsdb.com/model-context-protocol/overview?_gl=1*80vwwp*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;MindsDB’s MCP Server&lt;/a&gt; and &lt;a href="https://docs.mindsdb.com/model-context-protocol/cursor_usage?_gl=1*80vwwp*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;Cursor's MCP Client&lt;/a&gt; comes in. &lt;/p&gt;

&lt;p&gt;In the next section, we’ll show how to expose these same hybrid-search insights to natural-language queries inside Cursor — allowing developers, editors, and analysts to interact with your MySQL-backed CMS data as effortlessly as chatting with an AI agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  From SQL to Natural Language: Using MindsDB’s MCP Server + Cursor with Your Web CMS Data in MySQL
&lt;/h2&gt;

&lt;p&gt;With M&lt;a href="https://docs.mindsdb.com/model-context-protocol/overview?_gl=1*noinpu*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;indsDB’s MCP Server&lt;/a&gt; and Cursor’s MCP Client, your MySQL-based CMS data becomes instantly accessible through natural-language queries. Instead of writing SQL or building custom endpoints, editors, analysts, and engineers can simply ask questions—“Which articles trended last month?” or “Show posts tagged ‘AI’ with high engagement”—and get real, grounded results directly from your database. This bridges the gap between technical data access and everyday content decision-making, making AI-powered insights available to anyone on your team.&lt;/p&gt;

&lt;p&gt;To begin, make sure you have used the below command to start the MCP server with Cursor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;docker&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="c1"&gt;--name mindsdb_container -p 47334:47334 -p 47335:47335 mindsdb/mindsdb&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once this is up and running, you can access Cursor and follow the below instructions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Navigate to Cursor’s Settings and open the MCP tab.&lt;/li&gt;
&lt;li&gt;Select Tools and MCP&lt;/li&gt;
&lt;li&gt;Select Add Custom MCP. It will automatically open a tab to the &lt;code&gt;mcp.json&lt;/code&gt; file. Add the following details and save the changes:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"mindsdb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://127.0.0.1:47334/mcp/sse"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;4.Navigate back to the Cursor Settings tab and you will see MindsDB is listed under ‘Installed MCP Servers’.&lt;br&gt;
5.Select ‘Toggle AI Pane’ and make sure the mode is set to Agent with the LLM model on Auto.&lt;/p&gt;

&lt;p&gt;You can now start chatting with your data. Ask the Agent to access your data in the MySQL database:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Which authors produce the most engaging content over time?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9ub26x8rlys9u7802z1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9ub26x8rlys9u7802z1.png" alt="mindsdb" width="800" height="613"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This identifies high-performing creators, helps allocate editorial resources, and informs hiring, content planning, and promotion decisions. Cursor provides the top 5 authors with their metrics,as well as detailed overall insights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Which content topics or tags drive the most traffic and engagement?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw2uyk6lkmpotdce0mkn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw2uyk6lkmpotdce0mkn.png" alt=" " width="800" height="615"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This flags content that ranks or gets traffic but fails to convert — the clearest signal for SEO optimization, rewriting, or improving titles, snippets, or structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.Which articles have a high bounce rate and low engagement?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgk8xrtjmq1nf8xs0g7i0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgk8xrtjmq1nf8xs0g7i0.png" alt=" " width="800" height="616"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cursor provides 8 articles with performance metrics and key patterns below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flax41obyw46azs3ri1ea.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flax41obyw46azs3ri1ea.png" alt=" " width="800" height="729"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This identifies pages harming user experience and SEO quality signals; these are high-priority candidates for revision, redesign, or content replacement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4.Which articles are trending upward or downward in performance?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxzzishu0meb961jfnad4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxzzishu0meb961jfnad4.png" alt=" " width="800" height="694"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here we can see early signals of growth or decline; essential for capitalizing on rising content and repairing slipping performers before they lose rankings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6.Which tags or topics are underrepresented but performing strongly?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffz6h1n6t2id1r869vlg9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffz6h1n6t2id1r869vlg9.png" alt=" " width="800" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cursor provides an overall comparison for the averages, recommendations and a summary:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqzv5nygmjky1oyrv5k2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqzv5nygmjky1oyrv5k2.png" alt=" " width="800" height="613"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This helps teams discover high-potential opportunities — underserved content themes that deliver strong engagement and should be expanded into content clusters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Which titles have strong SEO indicators based on long-term CTR trends?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpl7l2ukrme8u4vazicu3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpl7l2ukrme8u4vazicu3.png" alt=" " width="800" height="620"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here we get a list of articles with a summary below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwrm6qgfhbx0guk09i6yz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwrm6qgfhbx0guk09i6yz.png" alt=" " width="800" height="612"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Titles with consistently strong CTR trends signal which content performs well in search over time, helping teams focus SEO efforts where it drives the most impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example Use Cases Powered by MindsDB + MySQL + MCP
&lt;/h2&gt;

&lt;h4&gt;
  
  
  1. SEO Insights Assistant
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Ask: “What topics performed best in October by traffic?”&lt;/li&gt;
&lt;li&gt;How it works: MindsDB blends SQL metrics with hybrid retrieval to evaluate topic performance using both structured (views, CTR) and unstructured (content themes) data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Editorial Planning Assistant
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Ask: “Show me all drafts tagged ‘sustainability’ that are over 1,000 words.”&lt;/li&gt;
&lt;li&gt;How it works: MySQL filters structured fields while Knowledge Bases surface semantic matches, helping editors plan upcoming content efficiently.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Content Quality &amp;amp; Redundancy Auditor
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Ask: “Find duplicate or highly similar articles across the blog.”&lt;/li&gt;
&lt;li&gt;How it works: Hybrid search compares article embeddings, making it easy to detect overlap, outdated pages, or pieces needing consolidation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. Author Performance Dashboard
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Ask: “Which authors published fewer than 3 posts last month?”&lt;/li&gt;
&lt;li&gt;How it works: MindsDB queries author output trends, enabling managers to spot gaps in coverage and support team-wide performance insights.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  5. Content Gap &amp;amp; Opportunity Analyzer
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Ask: “Which trending topics from external sources are missing on our blog?”&lt;/li&gt;
&lt;li&gt;How it works: Combine CMS data with external knowledge sources via MCP, revealing blind spots and high-value content opportunities.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Matters (and What You Gain)
&lt;/h2&gt;

&lt;p&gt;Content teams rely on MySQL to store massive volumes of articles, metadata, and performance metrics — but turning that raw data into actionable insight usually requires dashboards, manual queries, or complex BI setups. MindsDB removes these barriers by letting teams &lt;strong&gt;search semantically, analyze trends instantly, and ask natural-language questions directly against their CMS data.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With Knowledge Bases, Hybrid Search, and MCP + Cursor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You unlock deeper content intelligence that isn’t possible with SQL alone.&lt;/li&gt;
&lt;li&gt;You discover patterns across topics, engagement, and SEO performance without manual digging.&lt;/li&gt;
&lt;li&gt;You give editors, marketers, and SEO analysts AI-powered visibility into what drives results.&lt;/li&gt;
&lt;li&gt;You connect structured (MySQL) and unstructured (content) data, enabling decisions grounded in both narrative and numbers.&lt;/li&gt;
&lt;li&gt;You eliminate ETL and extra infrastructure, using the MySQL data you already have.&lt;/li&gt;
&lt;li&gt;You future-proof your CMS workflow by enabling AI-native querying, automation, and discovery.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MindsDB turns your MySQL-based CMS into a real-time, AI-driven content intelligence engine — without changing where your data lives.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;MySQL has always been the backbone of modern Web CMS platforms—reliable, structured, and central to how content teams store, organize, and publish information. But as content ecosystems grow more complex and audiences expect higher-quality, highly relevant material, traditional SQL alone isn’t enough to surface the insights teams need.&lt;/p&gt;

&lt;p&gt;MindsDB transforms this entire landscape.&lt;/p&gt;

&lt;p&gt;By pairing your existing MySQL data with Knowledge Bases, Hybrid Search, and MCP Server + Cursor, MindsDB turns ordinary CMS tables into an intelligent content engine—one capable of understanding semantic meaning, revealing performance patterns, and answering nuanced editorial questions in natural language.&lt;/p&gt;

&lt;p&gt;Most importantly, you achieve all of this without moving your data, rewriting your CMS, or introducing new databases. MySQL stays your source of truth—MindsDB simply makes it smarter.&lt;/p&gt;

&lt;p&gt;As you move forward, the combination of MySQL + MindsDB provides a scalable, future-ready foundation for content discovery, analytics, and decision-making. Whether you're running a global media operation, a SaaS documentation hub, or a content-driven startup, MindsDB helps your teams work faster, learn more, and create better content—all powered by the data you already own. If you would like to see MindsDB in action, &lt;a href="https://mindsdb.com/contact" rel="noopener noreferrer"&gt;contact our team.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Your CMS just became intelligent.&lt;/p&gt;

</description>
      <category>mysql</category>
      <category>mcp</category>
      <category>ai</category>
      <category>sql</category>
    </item>
    <item>
      <title>Blend Hybrid Retrieval with Structured Data using MindsDB Knowledge Bases</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Mon, 29 Dec 2025 23:17:24 +0000</pubDate>
      <link>https://dev.to/mindsdb/blend-hybrid-retrieval-with-structured-data-using-mindsdb-knowledge-bases-4267</link>
      <guid>https://dev.to/mindsdb/blend-hybrid-retrieval-with-structured-data-using-mindsdb-knowledge-bases-4267</guid>
      <description>&lt;p&gt;&lt;em&gt;Written by Andriy Burkov, Ph.D. &amp;amp; Author, MindsDB Advisor&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This tutorial is a follow-up to &lt;a href="https://mindsdb.com/blog/fast-track-knowledge-bases-how-to-build-semantic-ai-search-by-andriy-burkov" rel="noopener noreferrer"&gt;this tutorial&lt;/a&gt;, where we took the first steps in creating and using a MindsDB Knowledge Base feature. In this follow-up project, we will walk through creating a semantic search knowledge base using the famous Enron Emails Dataset. While in the previous tutorial, we simply used an existing dataset, in this one, we'll preprocess the original dataset by extracting structured attributes (also known as metadata) from it using Named Entity Recognition (NER). We will then create a knowledge base and perform both semantic and metadata-filtered searches.&lt;/p&gt;

&lt;p&gt;Before we get our hands dirty, let's refresh some basics. &lt;a href="https://info.mindsdb.com/e3t/Ctc/W5+113/d58S2n04/MWc75T6rldNW3KxJJ-3xSYHKW1PHNH55z43mJMZjDX45nR3bW8wM7ks6lZ3lkVmdC_97H6rWBW7-qwfg3BZpN7W3r0pn-6_0n6_W47q9bY6MHH-fW5Fq-884mCcw4W4Wtp1T3lr-vMW4K6cj83qc3bTW2R1vjj61n0zGN6l3vjysFVl_W76cGYS2hq9NSW52FV4475HB3TVQMCfm6pVjRgW7RjB-613hTj9V8tFrz2L8DflW96XBHT4Lb8_nW80LxBM1hMn8nW3Ml3xG2qXt6DW6RcBj_3lj_l5W1N6qRy5WYvXPW4f6cmr8wTH6WW6kSq-R8Dqh2rN5drnVWy8q3nW4qDMyd5N9QQ4N6PZg6-2FhQlVFp87G1PtznbVDckpd2xP1n2N4-xf3lxfZbWN4dgRjrkbsw1W7kw82q3KklqcW53kBBy63xBxpW4k4Lkx8PKF3BW1xVlJg4lW08WV1M9Wc5rdX7xW92cDv124CGg3W5MxTkM71sxg3W7t8CDv7CKmJzW1TSL1V7j37MyW50hpBc49xyRcW8g1Fkd3MMVBCW7nBNpW4XfL4WW8WTM-Z7NClb9W6RFtcB2RbHBVN5_C6L6VgFHqW57zG5F1XwwXSf4VN6Xb04?_gl=1*1kqa3ms*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExMjMwMDQ5MjkuMTc2NzA0OTYzMy4xNzY3MDQ5NjMz" rel="noopener noreferrer"&gt;Download the webinar code and materials here&lt;/a&gt; to follow along the tutorial.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw13pzwmukpdn2b9923mn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw13pzwmukpdn2b9923mn.png" alt="mindsdb"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Introduction to Knowledge Bases in MindsDB
&lt;/h2&gt;

&lt;p&gt;Knowledge Bases (KB) in MindsDB provide advanced semantic search capabilities, allowing you to find information based on meaning rather than just keywords. They use embedding models to convert text into vector representations and store them in vector databases for efficient similarity searches.&lt;/p&gt;

&lt;p&gt;In addition to searching for knowledge nuggets using semantic similarity (soft search criteria), MindsDB KBs allow the user to combine both soft search criteria with hard ones called "metadata," which can be seen as regular relational database table columns.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In this tutorial, we assume that the user has a free open-source MindsDB instance running in their local environment. Please follow &lt;a href="https://docs.mindsdb.com/setup/self-hosted/docker" rel="noopener noreferrer"&gt;these steps&lt;/a&gt; to set it up.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To demonstrate both soft and hard searches in a MindsDB KB, we'll use the Enron Corpus - one of the largest publicly available collections of corporate emails, containing over 500,000 emails from Enron executives during the years leading up to the company's collapse in 2001. This dataset is particularly interesting because it contains real business communications, including scandal-related content, making it perfect for demonstrating knowledge base search capabilities.&lt;/p&gt;

&lt;p&gt;Named Entity Recognition is the technique we'll use to automatically extract those structured attributes—such as people, organizations, dates, and locations—from the raw email text. These extracted entities will become the metadata columns in our knowledge base, allowing us not only to search semantically by meaning, but also to filter results using precise, structured criteria like sender, company, or time period.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Settings Things Up
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Dependencies Installation
&lt;/h3&gt;

&lt;p&gt;First, let's install the dependencies and set up the NER. We will use SpaCy for this, since its pretrained models can automatically extract entities like people, organizations, dates, and locations from the raw email text. Those extracted entities will then be transformed into structured metadata columns, which we’ll store alongside the email content and later use to power rich, metadata-aware queries in our MindsDB knowledge base.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;mindsdb&lt;/span&gt; &lt;span class="n"&gt;mindsdb_sdk&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="n"&gt;yaspin&lt;/span&gt; &lt;span class="n"&gt;spacy&lt;/span&gt;

&lt;span class="c1"&gt;# Download spaCy English model for Named Entity Recognition
&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;spacy&lt;/span&gt; &lt;span class="n"&gt;download&lt;/span&gt; &lt;span class="n"&gt;en_core_web_sm&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Dependencies installed successfully!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2.2 Dataset Selection and Download
&lt;/h2&gt;

&lt;p&gt;We'll will download the Enron email's dataset from &lt;a href="https://huggingface.co/datasets/snoop2head/enron_aeslc_emails" rel="noopener noreferrer"&gt;Hugging Face&lt;/a&gt;, which is a large collection of real-world corporate messages from the Enron corpus, paired with their original subject lines and cleaned body text. Each entry includes the email’s metadata (such as sender, recipients, and timestamp) along with the full message content, organized into standard train/validation/test splits so the dataset be uses for tasks like summarization, classification, or downstream NLP experiments.&lt;/p&gt;

&lt;p&gt;For our tutorial purposes, we will only use the train fraction of the dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Download Enron Emails Dataset from Hugging Face
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# Load the Enron dataset (536k emails)
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Downloading Enron emails dataset...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snoop2head/enron_aeslc_emails&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dataset shape: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dataset columns:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_email_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email_text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Parse raw email text to extract subject, body, and metadata&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;email_text&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;email_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Initialize result dictionary
&lt;/span&gt;    &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract subject
&lt;/span&gt;    &lt;span class="n"&gt;subject_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Subject:\s*(.*?)(?:\n|$)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;subject_match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subject_match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract from
&lt;/span&gt;    &lt;span class="n"&gt;from_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;From:\s*(.*?)(?:\n|$)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;from_match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;from_match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract to
&lt;/span&gt;    &lt;span class="n"&gt;to_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;To:\s*(.*?)(?:\n|$)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;to_match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;to_match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract date
&lt;/span&gt;    &lt;span class="n"&gt;date_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Date:\s*(.*?)(?:\n|$)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;date_match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;date_match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract body (everything after the headers)
&lt;/span&gt;    &lt;span class="c1"&gt;# Look for the end of headers (usually marked by double newline or start of actual content)
&lt;/span&gt;    &lt;span class="n"&gt;header_end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\n\s*\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;header_end&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;email_text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;header_end&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;():].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Fallback: try to find content after common header patterns
&lt;/span&gt;        &lt;span class="n"&gt;body_start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;(?:Subject:.*?\n.*?\n|X-.*?\n)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DOTALL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;body_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;email_text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;body_start&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;():].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;email_text&lt;/span&gt;

    &lt;span class="c1"&gt;# Clean up body text
&lt;/span&gt;    &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\n+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\s+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;

&lt;span class="c1"&gt;# Parse first few emails to understand structure
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Parsing email structure...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sample_emails&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;parsed_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sample_emails&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# The dataset might have different column names, let's check
&lt;/span&gt;    &lt;span class="n"&gt;email_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Find the column with email content
&lt;/span&gt;            &lt;span class="n"&gt;email_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;email_content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_email_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;06&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;parsed_samples&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    Downloading Enron emails dataset...
    Dataset shape: &lt;span class="o"&gt;(&lt;/span&gt;535703, 1&lt;span class="o"&gt;)&lt;/span&gt;
    Dataset columns: &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'text'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
    Parsing email structure...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's print some records to see what's inside:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_parsed_sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed_samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Sample of parsed emails:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_parsed_sample&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Email #&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;From: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;To: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Date: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Subject: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Body Preview: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Successfully parsed email structure!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Columns extracted: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df_parsed_sample&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sample of parsed emails:
====================================================================================================

Email #1
ID: email_000000
From: phillip.allen@enron.com
To: tim.belden@enron.com
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
Subject: Body:
Body Preview: Here is our forecast
--------------------------------------------------------------------------------

Email #2
ID: email_000001
From: phillip.allen@enron.com
To: john.lavorato@enron.com
Date: Fri, 4 May 2001 13:51:00 -0700 (PDT)
Subject: Re:
Body Preview: Traveling to have a business meeting takes the fun out of the trip. Especially if you have to prepare a presentation. I would suggest holding the business plan meetings here then take a trip without a...
--------------------------------------------------------------------------------

Email #3
ID: email_000002
From: phillip.allen@enron.com
To: leah.arsdall@enron.com
Date: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)
Subject: Re: test
Body Preview: test successful. way to go!!!
--------------------------------------------------------------------------------

Email #4
ID: email_000003
From: phillip.allen@enron.com
To: randall.gay@enron.com
Date: Mon, 23 Oct 2000 06:13:00 -0700 (PDT)
Subject: Body:
Body Preview: Randy, Can you send me a schedule of the salary and level of everyone in the scheduling group. Plus your thoughts on any changes that need to be made. (Patti S for example) Phillip
--------------------------------------------------------------------------------

Email #5
ID: email_000004
From: phillip.allen@enron.com
To: greg.piper@enron.com
Date: Thu, 31 Aug 2000 05:07:00 -0700 (PDT)
Subject: Re: Hello
Body Preview: Let's shoot for Tuesday at 11:45.
--------------------------------------------------------------------------------

Successfully parsed email structure!
Columns extracted: ['subject', 'body', 'from', 'to', 'date', 'email_id']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The data looks good, so now, let's load SpaCy and its pretrained NLP model for English:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;spacy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;spacy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;displacy&lt;/span&gt;
&lt;span class="c1"&gt;# Load spaCy model for NER
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loading spaCy model for Named Entity Recognition...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;nlp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spacy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en_core_web_sm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;nlp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spacy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en_core_web_sm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spaCy model &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;en_core_web_sm&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; loaded successfully!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;OSError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to load spaCy model:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loading spaCy model for Named Entity Recognition...
spaCy model 'en_core_web_sm' loaded successfully!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  2.3 Preparing the Dataset for the Knowledge Base
&lt;/h2&gt;

&lt;p&gt;So far, we have got a raw collection of email messages, but we need a dataset to created a knowledge base from. In this dataset, we want to have natural language texts for soft semantic search and named attributes for hard filtering of data rows.&lt;/p&gt;

&lt;p&gt;The first step in preparing a dataset for a KB is cleaning it up and making sure we have a unique ID column. The second step is extracting the named entities from the cleaned records. We will perform both steps in the below cell:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MAX_EMAILS_TO_PROCESS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500_000&lt;/span&gt;
&lt;span class="n"&gt;MIN_BODY_SIZE_CHARS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tqdm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tqdm&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_email_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Clean and prepare email text for processing&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

    &lt;span class="c1"&gt;# Convert to string and clean
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Remove excessive whitespace and newlines
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\n+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\s+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Remove common email artifacts
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-----Original Message-----.*$&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MULTILINE&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DOTALL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;________________________________.*$&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MULTILINE&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DOTALL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_entities_with_ner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nlp_model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract named entities using spaCy NER&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="c1"&gt;# Limit text length to avoid memory issues
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;nlp_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;entities&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;money&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dates&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;entity_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Skip very short entities
&lt;/span&gt;                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PERSON&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ORG&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LOC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  &lt;span class="c1"&gt;# Geopolitical entities, locations
&lt;/span&gt;                &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MONEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;money&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DATE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dates&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EVENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PRODUCT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entity_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Remove duplicates and limit to top entities
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]))[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Max 5 entities per type
&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;entities&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error processing text: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="c1"&gt;# Use the real Enron dataset that was loaded earlier
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📧 Working with real Enron dataset: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; emails&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dataset columns: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Sample a reasonable subset for this tutorial (full dataset is very large)
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sampling real Enron emails for processing...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MAX_EMAILS_TO_PROCESS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Process the real emails and extract entities
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processing real Enron emails and extracting entities...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;processed_emails&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;tqdm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_sample&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_sample&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processing emails&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Get the raw email content from 'text' column
&lt;/span&gt;    &lt;span class="n"&gt;email_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;email_content&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email_content&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;

    &lt;span class="c1"&gt;# Parse the email using the function from Cell 3
&lt;/span&gt;    &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_email_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Clean the content
&lt;/span&gt;    &lt;span class="n"&gt;subject&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;clean_email_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;clean_email_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;MIN_BODY_SIZE_CHARS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Skip very short emails
&lt;/span&gt;        &lt;span class="k"&gt;continue&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract entities from both subject and content
&lt;/span&gt;    &lt;span class="n"&gt;full_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;entities&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_entities_with_ner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nlp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Create email ID
&lt;/span&gt;    &lt;span class="n"&gt;email_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;06&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;processed_email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;email_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from_address&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Limit length
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to_address&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# Limit length  
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_sent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# Limit length
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;money_amounts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;money&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;money&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dates_mentioned&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dates&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dates&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content_length&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;entity_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;processed_emails&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processed_email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Convert to DataFrame
&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processed_emails&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;✅ Processed &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; real Enron emails with entities extracted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📧 Working with real Enron dataset: 535703 emails
Dataset columns: ['text']
Sampling real Enron emails for processing...
Processing real Enron emails and extracting entities...


Processing emails: 100%|███████████████████████████████████████████████| 500000/500000 [2:55:44&amp;lt;00:00, 47.42it/s]



✅ Processed 453905 real Enron emails with entities extracted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We now have a Pandas dataframe containing, for each email, its text and the extracted attributes. Let's look at some of them and some stats:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Show a sample of processed data
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📊 Sample of Enron emails processed:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📧 Real Email #&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🆔 ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;👤 From: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from_address&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;👤 To: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to_address&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📅 Date: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_sent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📝 Subject: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;👥 Persons: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;None detected&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🏢 Organizations: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;None detected&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📍 Locations: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;None detected&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;💰 Money: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;money_amounts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;money_amounts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;None detected&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;💬 Content Preview: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Show statistics on real data
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📈 Real Data Processing Statistics:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;• Total real emails processed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;• Average content length: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content_length&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; characters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;• Average entities per email: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;entity_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;• Emails with persons mentioned: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;• Emails with organizations mentioned: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;• Emails with money amounts: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;money_amounts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Show some interesting real examples
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;🔍 Most interesting real emails (by entity count):&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;top_emails&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nlargest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;entity_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_emails&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📧 High-entity email from &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from_address&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📝 Subject: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;👥 Persons: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🏢 Organizations: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;💰 Money: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;money_amounts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📊 Sample of Enron emails processed:
========================================================================================================================

📧 Real Email #1
🆔 ID: email_000000
👤 From: daren.farmer@enron.com
👤 To: susan.trevino@enron.com
📅 Date: Fri, 10 Dec 1999 08:33:00 -0800 (PST)
📝 Subject: Re: Meter 5892 - UA4 1996 and 1997 Logistics Issues
👥 Persons: Daren J Farmer/HOU, Susan, Meter 5892 - UA4 1996, Mary M Smith/HOU, Susan D Trevino
🏢 Organizations: Volume Management
📍 Locations: UA4
💰 Money: None detected
💬 Content Preview: Susan, I need you to do the research on this meter. You will need to review the various scheduling systems to see how this was handled prior to 2/96. You can also check with Volume Management to see i...
----------------------------------------------------------------------------------------------------

📧 Real Email #2
🆔 ID: email_000001
👤 From: eric.bass@enron.com
👤 To: jason.bass2@compaq.com, phillip.love@enron.com, bryan.hull@enron.com,
📅 Date: Fri, 18 Aug 2000 05:03:00 -0700 (PDT)
📝 Subject: DRAFT
👥 Persons: Bcc
🏢 Organizations: None detected
📍 Locations: Rice Village
💰 Money: None detected
💬 Content Preview: Cc: timothy.blanchard@enron.com Bcc: timothy.blanchard@enron.com Remember, the draft is this Sunday at 11:45 am at BW-3 in Rice Village. Please try to be there on time so we can start promptly. -Eric
----------------------------------------------------------------------------------------------------

📧 Real Email #3
🆔 ID: email_000003
👤 From: larry.campbell@enron.com
👤 To: pdrumm@csc.com
📅 Date: Mon, 31 Jul 2000 09:53:00 -0700 (PDT)
📝 Subject: More July CED-PGE
👥 Persons: Susan Fick, Patty
🏢 Organizations: None detected
📍 Locations: None detected
💰 Money: None detected
💬 Content Preview: Patty Could you please forward this to Susan Fick. I don't have her e-mail. LC
----------------------------------------------------------------------------------------------------

📧 Real Email #4
🆔 ID: email_000004
👤 From: phillip.allen@enron.com
👤 To: christi.nicolay@enron.com, james.steffes@enron.com, jeff.dasovich@enron.com,
📅 Date: Wed, 13 Dec 2000 07:04:00 -0800 (PST)
📝 Subject: Body:
👥 Persons: None detected
🏢 Organizations: None detected
📍 Locations: None detected
💰 Money: None detected
💬 Content Preview: Attached are two files that illustrate the following: As prices rose, supply increased and demand decreased. Now prices are beginning to fall in response these market responses.
----------------------------------------------------------------------------------------------------

📧 Real Email #5
🆔 ID: email_000005
👤 From: kurt.lindahl@elpaso.com
👤 To: atsm@chewon.com, aarmstrong@sempratrading.com, neilaj@texaco.com,
📅 Date: Tue, 31 Jul 2001 08:28:00 -0700 (PDT)
📝 Subject: El Paso
👥 Persons: Origination El Paso, Tx 77252-2511, Kurt Lindahl Sr., Rob Bryngelson
🏢 Organizations: the ElPaso Corporation, El Paso, Global LNG Division, El Paso Merchant Energy, Business Development
📍 Locations: Houston
💰 Money: None detected
💬 Content Preview: Dear Friends and Colleagues, This note is to inform you that I have joined El Paso Merchant Energy in their Global LNG Division reporting to Rob Bryngelson, Managing Director, Business Development. Pl...
----------------------------------------------------------------------------------------------------

📈 Real Data Processing Statistics:
• Total real emails processed: 453905
• Average content length: 1474 characters
• Average entities per email: 8.1
• Emails with persons mentioned: 383312
• Emails with organizations mentioned: 363549
• Emails with money amounts: 63303

🔍 Most interesting real emails (by entity count):

📧 High-entity email from tradersummary@syncrasy.com
📝 Subject: Syncrasy Daily Trader Summary for Wed, Jan 16, 2002
👥 Persons: Data, NC ERCOT(SP, Max, Aquila, Andy Weingarten
🏢 Organizations: Trader Summary, ERCOT(SP, SPP(= SP, Average-Daily Maximum Temperature', MAPP(HP
💰 Money: 37 -1 MAIN(CTR, 50,000, 43 -1 MAIN(CTR, 36 -1 MAIN(CTR, 40 -1 WSCC(RK

📧 High-entity email from lucky@icelandair.is
📝 Subject: Iceland Food Festival
👥 Persons: Hotel Klopp, Rich, Mar 1 - National Beer Day, David Rosengarten, Subject
🏢 Organizations: Reykjav?k/K?pavogur, Party, Party Gourmet Dinner, BWI, SCENIC SIGHTSEEING Blue Lagoon
💰 Money: 65, 66, 69, 50, 55

📧 High-entity email from truorange@aol.com
📝 Subject: True Orange, November 27, Part 2
👥 Persons: Sooners, Jody Conradt, ESPN, Harris, Northwestern
🏢 Organizations: Oregon State, K-State, Texas A&amp;amp;M, SEC, Big East
💰 Money: $1.1 million, $1.9 million, $2.5 million, $1.2 million, 750,000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The data looks good, so let's now save it into a CSV file that we will then load to our knowledge base:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Save processed real data
&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;enron_emails_processed_real.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;✅ Real Enron emails saved to &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;enron_emails_processed_real.csv&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Real Enron emails saved to 'enron_emails_processed_real.csv'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  2.3 Connecting to the Vector Store
&lt;/h2&gt;

&lt;p&gt;When the user creates a MindsDB Knowledge Base, MindsDB chunks all the text fragments into pieces (chunks) and uses an external text embedding model to convert each chunk into an embedding vector. Embedding vectors are numerical arrays that have the following property: if two texts are similar semantically, then their embedding vectors are close to each other in the vector space. This allows us to compare two texts semantically by applying a mathematical operation (like cosine similarity) to two vectors to see how close they are in the vector space.&lt;/p&gt;

&lt;p&gt;These embedding vectors need to be stored somewhere. There are various vector databases, including several open-source ones. MindsDB supports ChromaDB by default. However, ChromaDB doesn't support the "LIKE" operation, which is a standard operation in relational database SELECT queries. We will use LIKE in our tutorial; therefore, we will use a different open-source vector store, PGVector, which is part of the Postgres ecosystem.&lt;/p&gt;

&lt;p&gt;For this tutorial, we provisioned a PGVector instance on AWS. You can install it locally too. &lt;a href="https://www.datacamp.com/tutorial/pgvector-tutorial" rel="noopener noreferrer"&gt;Here's how you can do it&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let's create a vector database &lt;code&gt;enron_kb_pgvector&lt;/code&gt;, which will store knowledge base's embedding vectors:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Drop an existing pgvector database if it exists
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🗑️  Dropping existing pgvector database...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DROP DATABASE IF EXISTS enron_kb_pgvector;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Dropped existing database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️  Drop error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create fresh pgvector database connection
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        CREATE DATABASE enron_kb_pgvector
        WITH ENGINE = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pgvector&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,
        PARAMETERS = {
            &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;c3hsmn51hjafhh.cluster-czrs8kj4isg7.us-east-1.rds.amazonaws.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
            &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 5432,
            &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;df1f3i5s2jrksf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
            &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;u36kd0g64092pk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
            &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pc08df7cb724a4ad6b1a8288c3666fa087f1a89c1ba5d1a555b40a8ba863672e4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
        };
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Created pgvector database connection &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;enron_kb_pgvector&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌ Database connection error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🗑️  Dropping existing pgvector database...
✅ Dropped existing database
✅ Created pgvector database connection 'enron_kb_pgvector'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  2.4 Uploading the Dataset to MindsDB
&lt;/h2&gt;

&lt;p&gt;Now let's connect to our local MindsDB instance and upload the dataset:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Remember, that in this tutorial, we assume that the user has a free open-source MindsDB instance running in their local environment. Please follow &lt;a href="https://docs.mindsdb.com/setup/self-hosted/docker" rel="noopener noreferrer"&gt;these steps&lt;/a&gt; to set it up.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mindsdb_sdk&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="c1"&gt;# Connect to the MindsDB server
&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mindsdb_sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://127.0.0.1:47334&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connected to MindsDB server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# List available databases to confirm connection
&lt;/span&gt;&lt;span class="n"&gt;databases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;databases&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Available databases:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;databases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# First drop any knowledge bases
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🗑️  Dropping knowledge bases...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DROP KNOWLEDGE_BASE IF EXISTS enron_kb;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Dropped knowledge base enron_kb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️  KB drop error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Check if df_processed exists and has real data
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📊 Checking real processed Enron data...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Shape: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Columns: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌ df_processed is empty. Please run the cell that creates it first.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No processed data available&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;df_upload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_processed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Using &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_upload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; real processed Enron emails&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;NameError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌ Error: df_processed not found. Please run the cell that creates it first.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cannot continue without real processed data.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_for_upload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Clean text data for safe upload to MindsDB&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;

    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Remove problematic characters that might cause encoding issues
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[^\w\s\-\.\@\,\;\:\!\?\(\)\[\]\/]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Remove excessive whitespace
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\s+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Limit length to prevent upload issues
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;1997&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Clean the real data for upload
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cleaning real Enron data for upload...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Clean text fields
&lt;/span&gt;&lt;span class="n"&gt;text_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from_address&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to_address&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_sent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;money_amounts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dates_mentioned&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text_columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_upload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df_upload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_upload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean_for_upload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Ensure numeric columns are properly typed
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content_length&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;entity_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_upload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df_upload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_numeric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_upload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coerce&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📋 Final real Enron dataset for upload:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Shape: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df_upload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sample from addresses: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df_upload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from_address&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sample subjects: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df_upload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Upload to MindsDB
&lt;/span&gt;&lt;span class="n"&gt;files_db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_database&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;files&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;table_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enron_emails&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Delete existing table if it exists
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;files_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dropped existing table &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="c1"&gt;# Upload real Enron data
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Uploading real Enron emails to MindsDB...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;files_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df_upload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Created table files.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; with real Enron data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Verify upload with real data
&lt;/span&gt;    &lt;span class="n"&gt;sample_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT email_id, subject, persons, organizations FROM files.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; LIMIT 5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;✅ Sample real Enron data uploaded:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sample_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📧 &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;   👥 Persons: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;   🏢 Orgs: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Check total count
&lt;/span&gt;    &lt;span class="n"&gt;count_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COUNT(*) as total FROM files.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;total_emails&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;count_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📊 Total real Enron emails uploaded: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_emails&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌ Upload failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;✅ Real Enron data upload process completed!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Connected to MindsDB server
Available databases:
- files
- movies_kb_chromadb
🗑️  Dropping knowledge bases...
✅ Dropped knowledge base enron_kb

📊 Checking real processed Enron data...
Shape: (453905, 15)
Columns: ['email_id', 'from_address', 'to_address', 'date_sent', 'subject', 'content', 'persons', 'organizations', 'locations', 'money_amounts', 'dates_mentioned', 'events', 'products', 'content_length', 'entity_count']
✅ Using 453905 real processed Enron emails
Cleaning real Enron data for upload...

📋 Final real Enron dataset for upload:
Shape: (453905, 15)
Sample from addresses: ['daren.farmer@enron.com', 'eric.bass@enron.com', 'larry.campbell@enron.com']
Sample subjects: ['Re: Meter 5892 - UA4 1996 and 1997 Logistics Issues', 'DRAFT', 'More July CED-PGE']
Dropped existing table enron_emails
Uploading real Enron emails to MindsDB...
✅ Created table files.enron_emails with real Enron data

✅ Sample real Enron data uploaded:
📧 email_000000: Re: Meter 5892 - UA4 1996 and 1997 Logistics Issues...
   👥 Persons: Daren J Farmer/HOU, Susan, Meter 5892 - UA4 1996, Mary M Smith/HOU, Susan D Trevino
   🏢 Orgs: Volume Management
📧 email_000001: DRAFT...
   👥 Persons: Bcc
   🏢 Orgs: None
📧 email_000003: More July CED-PGE...
   👥 Persons: Susan Fick, Patty
   🏢 Orgs: None
📧 email_000004: Body:...
   👥 Persons: None
   🏢 Orgs: None
📧 email_000005: El Paso...
   👥 Persons: Origination El Paso, Tx 77252-2511, Kurt Lindahl Sr., Rob Bryngelson
   🏢 Orgs: the ElPaso Corporation, El Paso, Global LNG Division, El Paso Merchant Energy, Business Development

📊 Total real Enron emails uploaded: 453905

✅ Real Enron data upload process completed!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  2.4 Creating a Knowledge Base
&lt;/h2&gt;

&lt;p&gt;Now, let's create a knowledge base &lt;code&gt;enron_kb&lt;/code&gt; using our emails data. We'll use OpenAI's embedding model to convert the text into vectors. Note the &lt;code&gt;storage = enron_kb_pgvector.enron_vectors&lt;/code&gt; parameter which tells MindsDB to use our PGVector vector store. If we omit this parameter, teh default ChromaDB vector store will be used.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Drop existing knowledge base if it exists
&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DROP KNOWLEDGE_BASE IF EXISTS enron_kb;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Create knowledge base with pgvector storage
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;kb_creation_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE KNOWLEDGE_BASE enron_kb
    USING
        storage = enron_kb_pgvector.enron_vectors,
        embedding_model = {{
           &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
           &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
        }},
        metadata_columns = [
            &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, 
            &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;money_amounts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dates_mentioned&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,
            &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content_length&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;entity_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,
            &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from_address&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to_address&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date_sent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
        ],
        content_columns = [&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;],
        id_column = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;kb_creation_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Created knowledge base &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;enron_kb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; with email address and date filtering support&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌ Knowledge base creation error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Created knowledge base 'enron_kb' with email address and date filtering support
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now let's insert our email data into the knowledge base:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Insert the email data into the knowledge base (including the new metadata columns)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;yaspin&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;yaspin&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;yaspin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Inserting emails into updated knowledge base...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;insert_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            INSERT INTO enron_kb
            SELECT email_id,
                   subject,
                   persons,
                   organizations,
                   locations,
                   money_amounts,
                   dates_mentioned,
                   events,
                   products,
                   content_length,
                   entity_count,
                   from_address,
                   to_address,
                   date_sent,
                   content
            FROM   files.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
            USING
                batch_size = 200,
                threads = 10,
                error = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;skip&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,
                track_column = email_id;
            &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Emails inserted successfully into updated knowledge base!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌ Insert error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Emails inserted successfully into updated knowledge base!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Let's see what the data in the KB looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;search_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM enron_kb;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;search_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT count(*) FROM enron_kb;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;id&lt;/th&gt;
      &lt;th&gt;chunk_id&lt;/th&gt;
      &lt;th&gt;chunk_content&lt;/th&gt;
      &lt;th&gt;metadata&lt;/th&gt;
      &lt;th&gt;relevance&lt;/th&gt;
      &lt;th&gt;distance&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;email_024653&lt;/td&gt;
      &lt;td&gt;email_024653:content:3of3:1993to2382&lt;/td&gt;
      &lt;td&gt;Palmer of Caminus Corp on European Markets! ht...&lt;/td&gt;
      &lt;td&gt;{'events': None, '_source': 'TextChunkingPrepr...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;email_024585&lt;/td&gt;
      &lt;td&gt;email_024585:content:1of1:0to429&lt;/td&gt;
      &lt;td&gt;Most of you already know, but the move is taki...&lt;/td&gt;
      &lt;td&gt;{'events': None, '_source': 'TextChunkingPrepr...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;email_024902&lt;/td&gt;
      &lt;td&gt;email_024902:content:1of2:0to997&lt;/td&gt;
      &lt;td&gt;To facilitate these changes, you received an O...&lt;/td&gt;
      &lt;td&gt;{'events': None, '_source': 'TextChunkingPrepr...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;email_025043&lt;/td&gt;
      &lt;td&gt;email_025043:content:1of1:0to801&lt;/td&gt;
      &lt;td&gt;According to our system records, you have not ...&lt;/td&gt;
      &lt;td&gt;{'events': None, '_source': 'TextChunkingPrepr...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;email_025044&lt;/td&gt;
      &lt;td&gt;email_025044:content:1of3:0to996&lt;/td&gt;
      &lt;td&gt;The attached preliminary comments were finaliz...&lt;/td&gt;
      &lt;td&gt;{'events': None, '_source': 'TextChunkingPrepr...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;...&lt;/th&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;676450&lt;/th&gt;
      &lt;td&gt;email_024824&lt;/td&gt;
      &lt;td&gt;email_024824:content:1of1:0to585&lt;/td&gt;
      &lt;td&gt;Cc: m..love@enron.com, scott.palmer@enron.com ...&lt;/td&gt;
      &lt;td&gt;{'events': None, '_source': 'TextChunkingPrepr...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;676451&lt;/th&gt;
      &lt;td&gt;email_024829&lt;/td&gt;
      &lt;td&gt;email_024829:content:1of3:0to998&lt;/td&gt;
      &lt;td&gt;20 [IMAGE] CO.O.L. Travel Specials [IMAGE] Wed...&lt;/td&gt;
      &lt;td&gt;{'events': 'Love Field', '_source': 'TextChunk...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;676452&lt;/th&gt;
      &lt;td&gt;email_024829&lt;/td&gt;
      &lt;td&gt;email_024829:content:2of3:999to1998&lt;/td&gt;
      &lt;td&gt;on either Monday, May 21 or Tuesday, May 22, 2...&lt;/td&gt;
      &lt;td&gt;{'events': 'Love Field', '_source': 'TextChunk...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;676453&lt;/th&gt;
      &lt;td&gt;email_024829&lt;/td&gt;
      &lt;td&gt;email_024829:content:3of3:1999to2391&lt;/td&gt;
      &lt;td&gt;IN 139 - Louisville, KY return to top Featured...&lt;/td&gt;
      &lt;td&gt;{'events': 'Love Field', '_source': 'TextChunk...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;676454&lt;/th&gt;
      &lt;td&gt;email_024804&lt;/td&gt;
      &lt;td&gt;email_024804:content:1of1:0to354&lt;/td&gt;
      &lt;td&gt;FYI -- David Leboe in Investor Relations autho...&lt;/td&gt;
      &lt;td&gt;{'events': None, '_source': 'TextChunkingPrepr...&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;676455 rows × 6 columns&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;count_0&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;676455&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;As we can see, the data looks very much like a regular relational table. However, the fact that it's a knowledge base instance rather than a regular database connection, allows us to use a special syntax mixing the semantic similarity with regulare SQL "WHERE" constructs.&lt;/p&gt;

&lt;p&gt;You can also notice that the knowledge base contains chunks rather than the original texts of the email messages. Each chunk has its own embedding vectr. This allows finding more granular pieces of content similar to the user's question.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Performing Semantic Searches
&lt;/h2&gt;

&lt;p&gt;Now that our knowledge base is ready (or being populated), let's do some Q&amp;amp;A. For convenience, we will setup a utility function &lt;code&gt;answer_question_about_enron&lt;/code&gt; which will take as input question about the data and the attribute this data is expected to contain such as people names, organizations, locaitons, etc: the attributes thet the NER was supposed to have extracted. This utility function will combine the inputs into a SELECT query by using the MindsDB syntax. For example, if our question/request is "I need to see emails mentioning fraud." and we want to only to see emails from "John Smith", our SELECT query constructed by &lt;code&gt;answer_question_about_enron&lt;/code&gt; would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT 
    id, 
    chunk_content, 
    relevance,
    metadata
FROM enron_kb_full
WHERE content = 'I need to see emails mentioning fraud.' AND persons LIKE '%John Smith%'
ORDER BY relevance DESC
LIMIT 100;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above query will return emails that mention any fraud even if the word "fraud" itself isn't used in the emails' texts. This is a soft search. Only those emails will be retuned whose "persons" attribute contains "John Smith". This is a hard search.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;IPython.display&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;display&lt;/span&gt;

&lt;span class="c1"&gt;# Set up OpenAI client (replace with your API key)
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;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&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;answer_question_about_enron&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;money_amounts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subjects&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Answer questions about Enron using the knowledge base with metadata filtering

    Args:
        question (str): The question to ask
        persons (str or list): Person name(s) to filter by
        organizations (str or list): Organization name(s) to filter by  
        locations (str or list): Location name(s) to filter by
        money_amounts (str or list): Money amount(s) to filter by
        subjects (str or list): Subject keyword(s) to filter by

    Returns:
        str: Generated answer based on relevant emails
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_like_conditions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Helper function to create LIKE conditions for single values or lists&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

        &lt;span class="c1"&gt;# Convert single value to list for uniform processing
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
            &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

        &lt;span class="n"&gt;conditions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;conditions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;column_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; LIKE &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;conditions&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🤔 Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Build WHERE clause with optional filters
&lt;/span&gt;    &lt;span class="n"&gt;where_conditions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Handle persons filter
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;person_conditions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_like_conditions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;person_conditions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;where_conditions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;person_conditions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 Filtering by persons: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 Filtering by persons: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Handle organizations filter
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;org_conditions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_like_conditions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;org_conditions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;where_conditions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org_conditions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 Filtering by organizations: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 Filtering by organizations: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Handle locations filter
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;loc_conditions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_like_conditions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;loc_conditions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;where_conditions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loc_conditions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;locations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 Filtering by locations: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;locations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 Filtering by locations: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;locations&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Handle money_amounts filter
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;money_amounts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;money_conditions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_like_conditions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;money_amounts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;money_amounts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;money_conditions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;where_conditions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;money_conditions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;money_amounts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 Filtering by money amounts: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;money_amounts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 Filtering by money amounts: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;money_amounts&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Handle subjects filter
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;subjects&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;subject_conditions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_like_conditions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subjects&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;subject_conditions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;where_conditions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subject_conditions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subjects&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 Filtering by subjects: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subjects&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 Filtering by subjects: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;subjects&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;where_clause&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; AND &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;where_conditions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;search_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        SELECT 
            id, 
            chunk_content, 
            relevance,
            metadata
        FROM enron_kb
        WHERE &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;where_clause&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
        ORDER BY relevance DESC
        LIMIT 100;
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔍 SQL Query: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;search_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;search_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌ No relevant emails found matching your criteria.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Show sample results
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;search_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;nan&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📧 Result #&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🆔 Email ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📊 Relevance: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;relevance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📝 Subject: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;No subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;👥 Persons: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🏢 Organizations: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📍 Locations: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chunk_content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                &lt;span class="n"&gt;preview&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;💬 Content: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;preview&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error processing result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Prepare context for GPT
&lt;/span&gt;        &lt;span class="n"&gt;context_parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;search_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;nan&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
                &lt;span class="n"&gt;email_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Email ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Subject: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;No subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Persons mentioned: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;persons&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Organizations mentioned: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organizations&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Content: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chunk_content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
                &lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="c1"&gt;# Create prompt for GPT
&lt;/span&gt;        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        You are an expert analyst studying the Enron corporate emails dataset. Based ONLY on the following 
        email excerpts from the Enron corpus, answer the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s question.

        EMAIL EXCERPTS FROM ENRON CORPUS:
        &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

        QUESTION: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

        Instructions:
        - Provide a factual answer based only on the email content provided above
        - If the emails mention specific people, organizations, or amounts, include those details
        - If the emails don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t contain enough information to answer the question, state that clearly
        - Reference specific email IDs when making claims
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🤖 Generating answer using GPT-4...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful analyst answering questions about Enron emails. Use only the provided email content and be specific about sources.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;💡 ANSWER:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌ Error during search: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error during search: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="c1"&gt;# Process the three original questions with metadata filtering
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== ENRON EMAIL ANALYSIS WITH METADATA FILTERING ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📋 Question 1:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;answer1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;answer_question_about_enron&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What concerns did Sherron Watkins express to Ken Lay in her email about Enron&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="s"&gt;s accounting practices?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Watkins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Lay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📋 Question 2:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;answer_question_about_enron&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How did David Delainey justify inflating Mariner&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="s"&gt;s valuation from $250M to $600M in his email to Ken Lay?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Delainey&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📋 Question 3:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;answer_question_about_enron&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How did Tim DeSpain coach Ken Lay on what to tell credit rating agencies about Enron&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="s"&gt;s financial condition?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Moody&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== ENRON EMAIL ANALYSIS WITH METADATA FILTERING ===

&lt;p&gt;📋 Question 1:&lt;br&gt;
🤔 Question: What concerns did Sherron Watkins express to Ken Lay in her email about Enron''s accounting practices?&lt;br&gt;
🔍 Filtering by persons: Watkins, Lay&lt;br&gt;
🔍 SQL Query: &lt;br&gt;
        SELECT &lt;br&gt;
            id, &lt;br&gt;
            chunk_content, &lt;br&gt;
            relevance,&lt;br&gt;
            metadata&lt;br&gt;
        FROM enron_kb&lt;br&gt;
        WHERE content = 'What concerns did Sherron Watkins express to Ken Lay in her email about Enron''s accounting practices?' AND persons LIKE '%Watkins%' AND persons LIKE '%Lay%'&lt;br&gt;
        ORDER BY relevance DESC&lt;br&gt;
        LIMIT 100;&lt;/p&gt;

&lt;p&gt;✅ Found 16 results&lt;/p&gt;

&lt;p&gt;📧 Result #1&lt;br&gt;
🆔 Email ID: email_048101&lt;br&gt;
📊 Relevance: 0.7053&lt;br&gt;
📝 Subject: The key questions I asked Lay on Aug 22&lt;br&gt;
👥 Persons: Sherron S. Watkins, Lay&lt;br&gt;
🏢 Organizations: Enron Corp.&lt;br&gt;
📍 Locations: None&lt;/p&gt;
&lt;h2&gt;
  
  
  💬 Content: Sherron S. Watkins Vice President, Enron Corp. 713-345-8799 office 713-416-0620 cell
&lt;/h2&gt;

&lt;p&gt;📧 Result #2&lt;br&gt;
🆔 Email ID: email_335299&lt;br&gt;
📊 Relevance: 0.6832&lt;br&gt;
📝 Subject: TEAM 4 - HR ENERGY COMMERCE SUBPOENA (1) (01/14/02) AND (2) (12/10/01)&lt;br&gt;
👥 Persons: Ken Lay, Sherron Watkins, JEDI&lt;br&gt;
🏢 Organizations: BLUE DOG, LJM2, 09 09The, the RAP TEAM, Enron&lt;br&gt;
📍 Locations: V E, electr&lt;/p&gt;
&lt;h2&gt;
  
  
  💬 Content: Please search your files and collect all records and documents covered by o r relevant to the following requests: (1) 09All records relating to any investigations/review of the allegations raised by S...
&lt;/h2&gt;

&lt;p&gt;📧 Result #3&lt;br&gt;
🆔 Email ID: email_178480&lt;br&gt;
📊 Relevance: 0.6727&lt;br&gt;
📝 Subject: TEAM 4 - HR ENERGY COMMERCE SUBPOENA (1) (01/14/02) AND (2)&lt;br&gt;
👥 Persons: Ken Lay, Sherron Watkins, JEDI, Bcc&lt;br&gt;
🏢 Organizations: BLUE DOG, LJM2, &lt;a href="mailto:k..heathman@enron.com"&gt;k..heathman@enron.com&lt;/a&gt;, minutes , 09 09The&lt;br&gt;
📍 Locations: V E, electr&lt;/p&gt;
&lt;h2&gt;
  
  
  💬 Content: (12/10/01) Cc: &lt;a href="mailto:k..heathman@enron.com"&gt;k..heathman@enron.com&lt;/a&gt;, &lt;a href="mailto:team.response@enron.com"&gt;team.response@enron.com&lt;/a&gt; Bcc: &lt;a href="mailto:k..heathman@enron.com"&gt;k..heathman@enron.com&lt;/a&gt;, &lt;a href="mailto:team.response@enron.com"&gt;team.response@enron.com&lt;/a&gt; We remain in the process of gathering information sought by various governm ental agen...
&lt;/h2&gt;

&lt;p&gt;🤖 Generating answer using GPT-4...&lt;/p&gt;

&lt;p&gt;💡 ANSWER:&lt;br&gt;
Sherron Watkins expressed concerns to Ken Lay about Enron's accounting practices in her email, stating that she was "incredibly nervous that we will implode in a wave of accounting scandals." This concern was highlighted in an email discussing the broader context of Enron's financial issues, where it was noted that Andersen, the government, and Enron itself had access to financial data indicating the company's potential collapse (Email ID: email_446563). Additionally, her concerns were significant enough to prompt investigations and reviews of the allegations she raised in her August memo to Ken Lay, as mentioned in emails discussing subpoenas and document requests (Email IDs: email_335299 and email_178480).&lt;/p&gt;

&lt;p&gt;====================================================================================================&lt;/p&gt;

&lt;p&gt;📋 Question 2:&lt;br&gt;
🤔 Question: How did David Delainey justify inflating Mariner''s valuation from $250M to $600M in his email to Ken Lay?&lt;br&gt;
🔍 Filtering by persons: Delainey&lt;br&gt;
🔍 SQL Query: &lt;br&gt;
        SELECT &lt;br&gt;
            id, &lt;br&gt;
            chunk_content, &lt;br&gt;
            relevance,&lt;br&gt;
            metadata&lt;br&gt;
        FROM enron_kb&lt;br&gt;
        WHERE content = 'How did David Delainey justify inflating Mariner''s valuation from $250M to $600M in his email to Ken Lay?' AND persons LIKE '%Delainey%'&lt;br&gt;
        ORDER BY relevance DESC&lt;br&gt;
        LIMIT 100;&lt;/p&gt;

&lt;p&gt;✅ Found 100 results&lt;/p&gt;

&lt;p&gt;📧 Result #1&lt;br&gt;
🆔 Email ID: email_006062&lt;br&gt;
📊 Relevance: 0.7015&lt;br&gt;
📝 Subject: Mariner&lt;br&gt;
👥 Persons: Delainey, Ken, Kase Lawal, Bcc&lt;br&gt;
🏢 Organizations: un, IPO, Mariner, E P&lt;br&gt;
📍 Locations: None&lt;/p&gt;
&lt;h2&gt;
  
  
  💬 Content: Cc: &lt;a href="mailto:jeff.donahue@enron.com"&gt;jeff.donahue@enron.com&lt;/a&gt;, &lt;a href="mailto:raymond.bowen@enron.com"&gt;raymond.bowen@enron.com&lt;/a&gt; Bcc: &lt;a href="mailto:jeff.donahue@enron.com"&gt;jeff.donahue@enron.com&lt;/a&gt;, &lt;a href="mailto:raymond.bowen@enron.com"&gt;raymond.bowen@enron.com&lt;/a&gt; Ken, in response to your note, I am not aware of any official dialogue with Mr. Kase Lawal abou...
&lt;/h2&gt;

&lt;p&gt;📧 Result #2&lt;br&gt;
🆔 Email ID: email_400597&lt;br&gt;
📊 Relevance: 0.7015&lt;br&gt;
📝 Subject: Mariner&lt;br&gt;
👥 Persons: Delainey, Ken, Kase Lawal, Bcc&lt;br&gt;
🏢 Organizations: un, IPO, Mariner, E P&lt;br&gt;
📍 Locations: None&lt;/p&gt;
&lt;h2&gt;
  
  
  💬 Content: Cc: &lt;a href="mailto:jeff.donahue@enron.com"&gt;jeff.donahue@enron.com&lt;/a&gt;, &lt;a href="mailto:raymond.bowen@enron.com"&gt;raymond.bowen@enron.com&lt;/a&gt; Bcc: &lt;a href="mailto:jeff.donahue@enron.com"&gt;jeff.donahue@enron.com&lt;/a&gt;, &lt;a href="mailto:raymond.bowen@enron.com"&gt;raymond.bowen@enron.com&lt;/a&gt; Ken, in response to your note, I am not aware of any official dialogue with Mr. Kase Lawal abou...
&lt;/h2&gt;

&lt;p&gt;📧 Result #3&lt;br&gt;
🆔 Email ID: email_275215&lt;br&gt;
📊 Relevance: 0.7015&lt;br&gt;
📝 Subject: Mariner&lt;br&gt;
👥 Persons: Delainey, Ken, Kase Lawal, Bcc&lt;br&gt;
🏢 Organizations: un, IPO, Mariner, E P&lt;br&gt;
📍 Locations: None&lt;/p&gt;
&lt;h2&gt;
  
  
  💬 Content: Cc: &lt;a href="mailto:jeff.donahue@enron.com"&gt;jeff.donahue@enron.com&lt;/a&gt;, &lt;a href="mailto:raymond.bowen@enron.com"&gt;raymond.bowen@enron.com&lt;/a&gt; Bcc: &lt;a href="mailto:jeff.donahue@enron.com"&gt;jeff.donahue@enron.com&lt;/a&gt;, &lt;a href="mailto:raymond.bowen@enron.com"&gt;raymond.bowen@enron.com&lt;/a&gt; Ken, in response to your note, I am not aware of any official dialogue with Mr. Kase Lawal abou...
&lt;/h2&gt;

&lt;p&gt;🤖 Generating answer using GPT-4...&lt;/p&gt;

&lt;p&gt;💡 ANSWER:&lt;br&gt;
David Delainey justified inflating Mariner's valuation from $250M to $600M based on several factors mentioned in the emails. According to the content of multiple emails (email IDs: email_006062, email_400597, email_275215, email_372317, email_280761, email_372732, email_332901), the justification included:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Successful Wells&lt;/strong&gt;: Mariner had enjoyed a series of successful wells that were expected to be booked in reserve reports by the following March.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increases in Gas and Oil Prices&lt;/strong&gt;: There were significant increases in gas and oil prices, which contributed to the higher valuation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reserve Growth&lt;/strong&gt;: The reserve growth was a key factor in the increased valuation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current Energy Prices&lt;/strong&gt;: The current energy prices at the time supported the higher valuation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Future Goals&lt;/strong&gt;: The goal was to demonstrate three to four quarters of increasing operating cash flow and reserves growth before attempting further actions.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These factors collectively contributed to the stretch target valuation of $600M, which Delainey noted was not incredibly out of line given the circumstances.&lt;br&gt;
📋 Question 3:&lt;br&gt;
🤔 Question: How did Tim DeSpain coach Ken Lay on what to tell credit rating agencies about Enron''s financial condition?&lt;br&gt;
🔍 Filtering by organizations: Moody&lt;br&gt;
🔍 SQL Query: &lt;br&gt;
        SELECT &lt;br&gt;
            id, &lt;br&gt;
            chunk_content, &lt;br&gt;
            relevance,&lt;br&gt;
            metadata&lt;br&gt;
        FROM enron_kb&lt;br&gt;
        WHERE content = 'How did Tim DeSpain coach Ken Lay on what to tell credit rating agencies about Enron''s financial condition?' AND organizations LIKE '%Moody%'&lt;br&gt;
        ORDER BY relevance DESC&lt;br&gt;
        LIMIT 100;&lt;/p&gt;

&lt;p&gt;✅ Found 100 results&lt;/p&gt;

&lt;p&gt;📧 Result #1&lt;br&gt;
🆔 Email ID: email_028284&lt;br&gt;
📊 Relevance: 0.6674&lt;br&gt;
📝 Subject: Yesterday s Call: Feedback&lt;br&gt;
👥 Persons: Good Luck, Jeff P.S., Cal Ed&lt;br&gt;
🏢 Organizations: LJM, ENE, Moody s, Fastow, SEC&lt;br&gt;
📍 Locations: Citi, Skilling&lt;/p&gt;
&lt;h2&gt;
  
  
  💬 Content: Ken, Thanks for having the call yesterday. I am a believer in Enron and we are buying your debt. Here s short feedback on the call. I give the call a B-/C grade. If you want a good example of a compan...
&lt;/h2&gt;

&lt;p&gt;📧 Result #2&lt;br&gt;
🆔 Email ID: email_268144&lt;br&gt;
📊 Relevance: 0.6636&lt;br&gt;
📝 Subject: Moody s Annual Review Meeting&lt;br&gt;
👥 Persons: Jeff McMahon, Stephen Moore - Relationship, Foley, Tim, Ben&lt;br&gt;
🏢 Organizations: Sierra Pacific, EBS, International Asset Sales, Wholesale Services, Moody s&lt;br&gt;
📍 Locations: California&lt;/p&gt;
&lt;h2&gt;
  
  
  💬 Content: Director, and Stephen Moore - Relationship Manager (our analyst). Diaz and Moore are very familiar with the Enron credit profile. Foley is their boss. He apparently is the leader of their ratings comm...
&lt;/h2&gt;

&lt;p&gt;📧 Result #3&lt;br&gt;
🆔 Email ID: email_152456&lt;br&gt;
📊 Relevance: 0.6528&lt;br&gt;
📝 Subject: Moody s and Standard Poor s&lt;br&gt;
👥 Persons: John Diaz, Ben, Bcc, Andy, Tim DeSpain&lt;br&gt;
🏢 Organizations: Credit Ratings - emphasize, Moody s Call:, Standard Poor s, EBS, Dhabol&lt;br&gt;
📍 Locations: None&lt;/p&gt;
&lt;h2&gt;
  
  
  💬 Content: Cc: &lt;a href="mailto:ben.glisan@enron.com"&gt;ben.glisan@enron.com&lt;/a&gt;, &lt;a href="mailto:andrew.fastow@enron.com"&gt;andrew.fastow@enron.com&lt;/a&gt; Bcc: &lt;a href="mailto:ben.glisan@enron.com"&gt;ben.glisan@enron.com&lt;/a&gt;, &lt;a href="mailto:andrew.fastow@enron.com"&gt;andrew.fastow@enron.com&lt;/a&gt; Two conference calls have been tenatively scheduled to allow you to directly discuss Enron s commit...
&lt;/h2&gt;

&lt;p&gt;🤖 Generating answer using GPT-4...&lt;/p&gt;

&lt;p&gt;💡 ANSWER:&lt;br&gt;
Tim DeSpain, along with Andy and Ben, coached Ken Lay on what to tell credit rating agencies about Enron's financial condition by emphasizing several key assurances. According to email ID: email_152456 and email ID: email_372468, they advised Ken Lay to stress the following points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Commitment to Maintaining Credit Ratings&lt;/strong&gt;: They emphasized that maintaining credit ratings was critical to Enron's fundamental businesses, particularly gas and power marketing. They noted that both counterparties and creditors placed significant importance on Enron's consistent rating profile.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Strength of Core Businesses&lt;/strong&gt;: They highlighted that Enron's core businesses were strong, positioning Enron as the leading franchise in energy marketing. They anticipated continued strength in financial performance from the commodity groups.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These points were intended to assure the credit rating agencies of Enron's financial stability and commitment to its credit ratings.&lt;br&gt;
&lt;/p&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Conclusion&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;In this tutorial, we've successfully built a sophisticated question-answering system over the Enron email corpus by combining MindsDB's Knowledge Base capabilities with Named Entity Recognition. This demonstrates how modern AI tools can transform unstructured text into a queryable, intelligent knowledge base.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Achievements
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automated metadata extraction&lt;/strong&gt;: By leveraging SpaCy's NER models, we automatically extracted structured entities (people, organizations, locations) from raw email text, converting unstructured data into a hybrid storage system that supports both semantic and structured queries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hybrid search capabilities&lt;/strong&gt;: The knowledge base enables both soft search criteria (semantic similarity through embeddings) and hard search criteria (metadata filtering), allowing for precise and flexible information retrieval. This combination significantly enhances search accuracy and reduces irrelevant results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Simplified query interface&lt;/strong&gt;: MindsDB abstracts away the complexity of vector databases, embedding models, and similarity calculations behind a familiar SQL interface. The addition of a simple &lt;code&gt;content&lt;/code&gt; attribute in SQL SELECT statements makes semantic search accessible to anyone familiar with SQL.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Practical RAG implementation&lt;/strong&gt;: By integrating the knowledge base with a chat LLM, we've created a Retrieval-Augmented Generation (RAG) system that can answer complex questions by first retrieving relevant context and then generating informed answers, significantly reducing hallucinations.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Real-World Applications
&lt;/h3&gt;

&lt;p&gt;The techniques demonstrated in this tutorial have broad applications beyond the Enron dataset:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Corporate knowledge management&lt;/strong&gt;: Search through internal documents, emails, and reports using both semantic queries and metadata filters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legal discovery&lt;/strong&gt;: Find relevant communications filtered by sender, recipient, date range, or mentioned entities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customer support&lt;/strong&gt;: Build intelligent support systems that can search through product documentation and past support tickets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research analysis&lt;/strong&gt;: Query academic papers, research notes, or experimental data with combined semantic and structured filtering&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Next Steps
&lt;/h3&gt;

&lt;p&gt;To extend this project, consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Expanding entity types&lt;/strong&gt;: Extract additional metadata such as monetary amounts, dates, or custom domain-specific entities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finetuning embeddings&lt;/strong&gt;: Use domain-specific embedding models for improved semantic matching in specialized fields&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-modal knowledge bases&lt;/strong&gt;: Incorporate documents, images, and other file types into your knowledge base&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced filtering&lt;/strong&gt;: Implement complex boolean logic and date-range queries for more sophisticated searches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production deployment&lt;/strong&gt;: Scale the system to handle larger datasets and concurrent users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For more information on MindsDB Knowledge Bases and advanced features, visit the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overview" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Watch the playback of the live webinar on youtube:&lt;br&gt;


  &lt;iframe src="https://www.youtube.com/embed/fKjX71-5Xyk"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>knowledgebases</category>
      <category>opensource</category>
    </item>
    <item>
      <title>MariaDB &amp; MindsDB Turns WooCommerce Data to Insights with Real-Time AI Analytics for eCommerce Teams</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Mon, 29 Dec 2025 23:07:03 +0000</pubDate>
      <link>https://dev.to/mindsdb/mariadb-mindsdb-turns-woocommerce-data-to-insights-with-real-time-ai-analytics-for-ecommerce-teams-3a9j</link>
      <guid>https://dev.to/mindsdb/mariadb-mindsdb-turns-woocommerce-data-to-insights-with-real-time-ai-analytics-for-ecommerce-teams-3a9j</guid>
      <description>&lt;p&gt;&lt;em&gt;Written by Chandre Van Der Westhuizen, Community &amp;amp; Marketing Co-ordinator at MindsDB&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;MariaDB is a widely used open-source relational database that powers eCommerce operations with features like high availability, scalability, and flexible storage engine options. Platforms such as WooCommerce rely on it to manage product catalogs, customer information, and transaction workflows efficiently - and this is why so many eCommerce teams choose MariaDB.&lt;/p&gt;

&lt;p&gt;eCommerce businesses run on data - customer activity, product trends, abandoned carts, shipping timelines, and returns. But too often, that data lives locked away in silos: MariaDB for transactions, spreadsheets for performance metrics, and dashboards for reporting.&lt;/p&gt;

&lt;p&gt;By the time a marketing or operations team reacts, the opportunity has passed.&lt;/p&gt;

&lt;p&gt;With MindsDB, eCommerce teams can connect directly to their &lt;a href="https://docs.mindsdb.com/integrations/data-integrations/mariadb?_gl=1*nmcc4x*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw#mariadb" rel="noopener noreferrer"&gt;MariaDB&lt;/a&gt; databases  and use AI to analyze, predict, and act on data - all in real time, using SQL or natural language, and with zero ETL pipelines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0te5tzghlj3q67fe66kp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0te5tzghlj3q67fe66kp.png" alt="mindsdb" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge: Too Many Tools, Too Much Latency
&lt;/h2&gt;

&lt;p&gt;WooCommerce stores backed by MariaDB collect a wealth of information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Orders, payments, and refunds&lt;/li&gt;
&lt;li&gt;Customer profiles and buying behavior&lt;/li&gt;
&lt;li&gt;Product inventory and pricing&lt;/li&gt;
&lt;li&gt;Shipment tracking and delivery times&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditionally, turning that raw data into insights required exporting CSVs, setting up ETL pipelines, or using third-party BI dashboards. These processes add friction, delay decisions, and make it nearly impossible to deliver real-time personalization or dynamic pricing.&lt;/p&gt;

&lt;h2&gt;
  
  
  MindsDB + MariaDB: AI-Powered Insights Without Moving Your Data
&lt;/h2&gt;

&lt;p&gt;MindsDB brings the power of AI directly into the database layer - meaning your WooCommerce data becomes instantly searchable, explainable, and predictable without ever moving it.&lt;/p&gt;

&lt;p&gt;With MindsDB’s integration with &lt;a href="https://docs.mindsdb.com/integrations/data-integrations/mariadb?_gl=1*11yrca0*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw#mariadb" rel="noopener noreferrer"&gt;MariaDB&lt;/a&gt;, eCommerce teams can run real-time analytics, semantic search, and predictions directly on live orders, customers, products, and reviews - all without ETL or data duplication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With MindsDB’s integration with MariaDB, you can:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query live WooCommerce data (orders, customers, products, reviews) directly in the database using SQL or natural language - no ETL required&lt;/li&gt;
&lt;li&gt;Perform hybrid search that blends structured and unstructured data (e.g., review-text + order history) to surface insights like shipping issues, churn risk, or product defects&lt;/li&gt;
&lt;li&gt;Predict outcomes (reorders, stockouts, customer lifetime value) in real time, since everything operates inside MariaDB&lt;/li&gt;
&lt;li&gt;Preserve data governance and security-data stays in MariaDB’s environment, enabling easier auditability and compliance&lt;/li&gt;
&lt;li&gt;Empower cross-functional teams (CX, marketing, operations, finance) with on-demand analytics from the same system powering your eCommerce&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MindsDB + MariaDB turns your WooCommerce store into a real-time AI analytics engine-&lt;/strong&gt; without changing your infrastructure or data stack.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4h87ufk1f0gbxqnzt9j5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4h87ufk1f0gbxqnzt9j5.png" alt="mindsdb" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The MindsDB Solution: AI-Native, Zero-ETL Analytics Inside MariaDB
&lt;/h2&gt;

&lt;p&gt;MindsDB’s Federate Query Engine allows you to connect to MariaDB using a single interface and SQL where you can unify your data with &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overview?_gl=1*1742whc*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Knowledge Bases&lt;/a&gt; and query it using &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/hybrid_search?_gl=1*1742whc*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Hybrid Search&lt;/a&gt; by performing SQL operations or natural language via &lt;a href="https://docs.mindsdb.com/mindsdb_sql/agents/agent?_gl=1*1742whc*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Agents&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To showcase how you can turn your raw data into valuable insights, we will be using a sample dataset for WooCommerce stored in MariaDB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-requisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Access MindsDB’s GUI via &lt;a href="https://docs.mindsdb.com/setup/self-hosted/docker?_gl=1*7oloxq*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; locally or &lt;a href="https://docs.mindsdb.com/setup/self-hosted/docker-desktop?_gl=1*1mhs5kr*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;MindsDB’s extension on Docker Desktop.&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Configure your default models in the MindsDB GUI by navigating to Settings → Models.&lt;/li&gt;
&lt;li&gt;Navigate to Manage Integrations in Settings and install the dependencies for MariaDB.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once you have installed the dependencies in the GUI, you can connect to MariaDB using the SQL Editor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;mariadb&lt;/span&gt;  &lt;span class="c1"&gt;--- display name for database.&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'mariadb'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;--- name of the mindsdb handler&lt;/span&gt;
&lt;span class="k"&gt;PARAMETERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="nv"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"demo_user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;--- Your database user.&lt;/span&gt;
   &lt;span class="nv"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"demo_password"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;--- Your password.&lt;/span&gt;
   &lt;span class="nv"&gt;"host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"samples.mindsdb.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;--- host, it can be an ip or an url.&lt;/span&gt;
   &lt;span class="nv"&gt;"port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"3307"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;--- common port is 3306.&lt;/span&gt;
   &lt;span class="nv"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"test_data"&lt;/span&gt;           &lt;span class="c1"&gt;--- The name of your database *optional.&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data we will access is the Woocommerce Orders, Products, Reviews and Customers tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unifying MariaDB’s WooCommerce Data using MindsDB’s Knowledge Bases
&lt;/h3&gt;

&lt;p&gt;A MindsDB &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overview?_gl=1*1p34k49*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Knowledge Base&lt;/a&gt; is an AI-enhanced table that organizes information by meaning instead of keywords, using embeddings, rerankers, and vector storage to understand context. This allows it to perform semantic reasoning across data points, delivering deeper insights and highly accurate, context-aware answers.&lt;/p&gt;

&lt;p&gt;Knowledge Bases will be created for the Woocommerce Orders, Products, Reviews and Customers tables. &lt;/p&gt;

&lt;p&gt;The first table we will use is the Products table. To create a Knowledge Base, the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/create?_gl=1*5wlrx8*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;CREATE KNOWLEDGE_BASE&lt;/a&gt; statement will be used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;KNOWLEDGE_BASE&lt;/span&gt; &lt;span class="n"&gt;products_kb&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
&lt;span class="k"&gt;storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;heroku&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_kb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;metadata_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'price'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'stock'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'rating'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'category'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="n"&gt;content_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'name'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="n"&gt;id_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'product_id'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here are the parameters provided: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;products_kb: The name of the knowledge base. &lt;/li&gt;
&lt;li&gt;storage : The storage table where the embeddings of the knowledge base is stored. As you can see we are using the PGVector database we created a connection with and provide the name orders to the table that will be created for storage. &lt;/li&gt;
&lt;li&gt;metadata_columns : Here columns are provided as meta data columns to perform metadata filtering. &lt;/li&gt;
&lt;li&gt;content_columns : Here columns are provided for semantic search. &lt;/li&gt;
&lt;li&gt;id_column: This uniquely identifies each source data row in the knowledge base&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The data from MariaDB can be inserted into this Knowledge Base using the INSERT INTO statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;products_kb&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stock&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mariadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;woocommerce_products&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can select the Knowledge Base to query the data that has been inserted using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/query?_gl=1*b2t07e*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;SELECT&lt;/a&gt; statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products_kb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn6ty5ay3zem4svcbgvnv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn6ty5ay3zem4svcbgvnv.png" alt="mindsdb" width="800" height="407"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using the same steps as above, knowledge bases have been created for the remaining tables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customers_kb : Created with the WooCommerce Customers table in MariaDB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb3ubndhwqo2z89imgam.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb3ubndhwqo2z89imgam.png" alt="mindsdb" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reviews_kb :  Created with the WooCommerce Reviews table in MariaDB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uk7l5jkizoidmexte0f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uk7l5jkizoidmexte0f.png" alt="mindsdb" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Orders_kb : Created with the WooCommerce Orders table in MariaDB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6p7b3vy35hwtekwr1zu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6p7b3vy35hwtekwr1zu.png" alt="mindsdb" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Performing Keyword and Semantic Search using MindsDB’s Hybrid Search
&lt;/h3&gt;

&lt;p&gt;Knowledge bases offer both semantic search and keyword-based search, each suited for different types of queries. &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/hybrid_search?_gl=1*1aelu4k*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Hybrid search&lt;/a&gt; combines them, ensuring users get results that match meaning as well as exact terms, covering scenarios where embeddings miss specific keywords or identifiers.&lt;/p&gt;

&lt;p&gt;eCommerce teams can understand which products in key categories (like Electronics and Fitness) are generating positive customer sentiment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Correlate positive reviews with products&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;  &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;reviews_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;products_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;reviews_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;products_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;reviews_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'would buy again'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;reviews_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rating&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;products_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Electronics'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Fitness'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;reviews_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;products_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By correlating favorable reviews with specific products using semantic intent (“would buy again”), teams can identify high-performing items, refine marketing strategies, improve product recommendations, and prioritize inventory for products that foster strong customer loyalty.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fghd8j7689rgull71bjnn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fghd8j7689rgull71bjnn.png" alt="mindsdb" width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's identify products that consistently generate negative sentiment-specifically complaints about quality and value.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Correlate products with negative reviews&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;  &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;reviews_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;products_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;reviews_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;products_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;reviews_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'Not worth the price.'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;reviews_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Quality could be better.'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;reviews_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;products_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By correlating these reviews directly with the affected products, eCommerce teams can detect potential quality issues, supplier problems, misleading product descriptions, or customer-experience gaps early, allowing them to take corrective action before negative feedback impacts sales, returns, or brand reputation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgtnjmept65o9b1f9otqf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgtnjmept65o9b1f9otqf.png" alt="mindsdb" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;eCommerce teams can quickly identify which customers have experienced refunded orders- a strong signal of friction, dissatisfaction, or potential operational issues.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Identify customers with refunded orders&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customers_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Refunded'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;customers_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By linking refunds directly to customer profiles, teams can investigate root causes, prevent churn among high-value customers, and improve support, logistics, or product quality before these problems escalate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F24lok87erpo05zg6mzab.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F24lok87erpo05zg6mzab.png" alt="mindsdb" width="800" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Teams can highlight high-priority operational risks by surfacing pending orders belonging to Platinum (top-tier) customers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Track orders that are pending for Platinum customers&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customers_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;  &lt;span class="n"&gt;customers_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Platinum'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Pending'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;customers_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relevance&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;customers_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These customers contribute disproportionately to revenue and loyalty, so delays or issues in their orders can directly impact retention and brand trust. By proactively identifying pending orders for Platinum customers, support and operations teams can intervene faster, reduce dissatisfaction, and ensure premium customers receive the service level they expect.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrachp9kow10wv2i0yu7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrachp9kow10wv2i0yu7.png" alt="mindsdb" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Together, these AI-powered Hybrid Search queries demonstrate how MindsDB and MariaDB transform raw WooCommerce data into actionable intelligence- helping teams anticipate issues, understand customers more deeply, and make smarter, faster decisions. By combining semantic and keyword search with structured analytics, businesses gain a real-time, 360° view of product performance, customer sentiment, and operational health, ultimately enabling a more resilient, data-driven eCommerce strategy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create a MindsDB Agent That Understands Your MariaDB’s WooCommerce Data
&lt;/h3&gt;

&lt;p&gt;MindsDB’s &lt;a href="https://docs.mindsdb.com/mindsdb_sql/agents/agent?_gl=1*eixsgi*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Agents&lt;/a&gt; make it possible to interact conversationally with your data- both structured and unstructured- through MindsDB. Here we will create an AI Agent with the Knowledge Bases we have previously created with the MariaDB data.&lt;/p&gt;

&lt;p&gt;Use the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/agents/agent_syntax?_gl=1*eixsgi*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw#create-agent-syntax" rel="noopener noreferrer"&gt;CREATE AGENT&lt;/a&gt; statement to create the AI Agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;AGENT&lt;/span&gt; &lt;span class="n"&gt;mariadb_ecommerce_agent&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
&lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nv"&gt;"knowledge_bases"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;"orders_kb"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"customers_kb"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"reviews_kb"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"products_kb"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
   &lt;span class="p"&gt;},&lt;/span&gt;
   &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'You are an AI assistant working with WooCommerce eCommerce data stored across several Knowledge Bases:

orders_kb - contains order-level information such as customer_id, order_date, total_amount, payment_method, and order_status.

customers_kb - contains customer profiles including first_name, last_name, email, country, loyalty_tier, signup_date, and total_spent.

reviews_kb - contains product reviews including review_text, rating, review_date, and the customer who wrote the review.

products_kb - contains product catalog information such as product name, category, price, stock, and rating.

Use these Knowledge Bases to answer questions about WooCommerce performance, customer behavior, product insights, order patterns, and review sentiment.
Always provide grounded, data-backed answers using the available knowledge.'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is the breakdown of the parameters provided to the agent: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mariadb_ecommerce_agent : The name provided to the agent &lt;/li&gt;
&lt;li&gt;data : This parameter stores data connected to the agent, including knowledge bases and data sources connected to MindsDB. &lt;/li&gt;
&lt;li&gt;knowledge_bases : stores the list of knowledge bases to be used by the agent. &lt;/li&gt;
&lt;li&gt;prompt_template  : This parameter stores instructions for the agent. It is recommended to provide data description of the data sources listed in the knowledge_bases parameter to help the agent locate relevant data for answering questions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MindsDB offers a &lt;a href="https://docs.mindsdb.com/mindsdb_sql/agents/agent_gui?_gl=1*eixsgi*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Chat Interface&lt;/a&gt; in the GUI that allows you to chat with your AI Agent in natural language. Lets ask a few questions to gain insights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question 1: Which products have low stock levels?&lt;/strong&gt;&lt;br&gt;
By identifying products that are running low, eCommerce teams can prevent stockouts, avoid lost revenue, maintain accurate delivery estimates, and plan timely replenishment—ensuring a smoother shopping experience and better inventory management overall.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4aws3d4n4nqm4ehm010m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4aws3d4n4nqm4ehm010m.png" alt="mindsdb" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can expand on the table and scroll to see the full list.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fto4cv87rks7gmqj7f5is.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fto4cv87rks7gmqj7f5is.png" alt="mindsdb" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question2: Which categories have the highest average rating?&lt;/strong&gt;&lt;br&gt;
Knowing which product categories have the highest average rating helps eCommerce teams understand where customers are most satisfied. It guides decisions around merchandising, marketing focus, supplier relationships, and future product investments—allowing the business to double down on categories that consistently deliver strong customer experiences.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe5i4cly3ci9xxgm3qzyt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe5i4cly3ci9xxgm3qzyt.png" alt="mindsdb" width="800" height="477"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question 3: Is there any meaningful correlation between review sentiment and repeat purchasing?&lt;/strong&gt;&lt;br&gt;
This helps reveal whether customer satisfaction directly influences loyalty and repeat purchases. Understanding this correlation allows eCommerce teams to prioritize experience improvements that have the greatest impact on long-term revenue and customer retention.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmpdudu4pjl3jqfucoxc3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmpdudu4pjl3jqfucoxc3.png" alt="mindsdb" width="800" height="509"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question 4: Which customers have made the most purchases this year?&lt;/strong&gt;&lt;br&gt;
This identifies your highest-engagement customers—the ones driving the most transactions and revenue. Knowing who they are allows teams to target rewards, personalized marketing, and retention strategies toward their most valuable buyers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi2wtykaji1z117v182ng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi2wtykaji1z117v182ng.png" alt="mindsdb" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question 5: Give me a summary of customer sentiment for electronics products.&lt;/strong&gt;&lt;br&gt;
This provides a clear snapshot of how customers feel about a key category, helping teams quickly identify strengths, weaknesses, and emerging issues. Understanding sentiment for electronics products guides product improvements, marketing decisions, and support priorities.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focljx1mjcn06f8m42lxh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focljx1mjcn06f8m42lxh.png" alt="mindsdb" width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question 6: What is the sum of the total amount per order status&lt;/strong&gt;&lt;br&gt;
This shows how revenue is distributed across different order statuses, such as Completed, Pending, or Refunded. This helps teams understand fulfillment efficiency, identify bottlenecks, and quantify the financial impact of delays or cancellations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjfhkd28fn4o1q7o0k6b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjfhkd28fn4o1q7o0k6b.png" alt="mindsdb" width="800" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These questions empower eCommerce teams to turn raw operational data into clear, actionable insights. By understanding customer behavior, product performance, sentiment trends, and revenue patterns, businesses can make smarter decisions that improve customer experience, streamline operations, and drive sustainable growth- and with MindsDB and MariaDB working together, these insights become real-time, AI-driven, and accessible directly from the data source.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Use Cases for Teams Using MariaDB with MindsDB:
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Customer Retention &amp;amp; Re-Order Predictions: **Identify who will buy again in the next 30 days, and trigger automated win-back campaigns.&lt;br&gt;
**2. Smarter Product Recommendations: **Blend product metadata, purchase history, and review sentiment to power AI personalization.&lt;br&gt;
**3. Real-Time Inventory Forecasting:&lt;/strong&gt; Predict stockouts or slow-moving items and optimize replenishment.&lt;br&gt;
&lt;strong&gt;4. Operational Intelligence:&lt;/strong&gt; Understand why refunds spike, what customers complain about, and where shipment delays occur.&lt;br&gt;
&lt;strong&gt;5. Executive Dashboarding Without Limits:&lt;/strong&gt; Query everything directly from SQL or natural language - without waiting for ETL jobs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters for Teams Using MariaDB
&lt;/h2&gt;

&lt;p&gt;With MindsDB and MariaDB, you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A single unified view of your entire operation:&lt;/strong&gt; Orders, reviews, returns, and inventory - all searchable and analyzable.&lt;/li&gt;
&lt;li&gt;**Instant insights without exporting data: **Skip ETL. Skip spreadsheets. See what’s happening now.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-native intelligence across every workflow:&lt;/strong&gt; Forecasting, summarization, classification, hybrid search - all inside SQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in trust and traceability:&lt;/strong&gt; Every answer is grounded in your actual database rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the future of your business - AI that understands because it sits directly on top of your real data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;MariaDB and MindsDB together unlock a new era of real-time, AI-powered intelligence- moving teams beyond traditional dashboards, slow ETL pipelines, and disconnected tools. By unifying structured operational data with unstructured review text and layering semantic understanding on top, MindsDB transforms MariaDB into a live decision engine that answers complex questions, reveals hidden patterns, and predicts what happens next.&lt;/p&gt;

&lt;p&gt;Whether you're optimizing inventory, improving product quality, reducing refunds, understanding customer sentiment, or driving retention, MindsDB makes these insights available instantly and directly where your data already lives. With hybrid search, knowledge bases, and intelligent agents, teams can finally interact with their MariaDB data the way they think- conversationally, contextually, and without friction.&lt;/p&gt;

&lt;p&gt;The result is faster decisions, more resilient operations, and a smarter, data-driven business powered by AI that sits right inside your database. If you are using MariaDB and would like to supercharge your data with AI, &lt;a href="https://mindsdb.com/contact" rel="noopener noreferrer"&gt;contact our team to get started.&lt;br&gt;
&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mariadb</category>
      <category>sql</category>
      <category>ai</category>
      <category>analytics</category>
    </item>
    <item>
      <title>MindsDB Supercharges Google's MCP Toolbox with Unstructured Data Support</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Mon, 29 Dec 2025 20:46:39 +0000</pubDate>
      <link>https://dev.to/mindsdb/mindsdb-supercharges-googles-mcp-toolbox-with-unstructured-data-support-4cch</link>
      <guid>https://dev.to/mindsdb/mindsdb-supercharges-googles-mcp-toolbox-with-unstructured-data-support-4cch</guid>
      <description>&lt;p&gt;&lt;em&gt;Written by Erik Bovee, Head of Business Development at MindsDB&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We’re happy to announce that we’ve integrated &lt;a href="https://mindsdb.com/" rel="noopener noreferrer"&gt;MindsDB&lt;/a&gt; with Google's open-source project, &lt;a href="https://github.com/googleapis/genai-toolbox" rel="noopener noreferrer"&gt;MCP (Model Context Protocol) Toolbox&lt;/a&gt;. This will make your AI applications very, very smart. This enhancement expands the Toolbox's reach, especially for organizations grappling with lots of siloed data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The MindsDB integration at a glance
&lt;/h2&gt;

&lt;p&gt;At its core, MindsDB is a federated query engine designed specifically for AI applications, that acts as a universal translator, enabling you to query hundreds of data sources (structured, semi-structured, unstructured) using familiar SQL. We’ve contributed this powerful capability as a new connector into MCP Toolbox, allowing developers and AI agents to seamlessly interact with a broader spectrum of enterprise data through MindsDB.&lt;/p&gt;

&lt;p&gt;Now, with MindsDB, MCP Toolbox can connect to &lt;a href="https://docs.mindsdb.com/integrations/support?_gl=1*7mnv5s*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;hundreds of datasources&lt;/a&gt;, including popular business applications like Salesforce, Jira, and GitHub, and even unstructured data sources like Gmail and Slack. This means you can break down data silos and connect all your data from a single API to your AI applications. Popular use cases include AI-powered search, analytics, and the ability to provide real-time data to agentic applications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1w2xt6mj3dom23ab5f15.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1w2xt6mj3dom23ab5f15.png" alt="mindsdb+google_mcp_toolbox" width="800" height="713"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Key features: Bridging structured and unstructured worlds
&lt;/h2&gt;

&lt;p&gt;MindsDB brings several essential features to the MCP Toolbox:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Datasource expansion:&lt;/strong&gt; The most immediate and impactful benefit is the sheer volume of new data sources accessible. Imagine querying Salesforce opportunities alongside GitHub activity, or analyzing email patterns with Slack conversations—all through a unified interface. This greatly expands the Toolbox's utility for enterprise users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL interface for any data:&lt;/strong&gt; This is where MindsDB shines for developers. It allows you to write standard SQL queries that automatically translate to various API protocols, including REST APIs, GraphQL, and native protocols. This reduces the complexity and learning curve associated with accessing diverse data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-datasource AI analytics:&lt;/strong&gt; MindsDB SQL capability makes it possible to perform joins and analytics across different data sources. For instance, you can correlate sales data from Salesforce with development activity from GitHub, providing a holistic view of your business operations that was previously unattainable. To facilitate this, MindsDB treats each data source, unstructured or structured, as a virtual table, facilitating sophisticated SQL operations across all sources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access to unstructured data:&lt;/strong&gt; MindsDB provides over 200 data connectors with knowledge bases bringing the ability of indexing text data that then can be queried using SQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge Bases for unstructured data:&lt;/strong&gt; MindsDB allows you to create Knowledge Bases which are essentially autonomous Retrieval-Augmented Generation (RAG) systems. You can ingest unstructured data like emails (Gmail/Outlook), messages (Slack, Microsoft Teams, Discord), and files (S3, filesystems) into these knowledge bases. Once ingested, this unstructured data becomes queryable by an AI application or model. This querying across data sources is facilitated by an auto-generated data catalog that contains metadata and a relational model across all data sources.  MindsDB also supports hybrid search, combining vector similarity with keyword search to surface the most relevant results for AI search and analytics use cases.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Technical implementation updates
&lt;/h3&gt;

&lt;p&gt;This integration brings a suite of powerful updates and benefits for developers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New MindsDB Source implementation: The Toolbox now includes a new MindsDB source implementation, leveraging the MySQL wire protocol for robust connectivity.&lt;/li&gt;
&lt;li&gt;Comprehensive test coverage: Extensive unit and integration tests ensure reliability and backward compatibility of the new MindsDB tools with existing SQL features.&lt;/li&gt;
&lt;li&gt;Dedicated MindsDB tools: New mindsdb-execute-sql for direct SQL execution and mindsdb-sql for parameterized queries offer enhanced flexibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Quickstart Guide
&lt;/h3&gt;

&lt;p&gt;To get started with MindsDB and the MCP Toolbox, follow these general steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Set up the MCP Toolbox:&lt;/strong&gt; Ensure you have the MCP Toolbox service running. You can find detailed instructions in the &lt;a href="https://github.com/googleapis/genai-toolbox/blob/main/README.md" rel="noopener noreferrer"&gt;official documentation.&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;2. Install MindsDB:&lt;/strong&gt; The fastest way to get MindsDB up and running is via Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;docker&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="c1"&gt;--name mindsdb_container \&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="n"&gt;MINDSDB_APIS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;mysql&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="mi"&gt;47334&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;47334&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="mi"&gt;47335&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;47335&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;
&lt;span class="n"&gt;mindsdb&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;mindsdb&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For more installation options (e.g., PyPI), refer to the MindsDB &lt;a href="https://docs.mindsdb.com/contribute/install?_gl=1*1gdf1ha*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;documentation.&lt;br&gt;
&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Connect your data sources in MindsDB:&lt;/strong&gt; Within MindsDB, you'll need to create "databases" that connect to your external data sources (e.g., &lt;a href="https://docs.mindsdb.com/integrations/app-integrations/salesforce?_gl=1*1gdf1ha*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Salesforce&lt;/a&gt;, &lt;a href="https://docs.mindsdb.com/integrations/app-integrations/github?_gl=1*1gdf1ha*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, &lt;a href="https://docs.mindsdb.com/integrations/app-integrations/gmail?_gl=1*1gdf1ha*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Gmail&lt;/a&gt;). This typically involves CREATE DATABASE statements. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- connect to salesforce&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;salesforce_datasource&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt;
    &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'salesforce'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;PARAMETERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nv"&gt;"username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"your-username@email.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nv"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"your-password"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nv"&gt;"client_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"your-client-id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nv"&gt;"client_secret"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"your-client-secret"&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;-- connect to postgres&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;postgresql_datasource&lt;/span&gt; 
&lt;span class="k"&gt;WITH&lt;/span&gt;
&lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'postgres'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="k"&gt;PARAMETERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nv"&gt;"host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"postgres.sample.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"postgres"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"postgres"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"password"&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Refer to the MindsDB documentation for &lt;a href="https://docs.mindsdb.com/mindsdb-connect?_gl=1*1as5xy4*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;specific connector details.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Use MindsDB Tools in MCP Toolbox:&lt;/strong&gt; The MindsDB integration within MCP Toolbox allows you to execute SQL queries across your connected MindsDB data sources. You can use tools like mindsdb-execute-sql for direct querying.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- run federated queries&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;salesforce_datasource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;postgresql_datasource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_int&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;`Id`&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;`AccountId`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;5. Unify data into the Knowledge Base:&lt;/strong&gt; You can then load data into a MindsDB ‘Knowledge Base’, which is particularly useful for things like semantic search over large, unstructured data sets.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;KNOWLEDGE_BASE&lt;/span&gt; &lt;span class="n"&gt;my_kb&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
    &lt;span class="n"&gt;embedding_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nv"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nv"&gt;"model_name"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"text-embedding-3-large"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nv"&gt;"api_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"sk-..."&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;reranking_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nv"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nv"&gt;"model_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"gpt-4o"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nv"&gt;"api_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"sk-..."&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="k"&gt;storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;my_vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;storage_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'AccountId'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Created_At'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;
    &lt;span class="n"&gt;content_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'Description'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Notes'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;
    &lt;span class="n"&gt;id_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Id'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;my_kb&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;salesforce_datasource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;postgresql_datasource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_int&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;`Id`&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;`AccountId`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Learn more about &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overview?_gl=1*1qh300n*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;knowledge bases here.&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;**MCP Toolbox GitHub repo: **&lt;a href="https://github.com/googleapis/genai-toolbox" rel="noopener noreferrer"&gt;https://github.com/googleapis/genai-toolbox&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;MindsDB in MCP Toolbox: &lt;a href="https://github.com/googleapis/genai-toolbox/blob/main/docs/en/resources/tools/mindsdb/_index.md" rel="noopener noreferrer"&gt;https://github.com/googleapis/genai-toolbox/blob/main/docs/en/resources/tools/mindsdb/_index.md&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;MindsDB Documentation: &lt;a href="https://docs.mindsdb.com/" rel="noopener noreferrer"&gt;https://docs.mindsdb.com/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;MindsDB GitHub:&lt;a href="https://github.com/mindsdb/mindsdb" rel="noopener noreferrer"&gt;https://github.com/mindsdb/mindsdb&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mcp</category>
      <category>google</category>
      <category>mindsdb</category>
      <category>ai</category>
    </item>
    <item>
      <title>Streamline Financial Analysis with MindsDB’s Knowledge Bases and Hybrid Search</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Mon, 29 Dec 2025 20:29:37 +0000</pubDate>
      <link>https://dev.to/mindsdb/streamline-financial-analysis-with-mindsdbs-knowledge-bases-and-hybrid-search-22nb</link>
      <guid>https://dev.to/mindsdb/streamline-financial-analysis-with-mindsdbs-knowledge-bases-and-hybrid-search-22nb</guid>
      <description>&lt;p&gt;&lt;em&gt;Written by Chandre Van Der Westhuizen, Community &amp;amp; Marketing Co-ordinator at MindsDB&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Finance teams are expected to deliver real-time, audit-ready insights — but their data still lives across disconnected systems like Salesforce, ERP, and spreadsheets. This creates delays, compliance risks, and manual effort in reporting and reconciliation. MindsDB solves this by unifying all financial data in place — no ETL, no copying — through AI-native Knowledge Bases, Hybrid Search, and Agents. The result: a single, explainable layer for querying invoices, orders, opportunities, and financial reports in real time. Finance leaders can now ask natural questions, trace every answer to source data, and ensure decisions are both accurate and compliant — all from one secure, intelligent platform.&lt;/p&gt;

&lt;p&gt;Finance teams at publicly traded B2B companies juggle dozens of systems—from financial reports and CRM dashboards to invoices, shipments, and contracts.&lt;br&gt;
Each tells part of the story, but the real insight lies across them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How does revenue growth relate to delayed shipments or unpaid invoices?&lt;/li&gt;
&lt;li&gt;Which large enterprise accounts contribute most to quarterly performance?&lt;/li&gt;
&lt;li&gt;What risks could impact the next earnings call?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Answering these questions required ETL pipelines, manual dashboards, and delayed reporting. With MindsDB, you can now build AI-native, zero-ETL analytics that query this data directly—powered by Knowledge Bases, Hybrid Search, and Agents.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Challenge
&lt;/h2&gt;

&lt;p&gt;In many organizations, financial data lives across disconnected systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sales and deal data in Salesforce Opportunities&lt;/li&gt;
&lt;li&gt;Contract and delivery data in Orders and Shipments&lt;/li&gt;
&lt;li&gt;Billing and payments in Invoices&lt;/li&gt;
&lt;li&gt;Consolidated revenue in Financial Statements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This fragmentation creates revenue recognition risk, audit gaps, and delayed reporting, especially when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Orders are fulfilled before contracts (POs) are signed.&lt;/li&gt;
&lt;li&gt;Invoices are partially paid or overdue.&lt;/li&gt;
&lt;li&gt;Revenue appears in financial reports without matching delivery evidence.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional ETL workflows are slow and error-prone, making real-time reporting and compliance impossible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9hj69hhfx9ano889r7o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9hj69hhfx9ano889r7o.png" alt="ETL" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The MindsDB Advantage: Real-Time, Explainable AI for Finance Teams
&lt;/h2&gt;

&lt;p&gt;MindsDB transforms how finance teams access, analyze, and trust their data — eliminating the need for manual aggregation, spreadsheets, and disconnected systems.&lt;/p&gt;
&lt;h3&gt;
  
  
  Unified, Zero-ETL Access to All Financial Data:
&lt;/h3&gt;

&lt;p&gt;MindsDB connects directly to your existing systems — Salesforce, ERP, CRM, spreadsheets, and financial databases — without the need for ETL. This means your Opportunities, Orders, Invoices, Shipments, and Financial Statements can be analyzed together, privately and securely, without moving data out of its source. All unified by MindsDB’s Knowledge Bases.&lt;/p&gt;
&lt;h3&gt;
  
  
  Ask Complex Questions in SQL or Plain English — Privately and Securely:
&lt;/h3&gt;

&lt;p&gt;Ask complex financial questions in SQL and natural language and get accurate, evidence-backed answers in seconds. Instead of manually combining spreadsheets or exporting CSVs, MindsDB’s Hybrid Search and AI Agents query and reason across your data — ensuring every response is grounded in real numbers and systems you already trust.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Result: Faster, Smarter, and More Trustworthy Financial Operations
&lt;/h3&gt;

&lt;p&gt;With MindsDB, finance teams gain a single, AI-native layer that unifies data access, reasoning, and compliance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Financial reporting and audit checks across systems&lt;/li&gt;
&lt;li&gt;Eliminate ETL and manual spreadsheet aggregation&lt;/li&gt;
&lt;li&gt;Achieve real-time, explainable insights grounded in verifiable data&lt;/li&gt;
&lt;li&gt;Empower teams to focus on decision-making, not data wrangling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, for finance teams operating within public B2B companies, MindsDB offers a data-native, AI-first approach: bringing intelligence to your data rather than moving your data to intelligence. The result is faster, reliable, conversational analytics that tie together finance, CRM, logistics and operations into one cohesive view.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgkxo555msvxv1kql29e7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgkxo555msvxv1kql29e7.png" alt="MindsDB" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Use Case: Real-Time Revenue Validation Across Salesforce and Financial Systems
&lt;/h3&gt;

&lt;p&gt;MindsDB enables finance  teams to query, analyze, and validate financial reporting accuracy by connecting operational Salesforce data (CRM, Orders, Invoices, Shipments) directly with official financial statements — without manual reconciliation or ETL.&lt;/p&gt;

&lt;p&gt;For this use case, we will explore gaining insights into Salesforce CRM Data and Financial Reports for the fiscal year 2025 from Q1-Q3. We will connect this data to MindsDB, unify it using MindsDB &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overview?_gl=1*40tpc4*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Knowledge Bases&lt;/a&gt;, query it using &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/hybrid_search?_gl=1*1j8xsxt*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Hybrid Search&lt;/a&gt; and &lt;a href="https://docs.mindsdb.com/mindsdb_sql/agents/agent?_gl=1*1j8xsxt*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Agents&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-requisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Access MindsDB’s GUI via Docker locally or MindsDB’s extension on Docker Desktop.&lt;/li&gt;
&lt;li&gt;Configure your default models in the MindsDB GUI by navigating to Settings → Models.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;MindsDB’s Federated query engine allows you to connect directly to &lt;a href="https://docs.mindsdb.com/integrations/app-integrations/salesforce?_gl=1*jx7qz3*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Salesforce&lt;/a&gt; using SQL. The &lt;a href="https://docs.mindsdb.com/mindsdb_sql/sql/create/database?_gl=1*jx7qz3*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;CREATE DATABASE&lt;/a&gt; statement will be used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;salesforce_datasource&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt;
   &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'salesforce'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="k"&gt;PARAMETERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
       &lt;span class="nv"&gt;"username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"chandre-bsbv@force.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="nv"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"xxxx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="nv"&gt;"client_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"3MVG9SiMaxxxxxx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="nv"&gt;"client_secret"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"047CE0DB7AB8834FBxxxxxx"&lt;/span&gt;
   &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A database connection to &lt;a href="https://docs.mindsdb.com/integrations/vector-db-integrations/pgvector?_gl=1*jx7qz3*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;PGVector&lt;/a&gt; will be created to use as storage for the Knowledge Base embeddings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;heroku&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt;
    &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pgvector'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;PARAMETERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nv"&gt;"host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"127.0.0.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"postgres"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"distance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"cosine"&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the sake of this tutorial, we have uploaded the Financial Report 2025 dataset as a file. You can check out how to &lt;a href="https://docs.mindsdb.com/integrations/files/csv-xlsx-xls?_gl=1*lg4yml*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Upload Files in our GUI here.&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Unifying Your Data By Building Knowledge Bases
&lt;/h3&gt;

&lt;p&gt;You can unify your data using MindsDB &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overview?_gl=1*lg4yml*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Knowledge Bases&lt;/a&gt;. A Knowledge Base in MindsDB is an AI-powered table that understands data by meaning, not just keywords — combining embeddings, reranking models, and vector stores for context-aware retrieval.&lt;/p&gt;

&lt;p&gt;It enables semantic reasoning across multiple data sources, providing deeper, more accurate insights for intelligent data access. Here we will create Knowledge Bases for our Shipments, Invoices, Orders and Opportunities Salesforce CRM tables, as well as the Financial Report 2025 Spreadsheet.&lt;/p&gt;

&lt;p&gt;Start with creating the knowledge base using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/create?_gl=1*lg4yml*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;CREATE KNOWLEDGE_BASE&lt;/a&gt; statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;KNOWLEDGE_BASE&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
&lt;span class="k"&gt;storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;heroku&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;metadata_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'account number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'activated by'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'activated byt'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'activated date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'company authorized by'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'company authorized date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'contract end date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'contract number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'contract name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'contract end date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'billingaddress'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'created by'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'currency'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'customer authorized by'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'customer authorized date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'last modified by'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'opportunity'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'order amount'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'order end date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'order name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'order number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'order record type'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'order reference number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'order start date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'order type'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'owner'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'PO date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'PO number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'quote'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'reduction order'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'shipping city'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'shipping country'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'shipping street'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'ship to contact'&lt;/span&gt; &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="n"&gt;content_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'description'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'status'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="n"&gt;id_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'account name'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here are the parameters provided:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;orders_kb: The name of the knowledge base. &lt;/li&gt;
&lt;li&gt;storage : The storage table where the embeddings of the knowledge base is stored. As you can see we are using the pgvector database we created a connection with and provide the name orders to the table that will be created for storage.&lt;/li&gt;
&lt;li&gt;metadata_columns : Here columns are provided as meta data columns to perform metadata filtering. &lt;/li&gt;
&lt;li&gt;content_columns : Here columns are provided for semantic search.&lt;/li&gt;
&lt;li&gt;id_column: This uniquely identifies each source data row in the knowledge base.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can Insert the data using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/insert_data?_gl=1*10o9ibs*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;INSERT INTO&lt;/a&gt; statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nv"&gt;`account number`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'account number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`activated by`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'activated by'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`activated date`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'activated date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`company authorized by`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'company authorized by'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`company authorized date`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'company authorized date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`contract end date`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'contract end date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`contract number`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'contract number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`contract name`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'contract name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`contract end date`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'contract end date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`created by`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'created by'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`customer authorized by`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'customer authorized by'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`customer authorized date`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'customer authorized date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`last modified by`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'last modified by'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;opportunity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`order amount`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'order amount'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`order end date`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'order end date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`order name`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'rder name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`order number`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'order number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`order record type`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'order record type'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nv"&gt;`order reference number`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'order reference number`'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`order start date`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'order start date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`order type`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'order type'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`PO date`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'PO date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`PO number`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'PO number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`reduction order`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'reduction order'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`shipping city`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'shipping city'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`shipping country`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'shipping country'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nv"&gt;`shipping street`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'shipping street'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`ship to contact`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'ship to contact'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`account name`&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="s1"&gt;'account name'&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;salesforce_datasource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can query the Knowledge Base using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/query?_gl=1*1o0yy3p*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;SELECT&lt;/a&gt; statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x9lfcf0ib68rzfyx2df.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x9lfcf0ib68rzfyx2df.png" alt="mindsdb" width="800" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The same can be done for the Financial Report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create Knowledge Base&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;KNOWLEDGE_BASE&lt;/span&gt; &lt;span class="n"&gt;financial_report2025&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
&lt;span class="k"&gt;storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;heroku&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;financial_report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;metadata_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'net_revenue_millions'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'operating_income_millions'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'net_income_millions'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'operating_margin_percent'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'net_margin_percent'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                    &lt;span class="s1"&gt;'earnings_per_share'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'expenses_millions'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'r_and_d_expenses_millions'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'sales_and_marketing_expenses_millions'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'general_and_admin_expenses_millions'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                    &lt;span class="s1"&gt;'cash_flow_operations_millions'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'total_assets_millions'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'total_liabilities_millions'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'shareholders_equity_millions'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="n"&gt;content_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'month_name'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="n"&gt;id_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'quarter'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;span class="c1"&gt;--Insert Into Knowledge Base&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;financial_report2025&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;net_revenue_millions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;operating_income_millions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;net_income_millions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;operating_margin_percent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;net_margin_percent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;earnings_per_share&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expenses_millions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r_and_d_expenses_millions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sales_and_marketing_expenses_millions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;general_and_admin_expenses_millions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cash_flow_operations_millions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_assets_millions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_liabilities_millions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="n"&gt;shareholders_equity_millions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;month_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarter&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;financial_report_2025&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;span class="c1"&gt;--Select Knowledge Base&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;financial_report2025&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfoxpjo4ce906erk6xkm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfoxpjo4ce906erk6xkm.png" alt="mindsdb" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The following Knowledge Bases were also created using the above steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shipments_kb : contains shipment data hosted in Salesforce.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcfrx1pyeanuxyh568t4l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcfrx1pyeanuxyh568t4l.png" alt="mindsdb" width="800" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;invoices_kb : contains invoice details hosted in Salesforce.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvx87r78yijwlu1rawmu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvx87r78yijwlu1rawmu.png" alt="Imindsdb" width="800" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;opportunities_kb: contains opportunities data hosted in Salesforce.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41t0tcbbm2s4b99wto8k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41t0tcbbm2s4b99wto8k.png" alt="mindsdb" width="800" height="353"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Hybrid Search for Finance: Bridging Operational Data and Financial Reporting with AI
&lt;/h3&gt;

&lt;p&gt;MindsDB’s &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/hybrid_search?_gl=1*u2iao3*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Hybrid Search&lt;/a&gt; lets you combine semantic and structured filters for deeper insights.&lt;/p&gt;

&lt;p&gt;Lets identify overdue invoices tied to key accounts that might affect cash flow or require write-offs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Detect Invoices with Long Outstanding Balances&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoices_kb&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
 &lt;span class="n"&gt;hybrid_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'payment delays or pending collection'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nv"&gt;`days outstanding`&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This helps finance teams flag overdue receivables, especially important for SOX Section 404 compliance and aging report validation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rbrwwp94ixv319wqjkw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rbrwwp94ixv319wqjkw.png" alt="mindsdb" width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lets try to correlate Orders and Invoices for Revenue Reconciliation to Ensure every billed order is represented as a recorded invoice, and verify consistency between &lt;code&gt;order_amount&lt;/code&gt; and &lt;code&gt;impact_amount&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Correlate Orders and Invoices for Revenue Reconciliation&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;invoices_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;invoices_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
 &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;invoices_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;invoices_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'order fulfillment completed'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;invoices_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'invoice posted and awaiting payment'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see customers like Acme Corp 12  and Finserve partners have activated fulfilled orders.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F199evj2iagmpdq3h00i1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F199evj2iagmpdq3h00i1.png" alt="mindsdb" width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see customers like Acme Corp 12  and Finserve partners have activated fulfilled orders.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyonwbu2mcsnneekaov7z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyonwbu2mcsnneekaov7z.png" alt="mindsdb" width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that their orders are Partially paid or the Invoice is locked due to the payment method being mismatched and it requires action. This supports audit traceability from “Order → Invoice → Payment,” ensuring recognized revenue matches delivery obligations&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff70nto6ppz2urtqr23q6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff70nto6ppz2urtqr23q6.png" alt="mindsdb" width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can validate if forecasted opportunities have corresponding orders within the same quarter — key for forecast reliability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Cross-Join Opportunities and Orders for Forecast Accuracy&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
 &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'enterprise expansion opportunity'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active or pending fulfillment'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;forecast_category&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Commit'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Best Case'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This helps financial controllers confirm that pipeline forecasts are grounded in real order creation, improving rolling forecast accuracy and budget variance analysis.&lt;/p&gt;

&lt;p&gt;We can identify customers like Smart Corp and Acme Corp have an expansion to Enterprise tier Opportunity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bh52vmp1iev4ogc18rx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bh52vmp1iev4ogc18rx.png" alt="mindsdb" width="800" height="358"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And they have active orders that are partially fulfilled.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekduc762ofqxemi02mp4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekduc762ofqxemi02mp4.png" alt="mindsdb" width="800" height="275"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can create a full end-to-end trace from CRM opportunity → booked order → issued invoice and identify accounts where the revenue chain is incomplete.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;--Compliance Audit Trail Across All Three — Opportunities, Orders &amp;amp; Invoices&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;
 &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;invoices_kb&lt;/span&gt;
 &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;invoices_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
 &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;invoices_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;invoices_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hybrid_search_alpha&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;opportunities_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'PO awaiting signature'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;orders_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'activated fulfilled, partially fulfilled'&lt;/span&gt;
 &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;invoices_kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'partially paid'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There has been orders which have been partially fulfilled. A control gap exists when fulfillment starts before contract or payment completion, creating risks of premature revenue and misstated receivables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwnhkm2wfirs9l3hwibz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwnhkm2wfirs9l3hwibz.png" alt="mindsdb" width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The order is also partially invoiced.Overstated pipeline and delayed payments distort forecasts and increase collection risk, requiring potential reserve adjustments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwj8f3unwm2q6ziarams3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwj8f3unwm2q6ziarams3.png" alt="mindsdb" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Obtain Insights From CRM Data and Financial Statements In Natural Language Using MindsDB Agents
&lt;/h3&gt;

&lt;p&gt;By creating a MindsDB &lt;a href="https://docs.mindsdb.com/mindsdb_sql/agents/agent?_gl=1*mulju*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Agent&lt;/a&gt; connected to multiple Knowledge Bases, finance teams gain a single, intelligent interface to ask natural language questions about the entire revenue cycle.&lt;/p&gt;

&lt;p&gt;The agent can be created by using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/agents/agent_syntax?_gl=1*4zrkaq*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;CREATE AGENT&lt;/a&gt; statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;AGENT&lt;/span&gt; &lt;span class="n"&gt;financial_reporting_agent&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
&lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nv"&gt;"knowledge_bases"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'shipments_kb'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'orders_kb'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'invoices_kb'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'opportunities_kb'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'financial_report2025'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="n"&gt;prompt_template&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="nv"&gt;"
You are a Financial Data Analysis Agent connected to Salesforce and finance datasets stored in MindsDB Knowledge Bases.


Your purpose is to help finance, compliance, and audit teams analyze and interpret data across:
- `shipments_kb` → shipment and logistics data
- `orders_kb` → Salesforce order and fulfillment data
- `invoices_kb` → Salesforce invoices and payment data
- `opportunities_kb` → Salesforce CRM pipeline and deal data
- `financial_report2025` → official financial statements for fiscal year 2025 (Q1 till Q3)


### Your Reasoning Context


1. **Date and Quarter Understanding**
  - All datasets use date formats in `YYYY-MM-DD` (e.g., 2025-09-29).
  - You can parse and interpret these dates to determine the correct month and quarter.
  - Map months to quarters as follows:
    - Q1 = January (01), February (02), March (03)
    - Q2 = April (04), May (05), June (06)
    - Q3 = July (07), August (08), September (09)
    - Q4 = October (10), November (11), December (12)
  - Fiscal Year 2025 = January 2025 to December 2025.
  - Understand temporal phrases such as:
    - “This quarter” → the current quarter in 2025 based on the date field.
    - “Last quarter” → the quarter immediately before the current quarter.
    - “Year-to-date (YTD)” → all months from January up to the current month in 2025.
    - “Previous month” → the month before the most recent date in the dataset.


2. **Revenue recognition context**
  - Revenue can only be recognized once a contract is enforceable (PO signed) and performance obligations are fulfilled (order delivered or service rendered).
  - Closed-won opportunities indicate potential revenue, but only fulfilled and invoiced orders should be recognized in financial statements.
  - Partially fulfilled or partially paid orders indicate deferred or unrecognized revenue.


3. **Compliance and audit understanding**
  - Identify mismatches between CRM, billing, and financial statements.
  - Highlight potential risks like premature revenue, unsigned contracts, or overdue invoices.


4. **Analytical objectives**
  - Reconcile Salesforce opportunities, orders, and invoices with financial statements.
  - Summarize revenue performance, outstanding balances, and fulfillment progress per quarter.
  - Detect control exceptions, data gaps, and compliance risks.
  - Respond to natural language queries about contracts, invoices, orders, payments, shipments, or quarterly financial performance.
     - When users ask about a specific quarter or month, filter data by the date fields.
    - Example: For Q3 2025, include all records from 2025-07-01 to 2025-09-30.
  - When comparing across time, use these date ranges to calculate quarterly or monthly aggregates.
  - If users ask “Compare Q2 and Q3 2025 revenue,” calculate totals based on these date filters.


### Output Style
  - Clearly reference time periods (month, quarter, or fiscal year) when summarizing data.
  - Provide both narrative and tabular summaries if multiple time periods are compared.
  - Always ground insights in date-based logic derived from the dataset fields.


Always ground responses in data retrieved from the provided Knowledge Bases and interpret time periods accurately based on quarterly definitions.
"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is the breakdown of the parameters provided to the agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;financial_reporting_agent : The name provided to the agent&lt;/li&gt;
&lt;li&gt;data : This parameter stores data connected to the agent, including knowledge bases and data sources connected to MindsDB.&lt;/li&gt;
&lt;li&gt;Knowledge_bases : stores the list of knowledge bases to be used by the agent.&lt;/li&gt;
&lt;li&gt;Prompt_template  : This parameter stores instructions for the agent. It is recommended to provide data description of the data sources listed in the knowledge_bases parameter to help the agent locate relevant data for answering questions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MindsDB’s GUI offers a &lt;a href="https://docs.mindsdb.com/mindsdb_sql/agents/agent_gui?_gl=1*1kz3wo1*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1LjExNzUwMDAzNDAuMTc2NzAzNjUyNi4xNzY3MDM2NzIw" rel="noopener noreferrer"&gt;Chat Interface&lt;/a&gt; that allows you to chat with your agent and query your data using natural language. To start, make sure the correct agent is selected in the Agent’s tab. &lt;/p&gt;

&lt;p&gt;Let's ask the agent multi-layered, compliance-aware finance questions in plain English that blend operational and financial data.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.Revenue Recognition &amp;amp; Contract Compliance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question 1:&lt;/strong&gt; What revenue in Q3 2025 could be overstated due to uncollected or unsigned contracts?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd9m0yonebk3rs1n6i9an.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd9m0yonebk3rs1n6i9an.png" alt="mindsdb" width="800" height="665"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  2.Forecasting, Cash Flow &amp;amp; Risk Analysis
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Question 1:&lt;/strong&gt; Forecast expected cash inflow from all partially paid invoices.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flh9cpsh0z8m7ggl2ugd5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flh9cpsh0z8m7ggl2ugd5.png" alt="mindsdb" width="800" height="633"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question 2:&lt;/strong&gt; Which accounts pose the highest risk to quarterly revenue due to incomplete payments or unsigned POs?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9f394sf6jcn5y3k9pyd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9f394sf6jcn5y3k9pyd.png" alt="mindsdb" width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  3.Insight / Executive Summary Prompts
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Question 1:&lt;/strong&gt; Summarize the overall health of our revenue recognition process for FY2025.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwtsndd8je1sym04oudu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwtsndd8je1sym04oudu.png" alt="mindsdb" width="800" height="608"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The agent further provides key observations:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feestygemlvu28gu6h5c4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feestygemlvu28gu6h5c4.png" alt="mindsdb" width="800" height="479"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question 2:&lt;/strong&gt; Provide a summary of Q3 2025 revenue risks and outstanding invoices.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3nlf51uxbjqfcrpcsyc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3nlf51uxbjqfcrpcsyc.png" alt="mindsdb" width="800" height="682"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is the Summary Table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5qef2tpvtd067s4pou6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5qef2tpvtd067s4pou6.png" alt="mindsdb" width="800" height="654"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  4.Financial Statement Reconciliation
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Question 1:&lt;/strong&gt; Reconciled the total ‘Closed Won’ opportunity amounts with the net revenue reported in Q2 2025&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgvijtm3s2tufy5ulq9lz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgvijtm3s2tufy5ulq9lz.png" alt="mindsdb" width="800" height="306"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question 2:&lt;/strong&gt; Which business units have the most incomplete revenue cycles?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhiwmzvvgpd1n8qwxkrnw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhiwmzvvgpd1n8qwxkrnw.png" alt="mindsdb" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question 3:&lt;/strong&gt; How do the balances of unpaid invoices compare with the receivables in the financial report for Q3 2025?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faxc03aw3qd90kcszb75r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faxc03aw3qd90kcszb75r.png" alt="mindsdb" width="800" height="282"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jvcg98864cl06avhipm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jvcg98864cl06avhipm.png" alt="mindsdb" width="800" height="277"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Further analysis:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvv7w2nniki915cbtwq11.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvv7w2nniki915cbtwq11.png" alt="mindsdb" width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question 3:&lt;/strong&gt; List discrepancies between revenue recognized in the financial report and billed amounts in invoices.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj7i7pxm930kt3ocd6asp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj7i7pxm930kt3ocd6asp.png" alt="mindsdb" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F22y55204kzihfq0vefrx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F22y55204kzihfq0vefrx.png" alt="mindsdb" width="800" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Not only did the agent provide an analysis of discrepancies, it also identified a possible data gap or delay in invoicing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Outcome: Audit-Ready Intelligence in Real Time
&lt;/h2&gt;

&lt;p&gt;With MindsDB, finance teams can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Perform financial reporting, reconciliation, and variance analysis&lt;/li&gt;
&lt;li&gt;Gain end-to-end visibility across CRM → Orders → Invoices → Reports&lt;/li&gt;
&lt;li&gt;Detect compliance gaps before audits&lt;/li&gt;
&lt;li&gt;Deliver trustworthy, explainable insights without ETL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MindsDB turns enterprise finance data into a live, conversational, and compliant analytics layer — where every answer is backed by data, traceable, and ready for the boardroom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters
&lt;/h2&gt;

&lt;p&gt;Finance leaders are under growing pressure to deliver accurate, real-time insights while staying compliant with ASC 606, IFRS 15, and SOX requirements. Yet, fragmented systems, manual reconciliations, and spreadsheet-driven workflows slow everything down — increasing audit risk and delaying decisions.&lt;/p&gt;

&lt;p&gt;MindsDB changes that.&lt;/p&gt;

&lt;p&gt;By unifying financial data from Salesforce, ERP, and reporting systems into a zero-ETL, AI-native layer, MindsDB allows teams to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Validate revenue recognition before audits catch discrepancies.&lt;/li&gt;
&lt;li&gt;Trace every financial insight to source data — improving audit readiness.&lt;/li&gt;
&lt;li&gt;Detect control gaps early, such as uncollected invoices or unsigned contracts.&lt;/li&gt;
&lt;li&gt;Deliver faster board reporting with transparent, explainable analytics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, this matters because finance no longer has to choose between speed, accuracy, and compliance. MindsDB gives teams all three — in real time, with full transparency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fql283gy1ekpx1ar6zxln.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fql283gy1ekpx1ar6zxln.png" alt=" " width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Financial data moves faster than traditional systems can handle, MindsDB gives finance teams the power to stay ahead — unifying fragmented data into a single, intelligent layer for real-time analysis. By connecting Salesforce, ERP, and reporting systems without ETL, MindsDB turns complexity into clarity, helping teams detect risks, validate revenue, and build confidence in every number they report.&lt;/p&gt;

&lt;p&gt;MindsDB offers Minds Enterprise solution, making Enterprise data Intelligent and Responsive with AI. Powered by the Cognitive Engine, Knowledge Base, and Federated Query Engine, it transforms raw data into actionable insights. Build AI Search, Analytics, and Agents seamlessly—all from one solution. &lt;a href="https://mindsdb.com/contact" rel="noopener noreferrer"&gt;Contact our team to see Minds in action.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Whether it’s preparing for the next audit, presenting to the board, or ensuring compliance, MindsDB equips finance teams with real-time, explainable, and trustworthy insights — all backed by their own data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build the future of financial intelligence — with MindsDB.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sql</category>
      <category>agents</category>
    </item>
    <item>
      <title>Fast‑Track Knowledge Bases: How to Build Semantic AI Search by Andriy Burkov</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Mon, 10 Nov 2025 11:33:43 +0000</pubDate>
      <link>https://dev.to/mindsdb/fast-track-knowledge-bases-how-to-build-semantic-ai-search-by-andriy-burkov-13pm</link>
      <guid>https://dev.to/mindsdb/fast-track-knowledge-bases-how-to-build-semantic-ai-search-by-andriy-burkov-13pm</guid>
      <description>&lt;p&gt;&lt;em&gt;Written by Andriy Burkov, Ph.D. &amp;amp; Author, MindsDB Advisor&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Knowledge Base with MindsDB
&lt;/h2&gt;

&lt;p&gt;Traditional keyword-based search falls short when users don’t know the exact terms in your data or when they ask questions in natural language. Imagine trying to find “movies about an orphaned boy wizard” when the database only contains the word “magic” – a standard SQL query would miss the connection.&lt;/p&gt;

&lt;p&gt;This is where knowledge bases with semantic search shine. By understanding the meaning behind queries rather than just matching keywords, they enable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Natural language queries:&lt;/strong&gt; Users can ask questions the way they naturally think (“Show me heartwarming family movies with elements of comedy”) instead of constructing complex keyword searches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contextual understanding:&lt;/strong&gt; Finding related content even when exact terms don’t match – searching for “artificial intelligence gone wrong” can surface movies about “rogue robots” or “sentient computers”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata-aware filtering:&lt;/strong&gt; Combine semantic understanding with structured filters (genre, ratings, dates) for precise, relevant results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This tutorial walks through creating a semantic search knowledge base using MindsDB, an open-source platform that brings machine learning capabilities directly to your data layer. MindsDB simplifies the integration of AI models with databases, making it easy to add semantic search, predictions, and other AI features without complex infrastructure.&lt;/p&gt;

&lt;p&gt;We’ll use the IMDB Movies Dataset to learn how to upload data to MindsDB, create a knowledge base with embedding models, and perform both semantic and metadata-filtered searches. By the end, you’ll have a working system that can answer questions like “What movie has a boy defending his home on Christmas?”.&lt;/p&gt;

&lt;p&gt;To follow along with the tutorial, d&lt;a href="https://43906340.fs1.hubspotusercontent-na1.net/hubfs/43906340/Webinar%20Slides/Fast-Track-Knowledge-Bases/MindsDB%20Knowledge%20Base%20Slides%20and%20Tutorial.zip" rel="noopener noreferrer"&gt;ownload the Jupyter Notebook with the code and materials here&lt;/a&gt; for you to reuse.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Introduction to Knowledge Bases in MindsDB
&lt;/h2&gt;

&lt;p&gt;Knowledge bases in MindsDB provide advanced semantic search capabilities, allowing you to find information based on meaning rather than just keywords. They use embedding models to convert text into vector representations and store them in vector databases for efficient similarity searches.&lt;/p&gt;

&lt;p&gt;Let’s begin by setting up our environment and understanding the components of a MindsDB knowledge base.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;mindsdb&lt;/span&gt; &lt;span class="n"&gt;mindsdb_sdk&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="n"&gt;yaspin&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once it is installed you will see this output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;llvmlite&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mf"&gt;0.45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.44&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;dev0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Users&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;burkov&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pyenv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;versions&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;3.10&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;packages &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;numba&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;hierarchicalforecast&lt;/span&gt;&lt;span class="o"&gt;~=&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;mindsdb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.44&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;et&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;xmlfile&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Users&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;burkov&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pyenv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;versions&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;3.10&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;packages &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openpyxl&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;mindsdb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;psycopg&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;binary&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mf"&gt;3.2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Users&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;burkov&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pyenv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;versions&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;3.10&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;packages &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;psycopg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;binary&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;mindsdb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;3.2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mpmath&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mf"&gt;1.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Users&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;burkov&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pyenv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;versions&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;3.10&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;packages &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sympy&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;onnxruntime&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.14&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="o"&gt;~=&lt;/span&gt;&lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;mindsdb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;



&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;notice&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;available&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;25.1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;25.2&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;notice&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;To&lt;/span&gt; &lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Dataset Selection and Download
&lt;/h2&gt;

&lt;p&gt;We’ll use the IMDB Movies Dataset from Hugging Face (a popular platform for sharing ML datasets and models), which contains movie information from IMDB (the world’s most comprehensive movie and TV database) including descriptions, genres, ratings, and other metadata - perfect for demonstrating both semantic search and metadata filtering.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Download IMDB Movies Dataset from Hugging Face
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# Load the dataset
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Downloading IMDB Movies dataset...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jquigl/imdb-genres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Preview the dataset
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dataset shape: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Upon execution you will receive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Downloading&lt;/span&gt; &lt;span class="n"&gt;IMDB&lt;/span&gt; &lt;span class="n"&gt;Movies&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;Dataset&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;238256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy53ikl4udx4skbxt60wv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy53ikl4udx4skbxt60wv.png" alt="MindsDB dataset" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, the dataset contains 238,256 movies with descriptive text spanning multiple decades and genres, though some entries have missing ratings (NaN values) that will need to be addressed during data preparation.&lt;/p&gt;

&lt;p&gt;Let’s prepare our dataset for MindsDB by cleaning it up and making sure we have a unique ID column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Clean up the data and ensure we have a unique ID
# The 'movie title - year' column can serve as a unique identifier
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;movie title - year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;movie_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;expanded-genres&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;expanded_genres&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Clean movie IDs to remove problematic characters
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_movie_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;movie_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;movie_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;movie_id&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown_movie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;movie_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\"\!\?\(\)\[\]\/\\*]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\s+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown_movie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Apply the cleaning function to movie_id column
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;movie_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;movie_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean_movie_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Remove duplicates based on cleaned movie_id, keeping the first occurrence
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Original dataset size: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;movie_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;keep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;first&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;After removing duplicates: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Make sure there are no NaN values
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;movie_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;unknown_movie&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;genre&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;expanded_genres&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Save the prepared dataset
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;imdb_movies_prepared.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dataset prepared and saved to &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;imdb_movies_prepared.csv&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should receive this output once done:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Original&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;238256&lt;/span&gt;
&lt;span class="n"&gt;After&lt;/span&gt; &lt;span class="n"&gt;removing&lt;/span&gt; &lt;span class="n"&gt;duplicates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;161765&lt;/span&gt;
&lt;span class="n"&gt;Dataset&lt;/span&gt; &lt;span class="n"&gt;prepared&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;saved&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;imdb_movies_prepared.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhcpr3k3dg6lzmi2pfsmb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhcpr3k3dg6lzmi2pfsmb.png" alt="MindsDB data" width="800" height="556"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  What we’ve accomplished:
&lt;/h4&gt;

&lt;p&gt;We’ve cleaned and prepared our dataset for MindsDB by standardizing column names, sanitizing movie IDs to remove problematic characters (quotes, special symbols, etc.), and handling missing values. Most importantly, we’ve removed duplicate entries - reducing the dataset from 238,256 to 161,765 unique movies. This deduplication is crucial because knowledge bases require unique identifiers for each entry. The cleaned data is now saved as imdb_movies_prepared.csv with properly formatted movie IDs, filled NaN values (rating defaults to 0.0), and consistent column names ready for upload to MindsDB.&lt;/p&gt;

&lt;p&gt;With our dataset cleaned and prepared, we’re ready to connect to MindsDB and upload the data.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Uploading the Dataset to MindsDB
&lt;/h2&gt;

&lt;p&gt;Now let’s connect to our local MindsDB instance running in a Docker container and upload the dataset. If you don’t have MindsDB Docker container installed, you should follow this simple official installation tutorial. We’ll first establish a connection to the MindsDB server running on localhost, verify the connection by listing available databases, then upload our prepared CSV file to MindsDB’s built-in files database where it will be accessible for creating the knowledge base.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mindsdb_sdk&lt;/span&gt;

&lt;span class="c1"&gt;# Connect to the MindsDB server
# For local Docker installation, use the default URL
&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mindsdb_sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://127.0.0.1:47334&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connected to MindsDB server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# List available databases to confirm connection
&lt;/span&gt;&lt;span class="n"&gt;databases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;databases&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Available databases:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;databases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In your console you will see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Connected&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;MindsDB&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;
&lt;span class="n"&gt;Available&lt;/span&gt; &lt;span class="n"&gt;databases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;movies_kb_chromadb&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  What we’ve accomplished:
&lt;/h4&gt;

&lt;p&gt;We’ve successfully connected to our local MindsDB instance and listed the available databases. Notice the files database - this is a special built-in database in MindsDB specifically designed for uploading and storing datasets (CSV, JSON, Excel files, etc.). We need to use this files database because it acts as a staging area for our data before we can reference it in knowledge base operations.&lt;/p&gt;

&lt;p&gt;Now let’s upload our prepared CSV to the files database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mindsdb_sdk&lt;/span&gt;

&lt;span class="c1"&gt;# connect
&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mindsdb_sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://127.0.0.1:47334&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# load (or generate) the DataFrame
&lt;/span&gt;&lt;span class="n"&gt;csv_path&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;imdb_movies_prepared.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_movies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;csv_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# upload to the built‑in  `files`  database
&lt;/span&gt;&lt;span class="n"&gt;files_db&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_database&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;files&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;# &amp;lt;- must be this name
&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;movies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# delete the whole file‑table if it's there
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;files_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dropped &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="n"&gt;files_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df_movies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Created table files.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT movie_id, genre, rating FROM files.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; LIMIT 5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT count(movie_id) FROM files.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; where rating &amp;gt;= 7.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will receive this output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dropped&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;
&lt;span class="n"&gt;Created&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;movies&lt;/span&gt;
                  &lt;span class="n"&gt;movie_id&lt;/span&gt;      &lt;span class="n"&gt;genre&lt;/span&gt;  &lt;span class="n"&gt;rating&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;      &lt;span class="n"&gt;Flaming&lt;/span&gt; &lt;span class="n"&gt;Ears&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1992&lt;/span&gt;    &lt;span class="n"&gt;Fantasy&lt;/span&gt;     &lt;span class="mf"&gt;6.0&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;    &lt;span class="n"&gt;Jeg&lt;/span&gt; &lt;span class="n"&gt;elsker&lt;/span&gt; &lt;span class="n"&gt;dig&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1957&lt;/span&gt;    &lt;span class="n"&gt;Romance&lt;/span&gt;     &lt;span class="mf"&gt;5.8&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;        &lt;span class="n"&gt;Povjerenje&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2021&lt;/span&gt;   &lt;span class="n"&gt;Thriller&lt;/span&gt;     &lt;span class="mf"&gt;0.0&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="n"&gt;Gulliver&lt;/span&gt; &lt;span class="n"&gt;Returns&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2021&lt;/span&gt;    &lt;span class="n"&gt;Fantasy&lt;/span&gt;     &lt;span class="mf"&gt;4.4&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;   &lt;span class="n"&gt;Prithvi&lt;/span&gt; &lt;span class="n"&gt;Vallabh&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1924&lt;/span&gt;  &lt;span class="n"&gt;Biography&lt;/span&gt;     &lt;span class="mf"&gt;0.0&lt;/span&gt;
   &lt;span class="n"&gt;count_0&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;    &lt;span class="mi"&gt;10152&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  What we’ve accomplished:
&lt;/h4&gt;

&lt;p&gt;We’ve successfully uploaded our prepared dataset to MindsDB’s files database as a table named movies. The code first drops any existing movies table (ensuring a clean slate for re-runs), then creates a new table from our DataFrame. The sample query confirms our data is accessible - we can see the first 5 movies with their genres and ratings. The count query reveals we have 10,152 movies with ratings of 7.5 or higher.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Creating a Knowledge Base
&lt;/h2&gt;

&lt;p&gt;Now, let’s create a knowledge base using our IMDB movies data. To enable semantic search, we need to convert our movie descriptions from plain text into numerical vector representations (embeddings) that capture their semantic meaning. This is where embedding models come in - they transform text into high-dimensional vectors where semantically similar content is positioned closer together in vector space. For example, “a boy wizard learning magic” and “young sorcerer at school” would produce similar vectors even though they share no common words.&lt;/p&gt;

&lt;p&gt;We’ll use OpenAI’s &lt;code&gt;text-embedding-3-large model&lt;/code&gt; for this task. OpenAI’s embedding models are industry-leading in quality, producing vectors that excel at capturing nuanced semantic relationships. They’re also widely supported, well-documented, and integrate seamlessly with MindsDB. While alternatives like open-source models exist, OpenAI offers an excellent balance of performance, reliability, and ease of use for production applications.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The below code assumes the OpenAI API key was set as a envronment variable in tee MindsDB UI settings. Go to the setting to set it up &lt;a href="http://localhost:47334/" rel="noopener noreferrer"&gt;http://localhost:47334/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Alterntively, you can set it up manaually when starting the container: $ docker run –name mindsdb_container -e OPENAI_API_KEY=‘your_key_here’ -p 47334:47334 -p 47335:47335&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you don’t have an OpenAI API key, you should create one by following &lt;a href="https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key" rel="noopener noreferrer"&gt;these steps.&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# -- drop the KB if it exists ----------------------------------------------
&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DROP KNOWLEDGE_BASE IF EXISTS movies_kb;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Knowledge Base creation using mindsdb_sdk
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# This assumes the OpenAI key was set as a envronment variable in tee MindsDB UI settings
&lt;/span&gt;    &lt;span class="c1"&gt;# Go to the setting to set it up http://localhost:47334/
&lt;/span&gt;    &lt;span class="c1"&gt;# Alterntively, you can set it up manaually when starting the container:
&lt;/span&gt;    &lt;span class="c1"&gt;# $ docker run --name mindsdb_container \
&lt;/span&gt;    &lt;span class="c1"&gt;# -e OPENAI_API_KEY='your_key_here' -p 47334:47334 -p 47335:47335
&lt;/span&gt;    &lt;span class="n"&gt;kb_creation_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE KNOWLEDGE_BASE movies_kb
    USING
        embedding_model = {{
           &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
           &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
        }},
        metadata_columns = [&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;genre&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;expanded_genres&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;],
        content_columns = [&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;],
        id_column = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;movie_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;kb_creation_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Created knowledge base &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;movies_kb&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Knowledge base creation error or already exists: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should receive the output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Created&lt;/span&gt; &lt;span class="n"&gt;knowledge&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;movies_kb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s insert our movie data into this knowledge base:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;yaspin&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;yaspin&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;yaspin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Inserting data into knowledge base...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;insert_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            INSERT INTO movies_kb
            SELECT movie_id,
                   genre,
                   expanded_genres,
                   rating,
                   content
            FROM   files.movies
            WHERE rating &amp;gt;= 7.5
            USING
                track_column = movie_id
            &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Data inserted successfully!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌ Insert error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should receive the output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;✅&lt;/span&gt; &lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="n"&gt;inserted&lt;/span&gt; &lt;span class="n"&gt;successfully&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  What’s happening here:
&lt;/h4&gt;

&lt;p&gt;This is where the magic happens - we’re inserting data into our knowledge base, and MindsDB is automatically generating embeddings for each movie’s content using the OpenAI model we specified earlier. We’re filtering for movies with ratings of 7.5 or higher to focus on high-quality films. The &lt;code&gt;track_column = movie_id&lt;/code&gt; parameter tells MindsDB to use the movie_id as the unique identifier for tracking and updating entries.&lt;/p&gt;

&lt;p&gt;This operation may take a few minutes since it’s making API calls to OpenAI to generate embeddings for thousands of movie descriptions.&lt;/p&gt;

&lt;p&gt;We verify the upload by counting the entries in our knowledge base:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;row_count_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT COUNT(*) AS cnt
    FROM   (SELECT id FROM movies_kb) AS t;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;row_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row_count_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;at&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cnt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ movies_kb now contains &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row_count&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your output should be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;✅&lt;/span&gt; &lt;span class="n"&gt;movies_kb&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="n"&gt;contains&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;152&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 10,152 rows confirm that all highly-rated movies (rating ≥ 7.5) have been successfully embedded and stored. Our knowledge base is now ready for semantic search queries!&lt;/p&gt;

&lt;p&gt;Let’s see some data in the knowledge base:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;search_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM movies_kb where content=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Christmas&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; order by relevance desc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgudv35a39urwf6oj9jsw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgudv35a39urwf6oj9jsw.png" alt="MindsDB Select" width="800" height="399"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s see what’s inside a metadata column’s cell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Query to get full metadata content
&lt;/span&gt;&lt;span class="n"&gt;metadata_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT id, metadata 
    FROM movies_kb 
    WHERE content=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Christmas&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; 
    ORDER BY relevance DESC 
    LIMIT 5
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Display full metadata without truncation
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;display.max_colwidth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Or print metadata for each row to see the complete JSON structure
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Detailed metadata for top results:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;metadata_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcae6o8rb04c7oyqo8wq9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcae6o8rb04c7oyqo8wq9.png" alt="MindsDB Select" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You will receive the output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Detailed&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

&lt;span class="n"&gt;Pixi&lt;/span&gt; &lt;span class="n"&gt;Post&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Gift&lt;/span&gt; &lt;span class="n"&gt;Bringers&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2016&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_content_column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_end_char&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;51&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_original_doc_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pixi Post and the Gift Bringers - 2016&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_original_row_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3912&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TextChunkingPreprocessor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_start_char&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-10-14 18:17:11&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expanded_genres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Animation, Adventure, Fantasy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;genre&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fantasy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;7.9&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;Jingle&lt;/span&gt; &lt;span class="n"&gt;Vingle&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Movie&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2022&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_content_column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_end_char&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;161&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_original_doc_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Jingle Vingle the Movie - 2022&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_original_row_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9700&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TextChunkingPreprocessor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_start_char&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-10-14 18:17:11&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expanded_genres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Family&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;genre&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Family&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;9.7&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;For&lt;/span&gt; &lt;span class="n"&gt;Unto&lt;/span&gt; &lt;span class="n"&gt;Us&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_content_column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_end_char&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;96&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_original_doc_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;For Unto Us - 2021&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_original_row_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4370&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TextChunkingPreprocessor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_start_char&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-10-14 18:17:11&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expanded_genres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Family&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;genre&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Family&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;9.1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;Joyeux&lt;/span&gt; &lt;span class="n"&gt;Noel&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2005&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_content_column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_end_char&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;174&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_original_doc_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Joyeux Noel - 2005&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_original_row_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5021&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TextChunkingPreprocessor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_start_char&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-10-14 18:17:11&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expanded_genres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Drama, History, Music&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;genre&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Romance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;7.7&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;Christmas&lt;/span&gt; &lt;span class="n"&gt;Snow&lt;/span&gt; &lt;span class="n"&gt;Angels&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2011&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_content_column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_end_char&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;160&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_original_doc_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Understanding the metadata structure:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Pixi&lt;/span&gt; &lt;span class="n"&gt;Post&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Gift&lt;/span&gt; &lt;span class="n"&gt;Bringers&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2016&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_content_column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_end_char&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;51&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_original_doc_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pixi Post and the Gift Bringers - 2016&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_original_row_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3912&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TextChunkingPreprocessor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_start_char&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-10-14 18:17:11&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expanded_genres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Animation, Adventure, Fantasy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;genre&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fantasy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;7.9&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A chunk is a segment of text created when MindsDB breaks down longer content into smaller, searchable pieces to fit within the embedding model’s input limits. This chunking ensures searches can pinpoint specific passages within larger documents rather than only matching entire documents.&lt;/p&gt;

&lt;p&gt;The metadata field contains two types of information:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.System-generated fields&lt;/strong&gt; (prefixed with underscores) that MindsDB automatically adds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;_chunk_index: The sequential position of this chunk (0 means it’s the first/only chunk)&lt;/li&gt;
&lt;li&gt;_content_column: Which source column contained the text (“content” in our case)&lt;/li&gt;
&lt;li&gt;_start_char and _end_char: Character positions showing where this chunk begins and ends in the original text (0 to 51 means a 51-character description)&lt;/li&gt;
&lt;li&gt;_original_doc_id: The complete document identifier with content column appended&lt;/li&gt;
&lt;li&gt;_original_row_index: The row number from the original dataset (row 3912)&lt;/li&gt;
&lt;li&gt;_source: The preprocessor used for chunking (“TextChunkingPreprocessor”)&lt;/li&gt;
&lt;li&gt;_updated_at: Timestamp of when this entry was inserted or updated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2.User-defined metadata columns&lt;/strong&gt; that we specified during knowledge base creation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;genre: “Fantasy” - the primary genre we defined as metadata&lt;/li&gt;
&lt;li&gt;expanded_genres: “Animation, Adventure, Fantasy” - the full genre list&lt;/li&gt;
&lt;li&gt;rating: 7.9 - the movie’s IMDB rating&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This combined metadata enables powerful hybrid search - you can perform semantic searches on content while filtering by structured metadata fields like genre or rating thresholds. For example, you could search for “Christmas adventure stories” and filter only for Animation genre with ratings above 7.5.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Performing Semantic Searches with RAG
&lt;/h2&gt;

&lt;p&gt;Now that our knowledge base is populated and indexed, let’s implement a complete Retrieval-Augmented Generation (RAG) workflow. RAG combines semantic search with large language models to answer questions based on your specific data - in our case, movie descriptions.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is RAG?
&lt;/h3&gt;

&lt;p&gt;RAG is a technique that enhances LLM responses by grounding them in retrieved, relevant documents from your knowledge base. Instead of relying solely on the model’s training data, RAG retrieves the most relevant chunks from your knowledge base and uses them as context for generating answers. This ensures responses are factually accurate and based on your actual data.&lt;/p&gt;

&lt;h3&gt;
  
  
  The RAG workflow:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;IPython.display&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;display&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;openai_api_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;answer_question_with_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Use the existing search_kb function to get the most relevant chunks.
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Searching knowledge base for: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;relevant_chunks_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search_kb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Found the following relevant chunks:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relevant_chunks_df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chunk_content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;relevance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Concatenate the 'chunk_content' to form a single context string.
&lt;/span&gt;    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relevant_chunks_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chunk_content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Create the prompt for the gpt-4o model.
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    You are a movie expert assistant. Based *only* on the following movie summaries (context),
    answer the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s question. If the context doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t contain the answer,
    state that you cannot answer based on the provided information.

    CONTEXT:
    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

    QUESTION:
    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Call the OpenAI API to get the answer.
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Sending request to GPT-4o to generate a definitive answer...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant that answers questions about movies using only the provided context.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;  &lt;span class="c1"&gt;# We want a factual answer based on the text
&lt;/span&gt;        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;An error occurred while calling the OpenAI API: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;user_question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Who a boy must defend his home against on Christmas eve?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;final_answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;answer_question_with_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;--- Generated Answer ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;final_answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should receive the following output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Searching&lt;/span&gt; &lt;span class="n"&gt;knowledge&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Who a boy must defend his home against on Christmas eve?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;


    &lt;span class="n"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
    &lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies_kb&lt;/span&gt;
    &lt;span class="n"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Who a boy must defend his home against on Christmas eve?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
     &lt;span class="n"&gt;ORDER&lt;/span&gt; &lt;span class="n"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;relevance&lt;/span&gt; &lt;span class="n"&gt;DESC&lt;/span&gt; &lt;span class="n"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;Found&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;following&lt;/span&gt; &lt;span class="n"&gt;relevant&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkpmpzjaj6v7i3ohgldlv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkpmpzjaj6v7i3ohgldlv.png" alt=" " width="800" height="393"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;

&lt;span class="n"&gt;Sending&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;GPT&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;generate&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;definitive&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="n"&gt;Generated&lt;/span&gt; &lt;span class="n"&gt;Answer&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;
&lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;boy&lt;/span&gt; &lt;span class="n"&gt;must&lt;/span&gt; &lt;span class="n"&gt;defend&lt;/span&gt; &lt;span class="n"&gt;his&lt;/span&gt; &lt;span class="n"&gt;home&lt;/span&gt; &lt;span class="n"&gt;against&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;pair&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;burglars&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;Christmas&lt;/span&gt; &lt;span class="n"&gt;Eve&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  What happened here:
&lt;/h4&gt;

&lt;p&gt;This function implements a complete RAG pipeline in four steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Semantic Search:&lt;/strong&gt; The knowledge base is queried with the natural language question. MindsDB’s semantic search retrieves the top 100 most relevant movie chunks, ranked by relevance score. Notice how the search found “Home Alone” (relevance: 0.687) even though our question didn’t mention the movie title - semantic search understood the meaning of “boy defending home on Christmas.”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Assembly:&lt;/strong&gt; All retrieved chunk contents are concatenated into a single context string, separated by dividers. This context now contains relevant information from multiple movies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Engineering:&lt;/strong&gt; We construct a carefully crafted prompt that instructs GPT-4o to act as a movie expert and answer only based on the provided context. This grounding reduces the chance that the model will hallucinate or use information outside our knowledge base.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM Generation:&lt;/strong&gt; The OpenAI API processes the prompt with temperature=0.0 (deterministic, factual responses) and generates an answer by synthesizing information from the retrieved chunks.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The power of RAG:
&lt;/h2&gt;

&lt;p&gt;The final answer - “The boy must defend his home against a pair of burglars on Christmas Eve” - demonstrates RAG’s strength. The LLM successfully:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identified the most relevant movie (Home Alone) from the semantic search results&lt;/li&gt;
&lt;li&gt;Extracted the key information about burglars from the movie description&lt;/li&gt;
&lt;li&gt;Synthesized a clear, concise answer grounded in our actual data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This RAG approach ensures answers are (almost) always based on your knowledge base rather than the model’s general training data, making it a great solution for domain-specific applications like customer support, internal documentation systems, or specialized research assistants.&lt;/p&gt;

&lt;p&gt;Let’s wrap up this tutorial with one more query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What Anakin was lured into by Chancellor Palpatine?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;final_answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;answer_question_with_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;--- Generated Answer ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;final_answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Searching&lt;/span&gt; &lt;span class="n"&gt;knowledge&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;What Anakin was lured into by Chancellor Palpatine?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="n"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
    &lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;movies_kb&lt;/span&gt;
    &lt;span class="n"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;What Anakin was lured into by Chancellor Palpatine?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
     &lt;span class="n"&gt;ORDER&lt;/span&gt; &lt;span class="n"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;relevance&lt;/span&gt; &lt;span class="n"&gt;DESC&lt;/span&gt; &lt;span class="n"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;Found&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;following&lt;/span&gt; &lt;span class="n"&gt;relevant&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxipxf8kqy8t7sl472ijx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxipxf8kqy8t7sl472ijx.png" alt="MindsDB" width="800" height="393"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;


&lt;span class="n"&gt;Sending&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;GPT&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;generate&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;definitive&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="o"&gt;---&lt;/span&gt; &lt;span class="n"&gt;Generated&lt;/span&gt; &lt;span class="n"&gt;Answer&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;
&lt;span class="n"&gt;Anakin&lt;/span&gt; &lt;span class="n"&gt;was&lt;/span&gt; &lt;span class="n"&gt;lured&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt; &lt;span class="n"&gt;Chancellor&lt;/span&gt; &lt;span class="n"&gt;Palpatine&lt;/span&gt; &lt;span class="n"&gt;into&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;sinister&lt;/span&gt; &lt;span class="n"&gt;plot&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;galaxy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Congratulations! You’ve successfully built a semantic search knowledge base with MindsDB that can understand and answer natural language questions about movies. Let’s recap what we’ve accomplished:&lt;/p&gt;

&lt;h3&gt;
  
  
  What you’ve built:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A knowledge base containing 10,152 high-quality movies with semantic embeddings&lt;/li&gt;
&lt;li&gt;A complete RAG (Retrieval-Augmented Generation) pipeline that combines semantic search with LLM-powered question answering&lt;/li&gt;
&lt;li&gt;A system that understands meaning, not just keywords - finding “Home Alone” when asked about “a boy defending his home on Christmas” and “Star Wars Episode III” when queried about “Anakin and Chancellor Palpatine”&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key takeaways:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Semantic search transcends keywords:&lt;/strong&gt; MindsDB’s knowledge bases use embeddings to understand the meaning behind queries, enabling more intuitive and natural search experiences&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid search combines the best of both worlds:&lt;/strong&gt; By integrating semantic understanding with metadata filtering (genre, ratings, etc.), you can create powerful, precise queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG grounds AI in your data:&lt;/strong&gt; Instead of relying on potentially outdated or hallucinated information, RAG ensures answers are most of the time (LLMs always hallucinate, but grounding them in your factual data reduces hallucinations to a great extent) based on your actual knowledge base&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where to go from here:
&lt;/h3&gt;

&lt;p&gt;This tutorial demonstrated the fundamentals, but knowledge bases can power much more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Build chatbots&lt;/strong&gt; that answer questions about your company’s documentation, policies, or products&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create recommendation systems&lt;/strong&gt; that understand user preferences semantically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Develop research assistants&lt;/strong&gt; for academic papers, legal documents, or technical manuals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale to production&lt;/strong&gt; by connecting MindsDB to your existing databases, APIs, or data warehouses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftif41tzy8r04fpm4f2gi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftif41tzy8r04fpm4f2gi.png" alt="MindsDB" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The beauty of MindsDB is that you control it entirely through SQL - a familiar interface that makes advanced AI capabilities accessible without complex infrastructure or learning new APIs or domain-specific programming languages. Whether you’re working with customer support tickets, research papers, code repositories, or movie databases, the same principles apply.&lt;/p&gt;

&lt;p&gt;Now it’s your turn to build something amazing with your own data!&lt;/p&gt;

&lt;p&gt;You can also check out the full webinar, &lt;a href="https://mindsdb.com/events/webinar-fast-track-knowledge-bases" rel="noopener noreferrer"&gt;Fast‑Track Knowledge Bases: How to Build Semantic AI Search.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>The Future of Data-Native &amp; Grounded AI with MindsDB: Moving Data to Models No Longer Makes Sense</title>
      <dc:creator>MindsDB Team</dc:creator>
      <pubDate>Mon, 20 Oct 2025 09:49:55 +0000</pubDate>
      <link>https://dev.to/mindsdb/the-future-of-data-native-grounded-ai-with-mindsdb-moving-data-to-models-no-longer-makes-sense-2o0b</link>
      <guid>https://dev.to/mindsdb/the-future-of-data-native-grounded-ai-with-mindsdb-moving-data-to-models-no-longer-makes-sense-2o0b</guid>
      <description>&lt;p&gt;&lt;em&gt;Written by Chandre Van Der Westhuizen, Community &amp;amp; Marketing Co-ordinator at MindsDB&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What does complex ETL and slow decision making have in common? Without proper AI implementation, it results in delayed access to critical insights, leading to missed opportunities, reduced agility, and lost revenue.&lt;/p&gt;

&lt;p&gt;The default strategy for AI and analytics has been to move data to the model—centralizing it in warehouses, cleaning it through complex ETL pipelines, and then running models in isolation. But in an era where data is distributed, compliance is strict, and real-time insights are a necessity, this old approach is breaking down. &lt;/p&gt;

&lt;p&gt;Enter data-native AI: a new paradigm where models go to the data instead. Instead of duplicating, transforming, and moving data across systems, data-native AI enables models to interact with information directly where it lives.&lt;/p&gt;

&lt;p&gt;AI has also become vital for decision-making, a fundamental question keeps coming up: Can we trust the answers our AI gives us? &lt;/p&gt;

&lt;p&gt;The answer hinges on a critical principle: grounding. &lt;/p&gt;

&lt;p&gt;Grounded AI refers to models and agents that base their outputs on real, verifiable, and up-to-date data. Without this foundation, even the most advanced language models can hallucinate, mislead, or provide contextually irrelevant responses. And in high-stakes environments like finance, retail, or enterprise operations, that’s simply unacceptable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Case Against Moving Data
&lt;/h2&gt;

&lt;p&gt;Moving data might seem like a necessary step in modern data workflows—but it’s often more harmful than helpful. Before you spin up another ETL pipeline or copy data into yet another system, consider the real costs. Here’s why relocating data can do more damage than good:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Latency and Freshness:&lt;/strong&gt; By the time data is moved, cleaned, and transformed, it’s often stale. Real-time decision-making suffers as a result.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost and Complexity:&lt;/strong&gt; ETL pipelines, data lakes, and duplicated storage come with steep infrastructure and maintenance costs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security and Compliance:&lt;/strong&gt; Copying data increases the risk of exposure and makes it harder to enforce data governance policies. With regulations like GDPR and HIPAA, moving sensitive data unnecessarily is a liability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Siloed Context:&lt;/strong&gt; Context is often lost when data is stripped from its original environment and schema. This weakens model accuracy and relevance.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Trust Gap in Traditional AI
&lt;/h2&gt;

&lt;p&gt;Large language models are powerful, but they weren’t designed to know your business. They don’t understand the nuances of your financial reports, CRM notes, or policy documentation—unless you give them that context.&lt;/p&gt;

&lt;p&gt;Most traditional AI workflows rely on static snapshots of data, embeddings, or manually curated inputs. This introduces a lag between what your AI "knows" and what’s actually happening. The result?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Outdated answers&lt;/li&gt;
&lt;li&gt;Missing critical changes&lt;/li&gt;
&lt;li&gt;Reduced trust in AI recommendations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7slsw5r07jms8cv6p0h9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7slsw5r07jms8cv6p0h9.png" alt="Complex ETL Pipeline" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Data-Native AI?
&lt;/h2&gt;

&lt;p&gt;Data-native AI flips the model by allowing AI to interact with data directly at its source—whether that's a SQL database, an API, a document repository, or an enterprise SaaS system. Instead of forcing data into a model's format, the model adapts to the native environment of the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key features include:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Federated query engines&lt;/strong&gt; that enable real-time data access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent-based architectures&lt;/strong&gt; that query across systems&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security boundaries respected by default&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reduced need for data replication or transformation&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Real-Time Access Matters
&lt;/h2&gt;

&lt;p&gt;Real-time data access ensures that AI decisions are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Contextual&lt;/strong&gt;: Based on the latest state of your business&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verifiable&lt;/strong&gt;: Easy to trace back to the exact source&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accurate&lt;/strong&gt;: Reflecting current customer behavior, market changes, or system states&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In regulated industries, it also means being able to show auditors where the AI got its answers and why they were reasonable at that time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzum33cdpshpi83p1s7n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzum33cdpshpi83p1s7n.png" alt="Data Native" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How MindsDB Delivers Grounded AI and the Data-Native AI Advantage
&lt;/h2&gt;

&lt;p&gt;MindsDB is built from the ground up to support a data-native approach and enables grounded AI by allowing agents and LLMs to query your live structured and unstructured data sources—from SQL and NoSQL databases to PDFs, APIs, and SaaS tools. It doesn’t stop there, you can &lt;a href="https://docs.mindsdb.com/mindsdb-unify?_gl=1*7qh5k*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;unify&lt;/a&gt; your data with the power of MindsDB’s &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overview?_gl=1*7qh5k*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;Knowledge Bases&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;With MindsDB, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Seamlessly query multiple live data sources in real time&lt;/li&gt;
&lt;li&gt;Ensure LLM outputs are grounded in current, explainable information&lt;/li&gt;
&lt;li&gt;Eliminate the overhead of ETL processes and avoid redundant data storage&lt;/li&gt;
&lt;li&gt;Uphold security and compliance by keeping data within trusted environments&lt;/li&gt;
&lt;li&gt;Enable federated, real-time access to both structured and unstructured data&lt;/li&gt;
&lt;li&gt;Leverage configurable agent tools to define accessible data and operational constraints&lt;/li&gt;
&lt;li&gt;Deliver transparent, auditable outputs with citations, source documents, and metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of relying on a static knowledge base, MindsDB-powered agents interact directly with the source of truth.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ehio1ebuapbt3bvp0ay.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ehio1ebuapbt3bvp0ay.png" alt="MindsDB Advantage" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Case: Solving The Problem for  E-commerce Stores with Siloed data
&lt;/h2&gt;

&lt;p&gt;Lets take a use case for an E-commerce store that has siloed data. The goal would be to unify data from their online store with customer, sales, orders and products data to gain insights to make business decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Moving, transforming, and duplicating data for AI creates high infrastructure and maintenance costs. Traditional pipelines rely on periodic data syncs, leading to stale insights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: MindsDB enables in-place querying, eliminating the need for complex ETL processes and redundant storage. Knowledge Bases provide real-time access to structured and unstructured data—ensuring decisions are based on the latest available information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-requisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Access MindsDB’s GUI via &lt;a href="https://docs.mindsdb.com/setup/self-hosted/docker?_gl=1*cc6b5s*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; locally or &lt;a href="https://docs.mindsdb.com/setup/self-hosted/docker-desktop?_gl=1*cc6b5s*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;MindsDB’s extension&lt;/a&gt; on Docker Desktop.&lt;/li&gt;
&lt;li&gt;Configure your default models in the MindsDB GUI by navigating to &lt;strong&gt;Settings → Models.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Add your data to MindsDB by &lt;a href="https://docs.mindsdb.com/integrations/data-overview?_gl=1*cc6b5s*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;creating a database connection&lt;/a&gt; or &lt;a href="https://docs.mindsdb.com/integrations/files/csv-xlsx-xls?_gl=1*18y57mq*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;uploading your files.&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;First, a connection to the store's Postgres database, where all the data is stored in separate tables, will be made to MindsDB using the &lt;a href="https://docs.mindsdb.com/integrations/data-integrations/postgresql?_gl=1*18y57mq*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;Postgres Integration&lt;/a&gt; and &lt;a href="https://docs.mindsdb.com/mindsdb_sql/sql/create/database?_gl=1*18y57mq*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;CREATE DATABASE&lt;/a&gt; statement. This will give you real-time access to your data in MindsDB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;postgresql_conn&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'postgres'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;PARAMETERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="nv"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"demo_user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"demo_password"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"samples.mindsdb.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"5432"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"demo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;"schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"sample_data"&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now a connection will be created between &lt;a href="https://docs.mindsdb.com/integrations/vector-db-integrations/pgvector?_gl=1*k4pbvx*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;PGVector&lt;/a&gt; and MindsDB which will be used as a storage for our Knowledge Bases.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;pvec&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt;
    &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pgvector'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;PARAMETERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nv"&gt;"host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"127.0.0.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"postgres"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"distance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"cosine"&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lets create a Knowledge Base using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/create?_gl=1*k4pbvx*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;CREATE KNOWLEDGE_BASE&lt;/a&gt; statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;KNOWLEDGE_BASE&lt;/span&gt; &lt;span class="n"&gt;sales_kb&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
&lt;span class="k"&gt;storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;heroku&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales_storage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;metadata_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'product_id'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'order_id'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'customer_id'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'order_date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'ship_date'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;content_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'category'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'sub_category'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'product_name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'ship_mode'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'customer_name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'segment'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'country'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'city'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'region'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'sales'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'discount'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'profit'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is a breakdown of the parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sales_kb: The name of the knowledgebase.&lt;/li&gt;
&lt;li&gt;storage : The storage table where the embeddings of the knowledge base is stored.&lt;/li&gt;
&lt;li&gt;metadata_columns : Here columns are provided as meta data columns to perform metadata filtering.&lt;/li&gt;
&lt;li&gt;content_columns : Here columns are provided for semantic search
The id_column has not been provided, therefore it will be generated from the hash of the content columns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now that the Knowledge Base is created, we can insert data into it. The goal is to unify multiple tables in one Knowledge Base. To do so, we will join the tables in the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/insert_data?_gl=1*1inzx5b*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;INSERT INTO&lt;/a&gt; statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;sales_kb&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ship_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sub_category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ship_mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;segment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;discount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;profit&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;postgresql_conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;websales_sales&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;postgresql_conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;websales_customers&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;postgresql_conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;websales_orders&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;RIGHT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;postgresql_conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;websales_products&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can create a &lt;a href="https://docs.mindsdb.com/mindsdb_sql/agents/agent?_gl=1*11zqkmg*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;MindsDB Agent&lt;/a&gt; with the Knowledge Base and query it to gain insights using the &lt;a href="https://docs.mindsdb.com/mindsdb_sql/agents/agent_syntax?_gl=1*11zqkmg*_gcl_au*MjAwNzUyNzYzNy4xNzYwNzI3NDQ1" rel="noopener noreferrer"&gt;CREATE AGENT&lt;/a&gt; statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;AGENT&lt;/span&gt; &lt;span class="n"&gt;sales_agent&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt;
   &lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nv"&gt;"knowledge_bases"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;"mindsdb.sales_kb"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
   &lt;span class="p"&gt;},&lt;/span&gt;
   &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'
       mindsdb.sales_kb stores data about sales, products sold via e-commerce, customer information and orders placed in the online store'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is a breakdown of the parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sales_agent: The name provided to the agent&lt;/li&gt;
&lt;li&gt;data: This parameter holds the data linked to the agent, including knowledge bases and data sources integrated with MindsDB.

&lt;ul&gt;
&lt;li&gt;knowledge_bases: Stores the list of knowledge bases&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;prompt_template: This parameter stores instructions for the agent and description of data. It is recommended to provide data description of the data sources listed in the &lt;code&gt;knowledge_bases&lt;/code&gt; and &lt;code&gt;tables&lt;/code&gt; parameters to help the agent locate relevant data for answering questions.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The agent is ready to be queried and you can gain insights on shipping, customer, product and sales and profit analysis.&lt;/p&gt;

&lt;p&gt;Lets start with asking the agent how many customers belong to a specific segment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_agent&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'How many customers belong to a specific segment?'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows the size and value of each customer group, helping businesses tailor strategies, allocate resources, and measure growth.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu616gopo3bbzz9edjhpa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu616gopo3bbzz9edjhpa.png" alt="MindsDB Agent" width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can ask which orders drive the most profit, enabling sales teams and executives to focus on replicating high-value transactions and refining pricing or strategy for maximum impact.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_agent&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Can you show me the top 5 most profitable transactions'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7us5zmiza372ox4skmy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7us5zmiza372ox4skmy.png" alt="MindsDB Agent" width="800" height="303"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can query the agent to reveal which products deliver the greatest profitability, guiding smarter decisions on pricing, promotions, and resource prioritization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_agent&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Which products have the highest profit margins? '&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dmf5rouu1psbhl7om6q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dmf5rouu1psbhl7om6q.png" alt="MindsDB Agent" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The agent can also help track logistics efficiency, uncover customer preferences, and manage shipping costs for better supply chain decisions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_agent&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'How many orders were shipped using a specific shipping mode'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw42a0gje704qre3wkdal.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw42a0gje704qre3wkdal.png" alt="MindsDB Agent" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lets try to highlight overall performance patterns, helping businesses understand growth, seasonality, and shifts in customer demand.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_agent&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'What are the sales trends over the past year?'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15fpziz44efkef08nsm8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15fpziz44efkef08nsm8.png" alt="MindsDB Agent" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lets try to reveal the trade-off between higher sales volume and reduced profit margins, guiding smarter discounting and pricing strategies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_agent&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'How does the discount offered affect the sales and profit?'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F526pgubz4i62uiiapmmp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F526pgubz4i62uiiapmmp.png" alt="MindsDB Agent" width="800" height="306"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Impact
&lt;/h2&gt;

&lt;p&gt;In industries like finance, energy, retail, and enterprise software vendors, MindsDB unlocks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Faster insights&lt;/strong&gt;: Real-time portfolio analysis or fraud detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smarter automation:&lt;/strong&gt; AI that reacts to operational data without lag&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More trust&lt;/strong&gt;: Decisions backed by up-to-date, in-context data&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further use cases include:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customer support&lt;/strong&gt;: Agents reference current policy documents and CRM history to give personalized, accurate answers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Financial modeling:&lt;/strong&gt; AI uses real-time transaction data and risk metrics to suggest actions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance&lt;/strong&gt;: Agents flag violations based on up-to-date regulatory documents and behavior logs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;It used to make sense to move data to models—back when computing was scarce and data was centralized. But today, data is everywhere and speed matters. The future belongs to AI systems that go to the data, enabling real-time access without the delays of duplication or transfer. To stay competitive, AI must adapt to the data—not the other way around.&lt;/p&gt;

&lt;p&gt;Data-native and grounded AI is faster, more secure, and better suited for the dynamic, distributed environments of modern enterprises.  &lt;/p&gt;

&lt;p&gt;With MindsDB, that future is already here. &lt;a href="https://mindsdb.com/contact" rel="noopener noreferrer"&gt;Contact our team&lt;/a&gt; to see our solution in action.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>datascience</category>
      <category>ai</category>
      <category>development</category>
    </item>
  </channel>
</rss>
