DEV Community

Cover image for Blend Hybrid Retrieval with Structured Data using MindsDB Knowledge Bases
MindsDB Team for MindsDB

Posted on • Originally published at mindsdb.com

Blend Hybrid Retrieval with Structured Data using MindsDB Knowledge Bases

Written by Andriy Burkov, Ph.D. & Author, MindsDB Advisor

This tutorial is a follow-up to this tutorial, where we took the first steps in creating and using a MindsDB Knowledge Base feature. In this follow-up project, we will walk through creating a semantic search knowledge base using the famous Enron Emails Dataset. While in the previous tutorial, we simply used an existing dataset, in this one, we'll preprocess the original dataset by extracting structured attributes (also known as metadata) from it using Named Entity Recognition (NER). We will then create a knowledge base and perform both semantic and metadata-filtered searches.

Before we get our hands dirty, let's refresh some basics. Download the webinar code and materials here to follow along the tutorial.

mindsdb

1. Introduction to Knowledge Bases in MindsDB

Knowledge Bases (KB) in MindsDB provide advanced semantic search capabilities, allowing you to find information based on meaning rather than just keywords. They use embedding models to convert text into vector representations and store them in vector databases for efficient similarity searches.

In addition to searching for knowledge nuggets using semantic similarity (soft search criteria), MindsDB KBs allow the user to combine both soft search criteria with hard ones called "metadata," which can be seen as regular relational database table columns.

In this tutorial, we assume that the user has a free open-source MindsDB instance running in their local environment. Please follow these steps to set it up.

To demonstrate both soft and hard searches in a MindsDB KB, we'll use the Enron Corpus - one of the largest publicly available collections of corporate emails, containing over 500,000 emails from Enron executives during the years leading up to the company's collapse in 2001. This dataset is particularly interesting because it contains real business communications, including scandal-related content, making it perfect for demonstrating knowledge base search capabilities.

Named Entity Recognition is the technique we'll use to automatically extract those structured attributesβ€”such as people, organizations, dates, and locationsβ€”from the raw email text. These extracted entities will become the metadata columns in our knowledge base, allowing us not only to search semantically by meaning, but also to filter results using precise, structured criteria like sender, company, or time period.

2. Settings Things Up

2.1 Dependencies Installation

First, let's install the dependencies and set up the NER. We will use SpaCy for this, since its pretrained models can automatically extract entities like people, organizations, dates, and locations from the raw email text. Those extracted entities will then be transformed into structured metadata columns, which we’ll store alongside the email content and later use to power rich, metadata-aware queries in our MindsDB knowledge base.

!pip install mindsdb mindsdb_sdk pandas requests datasets yaspin spacy

# Download spaCy English model for Named Entity Recognition
!python -m spacy download en_core_web_sm

print("βœ… Dependencies installed successfully!")
Enter fullscreen mode Exit fullscreen mode

2.2 Dataset Selection and Download

We'll will download the Enron email's dataset from Hugging Face, which is a large collection of real-world corporate messages from the Enron corpus, paired with their original subject lines and cleaned body text. Each entry includes the email’s metadata (such as sender, recipients, and timestamp) along with the full message content, organized into standard train/validation/test splits so the dataset be uses for tasks like summarization, classification, or downstream NLP experiments.

For our tutorial purposes, we will only use the train fraction of the dataset.

# Download Enron Emails Dataset from Hugging Face
from datasets import load_dataset
import pandas as pd
import re
from datetime import datetime
import json

# Load the Enron dataset (536k emails)
print("Downloading Enron emails dataset...")
dataset = load_dataset("snoop2head/enron_aeslc_emails", split="train")
df = pd.DataFrame(dataset)

print(f"Dataset shape: {df.shape}")
print("Dataset columns:", df.columns.tolist())

def parse_email_text(email_text):
    """Parse raw email text to extract subject, body, and metadata"""
    if pd.isna(email_text) or email_text == '':
        return {'subject': '', 'body': '', 'from': '', 'to': '', 'date': ''}

    email_text = str(email_text)

    # Initialize result dictionary
    parsed = {'subject': '', 'body': '', 'from': '', 'to': '', 'date': ''}

    # Extract subject
    subject_match = re.search(r'Subject:\s*(.*?)(?:\n|$)', email_text, re.IGNORECASE)
    if subject_match:
        parsed['subject'] = subject_match.group(1).strip()

    # Extract from
    from_match = re.search(r'From:\s*(.*?)(?:\n|$)', email_text, re.IGNORECASE)
    if from_match:
        parsed['from'] = from_match.group(1).strip()

    # Extract to
    to_match = re.search(r'To:\s*(.*?)(?:\n|$)', email_text, re.IGNORECASE)
    if to_match:
        parsed['to'] = to_match.group(1).strip()

    # Extract date
    date_match = re.search(r'Date:\s*(.*?)(?:\n|$)', email_text, re.IGNORECASE)
    if date_match:
        parsed['date'] = date_match.group(1).strip()

    # Extract body (everything after the headers)
    # Look for the end of headers (usually marked by double newline or start of actual content)
    header_end = re.search(r'\n\s*\n', email_text)
    if header_end:
        parsed['body'] = email_text[header_end.end():].strip()
    else:
        # Fallback: try to find content after common header patterns
        body_start = re.search(r'(?:Subject:.*?\n.*?\n|X-.*?\n)', email_text, re.DOTALL)
        if body_start:
            parsed['body'] = email_text[body_start.end():].strip()
        else:
            parsed['body'] = email_text

    # Clean up body text
    parsed['body'] = re.sub(r'\n+', ' ', parsed['body'])
    parsed['body'] = re.sub(r'\s+', ' ', parsed['body'])

    return parsed

# Parse first few emails to understand structure
print("Parsing email structure...")
sample_emails = df.head(10)

parsed_samples = []
for idx, row in sample_emails.iterrows():
    # The dataset might have different column names, let's check
    email_content = None
    for col in df.columns:
        if row[col] and len(str(row[col])) > 100:  # Find the column with email content
            email_content = row[col]
            break

    if email_content:
        parsed = parse_email_text(email_content)
        parsed['email_id'] = f"email_{idx:06d}"
        parsed_samples.append(parsed)
Enter fullscreen mode Exit fullscreen mode
    Downloading Enron emails dataset...
    Dataset shape: (535703, 1)
    Dataset columns: ['text']
    Parsing email structure...
Enter fullscreen mode Exit fullscreen mode

Now let's print some records to see what's inside:

df_parsed_sample = pd.DataFrame(parsed_samples)
print("\nSample of parsed emails:")
print("="*100)

for idx, row in df_parsed_sample.head(5).iterrows():
    print(f"\nEmail #{idx+1}")
    print(f"ID: {row['email_id']}")
    print(f"From: {row['from'][:80]}{'...' if len(row['from']) > 80 else ''}")
    print(f"To: {row['to'][:80]}{'...' if len(row['to']) > 80 else ''}")
    print(f"Date: {row['date']}")
    print(f"Subject: {row['subject']}")
    print(f"Body Preview: {row['body'][:200]}{'...' if len(row['body']) > 200 else ''}")
    print("-" * 80)

print(f"\nSuccessfully parsed email structure!")
print(f"Columns extracted: {df_parsed_sample.columns.tolist()}")
Enter fullscreen mode Exit fullscreen mode
Sample of parsed emails:
====================================================================================================

Email #1
ID: email_000000
From: phillip.allen@enron.com
To: tim.belden@enron.com
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
Subject: Body:
Body Preview: Here is our forecast
--------------------------------------------------------------------------------

Email #2
ID: email_000001
From: phillip.allen@enron.com
To: john.lavorato@enron.com
Date: Fri, 4 May 2001 13:51:00 -0700 (PDT)
Subject: Re:
Body Preview: Traveling to have a business meeting takes the fun out of the trip. Especially if you have to prepare a presentation. I would suggest holding the business plan meetings here then take a trip without a...
--------------------------------------------------------------------------------

Email #3
ID: email_000002
From: phillip.allen@enron.com
To: leah.arsdall@enron.com
Date: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)
Subject: Re: test
Body Preview: test successful. way to go!!!
--------------------------------------------------------------------------------

Email #4
ID: email_000003
From: phillip.allen@enron.com
To: randall.gay@enron.com
Date: Mon, 23 Oct 2000 06:13:00 -0700 (PDT)
Subject: Body:
Body Preview: Randy, Can you send me a schedule of the salary and level of everyone in the scheduling group. Plus your thoughts on any changes that need to be made. (Patti S for example) Phillip
--------------------------------------------------------------------------------

Email #5
ID: email_000004
From: phillip.allen@enron.com
To: greg.piper@enron.com
Date: Thu, 31 Aug 2000 05:07:00 -0700 (PDT)
Subject: Re: Hello
Body Preview: Let's shoot for Tuesday at 11:45.
--------------------------------------------------------------------------------

Successfully parsed email structure!
Columns extracted: ['subject', 'body', 'from', 'to', 'date', 'email_id']
Enter fullscreen mode Exit fullscreen mode

The data looks good, so now, let's load SpaCy and its pretrained NLP model for English:

import spacy
from spacy import displacy
# Load spaCy model for NER
print("Loading spaCy model for Named Entity Recognition...")
nlp = spacy.load("en_core_web_sm")
try:
    nlp = spacy.load("en_core_web_sm")
    print("spaCy model 'en_core_web_sm' loaded successfully!")
except OSError as e:
    print("Failed to load spaCy model:", e)
Enter fullscreen mode Exit fullscreen mode
Loading spaCy model for Named Entity Recognition...
spaCy model 'en_core_web_sm' loaded successfully!
Enter fullscreen mode Exit fullscreen mode

2.3 Preparing the Dataset for the Knowledge Base

So far, we have got a raw collection of email messages, but we need a dataset to created a knowledge base from. In this dataset, we want to have natural language texts for soft semantic search and named attributes for hard filtering of data rows.

The first step in preparing a dataset for a KB is cleaning it up and making sure we have a unique ID column. The second step is extracting the named entities from the cleaned records. We will perform both steps in the below cell:

MAX_EMAILS_TO_PROCESS = 500_000
MIN_BODY_SIZE_CHARS = 50

import json
from tqdm import tqdm

def clean_email_content(text):
    """Clean and prepare email text for processing"""
    if pd.isna(text) or text == '':
        return ""

    # Convert to string and clean
    text = str(text)
    # Remove excessive whitespace and newlines
    text = re.sub(r'\n+', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    # Remove common email artifacts
    text = re.sub(r'-----Original Message-----.*$', '', text, flags=re.MULTILINE | re.DOTALL)
    text = re.sub(r'________________________________.*$', '', text, flags=re.MULTILINE | re.DOTALL)

    return text.strip()

def extract_entities_with_ner(text, nlp_model):
    """Extract named entities using spaCy NER"""
    if not text or len(text.strip()) == 0:
        return {}

    # Limit text length to avoid memory issues
    text = text[:5000] if len(text) > 5000 else text

    try:
        doc = nlp_model(text)
        entities = {
            'persons': [],
            'organizations': [],
            'locations': [],
            'money': [],
            'dates': [],
            'events': [],
            'products': []
        }

        for ent in doc.ents:
            entity_text = ent.text.strip()
            if len(entity_text) < 2:  # Skip very short entities
                continue

            if ent.label_ == "PERSON":
                entities['persons'].append(entity_text)
            elif ent.label_ == "ORG":
                entities['organizations'].append(entity_text)
            elif ent.label_ in ["GPE", "LOC"]:  # Geopolitical entities, locations
                entities['locations'].append(entity_text)
            elif ent.label_ == "MONEY":
                entities['money'].append(entity_text)
            elif ent.label_ == "DATE":
                entities['dates'].append(entity_text)
            elif ent.label_ == "EVENT":
                entities['events'].append(entity_text)
            elif ent.label_ == "PRODUCT":
                entities['products'].append(entity_text)

        # Remove duplicates and limit to top entities
        for key in entities:
            entities[key] = list(set(entities[key]))[:5]  # Max 5 entities per type

        return entities
    except Exception as e:
        print(f"Error processing text: {e}")
        return {}

# Use the real Enron dataset that was loaded earlier
print(f"πŸ“§ Working with real Enron dataset: {df.shape[0]} emails")
print(f"Dataset columns: {df.columns.tolist()}")

# Sample a reasonable subset for this tutorial (full dataset is very large)
print("Sampling real Enron emails for processing...")
df_sample = df.sample(n=MAX_EMAILS_TO_PROCESS, random_state=42).reset_index(drop=True)

# Process the real emails and extract entities
print("Processing real Enron emails and extracting entities...")
processed_emails = []

for idx, row in tqdm(df_sample.iterrows(), total=len(df_sample), desc="Processing emails"):
    # Get the raw email content from 'text' column
    email_content = row['text']

    if not email_content or len(str(email_content)) < 100:
        continue

    # Parse the email using the function from Cell 3
    parsed = parse_email_text(email_content)

    # Clean the content
    subject = clean_email_content(parsed['subject'])
    body = clean_email_content(parsed['body'])

    if len(body) < MIN_BODY_SIZE_CHARS:  # Skip very short emails
        continue

    # Extract entities from both subject and content
    full_text = f"{subject} {body}"
    entities = extract_entities_with_ner(full_text, nlp)

    # Create email ID
    email_id = f"email_{idx:06d}"

    processed_email = {
        'email_id': email_id,
        'from_address': parsed['from'][:100] if parsed['from'] else '',  # Limit length
        'to_address': parsed['to'][:100] if parsed['to'] else '',        # Limit length  
        'date_sent': parsed['date'][:50] if parsed['date'] else '',      # Limit length
        'subject': subject,
        'content': body,
        'persons': ', '.join(entities['persons']) if entities['persons'] else '',
        'organizations': ', '.join(entities['organizations']) if entities['organizations'] else '',
        'locations': ', '.join(entities['locations']) if entities['locations'] else '',
        'money_amounts': ', '.join(entities['money']) if entities['money'] else '',
        'dates_mentioned': ', '.join(entities['dates']) if entities['dates'] else '',
        'events': ', '.join(entities['events']) if entities['events'] else '',
        'products': ', '.join(entities['products']) if entities['products'] else '',
        'content_length': len(body),
        'entity_count': sum(len(v) for v in entities.values())
    }

    processed_emails.append(processed_email)

# Convert to DataFrame
df_processed = pd.DataFrame(processed_emails)

print(f"\nβœ… Processed {len(df_processed)} real Enron emails with entities extracted")
Enter fullscreen mode Exit fullscreen mode
πŸ“§ Working with real Enron dataset: 535703 emails
Dataset columns: ['text']
Sampling real Enron emails for processing...
Processing real Enron emails and extracting entities...


Processing emails: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 500000/500000 [2:55:44<00:00, 47.42it/s]



βœ… Processed 453905 real Enron emails with entities extracted
Enter fullscreen mode Exit fullscreen mode

We now have a Pandas dataframe containing, for each email, its text and the extracted attributes. Let's look at some of them and some stats:

# Show a sample of processed data
print(f"\nπŸ“Š Sample of Enron emails processed:")
print("="*120)

for idx, row in df_processed.head(5).iterrows():
    print(f"\nπŸ“§ Real Email #{idx+1}")
    print(f"πŸ†” ID: {row['email_id']}")
    print(f"πŸ‘€ From: {row['from_address']}")
    print(f"πŸ‘€ To: {row['to_address']}")
    print(f"πŸ“… Date: {row['date_sent']}")
    print(f"πŸ“ Subject: {row['subject']}")
    print(f"πŸ‘₯ Persons: {row['persons'] if row['persons'] else 'None detected'}")
    print(f"🏒 Organizations: {row['organizations'] if row['organizations'] else 'None detected'}")
    print(f"πŸ“ Locations: {row['locations'] if row['locations'] else 'None detected'}")
    print(f"πŸ’° Money: {row['money_amounts'] if row['money_amounts'] else 'None detected'}")
    print(f"πŸ’¬ Content Preview: {row['content'][:200]}{'...' if len(row['content']) > 200 else ''}")
    print("-" * 100)

# Show statistics on real data
print(f"\nπŸ“ˆ Real Data Processing Statistics:")
print(f"β€’ Total real emails processed: {len(df_processed)}")
print(f"β€’ Average content length: {df_processed['content_length'].mean():.0f} characters")
print(f"β€’ Average entities per email: {df_processed['entity_count'].mean():.1f}")
print(f"β€’ Emails with persons mentioned: {len(df_processed[df_processed['persons'] != ''])}")
print(f"β€’ Emails with organizations mentioned: {len(df_processed[df_processed['organizations'] != ''])}")
print(f"β€’ Emails with money amounts: {len(df_processed[df_processed['money_amounts'] != ''])}")

# Show some interesting real examples
print(f"\nπŸ” Most interesting real emails (by entity count):")
top_emails = df_processed.nlargest(3, 'entity_count')
for idx, row in top_emails.iterrows():
    print(f"\nπŸ“§ High-entity email from {row['from_address']}")
    print(f"πŸ“ Subject: {row['subject']}")
    print(f"πŸ‘₯ Persons: {row['persons']}")
    print(f"🏒 Organizations: {row['organizations']}")
    print(f"πŸ’° Money: {row['money_amounts']}")
Enter fullscreen mode Exit fullscreen mode
πŸ“Š Sample of Enron emails processed:
========================================================================================================================

πŸ“§ Real Email #1
πŸ†” ID: email_000000
πŸ‘€ From: daren.farmer@enron.com
πŸ‘€ To: susan.trevino@enron.com
πŸ“… Date: Fri, 10 Dec 1999 08:33:00 -0800 (PST)
πŸ“ Subject: Re: Meter 5892 - UA4 1996 and 1997 Logistics Issues
πŸ‘₯ Persons: Daren J Farmer/HOU, Susan, Meter 5892 - UA4 1996, Mary M Smith/HOU, Susan D Trevino
🏒 Organizations: Volume Management
πŸ“ Locations: UA4
πŸ’° Money: None detected
πŸ’¬ Content Preview: Susan, I need you to do the research on this meter. You will need to review the various scheduling systems to see how this was handled prior to 2/96. You can also check with Volume Management to see i...
----------------------------------------------------------------------------------------------------

πŸ“§ Real Email #2
πŸ†” ID: email_000001
πŸ‘€ From: eric.bass@enron.com
πŸ‘€ To: jason.bass2@compaq.com, phillip.love@enron.com, bryan.hull@enron.com,
πŸ“… Date: Fri, 18 Aug 2000 05:03:00 -0700 (PDT)
πŸ“ Subject: DRAFT
πŸ‘₯ Persons: Bcc
🏒 Organizations: None detected
πŸ“ Locations: Rice Village
πŸ’° Money: None detected
πŸ’¬ Content Preview: Cc: timothy.blanchard@enron.com Bcc: timothy.blanchard@enron.com Remember, the draft is this Sunday at 11:45 am at BW-3 in Rice Village. Please try to be there on time so we can start promptly. -Eric
----------------------------------------------------------------------------------------------------

πŸ“§ Real Email #3
πŸ†” ID: email_000003
πŸ‘€ From: larry.campbell@enron.com
πŸ‘€ To: pdrumm@csc.com
πŸ“… Date: Mon, 31 Jul 2000 09:53:00 -0700 (PDT)
πŸ“ Subject: More July CED-PGE
πŸ‘₯ Persons: Susan Fick, Patty
🏒 Organizations: None detected
πŸ“ Locations: None detected
πŸ’° Money: None detected
πŸ’¬ Content Preview: Patty Could you please forward this to Susan Fick. I don't have her e-mail. LC
----------------------------------------------------------------------------------------------------

πŸ“§ Real Email #4
πŸ†” ID: email_000004
πŸ‘€ From: phillip.allen@enron.com
πŸ‘€ To: christi.nicolay@enron.com, james.steffes@enron.com, jeff.dasovich@enron.com,
πŸ“… Date: Wed, 13 Dec 2000 07:04:00 -0800 (PST)
πŸ“ Subject: Body:
πŸ‘₯ Persons: None detected
🏒 Organizations: None detected
πŸ“ Locations: None detected
πŸ’° Money: None detected
πŸ’¬ Content Preview: Attached are two files that illustrate the following: As prices rose, supply increased and demand decreased. Now prices are beginning to fall in response these market responses.
----------------------------------------------------------------------------------------------------

πŸ“§ Real Email #5
πŸ†” ID: email_000005
πŸ‘€ From: kurt.lindahl@elpaso.com
πŸ‘€ To: atsm@chewon.com, aarmstrong@sempratrading.com, neilaj@texaco.com,
πŸ“… Date: Tue, 31 Jul 2001 08:28:00 -0700 (PDT)
πŸ“ Subject: El Paso
πŸ‘₯ Persons: Origination El Paso, Tx 77252-2511, Kurt Lindahl Sr., Rob Bryngelson
🏒 Organizations: the ElPaso Corporation, El Paso, Global LNG Division, El Paso Merchant Energy, Business Development
πŸ“ Locations: Houston
πŸ’° Money: None detected
πŸ’¬ Content Preview: Dear Friends and Colleagues, This note is to inform you that I have joined El Paso Merchant Energy in their Global LNG Division reporting to Rob Bryngelson, Managing Director, Business Development. Pl...
----------------------------------------------------------------------------------------------------

πŸ“ˆ Real Data Processing Statistics:
β€’ Total real emails processed: 453905
β€’ Average content length: 1474 characters
β€’ Average entities per email: 8.1
β€’ Emails with persons mentioned: 383312
β€’ Emails with organizations mentioned: 363549
β€’ Emails with money amounts: 63303

πŸ” Most interesting real emails (by entity count):

πŸ“§ High-entity email from tradersummary@syncrasy.com
πŸ“ Subject: Syncrasy Daily Trader Summary for Wed, Jan 16, 2002
πŸ‘₯ Persons: Data, NC ERCOT(SP, Max, Aquila, Andy Weingarten
🏒 Organizations: Trader Summary, ERCOT(SP, SPP(= SP, Average-Daily Maximum Temperature', MAPP(HP
πŸ’° Money: 37 -1 MAIN(CTR, 50,000, 43 -1 MAIN(CTR, 36 -1 MAIN(CTR, 40 -1 WSCC(RK

πŸ“§ High-entity email from lucky@icelandair.is
πŸ“ Subject: Iceland Food Festival
πŸ‘₯ Persons: Hotel Klopp, Rich, Mar 1 - National Beer Day, David Rosengarten, Subject
🏒 Organizations: Reykjav?k/K?pavogur, Party, Party Gourmet Dinner, BWI, SCENIC SIGHTSEEING Blue Lagoon
πŸ’° Money: 65, 66, 69, 50, 55

πŸ“§ High-entity email from truorange@aol.com
πŸ“ Subject: True Orange, November 27, Part 2
πŸ‘₯ Persons: Sooners, Jody Conradt, ESPN, Harris, Northwestern
🏒 Organizations: Oregon State, K-State, Texas A&M, SEC, Big East
πŸ’° Money: $1.1 million, $1.9 million, $2.5 million, $1.2 million, 750,000
Enter fullscreen mode Exit fullscreen mode

The data looks good, so let's now save it into a CSV file that we will then load to our knowledge base:

# Save processed real data
df_processed.to_csv('enron_emails_processed_real.csv', index=False)
print(f"\nβœ… Real Enron emails saved to 'enron_emails_processed_real.csv'")
Enter fullscreen mode Exit fullscreen mode
βœ… Real Enron emails saved to 'enron_emails_processed_real.csv'
Enter fullscreen mode Exit fullscreen mode

2.3 Connecting to the Vector Store

When the user creates a MindsDB Knowledge Base, MindsDB chunks all the text fragments into pieces (chunks) and uses an external text embedding model to convert each chunk into an embedding vector. Embedding vectors are numerical arrays that have the following property: if two texts are similar semantically, then their embedding vectors are close to each other in the vector space. This allows us to compare two texts semantically by applying a mathematical operation (like cosine similarity) to two vectors to see how close they are in the vector space.

These embedding vectors need to be stored somewhere. There are various vector databases, including several open-source ones. MindsDB supports ChromaDB by default. However, ChromaDB doesn't support the "LIKE" operation, which is a standard operation in relational database SELECT queries. We will use LIKE in our tutorial; therefore, we will use a different open-source vector store, PGVector, which is part of the Postgres ecosystem.

For this tutorial, we provisioned a PGVector instance on AWS. You can install it locally too. Here's how you can do it.

Let's create a vector database enron_kb_pgvector, which will store knowledge base's embedding vectors:

# Drop an existing pgvector database if it exists
try:
    print("πŸ—‘οΈ  Dropping existing pgvector database...")
    server.query("DROP DATABASE IF EXISTS enron_kb_pgvector;").fetch()
    print("βœ… Dropped existing database")
except Exception as e:
    print(f"⚠️  Drop error: {e}")

# Create fresh pgvector database connection
try:
    server.query("""
        CREATE DATABASE enron_kb_pgvector
        WITH ENGINE = 'pgvector',
        PARAMETERS = {
            "host": "c3hsmn51hjafhh.cluster-czrs8kj4isg7.us-east-1.rds.amazonaws.com",
            "port": 5432,
            "database": "df1f3i5s2jrksf",
            "user": "u36kd0g64092pk",
            "password": "pc08df7cb724a4ad6b1a8288c3666fa087f1a89c1ba5d1a555b40a8ba863672e4"
        };
    """).fetch()
    print("βœ… Created pgvector database connection 'enron_kb_pgvector'")
except Exception as e:
    print(f"❌ Database connection error: {e}")
    raise
Enter fullscreen mode Exit fullscreen mode
πŸ—‘οΈ  Dropping existing pgvector database...
βœ… Dropped existing database
βœ… Created pgvector database connection 'enron_kb_pgvector'
Enter fullscreen mode Exit fullscreen mode

2.4 Uploading the Dataset to MindsDB

Now let's connect to our local MindsDB instance and upload the dataset:

Remember, that in this tutorial, we assume that the user has a free open-source MindsDB instance running in their local environment. Please follow these steps to set it up.

import mindsdb_sdk
import re

# Connect to the MindsDB server
server = mindsdb_sdk.connect('http://127.0.0.1:47334')
print("Connected to MindsDB server")

# List available databases to confirm connection
databases = server.databases.list()
print("Available databases:")
for db in databases:
    print(f"- {db.name}")

# First drop any knowledge bases
try:
    print("πŸ—‘οΈ  Dropping knowledge bases...")
    server.query("DROP KNOWLEDGE_BASE IF EXISTS enron_kb;").fetch()
    print("βœ… Dropped knowledge base enron_kb")
except Exception as e:
    print(f"⚠️  KB drop error: {e}")

# Check if df_processed exists and has real data
try:
    print(f"\nπŸ“Š Checking real processed Enron data...")
    print(f"Shape: {df_processed.shape}")
    print(f"Columns: {df_processed.columns.tolist()}")

    if len(df_processed) == 0:
        print("❌ df_processed is empty. Please run the cell that creates it first.")
        raise ValueError("No processed data available")

    df_upload = df_processed.copy()
    print(f"βœ… Using {len(df_upload)} real processed Enron emails")

except NameError:
    print("❌ Error: df_processed not found. Please run the cell that creates it first.")
    print("Cannot continue without real processed data.")
    raise

def clean_for_upload(text):
    """Clean text data for safe upload to MindsDB"""
    if pd.isna(text) or text == '':
        return ''

    text = str(text)
    # Remove problematic characters that might cause encoding issues
    text = re.sub(r'[^\w\s\-\.\@\,\;\:\!\?\(\)\[\]\/]', ' ', text)
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    # Limit length to prevent upload issues
    if len(text) > 2000:
        text = text[:1997] + '...'

    return text.strip()

# Clean the real data for upload
print("Cleaning real Enron data for upload...")

# Clean text fields
text_columns = ['from_address', 'to_address', 'date_sent', 'subject', 'content', 
                'persons', 'organizations', 'locations', 'money_amounts', 
                'dates_mentioned', 'events', 'products']

for col in text_columns:
    if col in df_upload.columns:
        df_upload[col] = df_upload[col].apply(clean_for_upload)

# Ensure numeric columns are properly typed
for col in ['content_length', 'entity_count']:
    if col in df_upload.columns:
        df_upload[col] = pd.to_numeric(df_upload[col], errors='coerce').fillna(0).astype(int)

print(f"\nπŸ“‹ Final real Enron dataset for upload:")
print(f"Shape: {df_upload.shape}")
print(f"Sample from addresses: {df_upload['from_address'].head(3).tolist()}")
print(f"Sample subjects: {df_upload['subject'].head(3).tolist()}")

# Upload to MindsDB
files_db = server.get_database("files")
table_name = "enron_emails"

# Delete existing table if it exists
try:
    files_db.tables.drop(table_name)
    print(f"Dropped existing table {table_name}")
except Exception:
    pass

# Upload real Enron data
try:
    print("Uploading real Enron emails to MindsDB...")
    files_db.create_table(table_name, df_upload)
    print(f"βœ… Created table files.{table_name} with real Enron data")

    # Verify upload with real data
    sample_data = server.query(f"SELECT email_id, subject, persons, organizations FROM files.{table_name} LIMIT 5").fetch()
    print("\nβœ… Sample real Enron data uploaded:")
    for idx, row in sample_data.iterrows():
        print(f"πŸ“§ {row['email_id']}: {row['subject'][:60]}...")
        print(f"   πŸ‘₯ Persons: {row['persons']}")
        print(f"   🏒 Orgs: {row['organizations']}")

    # Check total count
    count_result = server.query(f"SELECT COUNT(*) as total FROM files.{table_name}").fetch()
    total_emails = count_result.iloc[0]['total']
    print(f"\nπŸ“Š Total real Enron emails uploaded: {total_emails}")

except Exception as e:
    print(f"❌ Upload failed: {e}")

print("\nβœ… Real Enron data upload process completed!")
Enter fullscreen mode Exit fullscreen mode
Connected to MindsDB server
Available databases:
- files
- movies_kb_chromadb
πŸ—‘οΈ  Dropping knowledge bases...
βœ… Dropped knowledge base enron_kb

πŸ“Š Checking real processed Enron data...
Shape: (453905, 15)
Columns: ['email_id', 'from_address', 'to_address', 'date_sent', 'subject', 'content', 'persons', 'organizations', 'locations', 'money_amounts', 'dates_mentioned', 'events', 'products', 'content_length', 'entity_count']
βœ… Using 453905 real processed Enron emails
Cleaning real Enron data for upload...

πŸ“‹ Final real Enron dataset for upload:
Shape: (453905, 15)
Sample from addresses: ['daren.farmer@enron.com', 'eric.bass@enron.com', 'larry.campbell@enron.com']
Sample subjects: ['Re: Meter 5892 - UA4 1996 and 1997 Logistics Issues', 'DRAFT', 'More July CED-PGE']
Dropped existing table enron_emails
Uploading real Enron emails to MindsDB...
βœ… Created table files.enron_emails with real Enron data

βœ… Sample real Enron data uploaded:
πŸ“§ email_000000: Re: Meter 5892 - UA4 1996 and 1997 Logistics Issues...
   πŸ‘₯ Persons: Daren J Farmer/HOU, Susan, Meter 5892 - UA4 1996, Mary M Smith/HOU, Susan D Trevino
   🏒 Orgs: Volume Management
πŸ“§ email_000001: DRAFT...
   πŸ‘₯ Persons: Bcc
   🏒 Orgs: None
πŸ“§ email_000003: More July CED-PGE...
   πŸ‘₯ Persons: Susan Fick, Patty
   🏒 Orgs: None
πŸ“§ email_000004: Body:...
   πŸ‘₯ Persons: None
   🏒 Orgs: None
πŸ“§ email_000005: El Paso...
   πŸ‘₯ Persons: Origination El Paso, Tx 77252-2511, Kurt Lindahl Sr., Rob Bryngelson
   🏒 Orgs: the ElPaso Corporation, El Paso, Global LNG Division, El Paso Merchant Energy, Business Development

πŸ“Š Total real Enron emails uploaded: 453905

βœ… Real Enron data upload process completed!
Enter fullscreen mode Exit fullscreen mode

2.4 Creating a Knowledge Base

Now, let's create a knowledge base enron_kb using our emails data. We'll use OpenAI's embedding model to convert the text into vectors. Note the storage = enron_kb_pgvector.enron_vectors parameter which tells MindsDB to use our PGVector vector store. If we omit this parameter, teh default ChromaDB vector store will be used.

# Drop existing knowledge base if it exists
server.query("DROP KNOWLEDGE_BASE IF EXISTS enron_kb;").fetch()

# Create knowledge base with pgvector storage
try:
    kb_creation_query = server.query(f"""
    CREATE KNOWLEDGE_BASE enron_kb
    USING
        storage = enron_kb_pgvector.enron_vectors,
        embedding_model = {{
           "provider": "openai",
           "model_name": "text-embedding-3-large"
        }},
        metadata_columns = [
            'subject', 'persons', 'organizations', 'locations', 
            'money_amounts', 'dates_mentioned', 'events', 'products',
            'content_length', 'entity_count',
            'from_address', 'to_address', 'date_sent'
        ],
        content_columns = ['content'],
        id_column = 'email_id';
    """)
    kb_creation_query.fetch()
    print("βœ… Created knowledge base 'enron_kb' with email address and date filtering support")
except Exception as e:
    print(f"❌ Knowledge base creation error: {e}")
Enter fullscreen mode Exit fullscreen mode
βœ… Created knowledge base 'enron_kb' with email address and date filtering support
Enter fullscreen mode Exit fullscreen mode

Now let's insert our email data into the knowledge base:

# Insert the email data into the knowledge base (including the new metadata columns)
from yaspin import yaspin

try:
    with yaspin(text="Inserting emails into updated knowledge base..."):
        insert_query = server.query(f"""
            INSERT INTO enron_kb
            SELECT email_id,
                   subject,
                   persons,
                   organizations,
                   locations,
                   money_amounts,
                   dates_mentioned,
                   events,
                   products,
                   content_length,
                   entity_count,
                   from_address,
                   to_address,
                   date_sent,
                   content
            FROM   files.{table_name}
            USING
                batch_size = 200,
                threads = 10,
                error = 'skip',
                track_column = email_id;
            """).fetch() 
    print("βœ… Emails inserted successfully into updated knowledge base!")
except Exception as e:
    print(f"❌ Insert error: {e}")
Enter fullscreen mode Exit fullscreen mode
βœ… Emails inserted successfully into updated knowledge base!
Enter fullscreen mode Exit fullscreen mode

Let's see what the data in the KB looks like:

search_query = server.query("SELECT * FROM enron_kb;")
display(search_query.fetch())

search_query = server.query("SELECT count(*) FROM enron_kb;")
display(search_query.fetch())
Enter fullscreen mode Exit fullscreen mode
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
id chunk_id chunk_content metadata relevance distance
0 email_024653 email_024653:content:3of3:1993to2382 Palmer of Caminus Corp on European Markets! ht... {'events': None, '_source': 'TextChunkingPrepr... None None
1 email_024585 email_024585:content:1of1:0to429 Most of you already know, but the move is taki... {'events': None, '_source': 'TextChunkingPrepr... None None
2 email_024902 email_024902:content:1of2:0to997 To facilitate these changes, you received an O... {'events': None, '_source': 'TextChunkingPrepr... None None
3 email_025043 email_025043:content:1of1:0to801 According to our system records, you have not ... {'events': None, '_source': 'TextChunkingPrepr... None None
4 email_025044 email_025044:content:1of3:0to996 The attached preliminary comments were finaliz... {'events': None, '_source': 'TextChunkingPrepr... None None
... ... ... ... ... ... ...
676450 email_024824 email_024824:content:1of1:0to585 Cc: m..love@enron.com, scott.palmer@enron.com ... {'events': None, '_source': 'TextChunkingPrepr... None None
676451 email_024829 email_024829:content:1of3:0to998 20 [IMAGE] CO.O.L. Travel Specials [IMAGE] Wed... {'events': 'Love Field', '_source': 'TextChunk... None None
676452 email_024829 email_024829:content:2of3:999to1998 on either Monday, May 21 or Tuesday, May 22, 2... {'events': 'Love Field', '_source': 'TextChunk... None None
676453 email_024829 email_024829:content:3of3:1999to2391 IN 139 - Louisville, KY return to top Featured... {'events': 'Love Field', '_source': 'TextChunk... None None
676454 email_024804 email_024804:content:1of1:0to354 FYI -- David Leboe in Investor Relations autho... {'events': None, '_source': 'TextChunkingPrepr... None None

676455 rows Γ— 6 columns

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
count_0
0 676455

As we can see, the data looks very much like a regular relational table. However, the fact that it's a knowledge base instance rather than a regular database connection, allows us to use a special syntax mixing the semantic similarity with regulare SQL "WHERE" constructs.

You can also notice that the knowledge base contains chunks rather than the original texts of the email messages. Each chunk has its own embedding vectr. This allows finding more granular pieces of content similar to the user's question.

3. Performing Semantic Searches

Now that our knowledge base is ready (or being populated), let's do some Q&A. For convenience, we will setup a utility function answer_question_about_enron which will take as input question about the data and the attribute this data is expected to contain such as people names, organizations, locaitons, etc: the attributes thet the NER was supposed to have extracted. This utility function will combine the inputs into a SELECT query by using the MindsDB syntax. For example, if our question/request is "I need to see emails mentioning fraud." and we want to only to see emails from "John Smith", our SELECT query constructed by answer_question_about_enron would look like this:

SELECT 
    id, 
    chunk_content, 
    relevance,
    metadata
FROM enron_kb_full
WHERE content = 'I need to see emails mentioning fraud.' AND persons LIKE '%John Smith%'
ORDER BY relevance DESC
LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

The above query will return emails that mention any fraud even if the word "fraud" itself isn't used in the emails' texts. This is a soft search. Only those emails will be retuned whose "persons" attribute contains "John Smith". This is a hard search.

import openai
import time
import json
from IPython.display import display

# Set up OpenAI client (replace with your API key)
client = openai.OpenAI(api_key="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")

def answer_question_about_enron(question: str, persons=None, organizations=None, locations=None, money_amounts=None, subjects=None):
    """
    Answer questions about Enron using the knowledge base with metadata filtering

    Args:
        question (str): The question to ask
        persons (str or list): Person name(s) to filter by
        organizations (str or list): Organization name(s) to filter by  
        locations (str or list): Location name(s) to filter by
        money_amounts (str or list): Money amount(s) to filter by
        subjects (str or list): Subject keyword(s) to filter by

    Returns:
        str: Generated answer based on relevant emails
    """

    def create_like_conditions(column_name, values):
        """Helper function to create LIKE conditions for single values or lists"""
        if values is None:
            return []

        # Convert single value to list for uniform processing
        if isinstance(values, str):
            values = [values]
        elif not isinstance(values, (list, tuple)):
            values = [str(values)]

        conditions = []
        for value in values:
            conditions.append(f"{column_name} LIKE '%{value}%'")

        return conditions

    print(f"πŸ€” Question: {question}")

    # Build WHERE clause with optional filters
    where_conditions = [f"content = '{question}'"]

    # Handle persons filter
    if persons:
        person_conditions = create_like_conditions("persons", persons)
        if person_conditions:
            where_conditions.extend(person_conditions)
            if isinstance(persons, list):
                print(f"πŸ” Filtering by persons: {', '.join(persons)}")
            else:
                print(f"πŸ” Filtering by persons: {persons}")

    # Handle organizations filter
    if organizations:
        org_conditions = create_like_conditions("organizations", organizations)
        if org_conditions:
            where_conditions.extend(org_conditions)
            if isinstance(organizations, list):
                print(f"πŸ” Filtering by organizations: {', '.join(organizations)}")
            else:
                print(f"πŸ” Filtering by organizations: {organizations}")

    # Handle locations filter
    if locations:
        loc_conditions = create_like_conditions("locations", locations)
        if loc_conditions:
            where_conditions.extend(loc_conditions)
            if isinstance(locations, list):
                print(f"πŸ” Filtering by locations: {', '.join(locations)}")
            else:
                print(f"πŸ” Filtering by locations: {locations}")

    # Handle money_amounts filter
    if money_amounts:
        money_conditions = create_like_conditions("money_amounts", money_amounts)
        if money_conditions:
            where_conditions.extend(money_conditions)
            if isinstance(money_amounts, list):
                print(f"πŸ” Filtering by money amounts: {', '.join(money_amounts)}")
            else:
                print(f"πŸ” Filtering by money amounts: {money_amounts}")

    # Handle subjects filter
    if subjects:
        subject_conditions = create_like_conditions("subject", subjects)
        if subject_conditions:
            where_conditions.extend(subject_conditions)
            if isinstance(subjects, list):
                print(f"πŸ” Filtering by subjects: {', '.join(subjects)}")
            else:
                print(f"πŸ” Filtering by subjects: {subjects}")

    where_clause = " AND ".join(where_conditions)

    try:
        search_query = f"""
        SELECT 
            id, 
            chunk_content, 
            relevance,
            metadata
        FROM enron_kb
        WHERE {where_clause}
        ORDER BY relevance DESC
        LIMIT 100;
        """

        print(f"πŸ” SQL Query: {search_query}")
        search_results = server.query(search_query).fetch()

        if len(search_results) == 0:
            return "❌ No relevant emails found matching your criteria."

        print(f"βœ… Found {len(search_results)} results")

        # Show sample results
        for idx, row in search_results.head(3).iterrows():
            try:
                metadata = json.loads(str(row['metadata'])) if row['metadata'] and str(row['metadata']) != 'nan' else {}
                print(f"\nπŸ“§ Result #{idx+1}")
                print(f"πŸ†” Email ID: {row['id']}")
                print(f"πŸ“Š Relevance: {row['relevance']:.4f}")
                print(f"πŸ“ Subject: {metadata.get('subject', 'No subject')}")
                print(f"πŸ‘₯ Persons: {metadata.get('persons', 'None')}")
                print(f"🏒 Organizations: {metadata.get('organizations', 'None')}")
                print(f"πŸ“ Locations: {metadata.get('locations', 'None')}")
                content = str(row['chunk_content'])
                preview = content[:200] + "..." if len(content) > 200 else content
                print(f"πŸ’¬ Content: {preview}")
                print("-" * 60)
            except Exception as e:
                print(f"Error processing result: {e}")

        # Prepare context for GPT
        context_parts = []
        for idx, row in search_results.iterrows():
            try:
                metadata = json.loads(str(row['metadata'])) if row['metadata'] and str(row['metadata']) != 'nan' else {}
                email_context = f"""
Email ID: {row['id']}
Subject: {metadata.get('subject', 'No subject')}
Persons mentioned: {metadata.get('persons', 'None')}
Organizations mentioned: {metadata.get('organizations', 'None')}
Content: {row['chunk_content'][:800]}...
"""
                context_parts.append(email_context)
            except:
                continue

        context = "\n---\n".join(context_parts[:8])

        # Create prompt for GPT
        prompt = f"""
        You are an expert analyst studying the Enron corporate emails dataset. Based ONLY on the following 
        email excerpts from the Enron corpus, answer the user's question.

        EMAIL EXCERPTS FROM ENRON CORPUS:
        {context}

        QUESTION: {question}

        Instructions:
        - Provide a factual answer based only on the email content provided above
        - If the emails mention specific people, organizations, or amounts, include those details
        - If the emails don't contain enough information to answer the question, state that clearly
        - Reference specific email IDs when making claims
        """

        print("πŸ€– Generating answer using GPT-4...")
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful analyst answering questions about Enron emails. Use only the provided email content and be specific about sources."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1
        )

        answer = response.choices[0].message.content
        print(f"\nπŸ’‘ ANSWER:\n{answer}")
        return answer

    except Exception as e:
        print(f"❌ Error during search: {e}")
        return f"Error during search: {e}"


# Process the three original questions with metadata filtering
print("=== ENRON EMAIL ANALYSIS WITH METADATA FILTERING ===\n")

print("πŸ“‹ Question 1:")
answer1 = answer_question_about_enron(
    "What concerns did Sherron Watkins express to Ken Lay in her email about Enron''s accounting practices?",
    persons=["Watkins", "Lay"]
)

print("\n" + "="*100 + "\n")

print("πŸ“‹ Question 2:")

answer = answer_question_about_enron(
    "How did David Delainey justify inflating Mariner''s valuation from $250M to $600M in his email to Ken Lay?",
    persons=["Delainey"]
)

print("πŸ“‹ Question 3:")

answer = answer_question_about_enron(
    "How did Tim DeSpain coach Ken Lay on what to tell credit rating agencies about Enron''s financial condition?",
    organizations=["Moody"]
)
Enter fullscreen mode Exit fullscreen mode
=== ENRON EMAIL ANALYSIS WITH METADATA FILTERING ===

πŸ“‹ Question 1:
πŸ€” Question: What concerns did Sherron Watkins express to Ken Lay in her email about Enron''s accounting practices?
πŸ” Filtering by persons: Watkins, Lay
πŸ” SQL Query:
SELECT
id,
chunk_content,
relevance,
metadata
FROM enron_kb
WHERE content = 'What concerns did Sherron Watkins express to Ken Lay in her email about Enron''s accounting practices?' AND persons LIKE '%Watkins%' AND persons LIKE '%Lay%'
ORDER BY relevance DESC
LIMIT 100;

βœ… Found 16 results

πŸ“§ Result #1
πŸ†” Email ID: email_048101
πŸ“Š Relevance: 0.7053
πŸ“ Subject: The key questions I asked Lay on Aug 22
πŸ‘₯ Persons: Sherron S. Watkins, Lay
🏒 Organizations: Enron Corp.
πŸ“ Locations: None

πŸ’¬ Content: Sherron S. Watkins Vice President, Enron Corp. 713-345-8799 office 713-416-0620 cell

πŸ“§ Result #2
πŸ†” Email ID: email_335299
πŸ“Š Relevance: 0.6832
πŸ“ Subject: TEAM 4 - HR ENERGY COMMERCE SUBPOENA (1) (01/14/02) AND (2) (12/10/01)
πŸ‘₯ Persons: Ken Lay, Sherron Watkins, JEDI
🏒 Organizations: BLUE DOG, LJM2, 09 09The, the RAP TEAM, Enron
πŸ“ Locations: V E, electr

πŸ’¬ Content: Please search your files and collect all records and documents covered by o r relevant to the following requests: (1) 09All records relating to any investigations/review of the allegations raised by S...

πŸ“§ Result #3
πŸ†” Email ID: email_178480
πŸ“Š Relevance: 0.6727
πŸ“ Subject: TEAM 4 - HR ENERGY COMMERCE SUBPOENA (1) (01/14/02) AND (2)
πŸ‘₯ Persons: Ken Lay, Sherron Watkins, JEDI, Bcc
🏒 Organizations: BLUE DOG, LJM2, k..heathman@enron.com, minutes , 09 09The
πŸ“ Locations: V E, electr

πŸ’¬ Content: (12/10/01) Cc: k..heathman@enron.com, team.response@enron.com Bcc: k..heathman@enron.com, team.response@enron.com We remain in the process of gathering information sought by various governm ental agen...

πŸ€– Generating answer using GPT-4...

πŸ’‘ ANSWER:
Sherron Watkins expressed concerns to Ken Lay about Enron's accounting practices in her email, stating that she was "incredibly nervous that we will implode in a wave of accounting scandals." This concern was highlighted in an email discussing the broader context of Enron's financial issues, where it was noted that Andersen, the government, and Enron itself had access to financial data indicating the company's potential collapse (Email ID: email_446563). Additionally, her concerns were significant enough to prompt investigations and reviews of the allegations she raised in her August memo to Ken Lay, as mentioned in emails discussing subpoenas and document requests (Email IDs: email_335299 and email_178480).

====================================================================================================

πŸ“‹ Question 2:
πŸ€” Question: How did David Delainey justify inflating Mariner''s valuation from $250M to $600M in his email to Ken Lay?
πŸ” Filtering by persons: Delainey
πŸ” SQL Query:
SELECT
id,
chunk_content,
relevance,
metadata
FROM enron_kb
WHERE content = 'How did David Delainey justify inflating Mariner''s valuation from $250M to $600M in his email to Ken Lay?' AND persons LIKE '%Delainey%'
ORDER BY relevance DESC
LIMIT 100;

βœ… Found 100 results

πŸ“§ Result #1
πŸ†” Email ID: email_006062
πŸ“Š Relevance: 0.7015
πŸ“ Subject: Mariner
πŸ‘₯ Persons: Delainey, Ken, Kase Lawal, Bcc
🏒 Organizations: un, IPO, Mariner, E P
πŸ“ Locations: None

πŸ’¬ Content: Cc: jeff.donahue@enron.com, raymond.bowen@enron.com Bcc: jeff.donahue@enron.com, raymond.bowen@enron.com Ken, in response to your note, I am not aware of any official dialogue with Mr. Kase Lawal abou...

πŸ“§ Result #2
πŸ†” Email ID: email_400597
πŸ“Š Relevance: 0.7015
πŸ“ Subject: Mariner
πŸ‘₯ Persons: Delainey, Ken, Kase Lawal, Bcc
🏒 Organizations: un, IPO, Mariner, E P
πŸ“ Locations: None

πŸ’¬ Content: Cc: jeff.donahue@enron.com, raymond.bowen@enron.com Bcc: jeff.donahue@enron.com, raymond.bowen@enron.com Ken, in response to your note, I am not aware of any official dialogue with Mr. Kase Lawal abou...

πŸ“§ Result #3
πŸ†” Email ID: email_275215
πŸ“Š Relevance: 0.7015
πŸ“ Subject: Mariner
πŸ‘₯ Persons: Delainey, Ken, Kase Lawal, Bcc
🏒 Organizations: un, IPO, Mariner, E P
πŸ“ Locations: None

πŸ’¬ Content: Cc: jeff.donahue@enron.com, raymond.bowen@enron.com Bcc: jeff.donahue@enron.com, raymond.bowen@enron.com Ken, in response to your note, I am not aware of any official dialogue with Mr. Kase Lawal abou...

πŸ€– Generating answer using GPT-4...

πŸ’‘ ANSWER:
David Delainey justified inflating Mariner's valuation from $250M to $600M based on several factors mentioned in the emails. According to the content of multiple emails (email IDs: email_006062, email_400597, email_275215, email_372317, email_280761, email_372732, email_332901), the justification included:

  1. Successful Wells: Mariner had enjoyed a series of successful wells that were expected to be booked in reserve reports by the following March.
  2. Increases in Gas and Oil Prices: There were significant increases in gas and oil prices, which contributed to the higher valuation.
  3. Reserve Growth: The reserve growth was a key factor in the increased valuation.
  4. Current Energy Prices: The current energy prices at the time supported the higher valuation.
  5. Future Goals: The goal was to demonstrate three to four quarters of increasing operating cash flow and reserves growth before attempting further actions.

These factors collectively contributed to the stretch target valuation of $600M, which Delainey noted was not incredibly out of line given the circumstances.
πŸ“‹ Question 3:
πŸ€” Question: How did Tim DeSpain coach Ken Lay on what to tell credit rating agencies about Enron''s financial condition?
πŸ” Filtering by organizations: Moody
πŸ” SQL Query:
SELECT
id,
chunk_content,
relevance,
metadata
FROM enron_kb
WHERE content = 'How did Tim DeSpain coach Ken Lay on what to tell credit rating agencies about Enron''s financial condition?' AND organizations LIKE '%Moody%'
ORDER BY relevance DESC
LIMIT 100;

βœ… Found 100 results

πŸ“§ Result #1
πŸ†” Email ID: email_028284
πŸ“Š Relevance: 0.6674
πŸ“ Subject: Yesterday s Call: Feedback
πŸ‘₯ Persons: Good Luck, Jeff P.S., Cal Ed
🏒 Organizations: LJM, ENE, Moody s, Fastow, SEC
πŸ“ Locations: Citi, Skilling

πŸ’¬ Content: Ken, Thanks for having the call yesterday. I am a believer in Enron and we are buying your debt. Here s short feedback on the call. I give the call a B-/C grade. If you want a good example of a compan...

πŸ“§ Result #2
πŸ†” Email ID: email_268144
πŸ“Š Relevance: 0.6636
πŸ“ Subject: Moody s Annual Review Meeting
πŸ‘₯ Persons: Jeff McMahon, Stephen Moore - Relationship, Foley, Tim, Ben
🏒 Organizations: Sierra Pacific, EBS, International Asset Sales, Wholesale Services, Moody s
πŸ“ Locations: California

πŸ’¬ Content: Director, and Stephen Moore - Relationship Manager (our analyst). Diaz and Moore are very familiar with the Enron credit profile. Foley is their boss. He apparently is the leader of their ratings comm...

πŸ“§ Result #3
πŸ†” Email ID: email_152456
πŸ“Š Relevance: 0.6528
πŸ“ Subject: Moody s and Standard Poor s
πŸ‘₯ Persons: John Diaz, Ben, Bcc, Andy, Tim DeSpain
🏒 Organizations: Credit Ratings - emphasize, Moody s Call:, Standard Poor s, EBS, Dhabol
πŸ“ Locations: None

πŸ’¬ Content: Cc: ben.glisan@enron.com, andrew.fastow@enron.com Bcc: ben.glisan@enron.com, andrew.fastow@enron.com Two conference calls have been tenatively scheduled to allow you to directly discuss Enron s commit...

πŸ€– Generating answer using GPT-4...

πŸ’‘ ANSWER:
Tim DeSpain, along with Andy and Ben, coached Ken Lay on what to tell credit rating agencies about Enron's financial condition by emphasizing several key assurances. According to email ID: email_152456 and email ID: email_372468, they advised Ken Lay to stress the following points:

  1. Commitment to Maintaining Credit Ratings: They emphasized that maintaining credit ratings was critical to Enron's fundamental businesses, particularly gas and power marketing. They noted that both counterparties and creditors placed significant importance on Enron's consistent rating profile.

  2. Strength of Core Businesses: They highlighted that Enron's core businesses were strong, positioning Enron as the leading franchise in energy marketing. They anticipated continued strength in financial performance from the commodity groups.

These points were intended to assure the credit rating agencies of Enron's financial stability and commitment to its credit ratings.

Enter fullscreen mode Exit fullscreen mode




Conclusion

In this tutorial, we've successfully built a sophisticated question-answering system over the Enron email corpus by combining MindsDB's Knowledge Base capabilities with Named Entity Recognition. This demonstrates how modern AI tools can transform unstructured text into a queryable, intelligent knowledge base.

Key Achievements

  1. Automated metadata extraction: By leveraging SpaCy's NER models, we automatically extracted structured entities (people, organizations, locations) from raw email text, converting unstructured data into a hybrid storage system that supports both semantic and structured queries.

  2. Hybrid search capabilities: The knowledge base enables both soft search criteria (semantic similarity through embeddings) and hard search criteria (metadata filtering), allowing for precise and flexible information retrieval. This combination significantly enhances search accuracy and reduces irrelevant results.

  3. Simplified query interface: MindsDB abstracts away the complexity of vector databases, embedding models, and similarity calculations behind a familiar SQL interface. The addition of a simple content attribute in SQL SELECT statements makes semantic search accessible to anyone familiar with SQL.

  4. Practical RAG implementation: By integrating the knowledge base with a chat LLM, we've created a Retrieval-Augmented Generation (RAG) system that can answer complex questions by first retrieving relevant context and then generating informed answers, significantly reducing hallucinations.

Real-World Applications

The techniques demonstrated in this tutorial have broad applications beyond the Enron dataset:

  • Corporate knowledge management: Search through internal documents, emails, and reports using both semantic queries and metadata filters
  • Legal discovery: Find relevant communications filtered by sender, recipient, date range, or mentioned entities
  • Customer support: Build intelligent support systems that can search through product documentation and past support tickets
  • Research analysis: Query academic papers, research notes, or experimental data with combined semantic and structured filtering

Next Steps

To extend this project, consider:

  • Expanding entity types: Extract additional metadata such as monetary amounts, dates, or custom domain-specific entities
  • Finetuning embeddings: Use domain-specific embedding models for improved semantic matching in specialized fields
  • Multi-modal knowledge bases: Incorporate documents, images, and other file types into your knowledge base
  • Advanced filtering: Implement complex boolean logic and date-range queries for more sophisticated searches
  • Production deployment: Scale the system to handle larger datasets and concurrent users

For more information on MindsDB Knowledge Bases and advanced features, visit the official documentation.

Watch the playback of the live webinar on youtube:

Top comments (0)