vaasav kumar

Posted on May 9

🚀 From Agents to Data Intelligence: Load Files, Scrape Web & Analyze with LangChain

#ai #python #architecture #agents

In my previous blog, I covered:
👉 From LLMs to Agents: Build Smart AI Systems with Tools in LangChain

We learned how to:

build custom tools
create AI agents
fetch real-world data

🔥 What’s Next?

Now let’s take it further.

👉 Instead of just querying tools, we will make AI work with real data sources:

In this blog, we will learn:

📄 Load and analyze text files
📊 Process CSV data
🌐 Fetch and analyze web URLs (web scraping)
⚡ Optimize using semantic search (vector DB)

📄 1. Load Text File Using TextLoader

We can directly load a *.txt file into LangChain:

from langchain_community.document_loaders import TextLoader

loader = TextLoader("tata_motors.txt")
docs = loader.load()
docs

Output

👉 This converts your text file into structured documents.

Add Queries to fetch result from .txt file

from langchain_community.document_loaders import TextLoader
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

loader = TextLoader("tata_motors.txt", encoding="utf-8")
docs = loader.load()

# Combine all texts into one single string
context = "\n\n".join(doc.page_content for doc in docs)

# Ask Questions
query = """
How much worth Tata Motors has provided on behalf of its Singapore holding company?
"""

prompt = ChatPromptTemplate.from_template("""
You are a stock research assistant.

Use only the context below.
Do not invent missing values.

User query:
{query}

Context:
{context}
""")

chain = prompt | llm
response = chain.invoke({
    "query": query,
    "context": context
})

print(response.content)

Output

📊 2. Load CSV Data Using CSVLoader

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader("cars.csv")
data = loader.load()
data

Output

👉 `cars.csv` file contents

You can also use Pandas for better control:

pip install -U pandas

👉 This allows LLMs to behave like a data analyst on your CSV.

import os
import pandas as pd
from langchain_openai import ChatOpenAI

df = pd.read_csv("cars.csv")
question = "List the cars within 10 Lakhs budget?"

csv_text = df.to_string(index=False)

prompt = f"""
You are answering questions from this CSV data.

CSV data:
{csv_text}

Question:
{question}

Answer clearly using only the CSV data.
"""

response = llm.invoke(prompt)

print(response.content)

Output

🌐 3. Load URLs & Perform Web Scraping

Now comes the powerful part.

pip install -U unstructured

👉 LLM will read web content and generate structured analysis.

from langchain_community.document_loaders import UnstructuredURLLoader

urls = [
    "https://www.tickertape.in/stocks/tata-motors-TMC",
    "https://groww.in/stocks/tata-motors-ltd",
]

loader = UnstructuredURLLoader(urls=urls)
documents = loader.load()

query = """
Analyze valuation, profitability, entry point, red flags,
and overall whether Tata Motors stock looks attractive.
"""

prompt = f"""
You are a stock research assistant.

Use only the context below. Do not invent missing values.

User query:
{query}

Return the answer in this exact format:

# Tata Motors Stock Analysis

## 1. Quick View
- Overall view:
- Reason:

## 2. Key Metrics Found
| Metric | Value | Interpretation |
|---|---:|---|
| Market Cap | | |
| PE Ratio | | |
| PB Ratio | | |
| Dividend Yield | | |
| Risk / Volatility | | |
| Red Flags | | |

## 3. Valuation

## 4. Profitability / Quality

## 5. Entry Point

## 6. Red Flags / Risks

## 7. Final Tentative View
"""

response = llm.invoke(prompt)

print(response.content)

Output

⚠️ Problem: Slow Performance

If you load many URLs:

⏳ Processing becomes slow
📉 Context becomes too large
💸 Cost increases
⚡ Solution: Semantic Search (Vector DB)

Instead of passing all data, we:

Split content into chunks
Convert into embeddings
Store in vector DB
Retrieve only relevant data

⚡ 4. Optimize using semantic search (vector DB)

from langchain_community.document_loaders import UnstructuredURLLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

urls = [
    "https://www.tickertape.in/stocks/tata-motors-TMC",
    "https://groww.in/stocks/tata-motors-ltd",
]

loader = UnstructuredURLLoader(urls=urls)
documents = loader.load()

# Step 1: Split into Chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(documents)

# Step 2: Create Embeddings + Store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_url_db"
)

# Step 3: Retrieve Relevant Data
retriever = vector_db.as_retriever(search_kwargs={"k": 4})
retrieved_docs = retriever.invoke(query)

context = "\n\n".join(
    doc.page_content for doc in retrieved_docs
)

query = """
Analyze valuation, profitability, entry point, red flags,
and overall whether Tata Motors stock looks attractive.
"""

prompt = ChatPromptTemplate.from_template("""
You are a stock research assistant.

Use only the context below. Do not invent missing values.

User query:
{query}

Context:
{context}

Return the answer in this exact format:

# Tata Motors Stock Analysis

## 1. Quick View
- Overall view:
- Reason:

## 2. Key Metrics Found
| Metric | Value | Interpretation |
|---|---:|---|
| Market Cap | | |
| PE Ratio | | |
| PB Ratio | | |
| Dividend Yield | | |
| Risk / Volatility | | |
| Red Flags | | |

## 3. Valuation

## 4. Profitability / Quality

## 5. Entry Point

## 6. Red Flags / Risks

## 7. Final Tentative View
""")

context = "\n\n".join(
    doc.page_content for doc in retrieved_docs
)

# Step 4: Final Analysis
chain = prompt | llm
response = chain.invoke({
    "query": query,
    "context": context,
})

print(response.content)

Output

🚀 What You Learned

In this blog, we moved from:
👉 AI Agents → AI + Data Intelligence

You learned how to:

Load text and CSV data
Scrape and analyze web content
Handle large data efficiently
Use vector databases for semantic search

DEV Community

🚀 From Agents to Data Intelligence: Load Files, Scrape Web & Analyze with LangChain

📄 1. Load Text File Using TextLoader

Output

Add Queries to fetch result from .txt file

Output

📊 2. Load CSV Data Using CSVLoader

Output

👉 `cars.csv` file contents

Output

🌐 3. Load URLs & Perform Web Scraping

Output

⚠️ Problem: Slow Performance

⚡ 4. Optimize using semantic search (vector DB)

Output

Top comments (0)

📄 1. Load Text File Using TextLoader

Output

Add Queries to fetch result from .txt file

Output

📊 2. Load CSV Data Using CSVLoader

Output

👉 cars.csv file contents

Output

🌐 3. Load URLs & Perform Web Scraping

Output

⚠️ Problem: Slow Performance

⚡ 4. Optimize using semantic search (vector DB)

Output

👉 `cars.csv` file contents