Fazil Hasanov

Posted on Jun 19

Building a Custom AI Agent for Stock Market Sentiment Analysis using Python and Natural Language Processing

#python #nlp #stockmarket #machinelearning

Introduction

The stock market is as much about psychology as it is about fundamentals. Investors and traders constantly analyze news, earnings calls, social media, and financial reports to gauge market sentiment—whether the collective mood is bullish, bearish, or neutral. But manually sifting through vast amounts of unstructured text is time-consuming and error-prone.

Enter Natural Language Processing (NLP)—a branch of AI that enables machines to understand, interpret, and generate human language. By building a custom AI agent that performs stock market sentiment analysis, we can automate the extraction of sentiment from financial news, tweets, and reports, providing real-time insights that can inform trading strategies.

In this article, we’ll build a production-ready AI agent using Python, leveraging NLP techniques and modern libraries to analyze sentiment around specific stocks. We’ll cover data collection, preprocessing, model training, deployment, and actionable insights—all with practical, reproducible code.

Why Build a Custom AI Agent?

While off-the-shelf sentiment analysis tools exist (e.g., VADER, TextBlob), they are often generic and not fine-tuned for financial language. Terms like "bullish," "short squeeze," or "FOMO" carry domain-specific meanings that general models may misinterpret.

A custom AI agent offers:

Domain-specific accuracy: Trained on financial text, it understands jargon and context.
Scalability: Can process thousands of articles or tweets per minute.
Real-time insights: Integrates with APIs to fetch and analyze live data.
Customization: You control the data sources, models, and output format.

Step 1: Setting Up the Environment

We’ll use Python 3.10+ and several key libraries:

pip install pandas numpy requests beautifulsoup4 nltk spacy textblob transformers torch scikit-learn matplotlib seaborn
python -m spacy download en_core_web_sm
python -m nltk.downloader punkt wordnet stopwords vader_lexicon

💡 Pro Tip: Use a virtual environment (python -m venv venv) to avoid dependency conflicts.

Step 2: Data Collection – Gathering Financial Text

Our AI agent needs data. We’ll collect two types:

News Articles (e.g., Reuters, Bloomberg)
Social Media (e.g., Twitter/X)

Option A: Scraping News Articles (Example: Yahoo Finance)

We’ll use requests and BeautifulSoup to scrape headlines.

import requests
from bs4 import BeautifulSoup

def scrape_yahoo_finance_news(ticker="AAPL"):
    url = f"https://finance.yahoo.com/quote/{ticker}/news"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    headlines = []
    for item in soup.find_all('h3', class_='Mb(5px)'):
        headline = item.get_text(strip=True)
        headlines.append(headline)

    return headlines[:10]  # Return top 10 headlines

# Example usage
headlines = scrape_yahoo_finance_news("TSLA")
print(headlines)

⚠️ Note: Web scraping may violate terms of service. Always check robots.txt and consider using official APIs (e.g., NewsAPI, Alpha Vantage) for production.

Option B: Using Twitter API (X API v2)

Twitter is a goldmine for real-time sentiment. We’ll use the tweepy library.

pip install tweepy

import tweepy
import pandas as pd

# Replace with your API keys
BEARER_TOKEN = "your_bearer_token_here"

def fetch_tweets(query="#AAPL", max_results=50):
    client = tweepy.Client(bearer_token=BEARER_TOKEN)

    tweets = client.search_recent_tweets(
        query=query,
        max_results=max_results,
        tweet_fields=["created_at", "text", "lang"],
        user_fields=["username"],
    )

    data = []
    for tweet in tweets.data:
        if tweet.lang == "en":
            data.append({
                "text": tweet.text,
                "created_at": tweet.created_at
            })

    return pd.DataFrame(data)

# Example usage
tweets_df = fetch_tweets("#TSLA -is:retweet lang:en", 100)
print(tweets_df.head())

🔐 Security Tip: Never hardcode API keys. Use environment variables (os.getenv('TWITTER_BEARER_TOKEN')).

Step 3: Preprocessing Text for NLP

Raw text is noisy. We must clean it before analysis.

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import spacy

nlp = spacy.load("en_core_web_sm")
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    # Remove mentions and hashtags
    text = re.sub(r'@\w+|\#\w+', '', text)
    # Remove punctuation and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize and lemmatize
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return ' '.join(tokens)

# Apply to DataFrame
tweets_df['cleaned_text'] = tweets_df['text'].apply(preprocess_text)
print(tweets_df[['text', 'cleaned_text']].head())

🧠 Why Lemmatization? It reduces words to their base form (e.g., "running" → "run"), improving model accuracy.

Step 4: Sentiment Analysis with NLP Models

We’ll explore three approaches:

A. Rule-Based: VADER (Valence Aware Dictionary and sEntiment Reasoner)

VADER is fast and works well with social media text.

from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

def get_vader_sentiment(text):
    scores = sia.polarity_scores(text)
    compound = scores['compound']
    if compound >= 0.05:
        return "positive"
    elif compound <= -0.05:
        return "negative"
    else:
        return "neutral"

tweets_df['vader_sentiment'] = tweets_df['text'].apply(get_vader_sentiment)
print(tweets_df['vader_sentiment'].value_counts())

B. Machine Learning: TextBlob

TextBlob provides a simple API for sentiment polarity.

from textblob import TextBlob

def get_textblob_sentiment(text):
    analysis = TextBlob(text)
    polarity = analysis.sentiment.polarity
    if polarity > 0:
        return "positive"
    elif polarity < 0:
        return "negative"
    else:
        return "neutral"

tweets_df['textblob_sentiment'] = tweets_df['text'].apply(get_textblob_sentiment)

C. Deep Learning: FinBERT (Financial BERT)

FinBERT is a BERT model fine-tuned on financial text. It offers state-of-the-art accuracy.

pip install transformers torch

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load FinBERT model and tokenizer
model_name = "yiyanghkust/finbert-tone"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

def get_finbert_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=1).item()

    # Map class index to label
    labels = ["positive", "negative", "neutral"]
    return labels[predicted_class]

# Apply to cleaned text
tweets_df['finbert_sentiment'] = tweets_df['cleaned_text'].apply(get_finbert_sentiment)
print(tweets_df['finbert_sentiment'].value_counts())

🚀 Why FinBERT? It

DEV Community