DEV Community: Ozhaya

Building a Real-Time Financial Sentiment API: Handling Noise and LLM Hallucinations

Ozhaya — Sat, 30 May 2026 23:25:14 +0000

Financial markets move faster than human cognition. A geopolitical headline can trigger automated oil liquidations within milliseconds. A single earnings report can wipe out a company’s valuation before a retail trader finishes reading the first paragraph.

I set out to build a production-grade system that could automatically ingest unstructured global financial news feeds, parse the entities affected, determine the sentiment polarity, and expose the results as machine-readable market signals.

This post details the technical architecture of the Market Sentiment API, the data engineering pipeline, and how I solved critical edge cases like LLM cost optimisation and ticker hallucinations.

1. Program Overview

The program processes incoming data through a pipeline designed to minimise LLM token overhead and optimise latency.
The core data pipeline consists of four distinct phases:

Getting information: Every 5 minutes news is obtained from RSS feeds (Bloomberg, Reuters, Financial Times, CNBC, BBC, Al Jazeera).
Filtering news: Relevant news is obtained by checking if the articles falls in 6 sections(company, war, policy, commodity, tech, disaster) using keywords commonly found in each section.
Sentiment extraction: An LLM extracts tickers, sentiment, and contextual summary
State Aggregation & Momentum Tracking: Relevant articles are gathered together and an LLM is used to get overall sentiment and momentum direction and confidence rating.

2. Article Filtering

Passing every raw RSS headline directly to an LLM creates astronomical token costs and introduces latency. More than 70% of standard business news lacks immediate market-moving impact.

To solve this at zero token cost, the ingestion engine passes incoming headlines through a localised string boundary matcher before the data ever touches an LLM.

The program dynamically loads domain-specific keywords from external text asset files (companies.txt, war.txt, policy.txt, etc.) into memory as Python sets for O(1) lookups. It then uses strict regex word boundaries (\b) to prevent false-positive partial matches (e.g., ensuring "gasoline" or "gas" matches cleanly without breaking on unrelated strings).

companies_set = get_set("companies.txt")

def match_set(title, keyword_set):
    title = title.lower()
    for k in keyword_set:
        if re.search(rf"\b{re.escape(k)}\b", title):
            return True

    return False

def company_news(title: str) -> bool:
    return match_set(title, companies_set)

3. Sentiment Extraction

Once an article passes the initial keyword filter, it reaches the first LLM layer. The goal here is to take the raw headline and description and transform it into structured financial output.

However, using a standard, unconstrained text prompt introduces a major failure mode: ticker hallucination. Out-of-the-box models frequently look at context clues and deduce tickers that are not explicitly mentioned (such as adding NVDA to a generic article about semiconductor logistics) or map companies to completely wrong asset symbols.

To eliminate variable outputs the following is added to llm instructions:

Input:
Title: Oil prices surge after Iran conflict escalates
Description: Markets fear supply disruptions in the Middle East

Output:
{
  "signals": [
    {"asset": "CL=F", "signal_score": 0.9},
    {"asset": "XLE", "signal_score": 0.7},
    {"asset": "SPY", "signal_score": -0.3}
  ],
  "summary": "Escalating Middle East tensions boosted oil prices and energy stocks while pressuring broader equities."
}

I advise using multiple example responses to enforce same output format.

4. State Aggregation & Momentum Tracking

A singular asset can appear across multiple news sources within the same extraction window, often yielding conflicting sentiment lines. If BBC prints a mildly bearish note on a ticker while the Financial Times breaks a highly bullish exclusive twenty minutes later, looking at individual articles in isolation provides an incomplete picture.

To resolve this, the system pulls all historical data captured over a rolling window and groups them by ticker symbol. These pooled source inputs are then passed through a second LLM state-aggregation layer.

Instead of a simple mathematical average, the LLM is advised to use each articles sentiment and hours since published to get the following responses overall sentiment, confidence and momentum.

The final output structural layout wraps the top-tier aggregated metrics alongside an array containing the exact downstream articles that built the consensus:

{
  "ticker": "para",
  "overall_sentiment": "neutral",
  "overall_sentiment_score": 0.5,
  "overall_confidence": 0.65,
  "sentiment_momentum": "neutral",
  "articles_analysed": 1,
  "summary": "Recent positive sentiment from a buyout attempt indicates potential, but overall confidence remains low due to limited coverage.",
  "signals": [
    {
      "title": "Paramount Is Pulling Every Lever to Sell LBO Debt",
      "summary": "Paramount's aggressive leveraged buyout attempt for Warner Bros. Discovery generated positive sentiment for both companies, suggesting potential growth and strategic consolidation in the media sector.",
      "signals": [
        {
          "asset": "WBD",
          "signal_score": 0.6
        },
        {
          "asset": "PARA",
          "signal_score": 0.5
        }
      ],
      "description": "Paramount Skydance Corp. stretched, then stretched, then stretched again in its audacious $110 billion takeover bid for Warner Bros. Discovery Inc.",
      "published_at": "2026-05-30T19:00:00Z",
      "since_published_hr": 4.220252061944445,
      "source": "Bloomberg News",
      "url": "https://www.bloomberg.com/news/articles/2026-05-30/paramount-is-pulling-every-lever-to-sell-lbo-debt-credit-weekly"
    }
  ]
}

Useful Links

Explore the Contract Shapes: Check out our interactive Swagger UI Documentation to run mock requests and map out the exact JSON payloads.

Integrate via RapidAPI: Grab a free tier developer token on RapidAPI to begin injecting live macro-sentiment triggers directly into your automated algorithmic models, quantitative trading bots, or custom terminal dashboards.

I Built an ML-Powered Email Validation API

Ozhaya — Tue, 19 May 2026 10:17:47 +0000

I built an ML model using XGBoost to catch auto-generated disposable emails when blacklists can't keep up. Most validators rely on MX records, SMTP checks, or blacklists - disposable emails have real mailboxes so MX and SMTP return valid. That's why I added an ML model to determine the risk of accepting an email based on the username and domain.

Compare these two emails:

john.doe@gmail.com

and

r9lo6tngee825@yzcalo.com

You can immediately tell the second one is fake, but why exactly? Is it the numbers, the consonant/vowel ratio, the length? We don't need to know the exact rules. We can train an XGBoost model on labelled data to figure it out, using features like digit count, length, and consonant/vowel ratio to predict whether a username or domain is legitimate.

Under the hood, the API combines MX records, blacklist checking, role detection, syntax validation, and new domain detection for basic coverage. On top of that, ML-powered scoring handles what static methods miss. It also supports batch validation of up to 30 emails per request. SMTP validation is intentionally excluded as disposable emails have real mailboxes so it returns valid anyway, and it adds significant latency for no benefit in this use case.

For the ML side, I used pandas for feature engineering, an 80/20 train-test split via sklearn's train_test_split, and XGBoost as the classifier. Features include pairs of vowels, consonants, digits and ratios for each. One feature that greatly increased the accuracy for username score was using determining whether the username contained a name. For example,

john.doe@gmail.com

contains john and doe which it automatically more likely to be safe than

r9lo6tngee825@yzcalo.com

even if we disregarded the domains.

Here's the API response for the fake email:

{
  "email": "r9lo6tngee825@homvela.com",
  "valid_email_structure": true,
  "is_role": false,
  "mx_records": true,
  "not_disposable": true,
  "new_domain": false,
  "domain_risk": 0.23987430334091187,
  "name_risk": 0.9953031539916992,
  "valid_email": true
}

This response shows us that even though the email is disposable, the traditional methods failed to detect it. However, name_risk is incredibly high, showing us that the email is very risky to accept. Interestingly, despite the domain being fake, domain_risk remains low - this is because the model is trained on patterns in the domain name itself, which doesn't follow the same conventional patterns as usernames.

Contrast this with a real email:

{
  "email": "john.doe@gmail.com",
  "valid_email_structure": true,
  "is_role": false,
  "mx_records": true,
  "not_disposable": true,
  "new_domain": false,
  "domain_risk": 0.221174955368042,
  "name_risk": 0.00679133040830493,
  "valid_email": true
}

This shows us that the email is safe to accept as both traditional and ML methods consider the email to be safe. Also worth noting the domain_risk for gmail is similar to the fake one, which shows us domain_risk alone isn't reliable except in special cases of unconventional named domains.

This is why combining all fields, rather than relying on any single one, gives you the most accurate result.

Limitations
Despite training a model for domain_risk, it has limited uses as patterns in domains are very limited. Unlike usernames, which often follow recognisable human patterns like names or words, auto-generated domains can look surprisingly normal compared to genuine ones, making it much harder for the model to distinguish between real and fake. A prime example of this is visible in the responses above, where the fake domain has a negligible difference in score compared to the real one.

Another takeaway was that blacklists are less effective than initially expected. While they work well for well-known disposable providers, the sheer volume of auto-generated domains makes it nearly impossible to maintain a comprehensive and up to date list. I still found them useful as the resources it takes to implement a blacklist and latency difference is negligible.

The reason I kept traditional methods is that ML is still just prediction; it can produce false positives (real email flagged as risky) and false negatives (disposable email missed). Combining both methods reduces the impact of either failure mode.

Try It Yourself
The API is available on RapidAPI with a free tier of 100 requests per month. If you're building a signup form, cleaning an email list, or trying to prevent fraud, give it a try and let me know what you think.

rapidapi.com

Note: the first request may be slow due to a cold start on the free server - subsequent requests will be faster.

Check out my interactive Swagger UI documentation. You can see every endpoint, all parameters, and example responses in one place.

Python Library
Installation:

pip install identify-fake-email

Quick Start:

from identify_fake_email.client import EmailValidator

client = EmailValidator("YOUR_RAPIDAPI_KEY")

result = client.validate("user@example.com")

if result.name_risk > 0.7 or not result.valid_email:
    print("Suspicious - review")
else:
    print("Safe to accept")

For bulk validation

emails = ["user1@gmail.com", "user2@gmail.com"]
results = client.validate_bulk(emails)

for result in results:
    if result.name_risk > 0.7 or not result.valid_email:
        print("Suspicious - review")
    else:
        print("Safe to accept")

Also covered this on Medium with a broader overview.