DEV Community

StockHark
StockHark

Posted on

Inside StockHark: How Our Reddit-Based Financial Sentiment Engine Works (Now in Beta, Paid Version Coming Soon)

Financial sentiment is one of the strongest early indicators of market direction, yet most platforms treat it as a shallow metric. At StockHark, we set out to build something more rigorous: a transparent, mathematically grounded sentiment engine that ingests large volumes of Reddit discussions and converts them into confidence-weighted sentiment signals. The full whitepaper is available on our site, but this article gives developers a closer look at the architecture and the reasoning behind each design choice. StockHark is currently available in free beta at https://www.stockhark.com and will transition to a paid model once the platform stabilizes.

The system begins with data collection. Reddit posts are fetched at thirty-minute intervals and validated against a library of more than four thousand approved stock symbols. This avoids noise from ticker collisions where everyday words like “FREE” or “RUN” accidentally match tradable equities. Every post is stored with metadata such as timestamps, subreddit source, and author validity. From there, the text is passed into our sentiment analysis layer, which uses FinBERT as the primary model. FinBERT is trained specifically on financial text and gives us a probability distribution for positive, negative, and neutral sentiment. We convert these outputs into numerical sentiment values by assigning positive probabilities a positive score, negative probabilities a negative score, and neutral probabilities zero. A rule-based layer then enhances this with domain-specific lexicons and multi-word phrase boosters, while intensifier words can increase the magnitude of a signal. All raw scores are clipped to the range between minus one and plus one for stability.

Once a raw sentiment score exists, StockHark applies a time-decay weight so that recent mentions matter more than older ones. The decay follows an exponential formula using wₜ = exp(−λ × Δt), with λ set to 0.1 in most production runs. A twenty-four-hour-old post therefore has only about nine percent of the weight of a fresh post. This allows the platform to react dynamically to emerging sentiment waves rather than being pulled backward by historical chatter.

Each source also has a signature weight. High-noise channels like WallStreetBets are weighted at 0.8, while other finance subreddits sit around 0.6. News has a weight of 1.0 due to its reliability. The system further adjusts for post volume using a logarithmic multiplier that increases the influence of symbols receiving broad attention without allowing spam to dominate. The goal is proportional influence, not brute frequency.

Aggregation occurs by combining raw sentiment with all associated weights and computing the weighted average. The final sentiment is again clamped to the −1 to +1 range to ensure comparability across stocks. To express reliability, we compute a confidence score that blends weight strength, consensus stability, and sample size. When sentiment is consistent across many independent posts, confidence increases. When opinions conflict heavily, confidence drops even if the average sentiment is strongly positive or negative.

Because Reddit contains bots, spam networks, and duplicated content, StockHark includes a filtering layer before sentiment analysis. Posts from accounts with extremely low karma, very young profiles, suspicious posting rates, or bot-like usernames are skipped. Exact duplicates are identified by hashing content with SHA-256 and comparing it to cached values. Near-duplicates are detected using SimHash and Hamming distance thresholds. These steps dramatically reduce noise and prevent sentiment loops caused by repeated posts. All filtering decisions are logged so that system behavior remains transparent.

The technical stack supporting the platform includes FinBERT, spaCy for ticker entity recognition, SQLite for storage, and Redis for caching of hashes, rate limits, and bot heuristics. Processing each post takes less than fifty milliseconds, allowing StockHark to operate in near-real-time with extremely low latency. Redundancy, health checks, and strict token limits ensure stable operation during peak activity hours.

The sentiment engine maps its outputs into an interpretable scale ranging from strongly bearish to strongly bullish. These categories are used to rank stocks, trigger alerts, and help users identify meaningful shifts in discussion. Though the architecture is mathematically complex, the goal is clarity for end users: a score that reflects not only emotional direction but also the reliability of the underlying data.

StockHark is now in open beta and completely free during this phase. This allows us to refine the models, expand the dataset, and improve weighting logic. As the system matures and stabilizes, StockHark will move to a paid model to support infrastructure scaling and additional data sources beyond Reddit.

If you would like to share thoughts, request features, or discuss integrations, you can write to us anytime at contact@stockhark.com
.

Top comments (0)