DEV Community: Eric P Green

Analyzing the r/wallstreetbets hivemind — August 2021

Eric P Green — Wed, 08 Sep 2021 10:06:03 +0000

The activity in the Reddit r/wallstreetbets community is staggering. Each day, there are around 800 posts and 50,000 comments debating approximately 280 different stocks. But by just browsing Reddit, between the memes and degenerate gamblers, it can be hard to understand the full nature of the discussion.

In this post, I’ve turned to a bit of SQL and Python to explore what’s happening in the wallstreetbets hivemind. I’ve analyzed stock popularity, sophisticated but overlooked discussions, and community influencers.

If you’re interested, here’s the raw Reddit data, my data pipeline, the derived data, and my Jupyter notebook. I’m using Beneath, an open data platform I’m building, to stream and save the data.

Btw, this isn’t investment advice… DYOR.

The meme stock rankings

Let’s start with the basics. What are the most discussed stocks and how have they changed over time?

The stocks on wallstreetbets can be broadly bucketed into two categories: long-standing wallstreetbets’ interests and stocks related to current events. We can see these two categories by inspecting a line graph of mentions over time (select stocks):

Click here to view interactive chart

The long-standing wallstreetbets interests jump out: these are the lines that occupy a significant percentage of mentions over the whole time period. These include staples like Gamestop (GME) and AMC, but the community has also long been tracking Clover Health (CLOV) and AMD, the semiconductor manufacturer. It doesn’t look like wallstreetbets will lose interest in these anytime soon.

On the other hand, we see stocks that spike suddenly due to specific events, such as Robinhood (HOOD) and Microvast (MVST), a lithium-ion battery manufacturer. Both of these stocks went public at the end of July and received bursts of attention from the community, but the interest hasn’t lasted. As of September 1st, both stocks now have a near-zero percent share of daily mentions.

In this next chart, we zoom in on the most discussed stocks in the month of August.

Click here to view interactive chart

GME and AMC have long been community favorites, and even in August, they remain the most mentioned stocks. I’ve been collecting data from wallstreetbets since March, and the two companies have been the most discussed for 4 of these 6 months. But it’s also clear that it’s not a power law distribution, and contending stocks get significant discussion, too.

Discussions of the next big thing

The NASDAQ includes over 4000 public equities, and the NYSE over 3000, so how does the community come to rally around certain stocks? One of my hypotheses is that some initial post triggers a deep and unique discussion that ultimately leads to community-wide attention. So, let’s try to find some interesting conversations.

I’ve tried to uncover some under-appreciated discussions by filtering for posts with at least 15 comments and 25 upvotes, and sorting those posts by highest average words per comment. Here are the top 10 for the month of August:

Click here to view interactive chart

If you’d like to read the discussions, the interactive chart includes a link to each post’s page on reddit.com. Note that the numbers might not reflect what you see on reddit.com because comments can be edited and deleted after-the-fact, and scores are continually changing.

The posts above reveal fairly educated discussions about storylines that, for the most part, haven’t yet hit the wallstreetbets front page. Stocks like Lordstown Motors (RIDE), Ford (FORD), and Proterra (PRTA) haven’t yet garnered much attention, but, in light of these deep discussions, they could be worth keeping an eye on.

Another hypothesis I wanted to test is that the share of rocket emojis in a discussion could signal a stock’s momentum within the community. Here’s a ranking of the posts from August that had the highest percentage of commenters include a rocket emoji (filtered for posts with at least 25 comments):

Click here to view interactive chart

Unsurprisingly, these posts reveal a number of meme stocks that have already made it to the front-page, like CLOV and WISH. But there are also companies like Pizza Hut (HUT) and Bitfarms (BITF) that haven’t made it to the front page (yet?). They’re definitely worth watching.

Behind every post and comment is a member of the wallstreetbets community. Let’s find out which authors are leading the discussion.

The influencers of wallstreetbets

To identify influencers, I wanted to find the active authors who get the most upvotes on substantial, forward-looking posts. To that end, I’ve applied a couple criteria. First, I’ve excluded posts labeled as a “Meme,” “Gain,” or “Loss,” which are mostly retrospective. Second, I’ve filtered for authors who have posted at least once since July 1st. One of the most popular Redditors of all time was u/DeepF***ingValue, but his last post was on April 15th, and I want this analysis to be current.

Here are the top authors since I started collecting data in March:

Click here to view interactive chart

The influencers that I found can be split into two categories: the analysts and the hype men.

The analysts, like the two top authors u/quantkim and u/nobjos, contribute breaking news, technical analysis, and quantitative reports. For example, u/quantkim shares articles about GameStop’s corporate turnaround, like this one, and has averaged 11,332 upvotes over 15 posts.

Conversely, the hype men typically talk up their big positions in popular stocks. Here’s one from u/dumbledoreRothIRA about a $600k position in $CLOV, and one from u/lookshee laying out his intention to buy the entirety of the GameStop company.

All the authors above clearly influence the community, so, to jump ahead of the crowd, it’d be smart to set up notifications for whenever they post.

What’s next

My analyses in this post really just scratch the surface of what you can infer from wallstreetbets data — there’s much more to do. To extend this work, I’m currently considering factoring in price movements, doing sentiment analysis, and creating a bot that mines for insights in real-time.

Last week, a Wall Street Journal article detailed that forward-thinking hedge funds are diving into the r/wallstreetbets data. By making this data public and queryable on Beneath, I hope I’ve made it more accessible to the everyday person!

If you’re interested in any of this, come hangout in the Beneath Discord community, follow me on twitter @ericpgreen2, or jump right into the data yourself 🚀🚀🚀

Evolution of a data system

Eric P Green — Mon, 05 Jul 2021 18:16:00 +0000

The holy grail of data work is putting data science into production. But without an extensive data engineering background, you might not know how to build a production data system. In this post, I'll show how you can turn a machine learning model into a production data app by laying out the high-level system design of a simple Reddit analytics tool.

Let’s analyze the seriousness of Reddit posts

Reddit is a serious place for serious people, but sometimes subreddits become corrupted by miscreants who spread useless banter. To avoid such unpleasantries, we want to build a web app that can advise us of the seriousness of different subreddits.

For our project, we’ll use machine learning to score the seriousness of every individual Reddit post. We’ll aggregate the scores by subreddit and time, and we’ll expose the insights via an API that we can integrate with a frontend. We want our insights to update in near real-time so we’re reasonably up-to-date with the latest posts.

So we’re clear on what the system should do, here’s the API interface:

/subreddit/[name]: Returns a) a subreddit’s posts and their seriousness scores, b) an all-time seriousness score, and c) hourly seriousness scores for the last week
/subreddits: Returns all subreddits we track and the all-time seriousness score for each

Let’s dive in.

Phase 1: building the data ingestion engine

To start, we want to extract posts from Reddit and write it into our own storage system. Our storage system will have two components: a message queue and a database.

Message queue: We’ll use a message queue to both store and enrich data in real-time. We’ll use RabbitMQ to keep things simple.
Database: We’ll use a database to permanently store and serve the data. Our API server will get its data from here. We’ll use Postgres, the do-it-all relational data store.

With our storage system in place (in theory), let’s write the first scripts of our data pipeline.

Reddit scraper: This script polls the Reddit API every second and writes new posts to a posts topic in our message queue.
posts consumer: This script reads data from the posts topic and inserts it into our Postgres database.

We need a way to deploy and run our code in production. We like to do that with a CI/CD pipeline and a Kubernetes cluster.

CI/CD pipeline: On every git commit, we’ll build our code as a Docker container, push it to a container registry, and deploy it to Kubernetes. GitHub Actions makes this easy to set up.
Kubernetes cluster: Kubernetes is a platform for running containerized code. Kubernetes can also store our database and Reddit credentials, and inject them into our containers.

We’ll use a cloud provider to provision the message queue, database and Kubernetes cluster. We prefer managed services when they’re available, so we won’t deploy the message queue or database directly on Kubernetes.

Here’s a diagram of what our system looks like so far:

Once all this is up and running, we need to validate that the data is flowing. An easy way to do that is to connect to our Postgres database and run a few SQL queries to check that new posts are continually added. When everything looks good, we’re ready to move on.

Phase 2: training the machine learning model

Now that we have the raw data in Postgres, we’re ready to develop our moneymaker, the seriousness scoring model. For this example, we’ll keep things simple and use a Jupyter notebook that pulls historical posts from the Postgres database.

Jupyter notebook: Inside the notebook, we label some training data, train and assess our model, and save the model to a file. Then our production code will be able to load the file to make inferences.

Note that there are other ways to train a machine learning model. Fancy “MLaaS'' and “MLOps” tools can help you continuously train, monitor and deploy models. If you want to integrate with one of these tools, you’ll likely connect your database to enable training, and you’ll ping an API to make an inference.

Here’s our system augmented with our ML development environment:

Phase 3: applying the model and aggregating the scores

Now it’s time to build the workers that will apply the model to new posts, and write out the resulting seriousness scores. That’s two different scripts:

posts enrichment. This script consumes the Reddit posts topic, applies the predictive model, and writes the data back to another topic posts-scores, which will contain post IDs and seriousness scores.
post-scores consumer. This script reads data from the posts-scores topic and inserts them into (a separate table in) our Postgres database.

Next up, we want to aggregate our results by subreddit and time. We’ll use dbt, which allows us to schedule periodic SQL queries. We’ll schedule two aggregating queries:

Roll up new scores: We’ll run this query every five minutes. On every run, it’ll calculate the mean scores of new posts and save the results to a table subreddit-scores-5min in Postgres.
Compute total score: This is a heavier query, so we’ll only run it once a day. It will compute each subreddit’s total seriousness score (across all time) and save the results to a table subreddit-scores-total.

With that, we have all the data that we want for our app available in Postgres. Here’s what the system looks like now:

Phase 4: completing a web app

Our last step is creating the interfaces for accessing our Reddit insights. We need to set up a backend API server and write our frontend code.

API server. The API server will fetch the insights from Postgres and serve the results to the frontend. It’ll implement the routes we specified in the introduction. We’ll build the API server in Python using the FastAPI framework.
Frontend client. The frontend will contain tables and charts for viewing and searching the insights. We’ll implement it with React and use a fancy charting library like Recharts.

Deploy the API server and frontend code to Kubernetes, and we have ourselves a full stack analytics application! Here’s what the final design looks like:

Reviewing the stack

Our Reddit analytics app is now ready to share with the world (at least on paper). We’ve set up a full stack that spans data ingest, model training, real-time predictions and aggregations, and a frontend to explore the results. It’s also a reasonably future proof setup. We can do more real-time enrichment thanks to the message queue, and we can do more aggregations thanks to dbt.

But the system does have its limitations. For scalability, we’re limited by the throughput of Postgres and RabbitMQ. For latency, we’re limited by the batched nature of dbt. To improve, we could add BigQuery as a data warehouse, use Kafka as our message queue, and add Flink as a real-time stream processor, but these powerful systems also come at the cost of greater complexity.

While there are always different tools you can use for the same job, this data system design is fairly standard. I hope it gives you perspective on what it takes to build a live analytics-centric web application.

Turn a Pandas DataFrame into an API

Eric P Green — Thu, 10 Jun 2021 13:20:22 +0000

Pandas DataFrames are my favorite way to manipulate data in Python. In fact, the end product of many of my small analytics projects is just a data frame containing my results.

I used to dump my dataframes to CSV files and save them to Github. But recently, I've been using Beneath, a data sharing service I'm building, to save my dataframes and simultaneously turn them into a full-blown API with a website. It's great when I need to hand-off a dataset to clients or integrate the data into a frontend.

In this post, I'll show you how that works! I'm going to fetch GitHub commits, analyze them, and use Beneath to turn the result into an API.

Setup Beneath

To get started, you need to install the Beneath pip module and login with a free Beneath account. It's pretty easy and the docs already cover it. Just follow these steps.

Make sure to remember your username as you'll need it in a minute!

Let's analyze some data

I think Github activity is a fascinating, underexplored data source. Let's scratch the surface and look at commits to... Pandas! Here's a quick script to fetch the pandas source code and aggregate some daily stats on the number of commits and contributors:

import io
import pandas as pd
import subprocess

# Get all Pandas commit timestamps
repo = "pandas-dev/pandas"
cmd = f"""
    if [ -d "repo" ]; then rm -Rf "repo"; fi;
    git clone https://github.com/{repo}.git repo;
    cd repo;
    echo "timestamp,contributor";
    git log --pretty=format:"%ad,%ae" --date=iso
"""
res = subprocess.run(cmd, capture_output=True, shell=True).stdout.decode()

# Group by day and count number of commits and contributors
df = (
    pd.read_csv(
        io.StringIO(res),
        parse_dates=["timestamp"],
        date_parser=lambda col: pd.to_datetime(col, utc=True),
    )
    .resample(rule="d", on="timestamp")["contributor"]
    .agg(commits="count", contributors="nunique")
    .rename_axis("day")
    .reset_index()
)

Now, the df variable contains our insights. If you're following along, you can change the repo variable to scrape another Github project. Just beware that some major repos can take a long time to analyze (I'm looking at you, torvalds/linux).

Save the DataFrame to Beneath

First, we'll create a new project to store our results. I'll do that from the command-line, but you can also use the web console:

beneath project create USERNAME/github-fun

Just replace USERNAME with your own username.

Now, we're ready to publish the dataframe. We do it with a simple one-liner directly in Python (well, I split it over multiple lines, but it's still just one call):

import beneath

await beneath.write_full(
    table_path="USERNAME/github-fun/pandas-commits",
    records=df,
    key=["day"],
    description="Daily commits to https://github.com/pandas-dev/pandas",
)

There are a few things going on here. Let's go through them:

The table_path gives the full path for the output table, including our username and project.
We use the records parameter to pass our DataFrame.
We provide a key for the data. The auto-generated API uses the key to index the data so we can quickly filter records. By default, Beneath will use our DataFrame's index as the key, but I prefer setting it manually.
The description parameter adds some documentation to the dataset that will be shown at the top of the table's page.

And that's it! Now let's explore the results.

Explore your data

You can now head over to the web console and browse the data and its API docs. Mine's at https://beneath.dev/epg/github-fun/table:pandas-commits (if you used the same project and table names, you can just replace my username epg for your own).

You can also share or publish the data. Permissions are managed at the project layer, so just head over to the project page and add members or flip the project settings to public.

Use the API

Now that the data is in Beneath, anyone with access can use the API. On the "API" tab of the table page, we get auto-generated code snippets for integrating the dataset.

For example, we can load the dataframe back into Python:

import beneath

df = await beneath.load_full("USERNAME/github-fun/pandas-commits")

Or we can query the REST API and get the commit info every day in May 2021:

curl https://data.beneath.dev/v1/USERNAME/github-fun/pandas-commits \
  -d type=index \
  -d filter='{"day":{"_gte":"2021-05-01","_lt":"2021-06-01"}}' \
  -G

Or use the React hook to read data directly into the frontend:

import { useRecords } from "beneath-react";

const App = () => {
  const { records, loading, error } = useRecords({
    table: "USERNAME/github-fun/pandas-commits",
    query: {
      type: "index",
      filter: '{"day":{"_gte":"2021-05-01","_lt":"2021-06-01"}}'
    }
  })
  ...
}

Check out the API tab of my dataframe in the Beneath console to see all the ways to use the data.

That's it

That's it! We used Beneath to turn a Pandas DataFrame into an API. If you have any questions, I'm online most of the time in Beneath's Discord (I love to chat about data science, so you're also welcome to just say hi 👋). And let me know if you publish a cool dataset that I can spotlight in the featured projects!