Jack Lewis

Posted on Mar 24, 2023

Building a dev.to analytics dashboard using OpenSearch

#dev #opensearch #data #analytics

Background

So I was doing some looking through bits and pieces you can get access to within your DEV account and I noticed there was a rather interesting tab in my dashboard for analytics. While I'm not a data scientist by any stretch of the imagination, it still pretty cool to see how you're doing in DEV.to. So taking a look, there's some pretty cool features showing things like readers, new followers and reactions. For reference, this is what mine looks like as I'm writing this article:

You can also get similar data out of an individual article.

From looking at this I can see that I clearly did well at the beginning this month, and it's tapered off pretty quickly (maybe need to look in to writing something a bit more popular?). I can see as well, that I've got 2 spikes in my reader graph that are related to an article I wrote around my setup for posting to DEV.to and also how to convert your local git repository to another remote programmatically. Interestingly, my second popular article didn't have any corresponding engagement in DEV.to, which means that it was probably popular via another website, rather than directly from DEV.to. Looking at the traffic summary for that second article, I can see this is true and that it's most popular on Twitter for some reason?

Anyway, getting a bit off track here, but after the first few minutes of looking at the analytics you can get out, there really isn't that much information I can get out, and it's also pretty general i.e.: some of the things I'm interested in finding out are probably pretty niche. So I spent some time thinking of if there was a way I could get some more information out that I wanted to know. Turns out, there is!

Forem API

The first step to figuring out if there's anything I can do to get more information, is to find out if there are any data sources that are accessible to me as a user. Fortunately for me, DEV is built on an open source software platform known as Forem. Luckily for me, it seems like somebody has also wanted to be able to programmatically grab data out of DEV and there's already an API that I can use to grab data from. Being honest, if this wasn't here I likely would have stopped at this point, as while it would be possible to scrape data directly from the DEV, it would have been a pain that I didn't want to deal with so this API makes everything possible.

While the API itself is incredibly useful, while writing this article there is currently 2 versions of the API which contain different endpoints:

This basically means I'm going to have to do some extra work later as both version contain useful stuff and I currently can't use the v1 API for everything. However, as I'll talk about later I've got some ways to make this really easy to implement later.

OpenSearch

Now I know I've got some data I could use, I now need to find a platform that I can use to analyse the data coming from the Forem API. I did consider some other pieces of software, such as Google BigQuery (with looker studio) and ElasticSearch (with Kibana), I ultimately went with OpenSearch which is essentially a forked version of ElasticSearch maintained by AWS. The main reasons are that I could host it locally for free (unlike BigQuery). I do have some prior experience with both elastic (back when it was called ELK) and OpenSearch, but my work with OpenSearch was far more recent, so I decided to go with that.

OpenSearch also provides code libraries that allow you to directly interact with the OpenSearch database from code, which given I'm writing something new that's specifically supposed to interact with just OpenSearch and an API, makes it a much more straightforward to implement. This is instead of going down the more 'traditional' route of analysing log files (which is the 'L' in ELK).

OpenSearch really consists of 2 parts, the NoSQL database OpenSearch, and the data visualization tool known as OpenSearch Dashboard. As a NoSQL database, OpenSearch doesn't really have the concept of "tables" but uses indexes instead, which work pretty similarly. As with a fair number of other NoSQL databases, the schema for the data is figured out based on the data itself, rather than setup beforehand. This does mean you can get into a mess if your data doesn't use standard data formats. Fortunately, as Forem is using OpenAPI, this shouldn't be an issue.

Installation

OpenSearch has an AWS managed service (of course) but the way I use it is via docker. If you're using windows (like me) you can use Docker Desktop to run the containers. OpenSearch provides 3 different ways to run the docker container, these are running a single node via docker and a pair of docker-compose files. I decided not to run in single node mode as I also wanted to run OpenSearch dashboard, which both docker-compose files allow you to do automatically. In terms of the docker compose versions, there is a production version and a dev version, the difference being that the dev version has the security plugin disabled. Given I'm just running locally and the security plugin takes some extra setup, I just went with the dev docker-compose file.

The Docker compose itself, will create a pair of OpenSearch nodes in a cluster and a OpenSearch dashboard instance to view the data:

version: '3'
services:
  opensearch-node1:
    image: opensearchproject/opensearch:latest
    container_name: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster # Name the cluster
      - node.name=opensearch-node1 # Name the node that will run in this container
      - discovery.seed_hosts=opensearch-node1,opensearch-node2 # Nodes to look for when discovering the cluster
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2 # Nodes eligibile to serve as cluster manager
      - bootstrap.memory_lock=true # Disable JVM heap memory swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM
      - "DISABLE_INSTALL_DEMO_CONFIG=true" # Prevents execution of bundled demo script which installs demo certificates and security configurations to OpenSearch
      - "DISABLE_SECURITY_PLUGIN=true" # Disables security plugin
    ulimits:
      memlock:
        soft: -1 # Set memlock to unlimited (no soft or hard limit)
        hard: -1
      nofile:
        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container
    ports:
      - 9200:9200 # REST API
      - 9600:9600 # Performance Analyzer
    networks:
      - opensearch-net # All of the containers will join the same Docker bridge network
  opensearch-node2:
    image: opensearchproject/opensearch:latest
    container_name: opensearch-node2
    environment:
      - cluster.name=opensearch-cluster # Name the cluster
      - node.name=opensearch-node2 # Name the node that will run in this container
      - discovery.seed_hosts=opensearch-node1,opensearch-node2 # Nodes to look for when discovering the cluster
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2 # Nodes eligibile to serve as cluster manager
      - bootstrap.memory_lock=true # Disable JVM heap memory swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM
      - "DISABLE_INSTALL_DEMO_CONFIG=true" # Prevents execution of bundled demo script which installs demo certificates and security configurations to OpenSearch
      - "DISABLE_SECURITY_PLUGIN=true" # Disables security plugin
    ulimits:
      memlock:
        soft: -1 # Set memlock to unlimited (no soft or hard limit)
        hard: -1
      nofile:
        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
        hard: 65536
    volumes:
      - opensearch-data2:/usr/share/opensearch/data # Creates volume called opensearch-data2 and mounts it to the container
    networks:
      - opensearch-net # All of the containers will join the same Docker bridge network
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:latest
    container_name: opensearch-dashboards
    ports:
      - 5601:5601 # Map host port 5601 to container port 5601
    expose:
      - "5601" # Expose port 5601 for web access to OpenSearch Dashboards
    environment:
      - 'OPENSEARCH_HOSTS=["http://opensearch-node1:9200","http://opensearch-node2:9200"]'
      - "DISABLE_SECURITY_DASHBOARDS_PLUGIN=true" # disables security dashboards plugin in OpenSearch Dashboards
    networks:
      - opensearch-net

volumes:
  opensearch-data1:
  opensearch-data2:

networks:
  opensearch-net:

NOTE: As I said earlier, this is NOT a production setup, this is a dev setup compose file, so should not be used outside a dev environment as you need the extra security.

If this is the first time you've run a OpenSearch cluster in Docker Desktop you might notice that the containers crash out and complain about something like vm.max_map_count is less than 262144. This is because the cluster needs more resources than the Docker receives by default. While you can just write into the WSL subsystem to fix it each time, I like to set up the config file, so I don't get annoyed by having to do some command line setup after every restart. In order to fix this, you need to create a file in C:\Users\<user running docker> called .wslconfig. The full file path would be something like C:\Users\jack.lewis\.wslconfig, and then you then need to add the following to this file:

[wsl2]
memory=4GB      # up's the memory from 2GB to 4GB
processors=4    # up's the number of processors to 4
kernelCommandLine="sysctl.vm.max_map_count=262144"   # up's the max map count to the minimum required by OpenSearch

At this point, if you used the config I linked above, the containers should start and look like the following:

From this, what you need to know is that http://localhost:9200 is the address you need to use to push data into OpenSearch and http://localhost:5601 can be opened in the browser to look at OpenSearch dashboard.

If you just want to explore the things OpenSearch can do, at this point you can open up dashboard, press the home button on the left-hand side, press add sample data and it should give you a choice to add some sample data to play around with:

Finding data to analyse

Immediately, I was very happy to note that Forem provides an OpenAPI spec document for both versions of the API, which is going to make it really easy to explore.

All you need to do is download the spec, and then drag the file into Postman and you instantly have a quick way to explore all the endpoints in the Forem API:

NOTE: if you want to actually return data, you need to set an API key at the top level folder (where it says DEV API (beta) for V0 and Forem API V1 for V1) and it's under variables. You can get the DEV API key from here and going to the point where it talks about DEV Community API Keys.

I think I might have also had to change the V1 base URL from https://dev.to/api to https://dev.to.

I think I really need to say though that the most common work I do day to day is building and integrating with API's. From this I know how difficult it is to get an OpenAPI spec right, and while the one from Forem does have some minor issues, on the whole I've had to do less messing around with it than pretty anything else I've integrated previously - which is great!

At this point, I can start exploring for data that would be useful and when I'm looking to integrate something into OpenSearch, I'm looking for 2 things:

WHAT happened
WHEN it happened

These, to me are the most important thing as OpenSearch is a time based analytics dashboard (when it happened) and the most common thing I really do in it is count how often something occurs (what happened).

From this, I'm drawn to the user's published article's endpoint in V1 and the follower's endpoint in V0. There's a lot more than this which could be useful, but at this point I'm pretty much just building a proof of concept, so these were the 2 endpoints I found that were most immediately useful to me (and honestly, the easiest to try and pull some useful stuff out of).

Integration

At this point, I now need to glue everything together. To do this, I need a program which can pull down data from Forem and then push it into the OpenSearch cluster.

If you just want to take a look at the final solution, it lives here:

jlewis92 / ForemAnalyticsGatherer

Retrieves analytics for forem

ForemAnalyticsGatherer

Retrieves analytics for forem

The command line arguments supported are as follows:

 -a, --articleGatherer     (Default: true) Whether the article gatherer is enabled

  -f, --followerGatherer    (Default: true) Whether the follower gatherer is enabled

  -f, --GatherData          (Default: once per day) How often to gather data in TimeSpan format

  -k, --ApiKey              The API key used to connect to the Forem API

  -n, --NodeList            A list of opensearch nodes

  -b, --BasePath            (Default: https://dev.to) The base path of the Forem site

  --help                    Display this help screen.

  --version                 Display version information.

View on GitHub

Programming language choice

It really wasn't much of a choice in this case as C# is my main language and I do enjoy working with it. Additionally, while .Net 7 is out, after being burnt by .Net Core 3.0 announcing end-of-life a few days before it happened, I pretty much always work with the long term support version if I can help it, so I went with .Net 6.

Forem integration

While I could have written out a full API to talk to the Forem API, this would have taken a long time and I had an OpenAPI spec document, so I used OpenAPI Generator which can pretty much instantly generate a full API integration, just from the OpenAPI spec. While I'm using C# and .Net 6, this tool has a pretty large list of supported languages, so if you wanted to code something similar, you could absolutely use your language of choice, rather than C#.

While there's a lot of different ways to install OpenAPI Generator, the easiest is probably NPM, where you just need to use the command npm install @openapitools/openapi-generator-cli -g and you should be able to start using the generator.

For me, all I needed to do was run the generator twice, once for V0 and again for V1 using the following command:

 openapi-generator-cli generate -i <location of the OpenAPI file on disc> -g csharp-netcore -o <output location> --additional-properties=targetFramework=net6.0,apiName=ForemVersion<Zero or One>,packageName=ForemVersion<Zero or One>

breaking this down:

 openapi-generator-cli generate # generate an API
 -i # The location of the Forem API OpenAPI yaml
 -g csharp-netcore # The name of the generator - .net 5+ lives in this one as well
 -o # where you want to save the output
  --additional-properties= # properties that are generator specific
  targetFramework=net6.0, # This where I set .Net 6
  apiName=ForemVersion<Zero or One> # This is the name of the default class generated by the API - it would normally be DefaultAPI
  ,packageName=ForemVersion<Zero or One> #  This is the name of the C# project that will be generated

After you run this command, go to the output directory, then into src and you should see 2 C# projects:

As you can see, it generates 2 C# projects that you can import into a C# solution. Rather helpfully, the generator also generates a load of tests for the generated code. All I needed to do at this point, was import the generated projects into my own via the solution explorer in Visual Studio and I had a ready built integration into Forem:

For those who use C#, the generator uses the RestSharp library by default, as opposed to HttpClient, but it's possible to change this if you really want to.

There were some bits and pieces I did need to update in the integration though, as building code this way will do exactly what swagger definition tells it to do, for example, if a value on a Request or Response is set as Required and you either don't set the value, or the value comes back as null, the API will throw an error. When I was doing this, there were a few values in the Articles Response that are marked as required, but I didn't get them back, so did need to modify the generated code very slightly to remove the IsRequired attribute from a few Articles model.

OpenSearch integration

Note: The next few bits are heavily C# based, so if you're not that interested in C# you can probably skip to the point where I start talking about OpenSearch again.

While ElasticSearch was originally built to ingest log files, you can also use a package to directly integrate OpenSearch into the code itself, which is what I did with this project. If you're interested in trying this out, you can find a list of supported clients here.

Integrating OpenSearch into an object-oriented language is extremely simple because all you need to do is pass the objects you want to index into OpenSearch, and then the rest is pretty much handled for you. For reference, here is pretty much all the code I use to push articles into OpenSearch (after setup):

/// <summary>
/// Indexes articles into OpenSearch
/// </summary>
/// <param name="articles">The articles to index</param>
/// <returns>A response based on whether the upload succeeded</returns>
public async Task<BulkResponse> IndexArticleData(List<ForemVersionOne.Model.ArticleIndex> articles)
{
    var response = await _openSearchClient.IndexManyAsync(articles, "articles");

    return response;
}

While I could spend some time throwing some additional error handling etc. This code is proof of concept, so I'm not too bothered. It should also be noted, that the object I'm passing in to this method is taken from the generated API code discussed above.

Breaking down what I'm doing in OpenSearch is as follows:

await // I'm using async methods to push the data
_openSearchClient // calling an OpenSearch client I setup in this class
.IndexManyAsync // I'm using index many, inbstead of index as I just want to push in a load of articles, then forget about it
(articles, // The list of articles I want to push into OpenSearch
"articles"); // The index in OpenSearch I'm pushing the data to

Re-indexing articles that I've previously indexed does not cause a new version of the article to be indexed, but does update the old article. While I've not dug into the OpenSearch code, I'm assuming this means there's something which is tracking identifiers attached to the data being pushed in (like this endpoint does have).

Pulling everything together

Now I've got code that can handle both pulling data down from Forem and then taking that data and pushing it into OpenSearch, I now just need to pull everything together and provide an interface that is easy to use.

For pulling everything together, it's not that special, I'm just using a standard C# library project that takes in an AppSettings object for settings. I did decide that I wanted the ability to toggle the collection of data from each endpoint, so I did split out the code along these lines. Also, given the data is paginated, I do loop through until I can get all the data for use. I understand this is not the most "efficient" method of doing this as I'm retrieving data I've previously indexed, but that might be something I look at in the future. If you're interested, the code for how this looks, the article's endpoint is here.

Settings

In terms of access, I decided the easiest thing to do was to create a console app project that can link into the analytics library easily and be very flexible. For example, as it's just an executable it can be run directly, via a service or via scheduled task with minimal setup. This is all facilitated via a timer event being fired off after a set amount of time (default to once a day) so that you don't need to keep running the data collection task. I also added the ability to run as a Linux docker container because Visual Studio makes this only a couple of button presses to add:

In order to work, the application does need some settings passed in, most notably there is an AppSettings file built into the project. This controls settings for the ForemGatherer library itself and by default looks like this:

{
  "AppSettings": {
    "BasePath": "https://dev.to",
    "NodeList": [
      "http://localhost:9200/"
    ],
    "ApiKey": "" // While the API key can be set in this file, I don't reccomend it
  }
}

The project is set up to pull settings into this file via several methods:

directly into the file - support is also there for using NETCORE_ENVIRONMENT i.e.: appsettings.dev.json
environment variables
user secrets
command line arguments
- These override all other methods

The command line arguments supported are as follows:

  -a, --articleGatherer     (Default: true) Whether the article gatherer is enabled

  -f, --followerGatherer    (Default: true) Whether the follower gatherer is enabled

  -f, --GatherData          (Default: once per day) How often to gather data in TimeSpan format

  -k, --ApiKey              The API key used to connect to the Forem API

  -n, --NodeList            A list of opensearch nodes

  -b, --BasePath            (Default: https://dev.to) The base path of the Forem site

  --help                    Display this help screen.

  --version                 Display version information.

As I said, I can also run this as a Docker container, and I've got some vague ideas to link it into a docker compose so that I can run everything together:

All you need to do is run the code and verify that you're getting some data in OpenSearch. You can do this by going to the following address in a web browser http://localhost:9200/_cat/indices?v and you should see something like the following:

health status index                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .opensearch-observability       fnli1mGtSJCzRUsMXXQJYg   1   1          0            0       416b           208b
green  open   followers                       2bSUsOgGSBOi3rJyuq3NPw   1   1         52            0     73.6kb         36.8kb
green  open   .opendistro-reports-definitions SqwXfm_BRMmyTk305ktGaQ   1   1          0            0       416b           208b
green  open   articles                        2LNsx-bOS6GizJvoyRAfLQ   1   1          8            0    112.8kb         56.4kb
green  open   .opendistro-reports-instances   40Wf6ofGStSqqJW7Z3-jew   1   1          0            0       416b           208b
green  open   .kibana_1                       bc9P7hL8SJ68jDutyPajFg   1   1         23            9     95.4kb         47.7kb

OpenSearch Dashboards

Once in OpenSearch, I need to add the data to the dashboard from OpenSearch. To do this, open the hamburger menu to the left and go to Management > Stack Management > Index Patterns and this should give you the following page:

Index Patterns are used to tell OpenSearch Dashboard what data you're going to use. These can incorporate as many or as few indexes as you want, but given you also need to set a time field, which is different between Followers and Articles, which meant I just created 2 index patterns. For articles, you do get a few choices on the next screen, which is setting a time field, but I'm most interested in published articles, so that's what I chose. This has the knock on impact of essentially removing my draft articles, which is also what I wanted to do.

Now I've got data into the OpenSearch Dashboard, I need to analyse the data. To start with, I used the Discover tool to see if there was interesting, which I did find pretty quickly, such as DEV using Cloudinary for image storage and that I've been fairly consistent in releasing articles:

One of the nicest things about OpenSearch is that I can set how long the data range I want to view is:

This is pretty helpful, as it lets me set the exact length of time I'm interested in, as opposed to the stats page, where the only option after monthly is infinite which is pretty difficult to see what's going on, each day as I've now been on the sight (slightly) more than a month.

I'm not going to go through how I made every individual visualization, but if you're interested I've dropped a copy of the dashboard I built in the repository for this project. Here are some pictures that I've taken of the dashboard:

It definitely bears repeating, I'm not a data scientist, so I'm sure you could figure out some better things to look at (as well as not constantly changing case in the names of the visualizations). I do think this does give a pretty good idea of the things I can gather for helping me see how I'm doing on DEV.

Next steps

It's pretty clear the API is still heavily under development by the Forem team, evidenced by the "Tags" endpoint disappearing while I was writing this article (to be fair it wasn't THAT useful anyway) and that there's a fairly new pull request for removing V0 endpoints I think it's likely that access to the OpenAPI docs (and followers and articles) will stay around. I'm thinking I could pull out some more data based around the endpoints I do have, but I do need to do some thinking about. Also, given the API is under active development, and there's a fair amount of data DEV has access to in the analytics console that I don't have access to via the API, it would be nice to extend my dashboard to include this if it does get updated.