DEV Community

OSS Insight
OSS Insight

Posted on

2 2

Data Preparation for Analytics

All the data we use here on this website sources from GH Archive, a non-profit project that records and archives all GitHub events data since 2011. The total data volume archived by GH Archive can be up to 4 billion rows. We download the json file on GH Archive and convert it into csv format via Script, and finally load it into the TiDB cluster in parallel through TiDB-Lightning.

In this post, we will explain step by step how we conduct this process.

Prepare the data in csv format for TiDB Lighting.
├── gharchive_dev.github_events.000000000000.csv
├── gharchive_dev.github_events.000000000001.csv
├── gharchive_dev.github_events.000000000002.csv
├── gharchive_dev.github_events.000000000003.csv
├── gharchive_dev.github_events.000000000004.csv
├── gharchive_dev.github_events.000000000005.csv
├── gharchive_dev.github_events.000000000006.csv
├── gharchive_dev.github_events.000000000007.csv
├── gharchive_dev.github_events.000000000008.csv
├── gharchive_dev.github_events.000000000009.csv
├── gharchive_dev.github_events.000000000010.csv
├── gharchive_dev.github_events.000000000011.csv
├── gharchive_dev.github_events.000000000012.csv
├── gharchive_dev.github_events.000000000013.csv
Configure the TiDB Lightning as follows.
cat tidb-lightning.toml
[mydumper.csv]
separator = ','
delimiter = '"'
header = true
not-null = false
backslash-escape = true
trim-last-separator = false

[tikv-importer]
backend = "local"
sorted-kv-dir = "/kvdir/"

disk-quota = "1.5TiB"

[mydumper]
data-source-dir = "/csv_dir/"
strict-format = false
no-schema = true

[tidb]
host = "xxx"
port = 3306
user = "github_events"
password = "******"

[lightning]
check-requirements = false
region-concurrency = 32
meta-schema-name = "gharchive_meta"
Load the data into the TiDB cluster.
nohup tidb-lightning -config ./tidb-lightning.toml > nohup.out
Convert the unstructured json file provided by GH Archive into structured data.
gharchive_dev> desc github_events;
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| id | bigint(20) | YES | MUL | | |
| type | varchar(255) | YES | MUL | | |
| created_at | datetime | YES | MUL | | |
| repo_id | bigint(20) | YES | MUL | | |
| repo_name | varchar(255) | YES | MUL | | |
| actor_id | bigint(20) | YES | MUL | | |
| actor_login | varchar(255) | YES | MUL | | |
| actor_location | varchar(255) | YES | | | |
| language | varchar(255) | YES | MUL | | |
| additions | bigint(20) | YES | MUL | | |
| deletions | bigint(20) | YES | MUL | | |
| action | varchar(255) | YES | MUL | | |
| number | int(11) | YES | | | |
| commit_id | varchar(255) | YES | MUL | | |
| comment_id | bigint(20) | YES | MUL | | |
| org_login | varchar(255) | YES | MUL | | |
| org_id | bigint(20) | YES | MUL | | |
| state | varchar(255) | YES | | | |
| closed_at | datetime | YES | MUL | | |
| comments | int(11) | YES | MUL | | |
| pr_merged_at | datetime | YES | MUL | | |
| pr_merged | tinyint(1) | YES | | | |
| pr_changed_files | int(11) | YES | MUL | | |
| pr_review_comments | int(11) | YES | MUL | | |
| pr_or_issue_id | bigint(20) | YES | MUL | | |
| event_day | date | YES | MUL | | |
| event_month | date | YES | MUL | | |
| author_association | varchar(255) | YES | | | |
| event_year | int(11) | YES | MUL | | |
| push_size | int(11) | YES | | | |
| push_distinct_size | int(11) | YES | | | |
+--------------------+--------------+------+-----+---------+-------+
With structured data at hand, we can start to make further analysis with TiDB Cloud. Execute SQL commands to generate analytical results. For example, you can execute SQL commands below to output the top 10 most starred JavaScript framework repos in 2021.
SELECT js.name, count(*) as stars
FROM github_events
JOIN js_framework_repos js ON js.id = github_events.repo_id
WHERE type = 'WatchEvent' and event_year = 2021
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
+-------------------+-------+
| name | stars |
+-------------------+-------+
| facebook/react | 22830 |
| sveltejs/svelte | 18573 |
| vuejs/vue | 18015 |
| angular/angular | 11037 |
| alpinejs/alpine | 6993 |
| preactjs/preact | 2965 |
| hotwired/stimulus | 1355 |
| marko-js/marko | 1006 |
| neomjs/neo | 826 |
| tastejs/todomvc | 813 |
+-------------------+-------+
We have analyzed all the GitHub projects regarding databases, JavaScripe frameworks, programming languages, web frameworks, and low-code development tools, and provided valuable insights in 2021, in real time, and custom insights. If the repository you care about is not included here, you're welcome to submit your PR here. If you want to gain more insights into other areas, you can try TiDB Cloud by yourselves with this 10-minute tutorial.

Below are the areas of GitHub projects we have analyzed.

gharchive_dev> show tables;
+-----------------------------+
| Tables_in_gharchive_dev |
+-----------------------------+
| cn_repos |
| css_framework_repos |
| db_repos |
| github_events |
| js_framework_repos |
| nocode_repos |
| programming_language_repos |
| static_site_generator_repos |
| web_framework_repos |
+-----------------------------+

https://pingcap-ossinsight-build-pr-271.surge.sh/blog/how-it-works

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry 🕒

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay