Data Preparation for Analytics

#github #opensource #database #discuss

All the data we use here on this website sources from GH Archive, a non-profit project that records and archives all GitHub events data since 2011. The total data volume archived by GH Archive can be up to 4 billion rows. We download the json file on GH Archive and convert it into csv format via Script, and finally load it into the TiDB cluster in parallel through TiDB-Lightning.

In this post, we will explain step by step how we conduct this process.

Prepare the data in csv format for TiDB Lighting.
├── gharchive_dev.github_events.000000000000.csv
├── gharchive_dev.github_events.000000000001.csv
├── gharchive_dev.github_events.000000000002.csv
├── gharchive_dev.github_events.000000000003.csv
├── gharchive_dev.github_events.000000000004.csv
├── gharchive_dev.github_events.000000000005.csv
├── gharchive_dev.github_events.000000000006.csv
├── gharchive_dev.github_events.000000000007.csv
├── gharchive_dev.github_events.000000000008.csv
├── gharchive_dev.github_events.000000000009.csv
├── gharchive_dev.github_events.000000000010.csv
├── gharchive_dev.github_events.000000000011.csv
├── gharchive_dev.github_events.000000000012.csv
├── gharchive_dev.github_events.000000000013.csv
Configure the TiDB Lightning as follows.
cat tidb-lightning.toml
[mydumper.csv]
separator = ','
delimiter = '"'
header = true
not-null = false
backslash-escape = true
trim-last-separator = false

[tikv-importer]
backend = "local"
sorted-kv-dir = "/kvdir/"

disk-quota = "1.5TiB"

[mydumper]
data-source-dir = "/csv_dir/"
strict-format = false
no-schema = true

[tidb]
host = "xxx"
port = 3306
user = "github_events"
password = "******"

[lightning]
check-requirements = false
region-concurrency = 32
meta-schema-name = "gharchive_meta"
Load the data into the TiDB cluster.
nohup tidb-lightning -config ./tidb-lightning.toml > nohup.out
Convert the unstructured json file provided by GH Archive into structured data.
gharchive_dev> desc github_events;
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| id | bigint(20) | YES | MUL | | |
| type | varchar(255) | YES | MUL | | |
| created_at | datetime | YES | MUL | | |
| repo_id | bigint(20) | YES | MUL | | |
| repo_name | varchar(255) | YES | MUL | | |
| actor_id | bigint(20) | YES | MUL | | |
| actor_login | varchar(255) | YES | MUL | | |
| actor_location | varchar(255) | YES | | | |
| language | varchar(255) | YES | MUL | | |
| additions | bigint(20) | YES | MUL | | |
| deletions | bigint(20) | YES | MUL | | |
| action | varchar(255) | YES | MUL | | |
| number | int(11) | YES | | | |
| commit_id | varchar(255) | YES | MUL | | |
| comment_id | bigint(20) | YES | MUL | | |
| org_login | varchar(255) | YES | MUL | | |
| org_id | bigint(20) | YES | MUL | | |
| state | varchar(255) | YES | | | |
| closed_at | datetime | YES | MUL | | |
| comments | int(11) | YES | MUL | | |
| pr_merged_at | datetime | YES | MUL | | |
| pr_merged | tinyint(1) | YES | | | |
| pr_changed_files | int(11) | YES | MUL | | |
| pr_review_comments | int(11) | YES | MUL | | |
| pr_or_issue_id | bigint(20) | YES | MUL | | |
| event_day | date | YES | MUL | | |
| event_month | date | YES | MUL | | |
| author_association | varchar(255) | YES | | | |
| event_year | int(11) | YES | MUL | | |
| push_size | int(11) | YES | | | |
| push_distinct_size | int(11) | YES | | | |
+--------------------+--------------+------+-----+---------+-------+
With structured data at hand, we can start to make further analysis with TiDB Cloud. Execute SQL commands to generate analytical results. For example, you can execute SQL commands below to output the top 10 most starred JavaScript framework repos in 2021.
SELECT js.name, count(*) as stars
FROM github_events
JOIN js_framework_repos js ON js.id = github_events.repo_id
WHERE type = 'WatchEvent' and event_year = 2021
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
+-------------------+-------+
| name | stars |
+-------------------+-------+
| facebook/react | 22830 |
| sveltejs/svelte | 18573 |
| vuejs/vue | 18015 |
| angular/angular | 11037 |
| alpinejs/alpine | 6993 |
| preactjs/preact | 2965 |
| hotwired/stimulus | 1355 |
| marko-js/marko | 1006 |
| neomjs/neo | 826 |
| tastejs/todomvc | 813 |
+-------------------+-------+
We have analyzed all the GitHub projects regarding databases, JavaScripe frameworks, programming languages, web frameworks, and low-code development tools, and provided valuable insights in 2021, in real time, and custom insights. If the repository you care about is not included here, you're welcome to submit your PR here. If you want to gain more insights into other areas, you can try TiDB Cloud by yourselves with this 10-minute tutorial.

Below are the areas of GitHub projects we have analyzed.

https://pingcap-ossinsight-build-pr-271.surge.sh/blog/how-it-works

DEV Community

Data Preparation for Analytics

Top comments (0)