DEV Community: Michael Li

TL;DR CTF Writeups: TryHackMe(Vulnversity)

Michael Li — Tue, 18 May 2021 00:23:05 +0000

Series Brief

Cybersecurity has always been something I want to get into or at least apply my data science skills to. Not because of the 'Mr. Robot' style Hollywood hacking science though, not that. Rather, I was drawn into it because it embodied curiosity, hunger for knowledge, problem-solving, and the mentality of always be tinkering things.

Enter TryHackMe, the popular online platform that lets you learn cybersecurity through many short, gamified labs(VMs spun up with certain purpose and configuration). It is quite beginner-friendly too. You don't have to build your own home lab and deal with those virtual machine configurations. They are already ready-made and ready to be exploited. It gives you access to other people's wisdom (CTF rooms developed by other experts or peers) where you can't have if you are building your own home lab. This series is writeups that will record what I felt about the challenges and things I learned from them with a TLDR-kinda style.

What to Expect for TLDR CTF Writeups

Might Be Helpful if You:

✔️ Want a 20,000 ft view of what this room is about
✔️ Want all the leads to help you through the journey but still want to tackle it on your own
✔️ Finish the room but still want to see what is some good takeaways

Might Want to Check Other Writeups if you:

🔶 Want a detailed step-to-step writeup to guide you through the challenge
For this purpose, I recommend this writeup

Overall Feelings

Onto the room that we'll be discussing today: TryHackMe: Vulnversity. It is the first real room if you choose the Offensive Pentesting path (The first Getting Started and Tutorial are too easy to count). Overall, I felt this room is quite well designed. It requires some Linux command-line knowledge but nothing too fancy. It breaks the whole tasks into bite-size pieces, so no single sub-task is too daunting to tackle. It condensed the typical hacking process of recon, exploit, post-exploit into these simple tasks, introduced you to tools, and guide you through the thought process. Note that you still need to do some research on your own. The room won't spoon-feed you everything. But that's exactly where the fun is, right?

Reasons to Try This Room

Learn basic hacking process (Recon -> Exploit -> Post-Exploit)
Get familiar with some very essential hacking tools (nmap, GoBuster, BurpSuite, systemctl, etc.)
Get a taste of privilege escalation and reverse shell.

Strategy/Tactics Used

➡️ Use nmap to gain knowledge of open port and services -> Find 80 port open, website on
➡️ Use GoBuster to search for folders on web-server -> Find an upload form on internal folder
➡️ Use BurpSuite to intercept the request and find out what upload format is supported -> php : No, phtml: YES!
➡️ Upload reverse-shell payload with .phtml extension to bypass filtering: Gain a reverse shell
➡️ Within the reverse shell, use find to search for SUID files that can be used for privilege escalation -> Find /bin/systemctl
➡️ Create a SystemD service file and use /bin/systemctl to enable and run it, gain root access! WIN!

Tools/Command Used

nmap

nmap is so essential in the recon process. This task is just scratching the surface. Some options used:
-sV: Attempts to determine the version of the services running
-p xxx or -p-: Portscan for port xxx or scan all ports
-A: Enables OS and version detection, executes in-build scripts for further enumeration

GoBuster

GoBuster is a URIs/DNS subdomain brute-force tool. It's developed in Go and will enumerate through the hostname you give it and spit out directory and folders. It will work better with a pre-built word list. The command used in the lab is:

    **gobuster dir -u http://<ip>:3333 -w <word list location>**

The word list can easily be under **/usr/share/wordlists** if you use Kali Linux. Some mentioned that dirsearch is a viable alternative here.
Using the tool, you can find an upload folder that you can upload file onto the web server.

BurpSuite

Trying to upload some files, and you'll find that most extensions are blocked. How to proceed? Enter BurpSuite, according to Arch Wiki. It is an 'integrated platform for performing security testing of web applications.' Well, we'll just use it to intercept some web requests and try different file extensions and see which one actually is not blocked.
The gist of it is to try to upload something, intercept it using BurpSuite, then change the file extension part with a pre-loaded wordlist of all kinds of file extensions(e.g. php, php5, phtml, etc.) to test which one actually can bypass the web-server filter.
Once found (phtml), just upload a PHPscript of a reverse shell and upload to the webserver, load it on the browser, then your listening nc -lvnp 1234 session will get the reverse shell.

systemctl

Once we have the reverse shell, the last thing will be to privilege escalation. One way of doing this is to search for executables with SUID permission. We do that with find:

find / -user root -perm -4000 -exec ls -ldb {} \\;

Out of the results returned, /bin/systemctl stands out. We can create some service file (e.g. root.service) and let systemctl start to get the previlege:

TF="root.service"
echo '[Service]
Type=oneshot
ExecStart=/bin/sh -c "cat /root/root.txt > /tmp/flag.txt"
[Install]
WantedBy=multi-user.target' > $TF

Note here you can't use text editor like vim or nano in a remote shell, just echo into the file. Not optimum, but get the job done. Once you have the root.service file, run systemctl to enable/start it:

/bin/systemctl enable /tmp/output/root.service
/bin/systemctl start /tmp/output/root.service

After the service is properly started, a simple cat /temp/flag.txt will give us the flag we want to pass the room.

References

How I Migrate My Data Science Blog from Pelican to Hugo

Michael Li — Sun, 02 May 2021 05:18:22 +0000

Motivation

Issues of Pelican

I've been using Pelican as the framework for my data science blog for a while now. It has worked for me, though there were always some minor glitches that made me feel not settled. It never feels complete and satisfying to me. Here are some of the big ones:

Small community and niche position

Pelican has a much smaller community than Hugo. It has 10.4k stars vs. the 51.4k stars on GitHub. On the Static Site Generator community, Pelican is a niche. People already know Python might want to try it out (like me!), but others with better understanding and programming skills might prefer other options.
With a smaller community comes with fewer themes, fewer plugins, and less support if you run into some weird issues. This is precisely my experience.

Lack of satisfying themes

It was pretty hard to find my current theme Elegant that have both a good look and feel as well as utilities. There are not many options to begin with.

Small glitches that are hard to tackle

It took me quite some effort to fully get everything to work. Google Analytics, Disqus, Jupyter Notebook support, Table of Content, Read time, etc. During the process, I barely got any help since there simply weren't many people using it. So I have to dig deep into the source code to fix a minor issue. It's not that the process not worth the time(it was and very challenging and educational for me as a programmer), but why I have to dig out the rocks while I can tend the flowers?

Speed, Speed, Speed

When it comes to speed as a programming language, Python sits on the 'slowest' end while Go is at the top(almost). What I can tell you is this: it shows on-site generating speed for sure. Pelican will take a couple of seconds to render all my articles (20+), while Go took a couple of milliseconds. Another big perk for Go is, it updates the site in real-time, while Pelican will lag behind a bit. This is more obvious when you made a small change and need to re-generate the whole site to see the updated version. Our time is too precious to be wasted, even a couple of seconds will add up to much.

Why Hugo

Concurrency and speed

Hugo boast itself as "The world’s fastest framework for building websites", and I can definitely see why. Golang is developed by Google to solve their code efficiency problems and is known for its great concurrency prowess. This transfers to Hugo pretty well. The standard build time for a Hugo site is on 100ms level while another static site generator is on 10 seconds level. If speed is your concern, then you'll definitely love Hugo.

Good Community Support

Hugo's open source project on GitHub is currently showing 54.4k Stars. That is a pretty big number. Many people use Hugo as their framework of choice for personal/business blogs. This means it's easier to search for similar questions when in doubt.
Also, the response time for Hugo's official forum is relatively short, given that you frame your question precisely.
Hugo also has excellent Documentation you can easily find what you want should you implement a new feature.

Exposure to Go

Golang as a server-side language is gaining prevalence over the past couple of years among back-end developers. It's a language worth putting some time into. Working with Hugo will unavoidably expose you to Go, and you might learn a thing or two when building your site and get you started with Go.

Themes, a lot of themes

Look no further at the official Hugo theme site. These are the free ones. There are also some sites that offer paid premium themes, and you can decide whether it's worth it. Free or paid, the Hugo theme community is very vibrant and active, result in many options to choose from.

Smooth Learning Curve

Some static site generator like Gatsby You'll need to have a solid understanding of React to use it. For Hugo, you don't really have to learn Go first, though knowing some Go will make it even smoother to speed things up.

My First Hugo Site

Enough of the theory-crafting. Let's get down to the details. I'll organize this part in chronological order to show the flow of how it's usually done and some issues I ran into, and how I tackled them.

Start from Quick Start

The simplest and best (to me at least) way to start your migration is to actually build a new site from scratch following the official Quick Start. It's relatively easy to follow and doesn't even have many steps. It will help if you know a bit command line and Git but not required. Hugo comes with a powerful and intuitive CLI interface, and even if you don't know much about the command line, you can finish the tutorial not sweating. For example, build the site is only:

hugo

The quick start will pick a theme for you(ananke) You can change it to your own choice later easily. The final site will look something like this

Pick a Theme

Picking a theme is mostly subjective. Choose anything you want. Something that looks appealing to you and meets all your utility needs will be a good start. Just don't spend too much time nailing down your 'perfect one', with so many choices, you are likely to switch multiple times before settle on one that you really feel comfortable with. Mine is Stack . For your first site, make sure to also have a look at the theme documentation cause you'll definitely need to read it multiple times to adjust it to your liking.

Configure and Adjust

Now comes the fun part, the tinkering! Tweak a theme to make it work for you for some people is daunting, but for me, it's daunting and exciting. It feels like puzzle solving. You get leads from theme documentation, Hugo documentation, YouTube videos, and stack overflow and put all the pieces together. When it's done, you'll feel excellent about yourself!

Clone, Submodule, and Config

First thing first, git clone the theme to local drive:

git clone https://github.com/CaiJimmy/hugo-theme-stack/ themes/hugo-theme-stack

It helps to add the theme as a submodule. It's easier to manage with Git that way. And you'll need that for future deployment if you want to put your site on Netlify.
Once the theme folder sits safely on your local drive, you just need to make some minor tweaks to the config file to make them work. You can do this in two ways. One is simply to change your current config.toml file:

echo theme \= \\"hugo-theme-stack\\" >> config.toml

But if you read the theme documentation, what's suggested is to simply copy the config.yaml file from the theme example site over, since there are other parts of the config you need to get right, and it's easier to start from the theme default config files.
Once done, your simple site will start to look like this:

Avatar

Now to the little details you need to iron out to make the theme work for you. The first thing that grabbed my attention is the glaring placeholder '150x150' avatar:

Gotta get rid of it first! Looking at the documentation, the avatar needs to be put somewhere (I put it under img subfolder) under assets folder under the site root directory. Then change the config.yaml to tell Hugo where to find it:

    sidebar:
        emoji: 🍥
        subtitle: Data Science for the Rest of Us 
        avatar:
            local: true 
            src: img/avatar.png

The site automatically reloaded, and the avatar got updated to my not-so-pretty photo:

favicon

Favicon is that one little thing that when you have it, you never notice it. But if it's not there, the absence of it will nab you forever. Let's get that straight.
I don't have a favicon for my site yet, so I somehow need to generate one. A quick way of doing this is to use favicon.io. It lets you generate your favicon out of an image, a couple of characters, or emoji of your like. For simplicity's sake, I decided to go with my first-name characters. You can always change them later if not satisfied. The UI looks something like the below:

With the favicon resource files downloaded, the next step is to figure out where to put them. Looking at the theme documentation, there's no mention of which folder it should be put under. What is the best way to find information if official documentation is insufficient? GitHub, of course! Usually, people will complain about the lack of information on an open-source project's GitHub and submit issues. Let's see if we can find any clue there. After some search, the theme's GitHub page can be found here Search for 'favicon' within the repo, aha, we have 12 issues related to it:

The circled issue (though in Chinese) is the one we need, and it pointed us to the /static folder to put the favicon. I put it under /static/img/. Then update the config.yaml :

params:
    mainSections:
        - post
    featuredImageField: image
    rssFullContent: true
    favicon: img/favicon-32x32.png

Reload, it works!

Front Matter

Front matter is the meta-data for your posts. It contains various predefined variables you can use, or you can customize your own if you prefer. It's all very flexible. In them, the title, date, description, categories, tags, and image are the most important. The categories and tags also decide how your content will be organized.

Content organizing and feature image

The theme allows for two ways to organize your content: categories and tags. To do so, just include them in your front matter, like so:

---
image: 9-things-i-learned-from-blogging-on-medium-for-the-first-month.jpeg
title: "9 Things I Learned from Blogging on Medium for the First Month "
description: "Why Medium is a good platform to exchange ideas"
slug: 9-things-i-learned-from-blogging-on-medium-for-the-first-month
date: 2019-10-04T20:56:10.704Z
categories: Machine Learning
tags: 
- "Machine Learning"
- "Blogging"
- "Medium"
---

The theme will collect all your categories/tags defined in all your posts and put them together in the relative 'categories' and 'tags' page. You can also give a feature image for each category or tag. Just create categories and tags folder under /content/, and in each folder, create a subfolder for each category or tag, under which put in _index.md file and an image (say ML.jpg). Within the _index.md file, put a front matter variable image and point to the image ML.jpg. Like below:

Once configured, it should look like this:

Shortcodes - Image Captions

Usually, images in Markdown files should be like this:

![image](url)

But unfortunately, this doesn't work well with image captions. After a couple of trial and errors, I found that the Hugo shortcode figure works pretty well:

{ {< figure caption="It ain’t much …" src="https://cdn-images-1.medium.com/max/2000/0*e5CJeyB0_LVFRe4a.jpg" >} }

It looks like this:

Now that the essential pieces are down, time to write a script to transfer my Pelican-based Markdown files to Hugo-based ones.

Writing the Pelican to Hugo migration script

Having figured out all the details of making the theme work, now it's time to transfer my posts tailored for Pelican to more Hugo-ready. This can be done easily with some Python scripting. I used code from this GitHub repo as a base and adapted to my needs. The code is pretty self-explanatory. It reads every line of the old Markdown file, uses regex to search through phrases, and needs updating and modifying each line accordingly, mostly front matter, image, and video links.

#!/usr/bin/env python3
#
# Pelican to Hugo v20180603
#
# Convert Markdown files using the pseudo YAML frontmatter syntax
# from Pelican to files using the strict YAML frontmatter syntax
# that Hugo and other static engines expect.
#
# Anthony Nelzin-Santos
# https://anthony.nelzin.fr
# anthony@nelzin.fr
#
# European Union Public Licence v1.2
# https://joinup.ec.europa.eu/collection/eupl/eupl-text-11-12

import os, os.path, re
import subprocess
from shutil import rmtree, copytree

    #  Add the path to your files below
outpath = 'path/to/your/hugo/content/folder'
inpath = 'path/to/your/Pelican/content/folder'

def pre_process():

    # Clear files in outpath
    for files in os.listdir(outpath):
        path = os.path.join(outpath, files)
        try:
            rmtree(path)
        except OSError:
            os.remove(path)

    # copy all Markdown files over
    cp_cmd = f'cp {inpath}/*.md {outpath}/'
    os.system(cp_cmd) # need 'shell=True' if passing the whole command as a string

def pelicantohugo():
    files = os.listdir(outpath)

    for file in files:
        first_img = True
        file_name, file_extension = os.path.splitext(file)
        # Input files will be left in place,
        # output files will be suffixed with "_hugo".
        regexed_file = file_name + '_hugo' + file_extension

        # Only convert visible Markdown files.
        # This precaution is useless… until it is useful,
        # mainly on .DS_Store-ridden macOS folders.
        if not file_name.startswith('.') and file_extension in ('.md'):
            input_file = os.path.join(outpath, file)
            output_file = os.path.join(outpath, regexed_file)

            # The files will be edited line by line using regex.
            # The conversion of a thousand files only takes a few seconds.
            with open(input_file, 'rU') as fi, open(output_file, 'w') as fo:
                for line in fi:
                    # Frontmost handling
                    line = re.sub(r'(Title:)', r'title:', line)
                    line = re.sub(r'(Tags:)', r'tags:', line)
                    line = re.sub(r'(Date:)', r'date:', line)
                    line = re.sub(r'(Category:)', r'categories:', line)
                    line = re.sub(r'(Slug:)', r'slug:', line)
                    line = re.sub(r'(Summary:.*$)', r'', line)
                    line = re.sub(r'(author:.*$)', r'', line)
                    line = re.sub(r'(Subtitle:)', r'description:', line)
                    # Add closing frontmatter delimiter after the tags.
                    line = re.sub(r'(\[TOC\].*$)', r'---', line)

                    # Add opening frontmatter delimiter before the title.
                    line = re.sub(r'(title:)', r'---\n\1', line)
                    # Enclose the title in quotes.
                    line = re.sub(r'title: (.*)', r'title: "\1"', line)
                    line = re.sub(r'description: (.*)', r'description: "\1"', line)
                    # Change date formatting.
                    line = re.sub(r'(date: \d{4}-\d{2}-\d{2}) (\d{2}:\d{2})', r'\1T\2:00Z', line)
                    # Slow but insightful way to edit the tags.
                    if re.match(r'tags: (.*)', line):
                        # Split the comma separated list of tags.
                        tag_split = re.sub(r'(.*)', r'\1', line).split(', ')
                        # Output the new list of tags.
                        tag_plist = '\n- '.join(tag_split)
                        # Insert a newline before the list.
                        tag_list = re.sub(r'tags: (.*)', r'tags: \n- \1', tag_plist)
                        # And enclose the tags in quotes.
                        line = re.sub(r'- (.*)', r'- "\1"', tag_list)
                    # get proper slug
                    if re.match(r'slug: (.*)', line):
                        slug_match = re.search(r'slug: (.*)', line)
                        slug = slug_match.group(1)
                        os.system(f'mkdir {outpath}/{slug}')  # create subfolder using slug for feature image
                    if re.search(r'\(https://cdn.*?\)', line): 
                        img = re.search(r'!\[(.*?)\]\((https://cdn.*?)\)', line)
                        img_url = img.group(2)
                        img_caption = img.group(1)
                        if first_img:   # for first image which is the feature image, need special handling
                            first_img = False
                            if re.search(r'\.((?:jpg|png|jpeg|gif|svg))', img_url): 
                                img_e = re.search(r'\.((?:jpg|png|jpeg|gif|svg))', img_url)
                                img_ext = img_e.group(1)
                            else:
                                img_ext = 'jpeg'
                            # download image from Medium and put into the created subfolder
                            os.system(f'wget -O {outpath}/{slug}/{slug}.{img_ext} {img_url}')
                            line = ''
                        else:
                            # all other images just extract the image url and put into 'figure' shortcode
                            line = f'{ { { {< figure caption="{img_caption}" src="{img_url}" >} } } }'

                    # YouTube shortcode
                    if re.search(r'src="https://www.youtube.com/embed/(.*?)"', line):
                        video = re.search(r'src="https://www.youtube.com/embed/(.*?)"', line)
                        video_code = video.group(1)
                        line = f'{ { { {< youtube video_code >} } } }'
                    fo.write(line)
                # Print a little something about the conversion.
                #print(file_name + ' converted.')
            os.remove(input_file)

            # when all is ready, set the 'image:' front matter correctly so feature image could work
            with open(output_file, 'r') as fi:
                data = fi.readlines ()
            with open(output_file, 'w') as fo:
                image_meta_added = False
                for line in data:
                    # Add opening frontmatter delimiter before the title.
                    if not image_meta_added and not first_img:
                        line = re.sub(r'(---)', f'---\nimage: {slug}.{img_ext}', line)
                        image_meta_added = True
                    fo.write(line)
        if not first_img: os.system(f'mv {output_file} {outpath}/{slug}/index.md')

pre_process()
pelicantohugo()

Some points worth noticing:

The script will remove everything in the output folder(Hugo content folder) and regenerate them all from the source folder.
Front matter: Heavily use regex to replace meta-data. e.g. from 'Title' to 'title', 'Subtitle' to 'description', date format, etc.
Image: Extract Medium CDM URL, download the image, and put it under post subfolder so that the feature image could work. Other images were put into a {{figure}} Hugo shortcode for better captioning.
YouTube: extract the video ID and put it into {{youtube}} Hugo shortcode. It works like a charm.

Transfer my new Medium posts into Hugo markdown

I have some Medium posts that have not yet been transferred to my Pelican blog, so another script is needed. No need to write it myself. I used a Python script from the GitHub repo. You need to use Medium's export service to get all your posts into a zip file and then use the script to turn them into Hugo markdown. Since there weren't many posts, I did some adapting manually to create the subfolders for each article so the feature image could work.

Switch repo on Netlify

Finally, we have everything we need, now is the time to toggle the switch on Netlify from Pelican to Hugo. Exciting!
First of all, I created a repo for the Hugo site, here
Then, log into my Netlify account, go to my site, then hit SIte Settings:

Choose Build & Deploy tab, then hit Edit Settings, like so:

Choose Link to a different repository->, a wizard screen will show up, where you can pick your Hugo repository. Do that, and do some basic build settings. Among them, notice the build command is simply hugo. One thing worth noticing is the Hugo version. The default Hugo version on Netlify is not high enough to properly build my site, and I ran into many weird errors. I found the solution to add a netlify.toml on my site root directory and assign the Hugo version inside of it. You can find the reference guide here.
Once all settings are done, a new build generates the new site.

Overall Feeling

Wow, this is a long post. I appreciate whoever made this far. I hope this long article helps a bit on your journey into the HUGO land. Finally, I want to share my overall feelings of the whole process:

It's not hard but requires some ironing out quite some wrinkles. (Also the fun part?)
Theme adopting takes the longest time. The documentation helps but often not complete. Github issues help tremendously.
There are a lot of very kind people who wrote scripts to automate the migration. Use them, but don't hesitate to modify them to your needs. A little programming goes a long way, especially when you have many articles.

Bonus
YouTube series I use to learn Hugo basics, all in bite-size.

How to Extract Knowledge from Wikipedia, Data Science Style

Michael Li — Sun, 09 Feb 2020 06:07:41 +0000

As Data Scientists, people tend to think what they do is developing and experimenting with sophisticated and complicated algorithms, and produce state of the art results. This is largely true. It is what a data scientist is mostly proud of and the most innovative and rewarding part. But what people usually don’t see is the sweat they go through to gather, process, and massage the data that leads to the great results. That’s why you can see SQL appears on most of the data scientist position requirements.

What is SPARQL?

There is another query language that could prove very useful in acquiring data from multiple sources and databases, Wikipedia the biggest among them. The query language is called SPARQL. According to Wikipedia:

SPARQL (pronounced “sparkle”, a recursive acronym [2] for SPARQL Protocol and RDF Query Language) is an RDF query language — that is, a semantic query language for databases — able to retrieve and manipulate data stored in Resource Description Framework (RDF) format

Well, this is not a very good definition. It hardly tells you what it can do. To translate it into human-readable language:

SPARQL is a query language similar to SQL in syntax but works on a knowledge graph database like Wikipedia, that allows you to extract knowledge and information by defining a series of filters and constraints.

If this is still too abstract to you, look at the image below:

Awarded Chemistry Nobel Prizes

It is a timeline of awarded chemistry Nobel prizes, generated by the WikiData Query Service website, using the code below:

    #Awarded Chemistry Nobel Prizes
    #defaultView:Timeline
    SELECT DISTINCT ?item ?itemLabel ?when (YEAR(?when) as ?date) ?pic
    WHERE {
      ?item p:P166 ?awardStat . # … with an awarded(P166) statement
      ?awardStat ps:P166 wd:Q44585 . # … that has the value Nobel Prize in Chemistry (Q35637)
      ?awardStat pq:P585 ?when . # when did he receive the Nobel prize

    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
    OPTIONAL { ?item wdt:P18 ?pic }
    }

Anyone familiar with SQL will find the above code quite intuitive. I’ll use another example to explain basically how to formulate similar queries to achieve the results you interested in.

Starting Point: Wikipedia Page

SPARQL works on multiple knowledge graph databases. To know what knowledge graph is, let’s start with something everyone is familiar with: Wikipedia. Wikipedia is the go-to place for most people when they want to research a topic/subject. If you go to Python creator Guido van Rossum’s page, you’ll see a detailed page with all kinds of good information.

Organized Page: WikiData

The problem with this page is it’s not organized. You can search on keywords, but you cannot easily find out the relationship between the information nodes. That’s where the knowledge graph comes into play. The red rectangle on the above page spells “Wikidata Item”, click it will bring you to the knowledge graph view of the same page:

Start the Query

Here we can see all information about Guido is well organized into categories, each category has multiple items. Use SPARQL, you can easily query this information. To do this, Wikipedia provides another page, a user-friendly query service called Wikidata Query Service:

This is where we can experiment with SPARQL. On the WikiData page, we observed that Guido is a programmer (obviously!), now what if we want to know other programmers that have an entry on Wikipedia? Let’s see the SPARQL code:

SELECT ?person

WHERE {

?person wdt:P106 wd:Q5482740 .

}

Here we defined a ?person as the **subject **of interest, this is also what will appear as a column in our query results. Then we specify some constraints with WHERE . The constraints are wdt:P106 need to be wd:Q5482740. What? You say. Let me explain it in more detail. wdt is a prefix of a ‘predicate’ or ‘attribute’ of the subject while wd is the prefix of a value(object in SPARQL terms, but that’s not important) of the attribute. wdt: means I am gonna specify an attribute of the subject here, and wd: means I will specify what the value of this attribute is. So what is P106 and Q5482740 ? These are just a code for the specific attribute and value. P106 stands for ‘occupation’ and Q5482740 stands for ‘programmer’. This line of code means, I want the ?person subject to have an attribute of ‘occupation’ of ‘programmer’. Not that scary anymore, right? You can find these codes easily on the WikiData page mentioned above.

Run the query and you’ll get the following results:

From Code to Name

We got a bunch of person items with different wd:value . If you look closer at the value, they are actually the code for a different person. For example, the first one wd:Q80 is Tim Berners-Lee, the inventor of WWW. This is not intuitive, we want to be able to directly see the names. To do that, we add a WikiData ‘label service’ that helps us translate the code to name, like so:

SELECT ?person ?personLabel

WHERE {

  ?person wdt:P106 wd:Q5482740 .
  ?person rdfs:label ?personLabel .
  FILTER ( LANGMATCHES ( LANG ( ?personLabel ), "fr" ) ) 

}

Similar syntax, we want the person to have a ‘label’ attribute, and we define a personLabel value variable to hold these values so we can display them in the query results. Also, we added the personLabel into our SELECT phrase so it will be displayed. Please be noted that I also added a FILTER below to only display the French language label, otherwise it will show multiple language labels for one person, which is not what we want:

Narrowing Down

From the above results, we can see that we have some 790 results. This is way too many. Let’s narrow them down to the ones that are ‘somebody’ in the industry. Someone that has an attribute of ‘notable work’:

SELECT ?person ?personLabel ?notableworkLabel

WHERE {

?person wdt:P106 wd:Q5482740 .
  ?person rdfs:label ?personLabel .
  FILTER ( LANGMATCHES ( LANG ( ?personLabel ), "fr" ) ) 

  ?person wdt:P800 ?notablework .
  ?notablework rdfs:label ?notableworkLabel .
  FILTER ( LANGMATCHES ( LANG ( ?notableworkLabel ), "fr" ) ) 

}

Again, wdt:P800 means ‘notable work’ attribute, everything else is similar. We then get the following results:

Group Multiple Labels

Now we have only 175 results, with each person’s notable work also listed. But wait, why we have five Richard Stallman? Turns out, Richard has more than one notable work, thus listed multiple times in the results. Let’s fix this by grouping all the notable work into one attribute:

SELECT ?person ?personLabel ( GROUP_CONCAT ( DISTINCT ?notableworkLabel; separator="; " ) AS ?works )

WHERE {

?person wdt:P106 wd:Q5482740 .
  ?person rdfs:label ?personLabel .
  FILTER ( LANGMATCHES ( LANG ( ?personLabel ), "fr" ) ) 

  ?person wdt:P800 ?notablework .
  ?notablework rdfs:label ?notableworkLabel .
  FILTER ( LANGMATCHES ( LANG ( ?notableworkLabel ), "fr" ) ) 

}

GROUP BY ?person ?personLabel

Here ‘GROUP BY’ is used. Also, GROUP_CONCAT function is used to concatenate multiple notableworkLabel into a new column works (I will not explain how these functions work, just want to quickly show you what SPARQL can do. Please feel free to Google if you want to know more, there are plenty of tutorial articles and videos out there):

Faces

Now we have a 90 results list, with all the ‘who-is-who’ in the software engineering world. But SPARQL can do more. Let’s add some faces to the names:

SELECT ?person ?personLabel ( GROUP_CONCAT ( DISTINCT ?notableworkLabel; separator="; " ) AS ?works ) ?image

WHERE {

?person wdt:P106 wd:Q5482740 .
  ?person rdfs:label ?personLabel .
  FILTER ( LANGMATCHES ( LANG ( ?personLabel ), "fr" ) ) 

  ?person wdt:P800 ?notablework .
  ?notablework rdfs:label ?notableworkLabel .
  FILTER ( LANGMATCHES ( LANG ( ?notableworkLabel ), "fr" ) ) 

  OPTIONAL {?person wdt:P18 ?image}

}

GROUP BY ?person ?personLabel ?image

The same pattern, we just added an OPTIONAL keyword prior since we don’t want to exclude someone if he doesn’t have an image in his profile. We also switch the view into ‘Image Grid’ :

Where are they?

Wow! This is much better. I saw quite some familiar faces! Maybe you wonder where are these guys located? Let’s find out:

#defaultView:ImageGrid

SELECT ?person ?personLabel ( GROUP_CONCAT ( DISTINCT ?notableworkLabel; separator="; " ) AS ?works ) ?image ?countryLabel ?cood

WHERE {
  ?person wdt:P106 wd:Q5482740 .
  ?person rdfs:label ?personLabel .
  FILTER ( LANGMATCHES ( LANG ( ?personLabel ), "fr" ) ) 

  ?person wdt:P800 ?notablework .
  ?notablework rdfs:label ?notableworkLabel .
  FILTER ( LANGMATCHES ( LANG ( ?notableworkLabel ), "fr" ) ) 


  OPTIONAL {?person wdt:P18 ?image}

  OPTIONAL {?person wdt:P19 ?country .
           ?country rdfs:label ?countryLabel .
            ?country wdt:P625 ?cood .
            FILTER ( LANGMATCHES ( LANG ( ?countryLabel ), "fr" ) )
           }

}

GROUP BY ?person ?personLabel ?image ?countryLabel ?cood

You can decipher the code above yourself maybe. It basically says I want this person to have an attribute of country, put into a variable country, then find out the coordinates of the country and put into a variable cood. With the coordinates, we can activate the ‘map’ view:

We can see we have a lot of them in the US, some in Europe and others scattered around other parts of the world.

More Examples

With a few lines of codes, we figured out the big influencers in the software industry, what they are known of, where they are and how they look. As you can see the potential here is limitless.

You can click the ‘Example’ button on the WikiData page to find out more fun and interesting examples you can do with it.

As an assignment for this article, can you figure out how to add the ‘date of birth’ attribute and generate a timeline graph like the one at the beginning of this article?

Conclusion

In this article, we used WikiData as a knowledge graph example to introduce SPARQL query language. There are other knowledge graphs out there like DBpedia, etc. This article is by no means a comprehensive tutorial. I just want to introduce the language to more people, so knowledge and information extraction can be done a bit more efficiently.

5 Things I Learned from Google’s New ML-Powered Recorder App

Michael Li — Thu, 06 Feb 2020 15:50:14 +0000

There are tons of audio recording apps in the app store, but you know things will be a bit different if Google developed a brand new one. Google recently released a new ‘Recorder’ app that is powered by its state-of-the-art Machine Learning algorithm that can transcribe what it hears with impressive precision in real-time. This is not the first time Google tried to bless its product with some AI ‘superpower’. Some of their prior attempts failed (I’m talking to you Google Clips!) and some had quite formidable success, for example, Google’s Pixel phone camera app. With the camera hardware spec a little below industry mainstream, Google’s Pixel flagship phone managed to pull off as one of the best smartphone cameras on the market thanks to its Machine Learning algorithms for image post-processing. The ‘Recorder’ app is yet another attempt by Google to spice up competition using AI, this time on audio.

After digging deeper into what the app can do differently and how AI played a core role in it, I found some very interesting insights on how Google handles app, AI and user experience that could shed some light on future app development in the AI-era.

What is Google Recorder?

You can refer to the following short YouTube video on what Recorder does. In short, you can use it to do real-time transcription, search recorded audio by keywords, automatically generate tags or segment the audio into different categories like music, speech, etc.

I’ve been using it for more than a week and found it to be useful, slick and pleasant to use. Recording audio is not a complicated task, but the AI part makes it even easier. I can see this little app makes a big difference for students and people attending meetings regularly.

#1 Adoption of Edge-First Model Design

Image from WishDesk

We all heard of the term ‘Mobile-First Design’. When companies develop their applications, they will design and optimize their app based on mobile experience first, then to other platforms like desktop or web. I think the same idea can also be applied to AI-powered application design, hence ‘Edge-First Design’.

Usually, Machine Learning based applications run on the cloud, this is due to the heavy computation requirements for most state-of-the-art ML models. For enterprise applications, this approach is fine since the hardware is hardly a real issue. But if a company wants to build impactful AI-based apps for the consumers, then a cloud-based system often won’t cut it. Running AI-based apps from the cloud is not only slow but also has serious privacy concerns. Also to the normal consumer, they are used to the snappiness modern mobile apps offers. They can care less whether your app is based on some SOTA models or not, if the experience cannot match the high standard they get used to, boosted by many years of modern smartphone hardware/software development. So putting the AI on the ‘Edge’, e.g. user’s phone, tablet, smart home devices will be a better way to success.

Image from Lann

Google Recorder app did a great job on this. It uses a new model called ‘**RNN transducer(RNN-T)**’ that is compact enough to reside on the phone while powerful enough to do real-time transcription. Instead of the traditional ‘pipeline’ approach, the RNN-T model uses a single neural network, end-to-end approach which is growingly more popular to solve complicated problems. Until recently, we’ve seen a lot of research progress being made on increasing the prediction performance by using bigger and bigger models, yet the opposite direction is equally important: Using as compact a model as possible to achieve similar performance so the model can be put on the edge. I expect more research to be done in this area when machine learning matures in the coming years.

#2 Use Different Technology Stack for Performance

Another interesting development is the introduction of Swift for TensorFlow. Created by the creator of Swift programming language, Chris Lattner. It uses open-source Swift language with TensorFlow and promises both fast development time like Python and high-level performance like C++. Fast.ai has a great introductory course on it. With ML moving more and more from research labs to commercial applications, the performance of ML models will play a much bigger role and Swift for TensorFlow has great potential on that. According to the founder of fast.ai, Jeremy Howard:
What's New in Machine Learning - WWDC 2019 - Videos - Apple Developer
*Core ML 3 has been greatly expanded to enable even more amazing, on-device machine learning capabilities in your app…*developer.apple.com

“Swift can match the performance of hand-tuned assembly code from numerical library vendors. Swift for TensorFlow is the first serious effort I’ve seen to incorporate differentiable programming deep into the heart of a widely used language that is designed from the ground up for performance”

#3 Privacy Matters

Photo by Matthew Henry on Unsplash

One of the biggest concerns for AI applications is privacy. For AI to really show value, it has to know a lot about the user, often times their personal life details people don’t feel comfortable sharing. Take audio recording as an example, you might want to record your family meeting discussing your next Christmas plan but don’t want it to be transferred onto the cloud and get 10 Christmas travel agents calling you to sell their products. This gives ‘offline’ ML apps an advantage. Since the model is deployed locally on the edge and no data need to be transferred to the cloud, the user can feel assured that their privacy can be protected. The Recorder app runs all the models on-device and makes it a bit less reluctance for people to adopt it.

#4 User Experience Design is Still the Key

The Recorder app has a very slick and elegant UI. It’s a simple app with minimal clutters. You can easily start/pause your recording, toggle between ‘Audio’ or ‘Transcript’ mode to check your recorded content and getting suggestions on tags from the content recorded. All works without friction.

During recording, the app will automatically categorize sections of audio as ‘Speech’, ‘Music’, ‘Whistling’ … etc. and color code them accordingly.

When playback your recorded audio, you can see each word get highlighted when being spoken in the transcript mode and you can search through the transcripts use the keyword you want. Very intuitive.

What I’m trying to say is: User experience design will make or break a great AI model. Only when working seamlessly with other parts of the app can an AI feature delivers its value to the end-user. A model that can address the user’s paint-point with high performance is only a start, not the end. AI should serve silently behind the scenes rather take the center stage.

#5 Responsiveness Come with a Price

In the mobile world, companies strive to offer more responsiveness. Consumers nowadays are very impatient and the last thing they want is to wait. Snappy experience means the user can focus on the content they want or the tasks at hand. But responsiveness on mobile devices is not easy to come by. Computing power, screen size, system resources are all very limited compare to desktop or cloud. To achieve the best responsiveness, more thoughts and research need to be put into the design and development of the app. This includes better use of CPU/GPU, memory optimization, choose fast programming languages for the implementation and reduce dependence on back-end servers. The Machine Learning industry has made great progress on research for the past few years, yet to have more impact on people’s day-to-day life, more investment and work has to be done on the engineering side. And a switch from research to engineering is a sign of matureness for new technology.

The Right Way of Developing AI Applications?

Image from Overwatch

People have this fantasy of scary AI taking over humanity for many years. Movies, novels, TV Shows all painted a very dramatic future of AI for mankind. To counter this public (biased?) impression on AI, special cautions need to be taken. It’s beneficial to adopt an ‘AI exist as a tool to help human’ mentality instead of an ‘AI vs Mankind’ one. AI can do a lot of things, but rather than develop AI apps that can ‘replace’ humans, it’s better to have AI that exists to help humans perform their tasks easier and faster. Like the Recorder app to help taking notes, image recognition systems to help the doctor diagnose better, augment reality app to help people better navigate the neighborhood, etc.

A quiet, friendly, yet powerful AI diligently working behind the scenes to help people do whatever they do better is so much more comfortable and approachable for people, compare to a robotic killing machine in a SciFi movie.

Two Sides of the Same Coin: Jeremy Howard's fast.ai vs Andrew Ng's deeplearning.ai

Michael Li — Tue, 04 Feb 2020 02:04:15 +0000

Which One to Take? WHY NOT BOTH!

Data science and artificial intelligence might be the hottest topic in tech right now, and rightfully so. There are tremendous breakthroughs both on application level and research fields. This is a blessing, and a curse, at least for students and enthusiasts that want to break into this area. There are too many algorithms to learn, too many coding/engineering skills to hone, and way too many new papers to keep up with even if you felt you’ve mastered the art.

The journey is long, the learning curve is steep, the strife is real, yet the potential is so great people still flock into it. The good thing is we also have great educators and instructors working on mitigating the pain and make the process a little less harsh and a bit more fun. We’ll explore two of the greatest among them and share a potentially effective approach to help you swim through the sea of Data Science a bit happier.

AI Learning ‘Burn-out’

Photo by Toa Heftiba on Unsplash

If you list what one needs to learn to become an ‘OK’ data scientist or machine learning engineer, it could be scarily long:

Math: Linear Algebra, Calculus, Statistics, Algorithms, …
Coding: Python, R, SQL/NoSQL, Hadoop, Spark, Tensorflow/PyTorch, Keras, Numpy, Pandas, OpenCV, Data Visualization…
**Algorithms: **Linear Regression, Logistic Regression, Support Vector Machine, PCA, Anomaly Detection, Collaborative Filtering, Neural Network, CNN, RNN, K-Means, NLP, Deep Learning, Reinforcement Learning, AutoML, …
**Engineering: **Command Line, Cloud platform(AWS, GCP, Asure), DevOps, Deployment, NGINX/Apache, Docker…

The list can easily make the head spin for a person just entering the filed. Yet, it is still just scratching the surface. Some people make an ambitious plan(Siraj Raval’s plan is great btw) and dive right into it. Some lost momentum and felt totally under the water and the exit is nowhere to be seen. What went wrong?

You ‘Overfit’ Yourself

From Andrew Ng’s Machine Learning course

Overfitting is a very familiar idea for anyone that knows a bit of Machine Learning. It basically means your algorithm learned ‘too much’ of the data and buried itself into the little details of the data-set and missing the big picture. Come to think of it, sometimes when we learn something, we dive so deep we forgot why we were learning it and how it will fit into the big picture. It’s something I’d like to call ‘overfitting’ your own learning. This happens especially often for people coming from academia background. A math Ph.D. tends to make sure all the theorems are fully understood before proceeding to the next one. This is great for learning math. Have a profound understanding of theories will give you great intuition and confidence. It will enable you to see patterns and issues people without the training cannot see easily, yet Data Science demands more.

Theory aside, there is also a practical part to it. A properly applied algorithm coupled with efficient codes, carefully tuned hyper-parameters, and well-designed pipeline will usually achieve decent results, but not algorithm alone. Delve down too deep into theory, and you risk missing the practical side of the learning. It’s equally important to accumulate experiences on how to implement what you learned and handle real-life complexities. How to address this? Entering deeplearning.ai and fast.ai courses.

Deeplearning.ai and Fast.ai

A lot of courses have been developed to help navigate people through the learning process. Among them, Deeplearning.ai and fast.ai are two unique ones that have their own approaches and can give us some insights into a potentially effective way of learning Data Science.

deeplearning.ai

deeplearning.ai is a paid course developed by Andrew Ng. Like his other courses, it is known for its well-designed learning curve, calm and smooth teaching style, and challenging while fun assignments. It is well accepted as the Deep Learning course one cannot go wrong with. It starts from the fundamental theories and works its way up on how to put all the pieces together to solve real-life problems. It’s also called a ‘Bottom-up’ approach.

fast.ai

fast.ai is introduced by Jeremy Howard and Rachel Thomas as a free course to teach people with basic coding experience state-of-the-art deep learning techniques. Without much explanation of the underlying theories, with very few lines of code, student of fast.ai is capable of achieving astoundingly great results on its own domain quickly into the lessons. (I built a Chinese Calligraphy Style Classifier that reaches 96% accuracy rate and deployed it on the cloud after finishing lesson 1 of fast.ai course.) It teaches you how to tackle the real-world problem first, then digs deeper and deeper into how and why things work. It’s also called a ‘Top-down’ approach.

Which One is the Best Approach? Both!

So ‘Bottom-up’ and ‘Top-down’, which one is better? Which one should we take? The answer is Both!

Photo by Maarten Deckers on Unsplash

See, these two courses complement each other. Say you start from Andrew Ng’s deeplearning.ai course, you buried yourself into endless formulas and theories, you gained a lot of intuition, but after weeks of learning, you still have nothing to show to your friends and not quite sure when you can apply your newly gained knowledge. Your study on the machine learning fundamentals is getting diminishing returns. Your brain slows down and you start to feel boring. Now is the perfect time to start taking a lesson or two of the fast.ai course. With the help of the powerful fast.ai library and few lines of code, you’ll be able to build impressive models that solve real-life problems and even beat some state-of-the-art papers and Kaggle competitions. This will give your brain a totally different kind of stimulation and your heart more confidence and passion to delve deeper into why everything works. Once you built a couple of projects and ‘wow’ed your friends, you will be more motivated to learn more about the fundamentals, then you can go back to deeplearning.ai course and keep your study there. These two courses push each other forward, you can just rinse and repeat till you finished both.

This forms a perfect learning circle.

Photo by Dan Freeman on Unsplash

The best thing about taking both courses this way is once you finished both, you’ll be fully prepared. You have tons of projects built along the way from fast.ai course to showcase to potential employers and you also have the deep knowledge of how everything works or even published one paper or two to show your findings. You are now a well-rounded Data Scientist. How cool is that?

Title:“This is CS50”: A Pleasant Way to Kick Off Your Data Science Education

Michael Li — Tue, 04 Feb 2020 01:49:48 +0000

CS50 professor David Malan teaches over 800 students on CS5 — from Youtube

So You Want to Get Into Data Science

Congratulations! Data Science is a career that’s hottest, hardest, most challenging, most rewarding, and full of top-notch minds. Your journey is bound to be full of fun, challenges, enlightenments, and achievements (big or small). New papers are published daily or even hourly. New techniques and experiments are developed regularly. New ways of thinking become the new norm. And what seems magical before, are proven feasible.

But You Don’t Know Where to Start

Photo by Ben White on Unsplash

But getting into Data Science is not easy. Far from it. The learning curve is brutal. There is so much to learn: Linear Algebra, Calculus, Statistics, Python, SQL, Machine Learning, Algorithm, Optimization, Data Wrangling, Data Visualization, Software Engineering, DevOps, … The list goes on and on.

Some people may have some background in math or statistics, which will definitely help. Yet you still need a solid foundation for software engineering to be efficient and be successful in your career. But this is not a problem, you say. After all, we live in an era of booming online education. There are plenty of courses paid and free we can choose. True, but this is precisely where the problem is. The biggest challenge for self-education these days is not lack of education resources, but hard to find the best or most relevant ones.

Enter CS50. If You are Only Allowed to Take One CS Course, Take CS50.

What is CS50? It is the introductory course on computer science taught at Harvard University by Professor David J. Malan. It is the largest class at Harvard with 800 students, 102 staff, and a professional production team. It offers both an on-campus and an online course. I’ve taken the online one, but it’s already **THE **best computer science course I came across, period. Let me tell you why:

The learning curve is so well designed, and it’s like watching a great suspense movie

The CS50 staff has the capability of knowing precisely what you do and do not know before each lecture (in that they have zero expert blindness). So the speech will not mention anything you are not familiar with. It smoothly guides you through key concepts of computer science and makes it seem obvious. It raises questions from time to time and later addresses them with a more in-depth explanation of the concepts. You’ll have plenty of ‘a-ha’ moments, and it almost felt like watching a suspense movie.

Covers core and essential fundamentals of computer science, and leave plenty of room for you to dig deeper

The course covers most of the critical computer science elements: C, Python, Data Structures, Algorithms, Software Engineering, Resource Management, Web Development, etc. It delves down deep enough so you can understand all the essential concepts while also know where to look if you want to dig deeper.

Orchestrate variety of ways to teach you challenging/boring concept, never felt boring

What is an array? Let’s find out! — thecrimson.com

CS50 has many ways to teach and keep you engaged. You’ll play a game to understand different sorting algorithms, receive a rubber duck to experience the famous Rubber Duck Debugging, watch experiments of ‘array of lights 🚥’ to learn data structure, even eat a delicious breakfast 🍞 while exploring the idea of pseudo-code. (One of my favorites is where David J. Malan uses a Yellowpage phonebook to explain binary search and tears down half of the book and throw it away. A definitive moment in CS50 indeed. )

Interactive, fun and engaging, the time just flew by, and you’ll be amazed what you can do once the course is over

The learning experience is so fun you’ll feel the time fly by without noticing it. Some of the problem set it gives are quite challenging, yet not impossible. And you’ll feel so proud of yourself once you cracked it. You’ll probably fall in love with the joy of problem-solving. If you are stuck, there is an online community on almost every social network platform (Twitter, Reddit, Stack Exchange, Facebook, etc.) where you can get help.

Out-of-class activities get you familiar with the ‘developer culture,’ which is essential for your future career.

Puzzle days, office hours, CS 50 Fairs, the final project ‘All-nighter’ hackathon (free breakfast at IHOP if you stay up all night), lots of activities designed to get you familiar with the ‘developer culture’ and better prepare you for the software engineering world.

State-of-the-art course software to get you started

How great is a computer science course if they don’t use the software tools they developed themselves? Over the years, CS50’s staff has developed a series of tools/software to help the students write code, submit homework, check their code quality/syntax, tidy up code styles, and even generate color-coded code documentation in PDF form! These are all neat and useful ‘training-wheels’ as David J. Malan puts it and will help you get up to speed.

But, please don’t just take my words for it, see what YouTube CEO Susan D. Wojcicki said about her experience:

And It Is Great for Data Science Too

Being a great course, CS50 is also very **relevant **to Data Science. It helps you lay a solid foundation of software engineering for your future career:

It teaches you C. More importantly, through C, you understand the fundamentals of computer like how memory works, what is a pointer, data structures, etc.
If you can write C, then you can quickly learn to write in C++. C++ is the de facto low-level, high-performance language used for data science libraries like Numpy, Pandas, Sk-Learn, etc.
It teaches Python, which is the primary high-level language for Machine Learning and Data Science.
It teaches SQL, which is the most widely used language in Data Science.
It also teaches web programming, useful when you try to deploy your model to production.

So essentially nothing taught in the course is not somewhat useful to you, and the foundation it helps you build will go a long way.

CS50 and Beyond!

Once you finished the course, you’ll be more knowledgable and confident to continue your Data Science journey, and I’ll point you to a couple of possible directions from here:

CS50’s Web Programming with Python and JavaScript

Teaches you the most relevant and progressive web programming tools like CSS, Javascripts, React, Flask/Django, by the talented TF Brian Yu. Link here.

**Jeremy Howard’s Fast.ai course to Start a ‘Top-down’ Approach for ML**

Fast.ai is fantastic and unique. It enables you to build state-of-the-art deep learning models within the first lesson with less than ten lines of code. Then it delves down deeper and deeper on the how and why. The only prerequisite is one-year of coding experience, which CS50 would have already prepared you with.

**Andrew Ng’s Machine Learning Course at Coursera**

Another great Machine Learning course, but a ‘Bottom-up’ style. It smoothly explains the math fundamentals first and gradually builds up the knowledge to piece together complicated machine learning models from scratch. I have an article that explains the difference between Andrew Ng and Jeremy Howard’s different approaches to machine learning education and recommend a potentially efficient way to learn.

Corey Schafer’s YouTube channel, Python and OOP Tutorials

As good as it is, CS50 only covers the generic and basic concepts of Python. You’ll need more in-depth knowledge to code efficiently for your data science projects. For this, I recommend Corey Schafer’s YouTube channel. He is one of the best Python educators I came across to explain complicated ideas in a crystal clear way. Not one second of his videos is wasted. The content is concise, to the point, and highly condensed. He has playlists for basic Python, SQL, Matplotlib, Git, and Object-Oriented Programming.

Conclusion

Learning Data Science is never a breeze, and I hope this article will help a little in alleviating the pain and make your journey a bit more efficient and fun. If you know other courses and resources that are also great, please feel free to leave a response so others can also see. Thanks!

Any feedback or constructive criticism is welcomed. You can either find me on Twitter @lymenlee or my blog site wayofnumbers.com.

How to Port Your Medium Articles to Personal Blog with a Simple Bash Script

Michael Li — Mon, 03 Feb 2020 23:04:59 +0000

Photo by Annie Spratt on Unsplash

Medium is a great publication platform. It has good exposure, quality content, readers that really appreciate good articles and a neat and easy to use UI. It’s especially great for writers that just start their journey.

As good as it is, having your own blog outside of Medium is still not a bad idea. It enables you to have another channel you can totally own to communicate with your readers. And who knows, no company can last forever, what if Medium got acquired by some other company or something even worse happen. You can still sleep well at night knowing you won’t lose all your articles.

I built my own using Pelican, a Python-based static site generator. I wrote an article explaining the whole process. For every Medium article, I need to copy the URL, run some command to transfer it into Markdown file, then generate the blog site using Pelican. It is simple, but not as simple as I like it to be. So this is a great opportunity for some quick and dirty Bash script to come for the rescue. Let’s see what we can do.

Structure the Script

Before start writing the script, it helps to structure out what we want to accomplish, makes it easier to write quality code. Basically, we need to:

Put all article URLs into one text file manually(plan to automate this part too in the future, using some scraping framework maybe)
Read every line of the file, and for each line.
Extract the title and subtitle
Use the title and subtitle to create meta-data needed for Pelican to turn the Markdown file into a post.
Run Pelican command to generate the static site.
Push the site to GitHub and trigger Netlify’s auto-build
Profit.

Let’s Write the Code

Photo by Shahadat Rahman on Unsplash

First of all, define our variables:

#!/bin/bash 
# Define variables
filename='articles.txt'
n=1

The structure the loop to read every line of the text file:

    # Read in file and do processing on each one
    while read line; do 
        # reading each line
        n=$((n+1)) 
        slug=$(echo $line | sed 's/https:\/\/towardsdatascience.com\///' )  # get slug from URL 
        FILE="$HOME/wayofnumbers.github.io/content/$slug.md"   # generate Markdown file name from slug 
        mediumexporter $line > $FILE   # convert medium article to markdown file    
        # some processing ...
    done < $filename

We used the sed command to remove the first part of the URL: https://towardsdatascience.com/ so the rest could be used as our slug. For example, https://towardsdatascience.com/9-things-i-learned-from-blogging-on-medium-for-the-first-month-2bace214b814 turns into 9-things-i-learned-from-blogging-on-medium-for-the-first-month-2bace214b814, perfect for a slug. Here we also uses the slug to create the filename for the MarkDown file. Then we use mediumexporter to transfer URL into the Markdown file. You can find out more about mediumexporter here.

Now that we have the Markdown file, let’s fill in the processing code we want:

    # Processing the markdown file 
        tail -n +2 "$FILE" > "$FILE.tmp" && mv "$FILE.tmp" "$FILE"  # remove the first line 
        fl=$(head -n 1 $FILE) # put first line (title) into fl 
        firstline=$(echo $fl | sed 's/# //') # Remove '# ' 
        tail -n +3 "$FILE" > "$FILE.tmp" && mv "$FILE.tmp" "$FILE"  # remove the first line 
        subtitle=$(head -n 1 $FILE) # put first line (subtitle) into subtitle 
        tail -n +2 "$FILE" > "$FILE.tmp" && mv "$FILE.tmp" "$FILE"  # remove the first two line

These lines are rather self-explanatory. Now we have firstline variable as the title and subtitle variable as the subtitle, we are now ready to construct the Markdown file meta-data for Pelican:

    # handle metadata for Pelican  
    meta="
    Title: $firstline
    Slug: $slug
    Subtitle: $subtitle
    Date: $(date)
    Category: Machine Learning
    Tags: Machine Learning, Artificial Intelligence
    author: Michael Li
    Summary: $firstline
    [TOC]
    "

You can refer to Pelican’s document here for more information about the meta-data format. Simply put, the Markdown file doesn’t need to specifically write the title and subtitle, as long as we specify the title and subtitle field in our meta-data, Pelican will automatically generate them for you in the post, with specific styles per the theme you choose.

With the correct meta-data, now we can finally update the Markdown and get it ready for site generation:

    { echo -n "$meta"; cat $FILE; } >$FILE.new # sticth meta-data and article content together 
    mv $FILE{.new,} 
    head -n -8 $FILE > $FILE.new # Remove medium's recommended articles
    mv $FILE{.new,}
    done < $filename  # don't forget to enclose the loop.

All my Medium articles have several recommendations for further readings. I removed those for my blog(the last line of code above). Now that the Markdown file is ready, time to generate the site and push it to the server:

    # push to server
    cd $HOME/wayofnumbers.github.io
    pelican content -s publishconf.py 
    git add .
    git commit -m "fix"
    git push origin dev

Conclusion

So there you go. This script only works on Pelican static site generator, but the gist of it can be applied to any of your blogging platforms. I hope you learned a thing or two. And happy blogging/coding!