DEV Community: Sean Lane

Running the "Real Time Voice Cloning" project in Docker

Sean Lane — Sat, 06 Jul 2019 03:02:11 +0000

I came across this awesome project called Real Time Voice Cloning by Corentin Jemine and I wanted to give it a shot. I’m currently working on a Mac laptop, but I have access to a remote server with some GPUs that could easily run the toolbox, but I wanted an easy way to get everything setup. Docker would do the trick as far as getting it setup, and then through forwarding the X Window System via SSH, I could view and control the program locally as it ran remotely. Note that these steps should be more or less compatible with Linux or macOS, but maybe on Windows with the WSL. I’m not really sure, as I haven’t tested the following on anything except Linux and macOS.

Step 0: You should probably have access to a machine with a CUDA-compatible GPU

Some variant of these instructions may allow the project to be ran with just a CPU, but I haven’t investigated that path, so you’re on your own there.

Step 1: Install `nvidia-docker`

Follow the instructions here: https://github.com/NVIDIA/nvidia-docker. Note that you’ll need have installed the NVIDIA driver and Docker as well.

Step 2: Clone the `Real-Time-Voice-Cloning` project and download pretrained models

I’ll assume that you’re working from your home directory, and we’ll make a directory called voice for our project to sit in and clone the GitHub repo:

cd ~
mkdir voice && cd voice
git clone https://github.com/CorentinJ/Real-Time-Voice-Cloning.git

Next, download the pretrained models as described here: https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models. Note that you’re expected to merge the contents with the project root directory.

Step 3: Copy the Dockerfile

Create a new file called Dockerfile and insert the following:

FROM pytorch/pytorch

WORKDIR "/workspace"
RUN apt-get clean \
        && apt-get update \
        && apt-get install -y ffmpeg libportaudio2 openssh-server python3-pyqt5 xauth \
        && apt-get -y autoremove \
        && mkdir /var/run/sshd \
        && mkdir /root/.ssh \
        && chmod 700 /root/.ssh \
        && ssh-keygen -A \
        && sed -i "s/^.*PasswordAuthentication.*$/PasswordAuthentication no/" /etc/ssh/sshd_config \
        && sed -i "s/^.*X11Forwarding.*$/X11Forwarding yes/" /etc/ssh/sshd_config \
        && sed -i "s/^.*X11UseLocalhost.*$/X11UseLocalhost no/" /etc/ssh/sshd_config \
        && grep "^X11UseLocalhost" /etc/ssh/sshd_config || echo "X11UseLocalhost no" >> /etc/ssh/sshd_config
ADD Real-Time-Voice-Cloning/requirements.txt /workspace/requirements.txt
RUN pip install -r /workspace/requirements.txt
RUN echo "<REPLACE THIS SENTENCE (INCLUDING ARROWS) WITH YOUR SSH PUBLIC KEY ON THE DOCKER HOST" \ 
    >> /root/.ssh/authorized_keys
RUN echo "export PATH=/opt/conda/bin:$PATH" >> /root/.profile
ENTRYPOINT ["sh", "-c", "/usr/sbin/sshd && bash"]
CMD ["bash"]

A rough summary of the above is that we:

Use the pytorch docker image as our base image
Update the image repos
Install some dependencies
- ffmpeg as a backend for PortAudio
- libportaudio2 for audio manipulation (?)
- openssh-server to SSH into the container
- python3-pyqt5 for the QT bindings (installing via pip didn’t seem to work for me)
- xauth for X forwarding
Set up the container to allow you to SSH in. This may not be secure, so I don’t advise using on any sort of public facing machine. Use at your discretion.
Allow X forwarding with the SSH server within the container
Add the repo’s requirements.txt file
Install those requirements
Action Required!!!: Insert your SSH public key so you can SSH into the container
Add the right Python interpreter to the root user’s PATH
Make sure the SSH server is running when the container starts

Note that if you plan on SSH’ing into the Docker host as well (like I did from my laptop to the docker host), you need to set X11Forwarding to yes in /etc/ssh/sshd_config on the docker host as well. Then reload and restart the SSH daemon (on Ubuntu this was systemctl daemon-reload && systemctl restart sshd).

Step 4: Modify your SSH config

Add the following to your SSH config at ~/.ssh/config on the docker host (or create the file if it doesn’t exists):

Host voice 
    Hostname localhost 
    Port 2150 
    User root 
    ForwardX11 yes 
    ForwardX11Trusted yes

Step 5: Build the container

Run the following command to build the container:

docker build -t voice-base .

You should be able to run the following to test:

docker run -it --rm --init --runtime=nvidia \
    --ipc=host --volume="$PWD:/workspace" \
    -e NVIDIA_VISIBLE_DEVICES=0 -p 2150:22 \
    --device /dev/snd voice-base
nvidia-smi
cd /workspace/Real-Time-Voice-Cloning
python demo_cli.py
exit

Step 6: Start the container

docker run -it --rm --init --runtime=nvidia \
    --ipc=host --volume="$PWD:/workspace" \
    -e NVIDIA_VISIBLE_DEVICES=0 -p 2150:22 \
    --device /dev/snd voice-base

The option --device /dev/snd should allow the container to pass sound to the docker host, though I wasn’t able to get sound working going from laptop->docker_host->container. I modified the Real-Time-Voice-Cloning project to save the output audio as a WAV file instead of playing within the application, and then copied the file locally to listen to the results.

At this point, the container should be running and will occupy that terminal, so open up a new terminal shell

Step 7: SSH into the container

From the docker host, this is done with:

ssh -X voice

voice refers to the name of the host we configured in Step 6.

For connecting this from a macOS machine to the docker host, follow these steps that were found from Indiana University:

Install XQuartz on your Mac, which is the official X server software for Mac
Run Applications > Utilities > XQuartz.app
Right click on the XQuartz icon in the dock and select Applications > Terminal. This should bring up a new xterm terminal windows.

From there, you will SSH into the docker host…:

ssh -X username@my.docker.host.tld

…and then SSH into the docker container:

ssh -X voice

Step 8: Run and play with the toolbox

Now that we have a terminal session that has X11 forwarding, we can navigate to the project directory and run the toolbox:

cd /workspace/Real-Time-Voice-Cloning
python demo_cli.py

Note that you’ll need to provide audio in the form of the datasets discussed in the README of the project, or upload your own audio samples to the container and then browse to them within the toolbox application. This should be straightforward, since the project directory on the docker host is mounted within the container.

I realize that some of the methods used here probably aren’t best practice, but they worked for playing around with this great project over a holiday weekend and I hope they prove helpful to someone.

Extracting Entries from jrnl.com

Sean Lane — Mon, 13 Aug 2018 00:00:00 +0000

A number of years ago, my wife began journaling her thoughts in an online service called LDSJournal.com (at least I believe that was the name). About 2 years ago, this service was acquired by a new site called jrnl.com. It seems to be a fairly neat service, but one thing we were concerned with is preserving the data should the account ever disappear.

Unfortunately, the only export option with jrnl.com seems to be the ability to download a PDF file that is created when you pay the service to have your journal entries printed physically.¹ According to that same source, there allegedly will be an option to backup the journal entries without having to purchase a physical copy, but at the moment it has been over a year since that helpdesk article promised that feature to be completed before the end of 2017. Aside from that, there could be loads of potential issues extracting my wife’s writings from the PDF file they produce, depending on how it’s put together. With that in mind, I used the following steps to retrieve her content.

In a similar manner to this article: Ian London: Web Scraping - Discovering Hidden APIs, I used the outgoing connections from the jrnl.com web application to identify their hidden API with which to access the entries. After logging into the service and navigating to the journal entries, you can view the request headers that your browser sends to jrnl.com to retrieve the entries and other content. The API key is one of these headers, with the first portion visible in the image below under Authorization.

Getting the API Key

Some more poking around showed that the base url for the API is https://jrnl.com/api/v1/, and the API endpoint for the entries is, unsurprisingly, https://jrnl.com/api/v1/entry. Using a REST API tool called Insomnia, we can plug in the API key and use the endpoint with the limit option set to allow more entries returned: https://jrnl.com/api/v1/entry?limit=250.

Then using something like the following Python script, you can convert the posts into a format for import elsewhere. This script is one I used to prepare the entries for import into the Ghost CMS platform which I set up following the instructions in my previous post. There is a little more post-processing to get it into Ghost, if you made it this far then following the Ghost documentation will get you the rest of the way. The script assumes that the entries from jrnl.com are isolated and saved as a JSON array in a file called posts.json.

#! /usr/bin/env python3

from datetime import datetime
import json, os

def millis(x):
  return int(datetime.fromisoformat(x).timestamp() * 1000)

with open('posts.json') as f:
  posts = json.load(f)

new_posts = []
for post in posts:
  temp = {
    'id': post['id'],
    'title': post['title'],
    'slug': post['title'],
    'html': post['content'],
    'image': None,
    'featured': 0,
    'page': 0,
    'status': 'published',
    'language': 'en_US',
    'meta_title': None,
    'meta_description': None,
    'author_id': 1,
    'created_at': millis(post['created']),
    'created_by': 1,
    'updated_at': millis(post['modified']),
    'updated_by': 1,
    'published_at': millis(post['entry_date']),
    'published_by': 1
  }
  new_posts.append(temp)

with open('new_posts.json', 'w') as f:
  json.dump(new_posts, f, indent=2)

http://helpdesk.jrnl.com/kb/article/150-can-i-backup-my-jrnl/

Example for LaTeX Funeral or Memorial Program

Sean Lane — Wed, 06 Jun 2018 00:00:00 +0000

Another quick post, but something that I hope might be useful to others. Within the past couple weeks, my grandmother of 83 years passed away and my family held a memorial service in her honor. I was asked if I could help out in creating a program booklet or pamphlet that could be given out to the attendees, something that would describe the service itself as well as share a piece of my grandmother’s life with them as we gathered to remember her. Grandma Arlene was a classy lady and I wanted to help her leave a lasting impression on all of those who could make it out to honor her life. As a graduate student in Computer Science, I felt like using LaTeX would be a great way to do so, though my Internet searches fell somewhat short of what I was looking for. We wanted a simple layout consisting of four “pages”, two of which would be printed on a single side of standard sized US letter paper, and then folded into a four page pamphlet after printing. This could also serve well for someone looking for a LaTeX template for religious or other services where a 4 page booklet is desired.

The files for this project can be found on GitHub here: Example for LaTeX Funeral or Memorial Program

The project makes use of standard LaTeX components as well as the pgfornament package to add some style to the program. The general process is as follows:

Edit the main.tex file as needed
- Note that you will need to adjust paper size as needed to fit onto a half of the paper size you intend to use. As it currently stands, it’s set to be printed on a 8.5 inches by 5.5 inches section of 8.5 inches by 11 inches sized US letter paper
- Also modify your images as needed. My grandmother was a prodigious quilter, and we wanted to have one of her quilts serve as the background for the third page where the actual service is described.
Use the Makefile default command to make the main file, which will produce 4 pages on the size of paper specified in main.tex and then take those four pages and place them on both sides of US letter paper as described by booklet.tex.
I found that the pages of the output booklet.pdf were still oriented in a portrait orientation, so I used Apple Preview to rotate the pages. There may likely be a programmatic way to address that issue, but I never bothered to resolve it.

Below are screenshots of the output.

First page

Second page

Dynamically updating Matplotlib figures in Jupyter notebooks

Sean Lane — Sat, 24 Feb 2018 00:00:00 +0000

Updating matplotlib figures dynamically seems to be a bit of a hassle, but the code below seems to do the trick. This is an example that outputs a figure with multiple subplots, each with multiple plots. Oddly enough, at the time of writing the image will be smaller than the figure until the Jupyter cells stops running, but this can be fixed but generating the figure in one cell, and then updating the image in a subsequent cell ¹.

This code is run with the assumption that the following data file can be found in the working directory named data.txt:

Sample TCL data

import numpy as np
import time
import matplotlib.pyplot as plt

%matplotlib notebook

def load_data():
    data = np.genfromtxt('data.txt', delimiter=',', skip_header=1)
    tm = data[:, 0]
    Q1 = data[:, 1]
    Q2 = data[:, 2]
    T1 = data[:, 3]
    T2 = data[:, 4]
    return (tm, Q1, Q2, T1, T2) 

(m_time, Q1s, Q2s, T1s, T2s) = load_data()

n = len(m_time)

labels = [
    [r'$T_1$ measured', r'$T_2$ measured'],
    # [r'$T_1 set point$', r'$T_2 set point$'],
    [r'$Q_1$', r'$Q_2$']
]

colors = [
    ['r:', 'b-'],
    ['r:', 'bx']
]

def plot_init(num_subplots=1, x_labels=None, y_labels=None):
    if not x_labels:
        x_labels = [None] * num_subplots
    if not y_labels:
        y_labels = [None] * num_subplots
    fig = plt.figure(figsize=(12,6), dpi=80)
    fig.subplots_adjust(hspace=.5)
    axes = []
    for i in range(1, num_subplots + 1):
        ax = plt.subplot(num_subplots, 1, i)
        ax.grid()
        if x_labels[i-1]:
            ax.set_xlabel(x_labels[i-1]) 
        if y_labels[i-1]:
            ax.set_ylabel(y_labels[i-1]) 
        axes.append(ax)
    return fig, axes

def plot_update(fig, axes, xs, ys, colors, labels):
    for i in range(len(axes)):
        for j in range(len(ys[i])):
            axes[i].plot(xs, ys[i][j], colors[i][j])
        axes[i].legend(labels=labels[i])
    fig.canvas.draw()

fig, axes = plot_init(
    num_subplots=2, 
    x_labels=[None, 'Time (sec)'],
    y_labels=['Temps (C)', 'Heaters']
)

for i in range(1, n):
    try:
        ys = [
            [T1s[:i], T2s[:i]],
            [Q1s[:i], Q2s[:i]]
        ]
        plot_update(fig, axes, m_time[:i], ys, colors, labels)
        time.sleep(0.2)
    except KeyboardInterrupt:
        break

Stack Overflow: Jupyter notebook matplotlib figures show up small until cell is completed^[return]

Setting up a new Python virtual environment for Jupyter notebooks

Sean Lane — Fri, 23 Feb 2018 00:00:00 +0000

A lot of my lab work and course work involved the use of Jupyter notebooks, though the Python dependencies needed conflict with other areas. I’ve been using virtualenvwrapper to isolate these, and other project, environments from each other. This post goes through the process of installing everything needed to get up and running with a clean Python environment for Jupyter notebooks with separate kernels for each environment, including the installation of jupyter_contrib_nbextensions which adds community developed features.

Initial setup

This only needs to be done once on your machine/user account, in order to get the building blocks in place for creating an indefinite amount of virtual environments for Python. First, you should install a suitable copy of Python on your machine. For macOS, I recommend using the Homebrew package manager (installation instructions at the link), then install Python. Note that I’m using Python 3 since Python 2 will be end-of-life’d come the year 2020, but if you’re on macOS consider installing Python 2 via Homebrew as well, since the system copy seems to be antiquated. Anyways, to install on mac via Homebrew:

$ brew install python3 # Follow any instructions given here from the output

In Ubuntu/Debian based systems:

$ sudo apt-get install python3 python3-pip

On Arch Linux based systems:

$ pacman -S python-virtualenvwrapper

Now, assuming Python 3 and pip are both installed, install virtualenvwrapper and modify your shell start up file according to these instructions: Install virtualenvwrapper. I do the following for my system:

$ sudo pip install virtualenvwrapper
$ echo "export WORKON_HOME=$HOME/.virtualenvs" >> $HOME/.profile
$ echo "source /usr/local/bin/virtualenvwrapper.sh" >> " >> $HOME/.profile
$ source ~/.profile

Creating new virtual environments

Now every time you need to create a new environment, use the following as an example. My example virtualenv will be named example, we’ll install Jupyter and any other dependencies, and we’ll add a line to $VIRTUAL_ENV/bin/postactivate so that when activating the environment, our current working directory will be switched to our project directory ~/path/to/example/code.

$ mkvirtualenv example -p python3 # Note we specify which interpreter to use
(example) $ echo "cd $HOME/path/to/example/code" >> $VIRTUAL_ENV/bin/postactivate
(example) $ pip install ipykernel
(example) $ pip install jupyter_contrib_nbextensions
(example) $ jupyter contrib nbextension install --sys-prefix # Kinda important
(example) $ pip install jupyter_nbextensions_configurator
(example) $ jupyter nbextensions_configurator enable --sys-prefix
(example) $ python -m ipykernel install --user --name=example
(example) $ pip install <anything-else-you-want>

Note that after creating the virtualenv example, the environment is automatically activated (which you can tell by the (example) prefix in your terminal as well as by running which python, which should output a path to the Python interpreter belonging to the environment). When activated, any calls to python use the environment’s Python interpreter as well as pip, which is why we didn’t have to call pip3 instead of pip. Note that for installing jupyter contrib nbextension and jupyter nbextensions_configurator, we used the option --sys-prefix which configures these extensions for use in the virtual environment and not the global system enviroment, which is what we’re trying to isolate ourselves from.

Send a fax from the command line with Python and Phaxio

Sean Lane — Wed, 30 Aug 2017 00:00:00 +0000

Note in 2019: I've created a small side project website that let's you quickly send off a fax with no hassle in case you don't want to mess with a Python script: FaxASAP.com

If you want to send a simple fax quickly, cheaply, and painlessly, Phaxio and Python make a nice combo. Below is a litte script that I wrote, based on this Ruby script by Pete Keen that is slightly out of date. There are Phaxios Python libraries, but I ran into a couple issues, and this seems to be the most brain-dead simple solution. Pros: No external dependencies. Cons: It uses the shell=True parameter for subprocess.call, but that shouldn’t be an issue since you’re only using this to send a quick fax at 2 AM and you don’t want to pay UPS/FedEx/whomever too much money for that privilege tomorrow, right?

Note that I’m not affiliated with Phaxio in any way, it just happens to be late, I needed to send a fax, and they checked all the right boxes. I stumbled on Phaxio, but for someone just wanting to send a quick fax once every year or so, it’s great. Pricing is about $0.07 a page (and I received $1.00 account credit just for signing up at the time of writing this), so it’s perfect for my use case.

Setup

Sign up for an account with Phaxio: https://www.phaxio.com/
Get your API Keys: Phaxio API Credentials
Put them into the script below (You can also use the Test keys to make sure this works before trying too.)
Run the script, for example, if I saved the script to fax.py, I’m sending to Tommy Tutone, and my file to send is letter.pdf, I would use the following: ./fax.py +15558675309 /path/to/letter.pdf

#!/usr/bin/env python3
from subprocess import call
import sys

if len(sys.argv) <= 2:
    print("Usage: send_fax NUMBER FILENAME...")
    exit(-1)

number = sys.argv[1]   
api_key = 'put_api_key_here'
api_secret = 'put_api_secret_here'

command_args = [
  "curl",
  "https://api.phaxio.com/v2/faxes",
  "-u '{}:{}'".format(api_key, api_secret),
  "-F 'to={}'".format(number)
]

for file in sys.argv[2:]:
    command_args.append("-F 'file=@{}'".format(file))

call(' '.join(command_args), shell=True)

The script can be found in this GitHub Gist here: https://gist.github.com/seanlane/67504bf39696de8c0bc88ad89844f9df

Feel free to fork it and suggest improvements.

PySpark and Latent Dirichlet Allocation

Sean Lane — Tue, 10 May 2016 00:00:00 +0000

This past semester (Spring of 2016), I had the chance to take two courses: Statistical Machine Learning from a Probabilistic Perspective (it’s a bit of a mouthful) and Big Data Science & Capstone. In the former, we had the chance to study the breadth of various statistical machine learning algorithms and processes that have flourished in recent years. This included a number of different topics ranging from Gaussian Mixture Models to Latent Dirichlet Allocation. In the latter, our class divided into groups to work on a capstone project with one of a number of great companies or organizations. It was only a 3 credit-hour course, so it was a less intensive project than a traditional capstone course that is a student’s sole focus for an entire semester, but it was a great experience nonetheless. The Big Data science course taught us some fundamentals with big data science and normal data analysis (ETL, MapReduce, Hadoop, Weka, etc.) and then released us off into the wild blue yonder to see what we could accomplish with our various projects.

For the Big Data course, my team was actually assigned two projects:

Attempting to track illness and outbreaks using social media
Creating a module for Apache PySpark to conduct Sensitivity Analysis of pyspark.ml models

Both of these projects involved the use of Apache PySpark, and as a result I came to become familiar with it at a basic level. For a final project within the Statistical Machine Learning class, I considered how I could bring the experience of both together, and thought of using the LDA capabilities of PySpark in order to model some of the social media data that my Big Data group had already gathered. An idea of mine was that if we could cluster the social media content, then we could find further patterns or filter out bad data, for example. That said, when my class attempted to implement LDA models ourselves, it took a considerable amount of time to process, but I felt that using PySpark on a cluster of computers would allow me to utilize a respectable amount of the social media data we had gathered. I came across a few tutorials and examples of using LDA within Spark, but all of them that I found were written using Scala. It is not a very difficult leap from Spark to PySpark, but I felt that a version for PySpark would be useful to some.

Summary explanation of Latent Dirichlet Allocation

The article that I mostly referenced when completing my own analysis can be found here: Topic modeling with LDA: MLlib meets GraphX. There, Joseph Bradley gives an apt description of what topic modeling is, how LDA covers it and what it could be used for. I’ll attempt to briefly summarize his remarks and refer you to the Databrick’s blog and other resources for deeper coverage. Topic modeling attempts to take “documents”, whether they are actual documents, sentences, tweets, etcetera, and infer the topic of the document. LDA attempts to do so by interpreting topics as unseen, or latent, distributions over all of the possible words (vocabulary) in all of the documents (corpus). This was originally developed for text analysis, but is being used in a number of different fields.

Example in PySpark

This example will follow the LDA example given in the Databrick’s blog post, but it should be fairly trivial to extend to whatever corpus that you may be working with. In this example, we will take articles from 3 newsgroups, process them using the LDA functionality of pyspark.mllib and see if we can validate the process by recognizing 3 distinct topics.

The step is to gather your corpus together. As I previously mentioned, we’ll use the discussions from 3 newsgroups. The entire set can be found here: 20 Newsgroups. For this example, I picked 3 topics that seem to be fairly distinct from each other:

comp.os.ms-windows.misc
rec.sport.baseball
talk.religion.misc

I extracted the collection of discussions, and then put all of the discussions into one directory to form my corpus. Then we can point the PySpark script to this directory to pull the documents in. The entirety of the code used in this example can be found at the bottom of this post.

The first actual bit of code will initialize our SparkContext:

from collections import defaultdict
from pyspark import SparkContext
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.sql import SQLContext
import re

num_of_stop_words = 50 # Number of most common words to remove, trying to eliminate stop words
num_topics = 3 # Number of topics we are looking for
num_words_per_topic = 10 # Number of words to display for each topic
max_iterations = 35 # Max number of times to iterate before finishing

# Initialize
sc = SparkContext('local', 'PySPARK LDA Example')
sql_context = SQLContext(sc)

Then we’ll pull in the data and tokenize it to form our global vocabulary:

data = sc.wholeTextFiles('newsgroup/files/*').map(lambda x: x[1])

tokens = data \
    .map( lambda document: document.strip().lower()) \
    .map( lambda document: re.split("[\s;,#]", document)) \
    .map( lambda word: [x for x in word if x.isalpha()]) \
    .map( lambda word: [x for x in word if len(x) > 3] )

Here we process the corpus by doing the following:

Load each file as an individual document
Strip any leading or trailing whitespace
Convert all characters into lowercase where applicable
Split each document into words, separated by whitespace, semi-colons, commas, and octothorpes
Only keep the words that are all alphabetical characters
Only keep words larger than 3 characters

This then leaves us with each document represented as a list of words that are hopefully more insightful than words like “the”, “and”, and other small words that we suspect are inconsequential to the topics we are hoping to find. The next step is to then generate our global vocabulary:

# Get our vocabulary
# 1. Flat map the tokens -> Put all the words in one giant list instead of a list per document
# 2. Map each word to a tuple containing the word, and the number 1, signifying a count of 1 for that word
# 3. Reduce the tuples by key, i.e.: Merge all the tuples together by the word, summing up the counts
# 4. Reverse the tuple so that the count is first...
# 5. ...which will allow us to sort by the word count

termCounts = tokens \
    .flatMap(lambda document: document) \
    .map(lambda word: (word, 1)) \
    .reduceByKey( lambda x,y: x + y) \
    .map(lambda tuple: (tuple[1], tuple[0])) \
    .sortByKey(False)

The above code performs the following steps:

Flattens the corpus, aggregating all of the words into one giant list of words
Maps each word with the number 1, indicate we count this word once
Reduce each word count, by finding all of the instances of any given word, and adding up their respective counts
Invert each tuple, so that the word count precedes each word…
…which then allows us to sort by the count for each word.

We now have a sorted list of tuples, sorted in descending order of the number of time each word is in the corpus. We can then use this to remove the most common words, which will most likely be commons words (like “the”, “and”, “from”) that are most likely not distinctive to any given topic, and are equally likely to be found in all of the topics. We then identify which words to remove by setting deciding to remove k amount of words, find the count of word that is k deep in the list, and then removing any words with that amount or more of occurrences in the vocabulary. After this, we will then index each word, giving each word a unique id and then collect them into a map:

# Identify a threshold to remove the top words, in an effort to remove stop words
threshold_value = termCounts.take(num_of_stop_words)[num_of_stop_words - 1][0]

# Only keep words with a count less than the threshold identified above, 
# and then index each one and collect them into a map
vocabulary = termCounts \
    .filter(lambda x : x[0] < threshold_value) \
    .map(lambda x: x[1]) \
    .zipWithIndex() \
    .collectAsMap()

This leaves us with a vocabulary that consists of tuples of words and their word counts, with the most common words removed. The next step is to represent each document as a vector of word counts. What this means is that instead of each document being formed of a sequence of words, we will have a list that is the size of the global vocabulary, and the value of each cell is the count of the word whose id is the index of that cell:

# Convert the given document into a vector of word counts
def document_vector(document):
    id = document[1]
    counts = defaultdict(int)
    for token in document[0]:
        if token in vocabulary:
            token_id = vocabulary[token]
            counts[token_id] += 1
    counts = sorted(counts.items())
    keys = [x[0] for x in counts]
    values = [x[1] for x in counts]
    return (id, Vectors.sparse(len(vocabulary), keys, values))

# Process all of the documents into word vectors using the 
# `document_vector` function defined previously
documents = tokens.zipWithIndex().map(document_vector).map(list)

The final thing to do before actually beginning to run the model is to invert our vocabulary so that we can lookup each word based on it’s id. This will allow us to see which words strongly correlate to which topics:

# Get an inverted vocabulary, so we can look up the word by it's index value
inv_voc = {value: key for (key, value) in vocabulary.items()}

Now we open an output file, and train our model on the corpus with the desired amount of topics and maximum number of iterations:

# Open an output file
with open("output.txt", 'w') as f:
    lda_model = LDA.train(documents, k=num_topics, maxIterations=max_iterations)

    topic_indices = lda_model.describeTopics(maxTermsPerTopic=num_words_per_topic)

    # Print topics, showing the top-weighted 10 terms for each topic
    for i in range(len(topic_indices)):
        f.write("Topic #{0}\n".format(i + 1))
        for j in range(len(topic_indices[i][0])):
            f.write("{0}\t{1}\n".format(inv_voc[topic_indices[i][0][j]] \
                .encode('utf-8'), topic_indices[i][1][j]))


    f.write("{0} topics distributed over {1} documents and {2} unique words\n" \
        .format(num_topics, documents.count(), len(vocabulary)))

Obviously, you can take the output and do with it what you will, but here we will get an output file called output.txt which will list each of our three topics that we are hoping to see. You can play around with the num_topics to see how the model reacts, but since we know we have discussions that center around three distinct topics, we would have that having 3 topics would reflect that by clustering around words that align with each of those topics separately.

The continuation of this is to gather “unlabeled” data (as much as this can be called labeled), and to use LDA to perform topic modeling on your newly found corpus. I’m still learning on how to go about that, but hopefully this has been of some help to anyone looking to get started with PySpark LDA.

Appendix: Here’s the complete script

from collections import defaultdict
from pyspark import SparkContext
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.sql import SQLContext
import re

num_of_stop_words = 50 # Number of most common words to remove, trying to eliminate stop words
num_topics = 3 # Number of topics we are looking for
num_words_per_topic = 10 # Number of words to display for each topic
max_iterations = 35 # Max number of times to iterate before finishing

# Initialize
sc = SparkContext('local', 'PySPARK LDA Example')
sql_context = SQLContext(sc)

# Process the corpus:
# 1. Load each file as an individual document
# 2. Strip any leading or trailing whitespace
# 3. Convert all characters into lowercase where applicable
# 4. Split each document into words, separated by whitespace, semi-colons, commas, and octothorpes
# 5. Only keep the words that are all alphabetical characters
# 6. Only keep words larger than 3 characters

data = sc.wholeTextFiles('newsgroup/files/*').map(lambda x: x[1])

tokens = data \
    .map( lambda document: document.strip().lower()) \
    .map( lambda document: re.split("[\s;,#]", document)) \
    .map( lambda word: [x for x in word if x.isalpha()]) \
    .map( lambda word: [x for x in word if len(x) > 3] )

# Get our vocabulary
# 1. Flat map the tokens -> Put all the words in one giant list instead of a list per document
# 2. Map each word to a tuple containing the word, and the number 1, signifying a count of 1 for that word
# 3. Reduce the tuples by key, i.e.: Merge all the tuples together by the word, summing up the counts
# 4. Reverse the tuple so that the count is first...
# 5. ...which will allow us to sort by the word count

termCounts = tokens \
    .flatMap(lambda document: document) \
    .map(lambda word: (word, 1)) \
    .reduceByKey( lambda x,y: x + y) \
    .map(lambda tuple: (tuple[1], tuple[0])) \
    .sortByKey(False)

# Identify a threshold to remove the top words, in an effort to remove stop words
threshold_value = termCounts.take(num_of_stop_words)[num_of_stop_words - 1][0]

# Only keep words with a count less than the threshold identified above, 
# and then index each one and collect them into a map
vocabulary = termCounts \
    .filter(lambda x : x[0] < threshold_value) \
    .map(lambda x: x[1]) \
    .zipWithIndex() \
    .collectAsMap()

# Convert the given document into a vector of word counts
def document_vector(document):
    id = document[1]
    counts = defaultdict(int)
    for token in document[0]:
        if token in vocabulary:
            token_id = vocabulary[token]
            counts[token_id] += 1
    counts = sorted(counts.items())
    keys = [x[0] for x in counts]
    values = [x[1] for x in counts]
    return (id, Vectors.sparse(len(vocabulary), keys, values))

# Process all of the documents into word vectors using the 
# `document_vector` function defined previously
documents = tokens.zipWithIndex().map(document_vector).map(list)

# Get an inverted vocabulary, so we can look up the word by it's index value
inv_voc = {value: key for (key, value) in vocabulary.items()}

# Open an output file
with open("output.txt", 'w') as f:
    lda_model = LDA.train(documents, k=num_topics, maxIterations=max_iterations)
    topic_indices = lda_model.describeTopics(maxTermsPerTopic=num_words_per_topic)

    # Print topics, showing the top-weighted 10 terms for each topic
    for i in range(len(topic_indices)):
        f.write("Topic #{0}\n".format(i + 1))
        for j in range(len(topic_indices[i][0])):
            f.write("{0}\t{1}\n".format(inv_voc[topic_indices[i][0][j]] \
                .encode('utf-8'), topic_indices[i][1][j]))


    f.write("{0} topics distributed over {1} documents and {2} unique words\n" \
        .format(num_topics, documents.count(), len(vocabulary)))

Hosting comments within issues on Github Pages

Sean Lane — Tue, 26 Jan 2016 00:00:00 +0000

Note: As of February 2018, the repo for this website is public, so I moved the comments to the same repo instead of using a separate project for them.

When I created a blog to have a place to write and document things, as well as complete a class requirement for CS 404, there were properties that I wanted it to have. I wanted it to be simple, to be hosted on a reputable platform, to be under my control, and to perform well. By using Github Pages to host a run a Jekyll static site, I was pretty much able to get everything in one fell swoop.

However, one thing I found lacking was comments or giving anyone a way to respond or comment on a given blog post. I looked around for a few different solutions. One option that many turn to is using Disqus comments. It is easy to implement, requiring a simple snippet of Javascript to be included on any post where you would like comments to be included, but several drawbacks of using Disqus quickly became apparent. As an additional Javascript component, it requires additional requests that can quickly bog down what was once a quick, simple website. ¹ Other issues of privacy and security also came up with Disqus, and any third party service you trust is just another liability for your website. ² Other products I looked at were Discourse and Poole, but I really wanted to avoid making the site any more complicated and having to rely on a third party.

I found a blog post by Ivan Zuzak that detailed how you can utilize Github’s Issue Tracking system to host the comments for a Github Pages site. ³ It is a really nifty hack that adds comments hosted via the same platform that the site is hosted one, with only a couple steps added to my workflow when posting.

I followed Ivan’s steps, with a small change to his instructions for my own situation. The reason for the change is that I host my site in a private Github repository, and I didn’t want to make it public. The fix was to simply use a second, public repo for the comments while I continue to keep the website in the private repo. Aside from that, everything worked perfectly. The following steps (which are further explained in Ivan’s blog post) set the system in place:

Adding the foundation to your site

(Optional) Create a public repository where you can create issues to host the comments. If the repo where your site is hosted is private, then the issues will be private as well. Even if the website has authorization to pull the comments for public viewing, no one would be able to submit new comments via Github without being explicitly granted access to at least view the repo. My work around is creating a second public repo to store my comments. If your Github Pages site is already in a public repo, then you can simply use the repo’s issue for comments.
Register a New OAuth Application with GitHub. Give it a name that you will remember it by (doesn’t really matter for our purposes). The Homepage and Authorization callback URLs should both be the URL of your blog. For example, mine are set to http://seanlane.net, which can be in seen in the following image:

Adding a new OAUTH Application in Github

This authorizes the site to by-pass the Same-Origin policy, which is further explained in Ivan’s piece ⁴.

I added the following code to the Jekyll template for each post:

{% if page.commentIssueId %}
  <div id="comments">
    <h2>Comments</h2>
    <div id="header">
        Want to leave a comment? Visit <a href="https://github.com/seanlane/seanlane-comments/issues/{{page.commentIssueId}}"> 
        this post's issue page on GitHub</a> (you'll need a GitHub account. What? Like you already don't have one? :).
    <div>
  </div>
  <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1/jquery.min.js"></script>
  <script type="text/javascript" src="http://datejs.googlecode.com/svn/trunk/build/date-en-US.js"></script>
  <script type="text/javascript">

      function loadComments(data) {
          for (var i=0; i < data.length; i++) {
              var cuser = data[i].user.login;
              var cuserlink = "https://www.github.com/" + data[i].user.login;
              var clink = "https://github.com/seanlane/seanlane-comments/issues/{{page.commentIssueId}}#issuecomment-" + 
                  data[i].url.substring(data[i].url.lastIndexOf("/") + 1);
              var cbody = data[i].body_html;
              var cavatarlink = data[i].user.avatar_url;
              var cdate = Date.parse(data[i].created_at).toString("yyyy-MM-dd HH:mm:ss");

              $("#comments").append("<div class='comment'><div class='commentheader'><div class='commentgravatar'>" 
                  + '<img src="' + cavatarlink + '" alt="" width="20" height="20">' 
                  + "</div><a class='commentuser' href=\"" + cuserlink + "\">" 
                  + cuser + "</a><a class='commentdate' href=\"" + clink 
                  + "\">" + cdate + "</a></div><div class='commentbody'>" + cbody + "</div></div>");
          }
      }

      $.ajax("https://api.github.com/repos/seanlane/seanlane-comments/issues/{{page.commentIssueId}}/comments", {
          headers: {Accept: "application/vnd.github.full+json"},
          success: function(msg){
              loadComments(msg);
          }
      });
  </script>
{% endif %}

This checks to see if the post has an Issue ID (which will be set in a following step) from which to gather comments, and then populates the bottom of the page with them.

To make the comments a little easier on the eyes, I added some CSS to my templates main CSS file as well:

/ ********************************
*
* COMMENTS
*
******************************** /

.comment {
    background-color: transparent;
    border-color: #CACACA;
    border-style: solid;
    border-width: 1px;
    color: black;
    display: block;
    margin-bottom: 10px;
    margin-top: 10px;
    padding: 0px;
    width: 100%;
  }

.comment .commentheader {
  border-bottom-color: #CACACA;
  border-bottom-style: solid;
  border-bottom-width: 1px;
  color: black;
  background-image: -webkit-linear-gradient(#F8F8F8,#E1E1E1);
  background-image: -moz-linear-gradient(#F8F8F8,#E1E1E1);
  color: black;
  display: block;
  float: left;
  font-family: helvetica, arial, freesans, clean, sans-serif;
  font-size: 12px;
  font-style: normal;
  font-variant: normal;
  font-weight: normal;
  height: 33px;
  line-height: 33px;
  margin: 0px;
  overflow-x: hidden;
  overflow-y: hidden;
  padding: 0px;
  text-overflow: ellipsis;
  text-shadow: rgba(255, 255, 255, 0.699219) 1px 1px 0px;
  white-space: nowrap;
  width: 100%;
}

.comment .commentheader .commentgravatar {
  background-attachment: scroll;
  background-clip: border-box;
  background-color: white;
  background-image: none;
  background-origin: padding-box;
  border-color: #C8C8C8;
  border-style: solid;
  border-width: 1px;
  color: black;
  display: inline-block;
  float: none;
  font-family: helvetica, arial, freesans, clean, sans-serif;
  font-size: 1px;
  font-style: normal;
  font-variant: normal;
  font-weight: normal;
  height: 20px;
  line-height: 1px;
  margin-left: 5px;
  margin-right: 3px;
  margin-top: -2px;
  overflow-x: visible;
  overflow-y: visible;
  padding: 1px;
  text-overflow: clip;
  text-shadow: rgba(255, 255, 255, 0.699219) 1px 1px 0px;
  vertical-align: middle;
  white-space: nowrap;
  width: 20px;
}

.comment .commentheader a:link {
  text-decoration: none;
}

.comment .commentheader a:hover {
  border-bottom:1px solid;
}

.comment .commentheader .commentuser {
  background-color: transparent;
  color: black;
  display: inline;
  float: none;
  font-family: helvetica, arial, freesans, clean, sans-serif;
  font-size: 12px;
  font-style: normal;
  font-variant: normal;
  font-weight: bold;
  height: 0px;
  line-height: 16px;
  margin-left: 5px;
  margin-right: 10px;
  overflow-x: visible;
  overflow-y: visible;
  padding: 0px;
  text-overflow: clip;
  text-shadow: rgba(255, 255, 255, 0.699219) 1px 1px 0px;
  white-space: nowrap;
  width: 0px;
}

.comment .commentheader .commentdate {
  background-color: transparent;
  color: #777;
  display: inline;
  float: none;
  font-family: helvetica, arial, freesans, clean, sans-serif;
  font-size: 11px;
  font-style: normal;
  font-variant: normal;
  font-weight: normal;
  height: 0px;
  line-height: 33px;
  margin: 0px;
  overflow-x: visible;
  overflow-y: visible;
  padding: 0px;
  text-overflow: clip;
  text-shadow: rgba(255, 255, 255, 0.699219) 1px 1px 0px;
  white-space: nowrap;
  width: 20em;
}

.comment .commentbody {
  background-attachment: scroll;
  background-clip: border-box;
  background-color: transparent;
  background-image: none;
  background-origin: padding-box;
  color: #333;
  display: block;
  margin-bottom: 1em;
  margin-left: 1em;
  margin-right: 1em;
  margin-top: 40px;
  overflow-x: visible;
  overflow-y: visible;
  padding: 0em;
  position: static;
  width: 96%;
  word-wrap: break-word;
}

.comment .commentbody p {
  margin-bottom: 0.5em;
  margin-top: 0.5em;
  margin-left: 0em;
  margin-right: 0em;
}

.comment .commentbody pre {
  border: 0px solid #ddd;
  background-color: #eef;
  padding: 0 .4em;
}

.comment .commentbody pre code {
  border: 0px solid #ddd;
}

.comment .commentbody code {
  border: 1px solid #ddd;
  background-color: #eef;
  font-size: 85%;
  padding: 0 .2em;
}

With those four steps, I can introduce comments on any given blog post (or any page with the support code in place). Again, note that most of the code came from Ivan Zuzak’s post, with some small modifications on my part.

Adding comments to a post

Now all that is left to do is perform the following steps for any post that you want to have comments. For each post, do the following:

Create a new issue in your designated repo to act as a host for your comments for that particular post. Every issue follows a base URL: https://github.com/{GITHUB USERNAME}/{REPO NAME}/issues/{ISSUE ID #}. Each issue is given a unique ID that is visible in the URL of the issue after it has been created. For example, https://github.com/seanlane/seanlane-comments/issues/1:

Screenshot of the Github issues page for this post's comments

Take the ID for that issue that will serve as the comments page for a particular post, and add the ID as a property in the page YAML front matter:

--- 
layout: post 
title: Comments on Github 
commentIssueId: 1 
---

When properly setup, we will then see an appropriate comments section after our blog post:

Screenshot of the Comments in action

It might be a slight hack, but now I have an easy way to pull comments into my static website without involving a third-party platform or forcing users to download yet another Javascript tracking widget. Hopefully, my example is of use to someone, and I appreciate Ivan’s post for leading the way.