DEV Community: Julio

Human Friendly Data Science Interviews

Julio — Sun, 22 Nov 2020 00:00:00 +0000

TL;DR. We focused on a holistic view of our candidates (technical and interpersonal skills) while trying to be fair with everyone’s time and life experiences. We could identify the outstanding people and those that weren’t a good fit and have had a great experience working with our hires!

After reading The software industry’s greatest sin: hiring by Neil Sainsbury and Take-home vs. whiteboard coding: The problem is bad interviews by Andrew Rondeau, several critical points about interviewing software developers stood out to me:

Software developers are usually assessed based on technical aspects ignoring their personal and organizational qualities. This might produce technically correct software with good performance, but that might be far from fulfilling users’ needs.
Someone can be technically excellent but lack the skills to understand and interact with your users and the rest of the team.
Someone can be technically excellent but keep using technologies they find interesting but are not aligned with the company’s goals.
There are tradeoffs between whiteboard and take-home questions: time invested by both parties, different development environment/conditions, visibility of the candidate’s technical and personal qualities, and the feedback loop between the examiner and the applicant.
A key aspect is to plan a good interview with coding assignments that consider the company’s needs and are fair for everyone involved.
Presenting existing code is briefly discussed by Andrew Rondeau as an alternative to whiteboard and take-home questions.

I am a postdoctoral researcher at a group that explores mobile data’s role in monitoring or supporting people with different health conditions. Broadly speaking, we collect smartphone and wearable sensor data, process it, and use it to create statistical and machine learning models that provide relevant behavioral or clinical insights. This is possible thanks to our team’s multi-disciplinary nature with expertise in psychology, statistics, computer science, software engineering, and data science.

Recently, we needed to hire a couple of data science interns from the local master’s program, and I was in charge of leading the technical part of the interviews. This was an excellent opportunity for me to pilot the type of technical interview that I’d like to experience based on the points I summarized above and the lab’s needs.

I divided our interviews into two 30-minute stages, one to talk about one of the candidate’s past projects and the other to find out how they would approach a data science problem that represents the kind of work we do.

For the first stage, we asked applicants to submit in advance a past data science project that they would like to discuss with us. I want to clarify that we accepted any industry, school, or hobby code repository and did not judge its purpose or complexity. We don’t expect that everyone will have the time to work on side-projects in their free time or disclose code from a previous employer. However, their sample project allowed us to understand some of the person’s experience with data science and software engineering practices like data cleaning, modeling, documentation, version control, variable and function naming, code comments, code formatting, and code refactoring. If any of these aspects were missing or seemed unsatisfactory, we made a note and ask about them during the interview.

When the time came for the first part of our face-to-face chat (where we talked about their chosen project), we focused on their hard skills (technical expertise, domain knowledge, and problem-solving abilities), soft skills (communication, multi-disciplinary collaboration, feedback reception), and traits like proactiveness, enthusiasm, motivation, clarity of thought, independence and attention to detail. This is our take on what Neil refers to as a candidate’s “holistic” view. Crucially, having technical and non-technical members from our team present made it easier to discuss and evaluate our candidates. More specifically, we inquired people about their role in previous teams (if any), their approach to learning, and their thought process to choose the best tool for the job. We also prompted them to explain complex non-technical concepts to everyone in the interview panel and to talk more about their experience interacting with past “clients” (teachers, fellow students, or any other stakeholders for those with experience in Industry). One of the advantages of this setup was that it allowed everyone to interact in a work environment very similar to what we experience every day while planning, implementing, executing, analyzing, and publishing a health intervention or monitoring study.

In the second part of the interview, we asked participants the following question: how would you implement a sleep classifier based on smartphone and Fitbit data? Even if this problem appears simple at first sight, numerous decisions and considerations can be taken into account along the way. For example, we can talk about missing data, feature engineering, data resampling, data imputation, class imbalance, type of model (population or individual), hyper-parameter tuning, model choice, baselines, cross-validation, evaluation metrics, etc. Consequently and to foster the discussion, we always dropped clues, clarifications, and follow up questions.

We did not expect our interviewees to reach a comprehensive solution or write any code (it took us weeks to finish a publishable solution, and the first part of the interview already would have given us an idea of their programming skills). Instead, we wanted to know more about their thinking process. We paid particular attention to the candidate’s understanding of the problem (do they ask relevant questions?), creativity (how do they suggest tackling this problem?), experience (are they levering solutions to past problems?), technical expertise (what programming language, libraries, or methods would they like to use?), and communication skills (can they engage the whole team in the discussion?).

This process fits well within our workflow and our team’s characteristics, and we hope that by sharing it, you can adapt it to your needs and provide a better experience for your candidates.

Setting up Travis CI to test R and Python scripts in MacOS and Ubuntu

Julio — Thu, 28 May 2020 00:00:00 +0000

A colleague and I configured Travis CI to run the tests of a project that relies on R and Python scripts. This project supports both macOS and Linux, so it was essential to test it in both environments. After some trial and error, we got this working with the travis.yaml file below. Our deployment manages the following project requirements: MySQL, Python 3.7, miniconda with a virtual environment, R and a cached virtual environment with renv, and slack notifications.

For Linux (Ubuntu 16.04) we do:

Install brew, linuxbrew-wrapper and linuxbrew/xorg
Install R using brew
Install miniconda using their provided script installer
Restore our conda virtual env
Cache renv‘s library. We use renv to keep a reproducible R environment with 161 packages; however, renv had to build them from source in Ubuntu 16.04 and our travis build was timing out. The solution we implemented was building the renv library (and therefore the travis’ cache) in three steps, first committing a small renv.lock with 40 packages, then 100, then all 161.

For MacOS (10.14.4)

Set up the OS image with osx_image: xcode11.3 and language: generic
Install MySQL, R and miniconda using brew (brew and Python 3.7 are already installed)
Restore our conda virtual env
Cache renv’s library. We faced and solved the same problem we had in Linux with renv’s building times timing out our travis build (see above). In addition, in MacOS, renv’s library path contains a space ~/Library/Application Support/renv which was causing issues with travis’ cache mechanism, as a quick fix we disabled renv’s global cache R -e 'renv::settings$use.cache(FALSE)' for both Linux and MacOS (for consistency) and cached our project’s renv library folder $TRAVIS_BUILD_DIR/renv/library as it now contains the actual packages instead of symbolic links.

services:
  - mysql
language: python

jobs:
  include:
    - name: "Python 3.7 on Xenial Linux"
      os: linux
      language: python
      python: 3.7
      before_install:
        - /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
        - export PATH=/home/linuxbrew/.linuxbrew/bin:$PATH
        - source ~/.bashrc
        - sudo apt-get install linuxbrew-wrapper
        - brew tap --shallow linuxbrew/xorg
        - brew install r
        - R --version
        - wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
        - bash miniconda.sh -b -p $HOME/miniconda
        - source "$HOME/miniconda/etc/profile.d/conda.sh"
        - hash -r
        - conda config --set always_yes yes --set changeps1 no
      cache:
        directories:
          - /home/travis/.linuxbrew
          - $HOME/.local/share/renv # global renv cache in linux (not used)
          - $TRAVIS_BUILD_DIR/renv/library # local renv cache

    - name: "Python 3.7 on macOS"
      os: osx
      osx_image: xcode11.3  # Python 3.7 running on macOS 10.14.4
      language: generic
      before_install:
        - brew install mysql
        - brew services start mysql
        - brew install r
        - R --version
        - brew cask install miniconda
        - eval "$(/usr/local/bin/conda shell.bash hook)"
      env:
        - RENV_PATHS_ROOT="$HOME/renv/cache"
      cache:
        directories:
          - /usr/local/lib/R
          - $RENV_PATHS_ROOT # global renv cache in MacOS (not used)
          - $TRAVIS_BUILD_DIR/renv/library # local renv cache

install:
  - conda init
  - conda update -q --all --yes conda
  - conda env create -q -n test-environment python=$TRAVIS_PYTHON_VERSION --file environment.yml
  - conda activate test-environment
  - snakemake renv_install
  - R -e 'renv::settings$use.cache(FALSE)'
  - snakemake renv_restore 

script:
  - python -m unittest discover tests/scripts/ -v

notifications:
  email: false
  slack:
    secure: SLACK_SECURE_KEY
    on_success: always
    template:
      - "Repo `%{repository_slug}` *%{result}* build (<%{build_url}|#%{build_number}>) for commit (<%{compare_url}|%{commit}>) on branch `%{branch}`."
      - "Execution time: *%{duration}*"
      - "Message: %{message}"

Quickly exploring CSV files with wc and awk

Julio — Sat, 18 Apr 2020 00:00:00 +0000

This week I was extracting high-intensity activity episodes from the Fitbit data of 150 people. The first thing I wanted to know after processing all participants was how many people had at least 1 episode. I am using RAPIDS to process the data, which means that the activity episodes for each participant are stored one per line in CSV files in individual folders. As I was looking for a quick and short solution, I went for Bash instead of Python or R.

Then, the problem is reduced to three steps: list all files in a subfolder with names that match a pattern, count the lines on each file, and filter those files with at least 2 lines (all files have at least the header row). For that, we can use wc and awk.

wc $(find . -name 'p*_fitbit_mvpa_episodes.csv') | awk '{if (($1 > 1) && ($4 ~ /^\.\/data/)) { print }}' | wc -l

The first part wc $(find . -name 'p*_fitbit_mvpa_episodes.csv') executes the wc command on the output of the find command, which retrieves all files in the current directory and any subdirectories with a name that matches the regular expression between quotes. The default output of the wc command has four columns for each file: line count, word count, byte count, and its path. These are piped into awk '{if (($1 > 1) && ($4 ~ /^\.\/data/)) { print }}' which filters and prints those lines where the value of the first column $1 (line count) is bigger than one and the fourth column $4 (file path) starts with ./data. The first part of the filter gets all the files with at least one activity episode (header + episode line), and the second part excludes the total count that wc appends. Finally, to obtain the number of files with at least one activity episode, I piped the previous list to wc with the -l flag that counts the number of lines (files) that awk printed. It turns out that out of 150 participants, only 20 have high-intensity activity episodes (this lead us to discover a problem with the data I was working with that is a matter for another post).

As an extra bit of information useful for our collaborators, I wanted to know the average number of episodes across all participants. For this I followed a similar process but instead of the second wc -l, I piped the output to awk where it is possible to keep a counter and sum of the values of each line, obtaining the average for the first column (line count) as follows:

wc $(find . -name 'p*_fitbit_mvpa_episodes.csv') | awk '{if (($1 > 1) && ($4 ~ /^\.\/data/)) { print $1}}' | awk '{ total += $1; count++ } END { print total/count }'

We have an average of 17.3 episodes across 20 people.

Polishing my blog's appearance and performance

Julio — Wed, 08 Apr 2020 00:00:00 +0000

When I was setting up this blog with Hugo and Netlify, I found five themes that I liked for their simplicity and aesthetics. I chose Cupper because it focuses on content, is accessible, posts are grouped by tags and not categories, it supports multiple shortcodes (like notes, warnings, code, among others), it includes minimal javascript, and it provides a dark theme.

As good as the theme is, I tweaked some things, and this post is a compilation of them for my future reference and for other people that might find them useful. A neat tip is that you don’t need to modify the original theme to make changes, instead you can add it as a submodule in the themes folder, create the files with the modifications you need in the folders at the top level of your blog and Hugo will use them first.

The TL;DR list of changes is below, but you can keep reading for more details:

Support for more highlighted languages in code snippets by Prism JS
CSS changes for text readability
Make the blog’s post list its homepage
Support for static comments using Staticman
Support for web analytics using GoatCounter
RSS feed with full posts instead of a short description
Image, CSS, and JavaScript optimization in Netlify

First, I updated the code highlighting languages supported by Prism JS, the highlighting library used by Cupper. This is the URL of my configuration that includes Markup, CSS, JS, Bash, Git, Java, JSON, Latex, Markdown, Python, R, SQL, YAML, and TOML. Don’t forget you need to update Prism’s JS and CSS.

I swapped the original logo for a text title, and made some small CSS adjustments to make the text more readable: decreased the contrast of the dark theme by 15%, changed the general line height, letter spacing of titles, and top margin of paragraphs.

/* The dark theme settings are a separate style tag in the header*/
.intro-and-nav, .main-and-footer { filter: invert(85%) }
* { background-color: inherit }
img:not([src*=".svg"]), .colors, iframe, .demo-container { filter: invert(85%) }
/* This is the site's main CSS */
html {
font-size: calc(1em + 0.33vw);
font-family: Arial, Helvetica Neue, sans-serif;
line-height: 1.5;
line-height: 1.8;
color: #111;
background-color: #fefefe;
}
h1,
h2,
h3,
h4 {
font-family: Miriam Libre, serif;
line-height: 1.125;
letter-spacing: 2px;
}
p + p {
margin-top: 1rem;
}
.logo {
border: 0;
font-size: 1.5rem;
}

I substituted the original homepage for the blog’s post list as I want them to be the focus of visitors, and made the about section an external link to my personal website. For the first change, I had to move the theme’s post template from post/single.html to _default/single.html, so all .md files in the content folder are rendered with it, and moved _default/list.html to layouts/index.html, so the original list of posts is now at the homepage. For the second change, I added a conditional to the navigation links rendering to make it open in a new tab:

<a href="{{ .URL }}" {{ if $active }}aria-current="page"{{ end }} {{ if eq .Name "About" }}target="_blank"{{ end }}>

I added support for Staticman comments instead of Disqus to have a git-backed, lightweight, ethical comment provider. You can read more about the whole process here

I added support for GoatCounter for web analytics instead of Google Analytics for these reasons; this post by David Papandrew made the search for an alternative easier. GoatCounter gives me the level of detail just right to know what are the most visited posts, referrals, and visitors’ platforms, it is GDPR friendly as it does not rely on cookies, just around 1.5Kb, open-source, and free for under 100k pageviews a month. You can support them in GitHub Sponsors and Patreon. Installing GC was really easy, all I had to do was to create an account there and add the following script:

<script data-goatcounter="https://MY_SITE.goatcounter.com/count" async src="//gc.zgo.at/count.js"></script>

I updated the RSS template to include full posts instead of just descriptions to make them work better with readers like Feedly, which I use a lot. I added this file to layouts/rss.xml.

Finally, I activated the asset optimization in Netlify to compress images and minify and bundle CSS and JS files using my netlify.toml file. For the latter to work, all their links need to be relative, so I modified the following lines in my templates:

<link rel="stylesheet" href="{{ "css/prism.css" | relURL }}" media="none" onload="this.media='all';">
<link rel="stylesheet" type="text/css" href="{{ $styles.RelPermalink }}">
<script src="{{ "js/prism.js" | relURL }}"></script>
<script src="{{ "js/dom-scripts.js" | relURL }}"></script>

// Append this to your netlify.toml
[build.processing]
skip_processing = false
[build.processing.css]
bundle = true
minify = true
[build.processing.js]
bundle = true
minify = true
[build.processing.images]
compress = true

All these changes gave the blog a PageSpeed score of 100%, Ylow Score of 94% with a Fully Loaded Time of 1.7s and a Total Page Size of 109Kb.

Configuring Staticman Comments with Hugo

Julio — Sun, 05 Apr 2020 00:00:00 +0000

I wanted to add comments to my blog, and Disqus seemed like a good option as the theme I’m using supports it out of the box. However, as things stand, I am happy with a solution that doesn’t require storing people’s data in third party databases and doesn’t add ads and unnecessary tracking scripts that could make the reading experience slower or cluttered.

After searching for open-source/ethical comment suppliers, I found out about Staticman, and I am giving it a try since it integrates with Hugo blogs, it uses a git repository to store and triage comments, it’s been around since 2015, and has good documentation. I just had to work around some constraints. In essence, you need to deploy your own instance of Staticman to Heroku as the official Staticman API hits its quota frequently (Heroku’s free tier is enough tho’), I wanted to keep this blog’s comments on a separate repository, and I am using Staticman API V2 since everything is hosted on GitHub (V3 supports other providers like Gitlab).

This post by Arne Petersen was of great help to put everything together. After some tweaks, my deployment works like this:

I’m only collecting people’s names and comments
I’m using reCaptcha 2 to avoid spam
I only load reCaptcha’s JS script when you click the “Show Comments” button
I accept/reject comments using pull requests.
After accepting a comment, my blog is re-build and published automatically in Netlify using webhooks
After someone submits a comment, they get redirected to the original blog post with a message explaining their comment will go live after approval. No AJAX or popups are required, and you can try it leaving a comment!

And the instructions:

Create a repository for your comments in your main GitHub account; we will call it blog_comments
Create a secondary GitHub account; we will call it account2. This is for security reasons as Arne pointed out, you are creating a Personal Access Token and keeping it in your Heroku instance which could give anyone who gets hold of it full access to your GitHub account.
Create a Personal Access Token in account2 at https://github.com/settings/tokens. Save it because you can only see it once, and you will need it in a bit.
Invite account2 as a collaborator to blog_comments going to https://github.com/YOUR_MAIN_GITHUB_ACCOUNT/blog_comments/settings/access
Deploy Staticman to Heroku using the purple button in the project’s README (make sure it’s in the master branch)
Create a private key for Staticman (you can do this in your Heroku instance going to “More” -> “Run console”): openssl genrsa -out key.pem
Add the following three Config vars to your Heroku instance in https://dashboard.heroku.com/apps/YOUR_INSTANCE_NAME/settings:
```
NODE_ENV         production
GITHUB_TOKEN     YOUR PERSONAL ACCESS TOKEN
RSA_PRIVATE_KEY  CONTENT OF key.pem
```
If you want to use reCaptcha to avoid spam, do the following:
1. Register your blog here. You can add localhost to the domain list to be able to test everything in your local machine. Save your siteKey and secret
2. Encrypt your reCaptcha secret obtained before by querying your Heroku instance in this URL: https://YOUR_HEROKU_APP_NAME.herokuapp.com/v2/encrypt/YOUR_UNENCRYPTED_RECAPTCHA_SECRET
Add this partial to your blog and call {{ partial "staticman.html" . }} where you want to load your comments.
Add the CSS styles from this line onwards to your blog.
Add this JS script to your partials

Add the following lines to the params list in your Hugo blog’s config.yaml

staticman:
    api: https://<YOUR_HEROKU_APP_NAME>.herokuapp.com/v2/entry/YOUR_MAIN_GITHUB_ACCOUNT/blog_comments/master/comments
    recaptcha:
        sitekey: "YOUR RECAPTCHA KEY"
        secret: "YOUR ENCRYPTED RECAPTCHA SECRET"

Add your Staticman configuration file, staticman.yaml, to the root of blog_comments. You can use the one below or this other one as a reference if you want to collect more data like emails or personal websites.

comments:
allowedFields: ["name", "comment"]
branch: "master"
commitMessage: "New comment in {options.slug}"
filename: "comment-{@timestamp}"
format: "yaml"
generatedFields:
date:
type: date
options:
format: "iso8601"
moderation: true
name: "YOUR SITES NAME"
path: "{options.slug}"
requiredFields: ["name", "comment"]
transforms:
email: md5
// Delete this if you do not want to use reCaptcha
reCaptcha: 
enabled: true
siteKey: "YOUR RECAPTCHA KEY"
secret: "YOUR ENCRYPTED RECAPTCHA SECRET"

Add your blog_comments repo as a submodule to your main repo in the data/comments folder: git submodule add https://github.com/YOUR_MAIN_GITHUB_ACCOUNT/blog-comments.git data/comments
I use Netlify to publish my blog, so I had to modify my Netlify build command to pull the latest version of blog_comments to render any new comments. You can do this using Netlify’s website or by adding a netlify.toml file to the root of your blog repo with the following lines:
```
[build]
publish = "public"
command = "git submodule update --remote data/comments && hugo --gc --minify"

[context.production.environment]
HUGO_VERSION = "v0.68.3"
HUGO_ENV = "production"
HUGO_ENABLEGITINFO = "true"
```
Every time someone comments on a blog post, Staticman creates a new branch and a Pull Request (PR) in blog_comments which you can accept or reject to publish it or not. Branches will start to pile up, so, for those PRs you reject, you have to delete their branches using GitHub’s UI manually. Still, for those PRs you accept, GitHub can automatically delete them by activating this feature.
At this point, you can submit your first comment from your computer or, commit everything to GitHub and do it online.
Optional. If you want to avoid triggering a new Netlify build manually every time you accept a comment, you can automatize it by using Integromat’s webhooks. You could also use Zappier, but you have to switch to their paid tier.
1. Got to Netlify’s Build hooks section in https://app.netlify.com/sites/YOUR_NETLIFY_DOMAIN/settings/deploys#build-hooks and click on Add build hook. Save the generated URL
2. Create a new Scenario in Integromat
3. Add a Custom Webhook trigger. Inside, add a new Webhook and copy its URL. Click on Determine data structure
4. Go to https://github.com/YOUR_MAIN_GITHUB_ACCOUNT/blog_comments/settings/hooks. Click on Add webhook, in Payload URL add the URL of the Integromat Custom Webhook trigger, in Content type select application/json, and under Let me select individual events check Pull requests. Click on Add Webhook.
5. Submit a comment in your blog, so Staticman creates a new Pull Request in blog_comments, and Integromat infers its content. You should see a confirmation message in the Custom Webhook trigger.
6. Add a HTTP action in Integromat. Connect this to the Custom Webhook trigger
7. In the connection between the HTTP action and the Custom Webhook trigger, add two conditions joined by an AND operator: action = closed and pul_request: merged = true. They should be autocompleted if Integromat was able to infer the PR’s content
8. Click in the HTTP action, add the Netlify hook’s URL you got earlier to the action’s URL field, and change its Method to POST
9. Turn the scenario ON using the switch at the bottom left and set Schedule setting to Immediatly
10. From now on, the scenario should trigger a Netlify build every time you accept a Staticman’s Pull Request

Feel free to leave a comment if you have issues or questions!

Organizing ideas with the Zettelkasten method

Julio — Fri, 03 Apr 2020 00:00:00 +0000

I came across the Zettelkasten (ZK) method as a flexible way of organizing knowledge. I have read different descriptions, and most of them describe it as a second brain, a single text-based repository where you can dump all your ideas and link them to not only store but also generate knowledge. This is what resonated with me the most, as I know that I am the most creative once I know any topic(s) in-depth and can make connections between its different components.

According to this website, keeping a ZK repository has multiple benefits like improving your thinking, writing, memory, and learning. I do think that keeping one will help me resurface ideas that I have after reading papers or technical content and retain concepts for longer as you are supposed to add notes in your own words instead of copy and pasting. That said, I need to check if the recommendation of writing long pieces in a ZK works for me. The idea is that you outline your text in a note and then add subsections in other notes linked to the original. I usually follow this iterative approach to writing the difference is that I like to have a quick overview of what I have written to expand and reorganize content, so I have to see if I can get used to not having this.

For the actual implementation of my ZK repo, I chose to type all my notes in markdown files stored in a single directory backed up in Github and Sublime with the ZK plugin in Mac OS (it should be cross-platform).

Sublime as editor

I set up my ZK repository in Mac OS using Sublime and a few extra plugins as suggested here. Most of the steps below are taken from that project’s README, but I replicate some of them here for future reference (I find the official docs a bit overwhelming).

Install the ZK plugin:

In Sublime’s command palette (cmd + P) run Install Package Control
In Sublime’s command palette run Package Control: Add Repository and paste this URL when prompted https://github.com/renerocksai/sublime_zk
In Sublime’s command palette run Package Control: Install Package and search for sublime_zk when prompted

Install the Silver Searcher plugin using: brew install the_silver_searcher
Install Pandoc using: brew install pandoc
Re-start Sublime

Initializing my ZK folder and creating my first note

In Sublime’s command palette run ZK: New Zettel Note and type a name for your new note
When prompted choose or create a new folder as your ZK repository (I added mine to git source control)
In Sublime’s command palette run ZK: Enter in Zettelkasten Mode
Now you can create a new note by typing Shift + Enter

Configuring my ZK folder

I am using the default file extension (.md), link notation ([[link]]) and ID precision in minutes (YYYYMMDDHHMM)

I configured ZK to insert the title of a note next to its ID when they are linked from other notes. In addition, I changed the color scheme, set the bib citation format to pandoc’s, and modified the template for new notes to insert the note ID, title, date, and tags at the beginning of the file. To do all this, go to Sublime’s Preferences > Package Settings > Sublime ZK > Settings User and add the following code:

{
// Insert note titles next to links when linking a note frome another
"insert_links_with_titles": true,
// Template for new notes
"new_note_template":
"---\nnote-id: {id}\ntitle: {title}\ndate: {timestamp: %Y-%m-%d}\ntags: \n---\n",
"color_scheme": "Packages/sublime_zk/Monokai Extended-zk.tmTheme",
"citations-mmd-style": false,
}

When you create a new note with the config above, its header will look like this:

---
note-id: 202003281428
title: My second note
date: 2020-03-28
tags:
---

Taking notes in my ZK

I create notes with the following principles in mind:

Each note is atomic and self-contained. This means that a note is related to a single idea, and I don’t need anything else to remember what a note means.
A link is a stronger connection than a tag. In other words, a single idea is developed throughout different notes connected by links, and multiple ideas related to the same broad topic are grouped with tags.
If a new note is related to an existing note, I link the parent note in the child note ([[parent ID]]).
I try to use specific tags, and before adding a new one, I list all existent tags to make sure I am not duplicating any. Using the # + ? shortcut helps me avoid typos.

Finally, I found the following shortcuts the most useful:

Create a new note shift + enter
Open the note pointed by a link opt + double-click on link or cursor on link + ctrl + enter
Insert a link to a note [ + [
Find all notes referencing another note cursor on link + opt + enter
View all tags # + !
View all notes [ + !
Autocomplete tag # + ?
Find all notes tagged by a tag cursor on tag + ctrl + enter
Expand note link inline ctrl + .
Expand tag inline (all referencing notes) ctrl + .
Expand citekey inline (all referencing notes) ctrl + .
Insert pandoc citation [ + @

I will update my experience using ZK in a few months.

PaperStream: collecting data from multiple-answer questions documents

Julio — Mon, 27 Aug 2018 14:43:22 +0000

Previously published at the Software Sustainability Institute's blog

As part of my PhD where we are researching if we can use smartphone data to monitor the progression of Parkinson’s Disease, we found out that we had to go “Back to Analogue” as a paper diary was the best tool for patients to self-report their symptoms. This was excellent for the study, but it gave us another thing to worry about, I would have to manually transcribe participants’ answers from paper into electronic files. We were aiming for ten participants that needed to complete a diary with 365 pages over a year; if it had taken 45 seconds (being very optimistic) to transcribe each page, encoding all ten diaries would have taken ~114 hours, or ~19 days of work!

Being a computer scientist and wanting to save a good 114 hours for when I have to write my dissertation, I searched the Internet for a tool that would have allowed me to create and encode paper diaries automatically (maybe some sort of software to mark multiple-answer exams?). To my dismay, the few options available were software projects no longer supported, not well documented, not free or open source, and not very user friendly. I decided then to take this as an opportunity to contribute back to the community because many of the tools we use for data analysis in our lab are freely available thanks to the work of others. This is how PaperStream was born.

PaperStream is a software that researchers and academics can use to create paper diaries, surveys, quizzes or any other document with multiple-answer questions that people can respond using pen and paper and that can then be encoded automatically into a CSV file. PaperStream is free, open source, available for Windows, Mac OS, and Linux, and fully documented as it was designed to be used by anyone without a technical background.

For the sake of this post, I will write about PaperStream using diaries and surveys as two examples to showcase its features. If you are working with quizzes or other types of documents, what I describe here shouldn't be too different from what you'd need to do. So, in our case, a diary is a questionnaire that one or more people must answer every day for several days, weeks, or months. Similarly, a survey is a questionnaire that one or more people have to answer only once.

For both a diary and a survey, you only need a one-page PDF document that will work as a template for every page. For diaries, PaperStream will label each page with a unique date, like the next figure, and for surveys, PaperStream will enumerate them with a unique ID. Once PaperStream has processed your templates, you will get a zip file with your diaries or surveys in both A4 and A5 size ready to print and bind.

After your participants have answered these printouts using a pen, you need to scan them as a multi-page TIF image, or as single PNG images compressed on a ZIP file. Once this is ready, you need to tell PaperStream where and what answers to look for through a marking rubric. A marking rubric is nothing more than a group of circles that indicate what areas of a page participants can mark with a pen and what those mark/answers mean, for example, the hour of the day or a point in a likert scale. Since you used a single template to create your diaries or surveys, you only need to design a marking rubric once, and that’s it! When the rubric is ready, PaperStream will give you a zip file with a CSV file containing all the answers of each diary or survey that you wanted to process. What is also useful, is that PaperStream can detect duplicates, missing data, and is very forgiving, as it will detect an answer when at least 15% of the answer area is filled in and has no problems when the pen goes outside it. This means that your participants don’t have to worry about how to respond the questions, it is as easy as using pen and paper.

From a technical perspective, Python was the best language to develop PaperStream on. It has multiplatform support, it can process PDF files thanks to the pypdf2 library and images via OpenCV. The first prototype of PaperStream was a script that converted a PDF template into a PDF booklet based in the booklet-maker project of Luke Plant. The second prototype, which needed to encode the answers from paper to an electronic file, was slightly more complicated. I wanted to maintain Python’s multi-platform capabilities while at the same time giving users a graphical interface that would not take too much time to develop. There are many options to create a GUI in Python. First, I tried Tkinter but canvas support and geometric shapes manipulation (like drag and drop) was not straight forward. For this reason, I decided to think of PaperStream as a desktop web app, meaning that the GUI would be HTML/CSS/Javascript based, taking advantage of HTML, SVG and the rich Javascript ecosystem, while relying on a local web server to route all calls from the web GUI to the processing scripts. Falcon was my choice for the web server due to its light weight, extensive documentation, and simple implementation, Fabric.js for the geometric manipulation, plus Async.JS for asynchronous calls and Noty.js for notifications. Then, for the actual encoding logic, I adapted the work done by Raphael Baron that used OpenCV to extract parts of a page framed by markers, complementing it with the answer extraction functionality that works by comparing black pixels between two paper sheets. All this open source software made PaperStream development faster and easier.

Finally, the last sprint for the first version of PaperStream was its testing, distribution and documentation. I implemented a few unit tests using Python’s unit test library for the core functionality of the scripts that create and encode documents. Then, I considered Docker and other similar options to make PaperStream available but in the end I went with PyInstaller which allows developers to distribute a Python project (including firing up a Falcon server) as a single executable or as a single zip file that works on all major OS. I also deployed PaperStream in a pip repository, so it could be installed with a single line by developers and other technical users. Finally, for the documentation I decided to give Hugo a try for the first time; writing it in Markdown was simple, and automatically publishing the static website to Netlify with every GitHub commit was super convenient.

I learnt a lot during the development of this project and I’d love if the community finds PaperStream useful and takes its development forward. Future cool functionality could include detecting different pen colours, shape marks, and even handwritten text. In the meantime, you can get PaperStream and its source code for free in GitHub and how-to guides to create and encode documents over Netlify. Oh, and in case you were wondering, with PaperStream I encoded all ~3650 pages in about 5 minutes; a whopping 1300% faster.