DEV Community: Ong Chin Hwee

Functional "Control Flow" - Writing Programs without Loops

Ong Chin Hwee — Sun, 04 Jul 2021 00:00:00 +0000

Recap

In my previous post on key principles of functional programming, I explained how the functional programming paradigm differs from imperative programming, and discussed how the the concepts of idempotency and avoidance of side effects are linked to the property of referential transparency that enables equational reasoning in functional programming.

Before we dive into some of the features of functional programming, let’s start with a personal anecdote during my first 3 months of writing Scala code.

There is no “If-Else” in Functional Code

I was writing a pure Scala function for a custom Spark UDF which computes revenue adjustments based on a custom tiered adjustment expressed in JSON string. While attempting to express the business logic in pure functional code (since that is the team’s coding style), I got pretty frustrated with my perceived drop in productivity to the point whereby I introduced “if-else” logic into my code in a bid to “get the job done”.

Let’s just say that I learnt a pretty tough lesson during code review for that particular merge request.

“No if-else in functional code, this is not imperative programming… No ifs, no elses. ”

Without “if-else”, how do we write “control flow” in functional programming?

The short answer: function composition.

The long answer: A combination of function composition and functional data structures.

As a deep-dive on each functional design pattern can be pretty lengthy, the focus of this post is to provide an overview of function composition and how it enables a more intuitive approach to designing data pipelines.

A Brief Intro to Function Composition

Function composition

In mathematics, function composition is an operation that takes two functions f and g in sequence and forms a composite function h such that h(x) = g(f(x)) - function g is applied to the result of applying the function f to a generic input x. Mathematically, this operation can be expressed as:

where is a composite function.

Intuitively, the composite function maps x in X to g(f(x)) in domain Z for all values in domain X.

A useful analogy to illustrate the concept of function composition is making butter toast in an oven with a slice of bread and cold butter. There are two possible operations:

Toasting in the oven (operation f)
Spreading butter over the widest surface (operation g)

If we toast the bread in the oven first and spread cold butter over the widest surface of what comes out of the oven, we get a slice of toasted bread with cold butter spread.

If we spread cold butter over the widest surface of the bread first and toast the bread with cold butter spread in the oven, we get a slice of toasted bread with warm butter spread.

And we know that “cold butter spread” != “warm butter spread”

From these examples, we can intuitively infer that the order of function application matters in function composition.

Similarly in designing data pipelines, we often write data transformations by applying functions to results of other functions. The ability to compose functions encourages refactoring of repeated code segments into functions for maintainability and reusability.

Functions as First-Class Objects

The core idea in functional programming is: functions are values.

This feature implies that a function can be [2,3]:

assigned to a variable
passed as a parameter to other functions
returned as a value from other functions

For this to work, functions must be first-class objects (and stored in data structures) in the runtime environment - just like numbers, strings and arrays. First-class functions are supported in all functional languages including Scala, as well as some interpreted languages such as Python.

Higher-Order Functions

A key implication resulting from the concept of functions as first-class objects is that function composition can be naturally expressed as a higher-order function.

A higher-order function has at least one of the following properties:

Accepts functions as parameters
Returns a function as a value

An example of a higher-order function is map.

When we look at the documentation for the Python built-in function map, it is stated that the map function takes in another function and an iterable as input parameters and returns an iterator that yields the results [4].

In Scala, each of the collection classes in package scala.collections and its subsets contain the map method that is defined by the following function signatures on ScalaDoc [5]:

def map[B](f: (A) => B): Iterable[B] // for collection classes
def map[B](f: (A) => B): Iterator[B] // for iterators that access elements of a collection

What the function signatures mean is that map takes a function input parameter f, and f transforms a generic input of type A to a resulting value of type B.

To square each value in a collection of integers, the iterative approach is to traverse each element in the collection, square the element and append the result to a collection of results that expands in length with each iteration.

In Python:

  def square(x):
      return x * x

  def main(args):

      collection = [1,2,3,4,5]
      # initialize list to hold results
      squared_collection = []
      # loop till the end of the collection
      for num in collection:
          # square the current number 
          squared = square(num)
          # add the result to list
          squared_collection.append(squared) 

      print(squared_collection)

In the iterative approach, two state changes occur at each iteration within the loop:

The squared variable holding the result returned from the square function; and
The collection holding the results of the square function.

To perform the same operation using a functional approach (i.e. without using mutable variables), the map function can be used to “map” each element in the collection to a new collection with the same number of elements as the input collection - by applying the square operation to each element and collecting the results into the new collection.

In Python:

  def square(x):
      return x * x

  def main(args):

      collection = [1,2,3,4,5]
      squared = list(map(square, collection))
      print(squared)

In Scala:

  object MapSquare {

      def square(x: Int): Int = {
          x * x
      }

      def main(args: Array[String]) {

          val collection = List[1,2,3,4,5]
          val squared = collection.map(square)
          println(squared)
      }
  }

In both implementations, the map function accepts an input function that is applied to each element in a collection of values and returns a new collection containing the results. As map has the property of accepting another function as a parameter, it is a higher-order function.

A few quick side-notes on differences between Python and Scala implementations:

Python map vs Scala map: An iterable function such as list is needed to convert the iterator returned from the Python map function into an iterable. In Scala, there is no need for explicit conversion of the result from the map function to an iterable, as all methods in the Iterable trait are defined in terms of an abstract method, iterator, which returns an instance of the Iterator trait that yields the collection’s elements one by one [6].
How values are returned from a function: While the return keyword is used in Python to return a function result, the return keyword is rarely used in Scala. Instead, the last line within a function declaration is evaluated and the resultant value is returned when defining a function in Scala. In fact, using the return keyword in Scala is not good practice for functional programming as it abandons the current computation and is not referentially transparent [7-8].

Anonymous Functions

When using higher-order functions, it is often convenient to be able to call input function parameters with function literals or anonymous functions without having to define them as named function objects before they can be used within the higher-order function.

In Python, anonymous functions are also known as lambda expressions due to their roots in lambda calculus. An anonymous function is created with the lambda keyword and wraps a single expression without using def or return keywords. For example, the square function in the previous example in Python can be expressed as an anonymous function in the map function, where the lambda expression lambda x: x * x is used as a function input parameter to map:

def main(args):

    collection = [1,2,3,4,5]
    squared = map(lambda x: x * x, collection)
    print(squared)

In Scala, an anonymous function is defined in-line with the => notation - where the function arguments are defined to the left of the => arrow and the function expression is defined to the right of the => arrow. For example, the square function in the previous example in Scala can be expressed as an anonymous function with the (x: Int) => x * x syntax and used as a function input parameter to map:

object MapSquareAnonymous {

    def main(args: Array[String]) {
        val collection = List[1,2,3,4,5]
        val squared = collection.map((x: Int) => x * x)
        println(squared) 
    }
}

A key benefit of using anonymous functions in higher-order functions is that single-use single-expression functions need not be wrapped explicitly within a named function definition, hence optimizing lines of code and improving code maintainability.

Recursion as a form of “functional iteration”

Recursion is a form of self-referential function composition - a recursive function takes the results of (smaller instances of) itself and uses them as inputs to another instance of itself. To prevent an infinite loop of recursive calls, a base case is required as a terminating condition to return a result without using recursion.

A classic example of recursion is the factorial function, which is defined as the product of all positive integers less than or equal to an integer n:

There are two possible iterative approaches to implementing a factorial function: using for loop, and using while loop.

In Python:

  def factorial_for(n):
      # initialize variable to hold factorial
      fact = 1
      # loop from n to 1 in decrements of 1
      for num in range(n, 1, -1):
          # multiply current number with the current product
          fact = fact * num
      return fact

  def factorial_while(n):
      # initialize variable to hold factorial
      fact = 1
      # loop till n reaches 1
      while n >= 1:
          # multiply current number with the current product
          fact = fact * n
          # subtract the number by 1
          n = n - 1
      return fact

In both iterative implementations of the factorial function, two state changes occur at each iteration within the loop:

The factorial variable storing the current product; and
The number being multiplied.

To implement the factorial function using a functional approach , recursion is useful in dividing the problem into subproblems of the same type - in this case, the product of n and (n-1)!).

The basic recursive approach for the factorial function looks like this:

In Python:

  def factorial(n):
      # base case to return value
      if n <= 0: return 1
      # recursive function call with another set of inputs
      return n * factorial(n-1)

In Scala:

  def factorial(n: Int): Long = {
      if (n <= 0) 1 else n * factorial(n-1)
  }

For the basic recursive approach, the factorial of 5 is evaluated in the following manner:

factorial(5)
if (5 <= 0) 1 else 5 * factorial(5 - 1)
5 * factorial(4) // factorial(5) is added to call stack
5 * (4 * factorial(3)) // factorial(4) is added to call stack
5 * (4 * (3 * factorial(2))) // factorial(3) is added to call stack
5 * (4 * (3 * (2 * factorial(1)))) // factorial(2) is added to call stack
5 * (4 * (3 * (2 * (1 * factorial(0))))) // factorial(1) is added to call stack
5 * (4 * (3 * (2 * (1 * 1)))) // factorial(0) returns 1 to factorial(1)
5 * (4 * (3 * (2 * 1))) // factorial(1) return 1 * factorial(0) = 1 to factorial(2)
5 * (4 * (3 * 2)) // factorial(2) return 2 * factorial(1) = 2 to factorial(3)
5 * (4 * 6) // factorial(3) return 3 * factorial(2) = 6 to factorial(4)
5 * 24 // factorial(4) returns 4 * factorial(3) = 24 to factorial(5)
120 // factorial(5) returns 5 * factorial(4) = 120 to global execution context

For n = 5, the evaluation of the factorial function involves 6 recursive calls to the factorial function including the base case.

While the basic recursive approach expresses the factorial function more closely with its definition (and more naturally) compared with the iterative approach, it also uses more memory as each function call is pushed to the call stack as a stack frame and popped off the call stack when the function call returns a value.

For larger values of n, the recursion gets deeper with more function calls to itself and more space has to be allocated to the call stack. When the space needed to store the function calls exceeds the capacity for the call stack, a stack overflow occurs!

Tail Recursion and Tail-call Optimization

To prevent infinite recursion from causing stack overflow and crashing the program, some optimizations have to be made to the recursive function in order to reduce consumption of stack frames in the call stack. A possible approach in optimizing the recursive function is by rewriting it as a tail recursive function.

A tail recursive function calls itself recursively and does not perform any computation after the recursive call returns. A function call is a tail call when it does nothing other than returning the value of the function call.

In functional programming languages such as Scala, tail-call optimization is typically included in the compiler to identify tail calls and compile the recursion to iterative loops that do not consume stack frames for each iteration. In fact, the stack frame can be reused for both the recursion function and the function being called within the recursion function [1].

With this optimization, the space performance for the recursion function can be reduced from O(N) to O(1) - from one stack frame per call to one stack frame for all calls [8]. In a way, a tail recursive function is a form of “functional iteration” with comparative performance to a loop.

For example, the factorial function can be expressed in the form of a tail recursion in Scala:

def factorialTailRec(n: Int): Long = {
    def fact(n: Int, product: Long): Long = {
        if (n <= 0) product
        else fact(n-1, n * product)
    }

    fact(n, 1)
}

While tail-call optimization is automatically performed during compilation in Scala, it is not the case for Python. Moreover, there is a recursion limit in Python (the default value is 1000) as a prevention measure against an overflow of the C call stack for the CPython implementation.

What’s next: Higher-Order Functions

In this post, we learn about:

Function composition
Higher-Order Functions as a key implication of functional programming
Recursion as a form of “functional iteration”

Have we found a replacement for “if-else” yet? Not entirely, but we now know how to write “loops” in functional programming using Higher-Order Functions and tail recursion.

In the next post, I will explore more on Higher-Order Functions and how they can be used in designing functional data pipelines.

Want more behind-the-scenes articles on my learning journey as a data professional? Check out my website at https://ongchinhwee.me!

References

3 Key Principles of Functional Programming for Data Engineering

Ong Chin Hwee — Sun, 09 May 2021 00:00:00 +0000

Recap

In my previous post on my motivations for learning Scala, I stated that one of my key reasons for learning Scala for data engineering is due to the programming language being primarily designed for functional programming.

Before we dive into the details of writing functional programs, it is important for us to understand the key principles of functional programming and how these programming principles are useful when designing reproducible data pipelines at scale.

In this post, I introduce:

What is Functional Programming
Key principles of Functional Programming and their implications on data pipeline design

What is Functional Programming

Functional programming is a declarative style of programming that emphasizes writing software using only :

Pure functions; and
Immutable values.

To put it simply, functional programmers see their code as mathematical functions - and combinations of functions as equations with defined inputs and outputs.

The concept of pure functions is the core of Functional Programming, and has important implications on how functional design principles could be used in designing data applications at scale. For now, here’s a simplified definition of “pure function”:

The output of a pure function depends only on its input parameters and its internal algorithm (i.e. the “black box” where the input parameters are fed into).
A pure function has no side effects ; it does not have any read/write interactions with the outside world.
As a consequence of the above two statements, if a pure function is called with an input parameter x infinite number of times, it will always return the same result y - regardless of any state change of an internal or external process.

Declarative vs Imperative Programming

In the imperative programming paradigm, code is viewed as statements that changes a program’s state. An imperative program consists of sequences of statements written as explicit instructions to the computer on how the program operates to change its state.

Procedural and object-oriented programming paradigms are extensions of imperative programming to improve maintainability of imperative programs by separating programs into smaller components. Procedural programming focuses on breaking down a program into procedures (also known as subroutines or functions), while object-oriented programming focuses on breaking down a program into objects with state (data) and behavior (code).

While procedural and object-oriented programming allow programs to be expressed in procedures that are easier for a programmer to understand without necessarily looking into the details, the complete program is still imperative since the order of execution for the statements (also known as control flow ) affects how the program state is being changed.

In contrast with imperative programming, the declarative programming paradigm expresses the computation logic of a program without explicitly describing the steps to achieve them in sequence.

Functional programming is characterised by a declarative programming style, with computations performed through evaluation of expressions as function application and encapsulation of state mutation over control flow. This programming paradigm enables the programmer to write self-contained reusable and testable programs without additional mock objects and interfaces.

Key Principles of Functional Programming

The key principles of functional programming are:

Pure functions and avoid side effects
Immutability
Referential transparency

Pure functions and avoid side effects

When we look at a mathematical function , we expect the function f to do nothing else other than computing the result y given its input x.

In other words, a pure function has no observable effect on the program execution besides returning a result (which is its main effect).

A function with side effects changes state outside the local function scope. Examples of side effects include:

modifying a variable or data structure in place
modifying a global state
performing any I/O operation (reading from or writing to a file/database, printing to console or reading user input etc.)
throwing an exception with an error

To illustrate the concept of pure function and its key implications, let’s use an oven as an example:

Pure Function - illustrated using oven and pizza

To bake a thin-crust Hawaiian pizza (sorry pizza purists), we need pizza crust and toppings, with the oven temperature set at 160 degree Celsius for 10 minutes. The inputs to the oven-baking function are:

pizza crust type (thin-crust)
list of toppings (cheese, tomato, ham, pineapple chutney)
oven temperature (in degree Celsius)
baking time (in minutes)

If we assume that the oven-baking operation is a pure function, we assume that the output of the operation only depends on the inputs and the internal algoithm of the oven-baking operation. We do not expect any side effects, such as the oven-baking operation burning down the kitchen.

Consequently, we expect the oven to return a perfectly-baked thin-crust Hawaiian pizza every single time regardless of how many times we perform the operation given the inputs without changing the state outside the oven. We do not expect the oven to return a cream-based pizza or a burnt pizza given the function input.

In more formal terminology, we expect a pure function (the oven-baking operation) to be:

deterministic and idempotent , and
without side effects.

In reality, we might sometimes open the oven door to check on the oven-baking operation. (I/O operation)

We might decide to shorten the baking time by turning the timer on the oven, or add more cheese to the pizza toppings. (modifying a variable in place)

The oven might heat up its surroundings, increasing the temperature of its external environment. (modifying a global state)

The oven might either get too hot or suffer a short circuit, affecting the successful completion of the oven-baking operation. (throwing an exception with an error)

These effects resulting from the oven-baking operation cause changes in state outside the oven besides the thin-crust Hawaiian pizza, hence making the oven-baking operation an impure function with side effects.

Immutability

Immutability means that once a value is assigned to a variable, the state of the variable cannot be changed.

The concept of immutability is important in Functional Programming, as it ensures that the function has a disciplined state and does not change other variables outside the function scope. Instead of modifying the value of a variable in place, state changes are managed by creating another instance without affecting the state of the original variable.

The use of immutable variables also ensures that the function is pure , as it prevents the side effect of state change after a value is assigned to an immutable variable.

A key implication of immutability is the ease of writing parallel/concurrent programs in Functional Programming.

In imperative programming, mutability of states often complicates reasoning about distributed states and concurrent execution, as it is immensely difficult to keep track of shared state changes across threads, cores and processors without running into race conditions. In concurrent operations, data race could arise when two threads perform conflicting operations (with one of them being a write operation) on the same memory location at the same time.

As Python is primarily designed as an object-oriented programming language, its imperative design patterns lead to complications in managing concurrent access to shared variables that are mutable by default - hence the need for a Global Interpreter Lock (GIL) to lock threads and prevent data race.

Immutability in functional programming simplifies the implementation of concurrency and provides powerful ways of building consistent and concurrent programs, as the use of immutable shared states leads to elimination of race conditions - making concurrent programming less problematic compared with the imperative approach.

Referential transparency

An important property resulting from the use of pure functions is referential transparency , which is intricately linked to the ability for equational reasoning of programs.

In the book Functional Programming in Scala by Paul Chiusano and Rúnar Bjarnason, referential transparency is formally defined as follows:

An expression e is referentially transparent if, for all programs p, all occurrences of e in p can be replaced by the result of evaluating e without affecting the meaning of p.

In other words, referential transparency is a property of expressions (not just functions) such that an expression can be substituted by its equivalent result without affecting the program logic for all programs.

The absence of side effects is a necessary, but not sufficient condition for referential transparency. The expression also has to be deterministic and idempotent to ensure the equivalence between the expression and its evaluated result.

A function is deterministic if it will always return the same output given the same input.

A function is idempotent if it can be applied multiple times without changing the result beyond its initial application. Examples of idempotent functions are the identity function, absolute value function and constant functions.

The sufficient conditions for referential transparency can be illustrated by the following analogy:

What if the oven breaks down over time even without external interference, causing the pizza to not be as well baked as before? There might not be observable side effects, but the output returned from the oven-baking operation is no longer the same as previous outputs given the same input. This makes the oven-baking operation non-deterministic since the result depends on when the operation is evaluated, breaking the property of referential transparency.

A key consequent of the property of referential transparency is that it enables equational reasoning of programs. The expression can be replaced with its equivalent result, and computation can be performed by substituting “equals for equals” without worrying about evaluation order or program state - similar to evaluating an algebraic expression in mathematics.

This mode of reasoning about program evaluation, called the substitution model, is simpler to reason about since the effects of evaluation are purely local and do not require sequential reasoning of state updates to understand the code. Even if there were any bugs in the function during the development process, the ease of reasoning makes debugging easier in functional programming compared with imperative programming.

When designing reproducible data pipelines at scale, having referential transparency in the code provides the following benefits:

Idempotency of functions assures the programmer that the data transformation functions in the program are reproducible beyond the initial application.
It enables the programmer to express code in more concise and readable functions and values, improving readability when coding.
It allows the programmer to focus on debugging within the function scope without worrying about state changes outside the function scope, improving maintainability of core transformations within a data pipeline.

What’s next: Functional Programming for Data Pipeline Design

In this post, we learn about:

Functional programming and how it differs from imperative programming
Concept of pure functions
Key principles of functional programming and their implications on data pipeline design

In my upcoming post, I will dive into some features of functional programming and how to implement them in designing functional data pipelines.

Want more behind-the-scenes articles on my learning journey as a data professional? Check out my website at https://ongchinhwee.me!

References

I Started Learning Scala as a Python Programmer. Here’s Why.

Ong Chin Hwee — Sun, 18 Apr 2021 00:00:00 +0000

Motivations for learning Scala

One of my tech goals in 2021 is to learn Scala. My key reason for learning Scala is to learn Functional Programming for data engineering.

The question is: Why go through the trouble of learning Scala if Functional Programming is supported in Python?

How I learn different programming paradigms

As a language purist who believes in using tools for their intended purposes, I believe that the best way to learn a new programming paradigm (whether it is Object-Oriented Programming or Functional Programming) in depth is to learn a suitable programming language that is:

Designed primarily for the target programming paradigm (not as an afterthought)
Similar in syntax and code patterns to a programming language that you are already familiar with

The reason why point no. 2 is important is so that the focus would be more on learning the programming paradigm without fretting way too much on the syntax and code patterns.

For example, I learnt Object-Oriented Programming by re-acquainting with C and gradually transitioning to C++ through learning object-oriented concepts such as classes, inheritance and polymorphism. This language transition helped in speeding up the process of learning Python, which is designed to be primarily an object-oriented programming language although the language supports multiple programming paradigms.

Why is Python not ideal for learning functional data engineering

While exploring parallel programming in Python for data science (and I highly recommend you watch my PyData Global 2020 talk for a vivid explanation of parallelism), I learnt about functional programming in Python through map and itertools. It surprised me how intuitively I could relate functional programming to mathematical functions that make sense, and how I am already subconsciously using some functional programming concepts in visualizing my data flows.

Since then, I started diving deeper into different programming paradigms but felt that I wasn’t diving sufficiently deep enough into the functional programming paradigm with Python.

1. Python is primarily an object-oriented programming language

While functions are first-class objects in Python, the code patterns of the builtins and the Python Standard Library are built around classes and objects. This makes Python primarily designed to encourage object-oriented and imperative design patterns rather than the functional paradigm, even though it is a multi-paradigm programing language that provides developers with some level of flexibility in using their preferred coding pattern.

2. Parallel programming is not truly concurrent in Python due to Python internals design

The object-oriented design of Python and its builtins also leads to complications in parallel programming in Python, whereby the problem of breaking down a sequential program into parallel chunks implies the need to manage concurrency to prevent concurrent access to a shared variable (hence the need for a Global Interpreter Lock in CPython).

As a data professional who deals with data volumes that at least requires some level of parallelism, having to spend precious developer hours figuring out which parts of a sequential program can be refactored for parallelism and consistency is a major source of frustration and angst. Having the intuition to break programs down into smaller independent functions (I/O and non-I/O) do not make the process of refactoring for optimal parallelism any easier.

I still love Python for its comprehensive data ecosystem and the friends I made from the community, and I am still actively using Python in my personal and work projects.

However, the demands for scalability and reproducibility of data processing pipelines at a larger scale requires heavy-lifting by parallel or even distributed processing, as well as a programming paradigm that supports that heavy-lifting while allowing data professionals to be productive in designing data solutions that they are confident of in producing the same result for the same data input.

Why learn Scala for data engineering

There might be various reasons why developers choose to learn Scala. For some, it could be due to web programming. In recent years, a popular reason for learning Scala is for Big Data.

Here are my reasons for learning Scala for data engineering:

1. Scala is primarily designed for functional programming and object-oriented programming

Scala is designed to seamlessly integrate the object-oriented programming paradigm of Java and the functional programming paradigm, and aimed to address criticisms of Java.

2. The syntax for Scala shares similarities with C/C++ and Python

Scala uses curly-brace syntax similar to C/C++, and also encourages indentation for nested logical blocks (e.g function within a class). The slight difference is that Scala uses 2 spaces instead of 4 spaces or tabs (which is the case for Python).

3. Scala is a strong statically-typed language with type inference, which saves trivial keystrokes while maintaining strong typing for function definitions

Static typing allows bugs to be caught more easily during the development process, hence avoiding bugs in complex applications. As a strong statically-typed language, code written in Scala is checked for type safety at compile time rather than at runtime. Similar to Python, Scala also supports type inference - meaning that the Scala compiler can often infer the types based on different elements of the source code.

What Scala differs from Python in terms of type inference is that type annotations are required in arguments of function definitions (they are optional in Python) to prevent breaking changes to function internals and optimize compile times. While I try to use type annotations as much as possible in Python, they do not affect execution on the Python interpreter. Making type annotations explicit as a default has the potential to save precious developer time on debugging applications that rely on multiple functions while ensuring that the desired data type is returned from a function.

What’s next: my progress in learning Scala as a Python programmer

It has been close to 3 months since I started learning Scala, and it has been an incredibly steep learning curve. While I’m glad that I could rely on my prior experiences with learning programming languages to start becoming productive with Scala, I do understand that the learning curve could have been even steeper for someone who is starting fresh.

I do have to admit that functional programming does get a lot of getting used to, especially when one has gotten used to writing code in procedural and object-oriented programming paradigms. I would like to think of functional programming as math with defined inputs and outputs, while procedural programming is a step-by-step recipe and object-oriented programming is objects with properties. They are different ways of writing code that works, but the thinking behind the process of writing code differs.

It does get very tempting to fall back on comfortable code patterns when the learning gets tough, especially in a multi-paradigm programming language like Python. Hence, I do find myself feeling terribly unproductive at times when getting used to (and being “steered towards”) writing code in the functional paradigm. I do love using type annotations in Scala though - they do make debugging much easier during the development process!

In my upcoming post, I will introduce the basic principles of functional programming and relate these principles to reproducible data pipeline designs.

Want more behind-the-scenes articles on my learning journey as a data professional? Check out my website at https://ongchinhwee.me!

Year 2020 in review - when tech conferences go virtual

Ong Chin Hwee — Thu, 31 Dec 2020 00:00:00 +0000

At the start of 2020, I set a goal to speak at 4 tech conferences including one in Europe. COVID-19 disrupted my plans completely, and I was forced to adapt to the new reality of virtual conferences. Here’s my journey from regional to international speaker in the midst of a pandemic, and lessons learnt along the way.

Recap of Year 2020: Speaking

In year 2020, I have given a total of 10 talks , including an exclusive episode and my first invited keynote :

1 Invited Keynote :

12 December 2020: “Is Rainfall Getting Heavier? Building a Weather Forecasting Pipeline with Singapore Weather Station Data” at PyCode Conference 2020

6 Conferences :

20 March 2020: Speed Up Your Data Processing: Parallel and Asynchronous Programming in Python at FOSSASIA Summit 2020 — Recording courtesy of FOSSASIA
25 April 2020: Just-in-Time with Numba at Remote Python Pizza
23 July 2020: Speed Up Your Data Processing at EuroPython 2020 — Recording courtesy of EuroPython
6 September 2020: Speed Up Your Data Processing: Parallel and Asynchronous Programming in Data Science at PyCon Taiwan 2020 — Recording courtesy of PyCon Taiwan
15 November 2020: “Speed Up Your Data Processing: Parallel and Asynchronous Programming in Data Science” at PyData Global — Pre-recorded talk, post-processed by PyData Global
6 December 2020: “Is Rainfall Getting Heavier? Building a Weather Forecasting Pipeline with Singapore Weather Station Data” at PyJamas 2020 — Livestream recording courtesy of PyJamas 2020
6 December 2020: “Seeing Data in Multiple Dimensions: Hierarchial Indexing in Pandas and How to Visualize Them” at PyJamas 2020 - Murphy’s Law struck down this one. ::sadface::

2 Meetups :

14 January 2020: Exploring Seasonal Insights from Singapore Weather Station Data at JuniorDevSG Code and Tell — Recording courtesy of Engineers.SG
28 March 2020: Just-in-Time with Numba at PyLadies International Women’s Month Lightning Talks

1 Exclusive Collaboration

11 November 2020: Wanted Spotlight - Breaking Data Episode 3: “How to Build Your Data Career Like A Business”

Facing Impostor Syndrome as a Tech Speaker

Before I got started with speaking at tech conferences and meetups, the question I asked myself was:

What would it take for me to be up on this stage as a speaker?

Throughout this year, the question I repeatedly asked myself was:

Am I good enough to become a successful speaker?

I struggle a lot with impostor syndrome as a developer and a tech speaker who has been doing most of my speaking for free.

Even after speaking at a higher-profile conference such as EuroPython, I still wonder if I am good enough to develop a good reputation as a developer and tech speaker.

In short, I’m not a natural when it comes to speaking in front of an audience. I still get nervous about what could go wrong, whether giving a bad talk would affect my ROI on speaking at a conference.

When I reflect upon my reasons for speaking at tech events:

Build my personal brand and raise my profile in the local tech community
Gain experience in public speaking
Pay it forward to the local tech community

I wonder if these reasons still hold true amidst a global pandemic when:

The local tech community is getting a bit more muted due to absence of in-person meetups and networking opportunities
I end up having to either speak one-way to a screen or pre-record my talk during a virtual conference, without much audience interaction
I find myself struggling to keep myself relevant in the local tech community while juggling my speaking and career goals, thinking of how to prepare for upcoming Call for Proposals and what to speak about in my next speaking opportunity

For year 2021, I will be focusing more on quality over quantity of conference talks - in order to dedicate more time towards my personal learning and career goals.

Speaking in-person at a Conference amidst a Pandemic

Where: FOSSASIA Summit 2020

When: 20 March 2020

Talk: Speed Up Your Data Processing: Parallel and Asynchronous Programming in Python

I was really looking forward to attending FOSSASIA Summit since 2019, as it is one of the largest gatherings of open-source enthusiasts - it could even be considered a headliner event for the open-source community. In fact, I was really looking forward to meet Eriol Fox again and bring them around Singapore for meatless food.

Alas, travel restrictions due to COVID-19 all but prevented most of the overseas speakers and attendees from attending in-person. Even the FOSSASIA organizers could not make it to the venue in person due to COVID-19 restrictions - they flew in from Europe.

Despite the challenges posed by COVID-19, the organizers decided to proceed with the event with a mix of offline and online talks with live streaming and chats. I could still proceed with giving my talk, and I ended up speaking to a room of less than 10 people and a livestream audience of I-have-no-idea-how-many.

It was a strange experience attending a conference with so few people. I originally designed this talk to be interactive and audience-driven, with the pace of the talk driven by casual “coffee shop” banters with the audience. While having less than 10 people in the audience does make for a more “intimate” experience, a part of me wished that I could interact with more people as per normal.

Then again, is anything normal in a pandemic?

As it turns out, this was my first and last in-person conference speaking of 2020.

My First Time Speaking at a Conference outside of Asia

Where: EuroPython 2020

Talk: Speed Up Your Data Processing: Parallel and Asynchronous Programming in Data Science

One of my speaking goals in 2020 is to speak at a tech conference outside of Asia, and this talk was intended for my European speaking debut.

COVID-19 rendered me unable to travel to speak. However, COVID-19 also led to conferences being moved online and I was able to speak at a European conference “virtually” for the first time.

I came in without much expectations on audience interest since my talk wasn’t part of the list of talks selected by popular vote - in fact, I was just happy to be given a chance to speak at EuroPython with an impressive lineup of high-profile Pythonistas! Hence, I was pleasantly surprised that my talk at EuroPython managed to make a pretty good impression - and might have opened a few more doors for me.

Speaking virtually at an in-person Conference amidst a Pandemic

Where: PyCon Taiwan 2020

When: 6 September 2020

Talk: Speed Up Your Data Processing: Parallel and Asynchronous Programming in Data Science

Speaking at PyCon Taiwan has been one of my key priorities in year 2020 even before the pandemic, as Taiwan was where I made my international debut on the conference-speaking circuit.

Alas, I couldn’t fly to Tainan for the in-person conference due to COVID-19.

Seeing live audience responses from the other end of the remote call while I presented my talk gives me hope about the future of tech conferences though - that one day we may be able to meet in-person again.

Pre-recording talk with “live” elements at my first PyData Conference

Where: PyData Global 2020

When: 15 November 2020

Talk: Speed Up Your Data Processing: Parallel and Asynchronous Programming in Data Science

By this time, I am getting a bit “sick” of writing reflection posts for virtual conferences. Moreover, this is a pre-recorded talk - no comments about audience response or speaking performance since the talk was not given live anyway.

Special mention to the folks who designed the virtual conference space on Gather to replicate the “in-person” conference experience as much as possible - it made PyData Global a slightly more engaging experience even when I was getting a bit “fatigued” by virtual conferences and Zoom calls.

When Murphy’s Law Struck Down The Internet and Mobile Network (and One of My Talks)

Where: PyJamas 2020

When: 6 December 2020

Talk #1: Is Rainfall Getting Heavier? Building a Weather Forecasting Pipeline with Singapore Weather Station Data

Talk #2: Seeing Data in Multiple Dimensions - Hierarchial Indexing and How to Visualize Them

For some reason, Murphy’s Law loves to strike when I unexpectedly get an opportunity to give conference talks on the first week of December.

The good news: Getting both talk proposals accepted via anonymous voting.

The bad news: Preparing two new talks, and having the Internet connection playing tantrums on my ambitious attempt to prepare two new talks in less than 24 hours.

Unfortuntely for me, Murphy’s Law won and struck down both my fibre broadband and mobile network by the time my second talk came around. Hence, Talk #1 lives and Talk #2 dies.

And, I ended up becoming so grumpy about my Internet woes until I ended up sleeping off my frustrations for the rest of the conference. Thanks goodness for YouTube replay.

I guess there’s a reason why I think giving two talks for a conference is not a good idea - and I resolve to refrain from doing that from now on. Quality over quantity. Sorry, community folks.

My first Keynote

Where: PyCode Conference 2020

When: 12 December 2020

Talk: Is Rainfall Getting Heavier? Building a Weather Forecasting Pipeline with Singapore Weather Station Data

I wasn’t really expecting to give any more talks after Pyjamas 2020, so this keynote invitation from the PyCode Conference organizing team came as a surprise.

My initial thought was: Why did they invite me to be their keynote speaker, when there are more well-known speakers in the Python community? Why me?

As I considered carefully on whether to accept the keynote invitation (I was asked to choose between Web and Data for the theme), I thought:

How do I make use of this opportunity to tell my story while setting the pitch of the event?

And that is how I decided to make my uniquely Singapore experience of “ponding” the centrepiece of my keynote speech - and tropical rainfall forecasting serves as the medium for me to explore the challenges faced by various experts in predicting rainfall amidst the backdrop of climate change.

Epilogue

After giving 7 talks within one year and most of them being virtual, I am starting to experience burnout and fatigue in some form.

Can I just emphasize once again that it feels kinda weird speaking to the webcam and you don’t know how many people are actually watching your talk? Imagine that happening for most of the year.

With the pandemic not showing any signs of abating, virtual conferences seem to be the “new normal” for aspiring conference speakers. That means I have to commit to making sure my speaking-related tech is in good condition and my speaking style is catered towards a virtual “on-demand” audience with little interaction.

While the temptation of giving as many talks as possible without the constraints of travel seems tempting, I found myself in a delicate balancing act between my speaking-related prep, my career, and my personal life all muddled into one in year 2020.

Hence, my goal in year 2021 is not to speak at even more conference compared to year 2020.

My goal in year 2021 is to focus on quality and deliver talks in 4 conferences locally and internationally, and I am starting 2021 with a tally of one.

#Shitoberfest: How free T-shirts ruined #Hacktoberfest2020

Ong Chin Hwee — Sat, 03 Oct 2020 00:00:00 +0000

About Hacktoberfest

Hacktoberfest is an annual event organized by DigitalOcean that celebrates open-source contributions. Occuring every October, the goal of Hacktoberfest is to encourage developers (of all backgrounds and skill level) and companies to make positive contributions to the open-source community.

To encourage developers to make more open-source contributions, the first 70,000 participants who successfully makes 4 pull requests (PRs) between 1 - 31 October in any time zone to any public repository on GitHub are eligible to receive a prize in the form of a limited-edition T-shirt. Yes, no limits.

Alternatively, participants can choose to plant a tree instead of getting a free T-shirt in year 2020 - as a show of support for #sustainability.

All these sound like an initiative with good intentions on paper - incentivise developers to contribute to open-source projects. Unfortunately, the organizers underestimated the extent of what people are willing to do for the sake of getting free T-shirts (or freebies in general).

How low-quality PRs turned #Hacktoberfest2020 into #Shitoberfest

On October 1st after work, I was just minding my business scrolling Twitter and searching GitHub for interesting #hacktoberfest issues to work on - and then I saw this tweet from a hilariously-named Twitter account called @shitoberfest that seems to be dedicated towards curating spam PRs.

Hi, I'm @shitoberfest. Do you maintain an opensource project?

Send a screenshot of bullshit drive by pull-requests caused by #hacktoberfest and tag @shitoberfest for curation and amplification. pic.twitter.com/50wuPISbYb

— #shitoberfest (@shitoberfest) October 1, 2020

“What on earth is happening?” I wondered while looking through the tweets in the account and realizing that the PRs seemed to follow similar patterns - with “Awesome Project” and other nonsensical edits to READMEs.

Next, I saw this tweet by Gael Varoquaux (@GaelVaroquaux) who is a co-founder of the scikit-learn project. And it looks like the spam PR situation is MASSIVE.

A difficulty in a popular open-source project: people submit contributions to gain credit, but not always useful (below: no content + invalid markup).

scikit-learn had 10000 pull requests, 732 still open.

Reviewing them is costly qualified labor.

Rapid closing harms openness 🤷 pic.twitter.com/6DKcB2aHmO

— Gael Varoquaux (@GaelVaroquaux) October 1, 2020

I highly recommend checking out @shitoberfest for a good laugh at some of the spam PRs. It’s so bad that it’s hilarious to the casual observers who don’t care too much about free T-shirts.

It’s not so funny for open-source maintainers though - they had to do extra work cleaning up spam PRs and tagging them as “invalid” or “spam” so that those spam PRs do not count towards the Hacktoberfest tally.

Cause

What caused #Shitoberfest? It wasn’t that massive a problem last year even though the problem of “spam PRs” have always been there, so it could not have been caused solely by the incentive of getting a free T-shirt.

This excellent narrative post by Joel Thoms explains in detail what caused the #Shitoberfest drama (thanks Eugene Yan for sharing that gem!): a YouTuber called CodeWithHenry who demonstrated how easy it is to make a Pull Request to a repo in order to win free stuff - by creating a low-quality PR.

This led to his viewers following exactly what he did, leading to this #Shitoberfest.

Without alluding to any particular nationality/race/ethnicity, he was speaking a non-English language in that demonstration video. That video has since been removed.

Psychology and Aftermath

The next question is: What is the psychology behind those people who created spam PRs just for the sake of getting a free T-shirt?

There are two key factors that might have influenced such behaviour:

The Psychology of Free - in this case, the psychology of trying to get free stuff with as low an effort or opportunity cost as possible.
The Psychology of FOMO (Fear of Missing Out) - in this case, the fear of missing out on a “limited edition” T-shirt that spells “bragging rights” after watching a YouTuber getting one from making PRs.

It’s human nature to love free stuff, especially if there’s a way to get them without spending too much effort and time - it’s like getting an undeserved gift. It’s also human nature to be subject to peer pressure and FOMO when seeing others getting stuff that we wish we had but do not have.

However, it’s not fair to exploit human nature in a way that causes masses of people to generate spam that leads to wastage of other people’s limited resources.

It is also not fair to other developers who genuinely want to learn through contributing to open-source projects that they care about, be it their own projects or projects by other developers.

The problem is that:

DigitalOcean created the limited T-shirt incentive to attract more people to contribute to any public GitHub repositories as part of Hacktoberfest.
CodeWithHarry wanted to share with his audience about ways to get free stuff through making PRs for Hacktoberfest, and demonstrated with a low-quality PR for convenience and speed - without explicitly mentioning in the video that the audience should not follow exactly what he did as it is a low-quality PR just for demonstration.
Some of those spammers might probably be new to open source and/or Hacktoberfest, do not know how to contribute meaningfully, but really want that free Hacktoberfest T-shirt anyway.

And these cumulated in a ruined Hacktoberfest and change of rules.

I feel like an impostor in tech. I'm still here.

Ong Chin Hwee — Sun, 08 Mar 2020 15:06:06 +0000

I started my career in tech in 2014 as a research assistant at an aerospace corporate research laboratory, developing proof-of-concept experiments for advanced manufacturing techniques for additive-manufactured aircraft engine components. While I fretted over getting the required metallic test pieces to conduct experiments and was getting increasingly frustrated at the budgeting and lead-time, I became envious of the research associates and research fellows working on computational projects whose research progress depended less on materials lead-time and could work on other aspects of their projects while waiting for their simulations to run.

During that low point of my career, I thought to myself:

How do I switch from conducting physical experiments to conducting simulated experiments as a career?

Dissatisfied with the state of my career, I resigned from my job towards the end of my 2-year contract with no clear path forward, knowing that my journey in pursuing a Masters in computational science and modeling would be far from smooth-sailing. I may have learnt to code in C and MATLAB during my undergraduate studies, and I did pretty well in writing numerical codes for my projects (with plenty of help from googling and reading sample codes). However, it had been more than 2 years since I wrote a line of code - how do I brush up on my coding skills to survive a computational Masters?

It certainly did not help that I was also going through a very low point in my personal life, escaping from an abusive relationship and struggling with physical pain. Nevertheless, I decided that I had to move forward with my long-term goal of becoming a computational researcher, and I had to start somewhere.

Learning to be human through coding and tech

While waiting for the first semester of my Masters program to commence, I did a refresher and started writing my first lines of code for the first time in 2 years. My coding skills had gone rusty from lack of use, but I felt free for the first time in 2 years - free to experiment, and free to make mistakes in the process. The pressures of trying to upkeep a front of perfection and invulnerability in the name of trying to be extraordinary slipped away as I encountered error messages and fixed the bugs in my code.

It's okay to make mistakes in the process - you're not going to destroy the computer with your syntax errors.

When I am reminded that syntax errors are part of the learning process and do not necessarily make me an impostor, I stopped beating myself up for making mistakes.

It has been more than 3 years since I came back into coding through computational science, and I have been working as a data engineer at a government-linked corporation since 1.5 years ago.

I'm very thankful that I was given a chance despite having learnt Python on my own just a few months ago, because they saw the value in the computational and research skills that I learnt through my Masters project and research stint. I have spoken at 2 regional tech conferences last year in Asia, and I am excited to be making my European speaking debut at DragonPy in Slovenia this year (assuming COVID-19 does not derail plans, fingers crossed). I have also made my very first contribution to the documentation for pandas 1.0 release, and got onto the waitlist for OSCON this year (phew, not rejected yet I guess).

I still feel very much like an impostor in tech, because I don't see enough of people like me in tech events and conferences. I still fare badly at technical interviews especially for data structures and algorithms, and struggle to understand all the big data tools that are highly sought after and talked about in the data ecosystem. I still feel a great sense of fear and shiftiness when I attempt to write technical posts in my developer blogs or deliver talks on stage, worrying that I would be exposed as a fraud if I make mistakes in my writing or mess up my talk.

Nevertheless, I'm still here in tech and I'm still coding.

Despite my own insecurities in my tech capabilities and having to contend with implicit discrimination in male-dominated workplaces with entrenched biases, I learnt the value of community in tech.

Even as I felt drained when attending tech events and conferences alone initially, the act of showing up and being radically honest in my intentions helped in making myself feel part of the tech community when people start noticing and reaching out to me.

Even with COVID-19 leading to a string of cancellations of major tech events, the local tech community came together to move the meetups online so that members of the tech community could continue to share their experiences and connect with each other while staying safe.

Most importantly, it is through getting involved in the tech community and staying in tech that I learn that it is okay to be imperfect and vulnerable even as a person from an underrepresented community. I write code and do tech, but I am also human and I am still learning how to improve myself everyday. And that has given me permission to make mistakes, learn from them, and grow to become a better person in my career and personal life.

Year 2019 in review - getting started with speaking at a tech conference

Ong Chin Hwee — Tue, 31 Dec 2019 00:00:00 +0000

At the start of 2019, I set a goal to speak at a tech event. By the end of 2019, I’ve spoken at 2 meetups and 2 conferences. Here’s my journey from wide-eyed event attendee to tech conference speaker, and lessons learnt along the way.

Recap of Year 2019: Speaking

In year 2019, I have given a total of 4 talks :

2 Conferences :
1. 31 August 2019: How to Make Your Data Processing Faster: Parallel Processing and JIT in Data Science at Women Who Code CONNECT Asia 2019 — Recording courtesy of Engineers.SG
2. 1 December 2019: Making Open Weather Data More Accessible: Extracting Seasonal Insights from Singapore Weather Station Data at OpenUP Global Summit 2019 — Recording courtesy of Open UP Summit
2 Meetups :
1. 27 August 2019: Parallel Processing in Python at Python User Group Singapore Meetup
2. 28 November 2019: Contributing to pandas documentation for the first time - lessons from open source at Women Who Code Singapore TalksDev #5 — Recording courtesy of Engineers.SG

Speaking at Tech Events: Why?

I’ve attended tech events and conferences since 2018, and I’ve learnt a lot from attending the sessions. As I sat among the audience watching in awe at the speakers and panelists on stage delivering their talks and speeches with confidence and style, I thought:

What would it take for me to be up on this stage as a speaker?

I’m more of an introvert in nature, and extensive networking during tech events can get very draining for me even though I do love interacting with new people. Having to find various means to start a conversation only to have it fall flat on people also adds on a lot of pressure on me, and consecutive occurrences do drain me severely of energy such that I need to take breaks from attending tech events. Moreover, I tend to get really nervous when speaking in front of an audience - unlike music performance whereby it is natural not to maintain eye contact with the audience while getting your whole self (body + mind) immersed in the music, public speaking requires a lot more attention on eye contact and body language.

In short, I’m not a natural when it comes to speaking in front of an audience. Still, I want to speak at tech events for the following reasons:

Build my personal brand and raise my profile in the local tech community
Gain experience in public speaking
Pay it forward to the local tech community

Getting Started

In end January, I attended a panel discussion on “Getting Started with Public Speaking” by Women Who Code Singapore. It was at this event where I met three of the panelists who would turn out to be pivotal in my speaking journey - Renu Yadav (Women Who Code Singapore), Chen Hui Jing (organizer of SingaporeCSS) and Michael Cheng (JuniorDevSG, EngineersSG).

As it turns out, Hui Jing also organizes Global CFP Diversity Day in Singapore and was promoting the event (as well as her meetup SingaporeCSS, which always needs speakers). I originally intended to attend another tech event on the same day, but woke up late and decided at the last minute to attend Global CFP Diversity Day - just to see what it was about.

During the Global CFP Diversity Day workshop, we need to write a speaker profile and work through a whole list of questions on crafting a CFP submission. Uh okay, what can a data engineer who is barely 5 months into the role write or speak about that would impress the CFP panel? Wait what, I have to tell Hui Jing about myself?

(insert more neverending questions that sprouted after Global CFP Diversity Day)

As it turns out, role-related pains/complaints/angst can be a great source of inspiration for talk proposals.

Source of Inspirations for Talk Proposals

The talks I have proposed to meetups and conferences so far are mainly derived from pains/complaints/angst that I faced while working on data analytics projects - both at work and while working on side projects during my free time. I tend to joke with fellow developers I met during tech events that I write talk proposals and give talks to “make up for the time spent on wrangling with all those dev issues and angsting/bitching about it” and to “make full use of the opportunity to speak to air my grievances while building my personal brand”.

For example, my talk on How to Make Your Data Processing Faster started out as a by-product from the Shopee Data Science Challenge which two of my colleagues and I participated in. As it was our first data science challenge, we faced loads of challenges processing our images and using them as our model inputs. To milk the most out of the “suffering” and precious time lost in trying to process the images, I wrote a Medium post about it and used whatever I wrote as a CFP idea with the intention to submit to multiple conferences. It turned out that data processing is a common bottleneck and a major pain point among data and software engineers, and I was overwhelmed yet honoured by how positive the reception was when I started speaking about parallel processing in data science.

In short, if you face a challenge/problem/issue at work and managed to solve it after expanding loads of time and effort, why not milk your pain’s worth by giving a talk and sharing your experiences with fellow developers so that they could learn from your experiences? After all, we can learn from each other’s experiences with a technology or a problem to solve, and that is independent of the number of years of experiences in the industry given how fast technology is changing.

My First Time Speaking at a Tech Event

Where: Python User Group Singapore Meetup

When: 27 August 2019

Talk: Parallel Processing in Python

For my first-ever public tech talk, I was initially thinking of starting small by speaking at relatively more “beginner-friendly” meetups such as JuniorDevSG Code and Tell. While I submitted my first CFP to a conference, I was prepared to submit CFPs to multiple conferences and meetups until I receive a talk acceptance.

Surprisingly, my first CFP submission on How to Make Your Data Processing Faster was accepted by end June before I had a meetup talk scheduled. I needed to practice my conference talk at a meetup where I could get relevant feedback fast, and I was informed by Ka Ho that there would be a long waiting list for speaker slots at JuniorDevSG Code and Tell. Due to the urgency of the situation, I booked a speaking slot with Martin for the Python User Group Singapore August 2019 meetup to practice a key portion of my conference talk. I was informed that the meetup would be held at the new Zendesk office at Marina One and the typical turnout would be around 50-60 people. The day before I was due to speak, the RSVP at Meetup.com went up to 200 people. Realising that I would have to deliver my first-ever meetup talk in front of more than 100 people even if 50% of the RSVPs dropped out, I took time off work to refine and practice my talk thoroughly.

On the day itself, the actual turnout was over 100 people - way more than I expected, especially since I have been relatively low-key about the talk, even removing any references to my gender in the speaker profile. There were technical issues, my “tech-y” jokes and Spark references kinda fell flat on the audience, and there was a slight overrun. Surprisingly, more than half the audience were still seated despite the slight overrun. I thought I kinda messed up my talk, so I was pleasantly surprised when a few people in the audience approached me off-stage (even though it was already pretty late) to thank me and give positive feedback for the talk.

No talk recordings here; frankly speaking, I don’t even dare to watch my own recording on my phone in full. Speaking in front of more than 100 people for a meetup is already quite a feat, and I’m glad I survived better than I thought I could despite having a nervous start.

My First Time Speaking at a Conference

Where: Women Who Code CONNECT Asia 2019

When: 31 August 2019

Talk: How to Make Your Data Processing Faster: Parallel Processing and JIT in Data Science

This is the first conference that took a chance on me to deliver a full-length talk at the conference stage. I enjoyed the speakers’ dinner, got to meet international speakers such as the really awesome Jiaqi Liu, Millie Chan and Kat Liu, and watched Hui Jing put up a highly-entertaining talk on Creating Art with CSS despite being massively jetlagged from her conference travels.

I had a practice talk on the Parallel Processing portion at the Python User Group Singapore August 2019 meetup. Based on the questions and feedback collated from the audience, I made some improvements to the slides and even sought the help of the Javascript/Node.JS folks on Twitter for some ideas on how to explain async to the general audience. Having the experience of speaking on stage behind a podium in front of more than 100 people + doing the open pose as suggested by the WWCode Taipei folks did help significantly with the nerves before the talk too.

Feedback from the audience was pretty good. Quite a number of people in the audience approached me off-stage, throughout the conference (including lunchtime) and/or on LinkedIn to thank me for the talk and give positive feedback on how much they learnt from the talk. I’ve also received a couple of mentions on Twitter which were really nice. Having someone come up to you and say “hey I attended your talk and it was really interesting” makes all the preparation work for the conference talk worth it.

My First Time Speaking at a Conference outside of Singapore

Where: Open UP Global Summit 2019

Talk: Making Open Weather Data More Accessible: Extracting Seasonal Insights from Singapore Weather Station Data

This is the first time I travelled out of Singapore to speak at a conference, and my first time delivering a talk with a demo segment. In the spirit of #opendata, the core objective of the talk is to show how we could make open weather data more accessible to anyone - developers and non-developers included, hence the title.

A large reason why I got the chance to speak at Open UP Global Summit was because the main organizer was in attendance at Women Who Code CONNECT Asia and enjoyed my talk. I feel incredibly honoured to be trusted with the conference stage, and blessed to be given the opportunity to travel to Taipei and deliver a full-length talk based on my weather station data API scraping project.

I refactored the code for my weather station data API scraping project (which was a product of a random weekend coding exploration) and prepared the Jupyter notebook for the time series visualizations in advance. As it was my first time delivering the talk in a demo-driven format, I prepared the presentation slides first and rehearsed the non-demo parts incessantly while conceiving of possible plan Bs in case the API scraping demo does not run smoothly. As my speaking slot was on 1st December and I wanted to analyse weather data up to end November, I ran both the API scraping and time series visualization notebook to obtain the latest data and visualizations, and tested both demo segments until 3am to ensure that I could showcase the demo with the latest weather readings on stage.

On the day of the talk itself, Murphy’s Law of Demos struck with a Wifi connection that kept dropping every 10 minutes (because I was using the venue wifi had to keep re-doing the login regularly) and issues with my Jupyter server. I made a snap decision to truncate my demo, explain a bit about the scraping code and showcase an offline Jupyter notebook of the time series visualization with all the codes executed beforehand.

Once again, Murphy’s Law of Demos struck. While attempting to showcase the offline Jupyter notebook on Visual Studio Code, the scrolling turned wonky and I had difficulties scrolling to the visualization that I wanted to show the audience! I had not thought of a plan B for that situation, and with only 3 minutes left, I wrestled with the scrolling and finally managed to showcase the time series box-and-whisker plots for the scraped weather data at the last minute. Not too sure if the audience picked up that I was kinda “panicking” with my demo, though at least two of the speakers couldn’t tell that it was only my first year of speaking or conducting a demo on stage.

Another blunder I made on hindsight was that I did not show my last slide with my social media and GitHub repo before ending my talk, so that was a lost opportunity for self-promotion.

After the talk, I was feeling fairly negative about what happened to the demo segments on stage and felt that I messed up big time, especially since the audience didn’t seem too responsive to my attempts in engaging them through questions. The feeling of “I think I messed up big time” intensified when not many people tweeted about my talk (except for my fellow speaker and tech community Korean sister Sujin Lee who delivered a highly informative talk on data-driven design for visualization) or approached me off-stage throughout the conference, but I was assured by the organizers and a few of my fellow speakers not to worry too much and that I did pretty well for my talk.

I still felt that I could have done much better for this talk especially in the demo segments, but it was a great international speaking experience nevertheless. I also learnt to be less harsh on myself when things do not go according to plan on stage, as it is likely that the audience may not even notice your mistakes or stumbles if you keep your composure and continue with your performance.

Epilogue

Fate seems to work in mysterious ways, and this time it seems to go full circle.

Remember that I said I was initially thinking of starting small by speaking at JuniorDevSG Code and Tell?

After my first conference talk at Women Who Code CONNECT Asia and during my speaking break, I was approached by Michael Cheng to give a talk at JuniorDevSG Code and Tell which would be held on 14 January 2020. It looks like my first talk at JuniorDevSG Code and Tell will indeed be my first talk - for the year 2020.

And this time, I would like to deliver a better performance for my Singapore Weather Station talk.

Understanding Python Dependency Management using pideptree

Ong Chin Hwee — Fri, 25 Oct 2019 01:44:56 +0000

Learning about tree-based dependency management for a team project developed in Python using pipdeptree

Dependency management is important, as packages depend on versions of other core packages in order to run as intended. Typically in a Python project, dependencies are downloaded using a requirements.txt file, which lists the packages and their dependencies as a flat file. While the package versions are included in the requirements.txt file, the dependency relationships are not explicitly stated. Determining the dependency relationships between packages using requirements.txt often requires "reverse engineering" in the form of tracing the dependencies of each installed package and figuring out why pip installed certain packages (since pip does the work of resolving package dependencies when installing packages).

While searching for ways to resolve the multiple requirements.txt files from my colleages within a team project, I stumbled across pipdeptree, a command-line utility for displaying installed Python packages in the form of a dependency tree. The output is displayed in a tree-based format instead of a flat list, showing the dependency relationships between installed Python packages and their associated dependencies.

Using pipdeptree for dependency management

To install pipdeptree, use the following command:

pip install pipdeptree

or, if you prefer to use conda:

conda install -c conda-forge pipdeptree

In order to be able to manage dependencies within virtual environments, pipdeptree has to be installed within each individual virtual environment. If you are starting a new virtual environment for a project and would like to use pipdeptree for dependency management, you would have to install pipdeptree in that new virtual environment.

Dependency tree output

To view the dependency tree of every installed package within the virtual environment:

pipdeptree

To view the dependency tree of a particular package e.g. pandas, the flag -p or --packages is used:

pipdeptree -p pandas

To view the reverse dependency tree - the packages that are dependent on every installed package within the virtual environment, the flag -r or --reverse is used:

pipdeptree -r

Sometimes we may prefer to have the dependency tree displayed as json representation to be used as input to other external tools. In this case, the flag j or --json outputs a flat list of all packages with their immediate dependencies, while the flag --json-tree outputs a nested json representing the dependency relationships between packages.

pipdeptree --json       # for immediate dependencies

pipdeptree --json-tree  # for nested dependencies

To lay out the dependency graph, GraphViz is required in both the command-line interface and the virtual environment. The available output formats are dot, jpeg, pdf, png and svg. For example, if I would like to output my dependency graph in pdf, I use the following command:

pipdeptree --graph-output pdf > dependencies.pdf

Installing Graphviz in virtual environment

First, I installed Graphviz on Ubuntu 18.04 LTS Windows Subsystem for Linux (WSL) using apt-get install by using the following command:

sudo apt-get update

sudo apt-get install graphviz

Next, I installed graphviz for Python using conda:

conda install graphviz

As of now, I have yet to get Grahpviz running successfully on Windows + Anaconda, but I will try setting up Graphviz on Windows to work with pipdeptree when I have the time (still figuring out what went wrong). Nevertheless, Graphviz works smoothly on Ubuntu 18.04 LTS WSL, so my current dependency management workflow is now on pip + venv + pipdeptree - and it works pretty smoothly without additional packages that conda installs in environments!

References

Accelerating Batch Processing of Images in Python — with gsutil, numba and concurrent.futures

Ong Chin Hwee — Sun, 26 May 2019 18:56:47 +0000

How to accelerate batch processing of almost a million images from several months to just around a few days

In a data science project, one of the biggest bottlenecks (in terms of time) is the constant wait for the data processing code to finish executing. Slow code, as well as intermittent connection to web and remote instances affect every step of a typical data science pipeline — data collection, data pre-processing/parsing, feature engineering, etc. Sometimes, the gigantic execution times even end up making the project infeasible and often forces a data scientist to work with only a subset of the entire dataset, depriving the data scientist of insights and performance improvements that could be obtained with a larger dataset.

In fact, time bottlenecks resulting from long execution times are even more accentuated for batch processing of image data, which are often read as numpy arrays of large dimensions.

Problem Description

About 2 months ago, my colleagues and I took part in the Advanced Category for the Shopee National Data Science Challenge 2019. The competition involves extracting product attributes from product title, and we were given three main categories of items: mobile, fashion and beauty products. In addition, we were also given the image path of each item and the associated image file — all 77.6 GB of it!

It was our first ever competition on Kaggle, and we started out feeling confident with the hybrid text-and-image model we had in mind — only to face bottlenecks in downloading and processing the large image datasets into Numpy arrays in order to feed them as inputs for our model. Here are some of my notes on the approach I attempted to resolve these issues, with particular focus on how I used numba and concurrent.futures to accelerate batch processing of almost a million images from several months to just around a few days.

Data Processing Workflow

To start off, here are the general steps in our data processing workflow:

Download large image datasets from source using wget command
Upload large volume of image files to Google Cloud Storage using gsutil
Import each image file from Cloud Storage to Colab
Convert each image to standardized _numpy _array

Step 1: Using wget command to download large image datasets

The first bottleneck we faced was downloading the image files from the Dropbox links provided by Shopee. Due to data leaks for the fashion category, we had to download the updated CSV files and image files again. One of the archive files containing images for the training set in the fashion category (including the original test set that had its attribute labels leaked) amounted to 35.2 GB, and our multiple attempts throughout the week at using Google Chrome to download the .tar.gz archive files containing the image files failed due to “Network error”.

This bottleneck was resolved using the wget command on Ubuntu in Windows 10 WSL (Windows Subsystem for Linux). The best part about using the wget command for downloading large files is that it works exceedingly well for poor or unstable connections, as wget will keep retrying until the whole file has been retrieved and is also smart enough to continue the download from where it left off.

I opened 2 instances of Ubuntu for WSL and ran wget commands on each instance to download the .tar.gz archive files containing the images for the three categories. All four archive files were downloaded successfully after 16 hours, surviving poor connection and network errors. Extracting the image files from the archive files using the tar -xvzf command took another 12 hours in total.

Tip: Working on command line is usually faster than working on GUI — so it pays to know a bit of command line as a speed hack.

Step 2: Upload image files to Google Cloud Storage using gsutil cp

My team uses Google Colab to share our Jupyter notebooks and train our models. Colab is great for developing deep learning applications using popular frameworks such as TensorFlow and Keras — and it provides free GPU and TPU.

However, there are some limits that we face while using Colab:

Memory limit:~12 GB RAM available after startup
Timeout: You are disconnected from your kernel after 90 minutes of inactivity — that means we can’t just take a nap while letting our processes run and waiting for our files to be uploaded (Constant vigilance!).
Reset runtimes: Kernels are reset after 12 hours of execution time — that means all files and variables will be erased, and we would have to re-upload our files onto Colab. Continuously having to re-upload tens of thousands of image files onto Colab while on constant standby is too tedious and slow.
Google Drive: Technically, we could mount our Google Drive onto Colab to access the files in Google Drive. As none of us pay for additional storage space on Google Drive, we do not have enough storage space for 77.6 GB of image files. Accessing the files through a portable drive is also not a plausible option, as that would add additional consideration in terms of data connectivity between the desktop/laptop and the portable drive.

In short, we needed a solution whereby we could store and access our files whenever needed, while at the same time paying only for what we use rather than fork out additional money just to pay for more storage on Google Drive.

In the end, we decided on using Google Cloud Storage (GCS) to store and access our image files. GCS is a RESTful online file storage web service on the Google Cloud Platform (GCP) which allows worldwide storage and retrieval of any amount of data at any time. Google provides 12 months and US$300 of GCP credits as a free tier user, which is perfect for our case since the credits would last for at least a month if we use them wisely.

First, I created a GCS bucket by using gsutil mb on Cloud SDK (the instructions on installing and setting up Cloud SDK can be found here and here respectively — I used apt-get to install Cloud SDK on my Ubuntu image, while Cloud SDK is available in Colab).

# Replace 'my-bucket' with your own unique bucket name
! gsutil mb gs://my-bucket

Let’s say I decide to call my storage bucket ‘shopee-cindyandfriends’:

! gsutil mb gs://shopee-cindyandfriends

Next, I proceeded to upload all my image files from each folder directory to my storage bucket using gsutil cp. Since I have a large amount of files to transfer, I performed a parallel copy using the gsutil -m option. The syntax is as follows:

# Replace 'dir' with directory to copy from
! gsutil -m cp -r dir gs://my-bucket

Let’s say I’m uploading all the image files from the fashion_image directory to my storage bucket:

! gsutil -m cp -r fashion\_image gs://shopee-cindyandfriends

Now, wait patiently and go about your usual day (maybe take a nap or grab some coffee to recharge) while gsutil uploads your files. Don’t worry too much about poor or unstable connections as:

gsutil does retry handling — the gsutil cp command will retry when failures occur.
If your upload is interrupted or if any failures were not successfully retried at the end of the gsutil cp run, you can restart the upload by running the same gsutil cp command that you ran to start the upload.

Uploading the image files to the GCS bucket using gsutil cp command took around 12–15 hours in total, surviving poor connection and network disruptions. Do not try uploading large amounts of files using the GCP web console — your browser will crash!

All 4 folders in our storage bucket — success!

Step 3: Import each image file from Cloud Storage to Colab

Now that we have our complete set of image files uploaded on Cloud Storage, we need to be able to access these files on Colab via the image path of each item in the dataset. The image path of each item is extracted from the dataframe which in turn was extracted from the CSV file of the corresponding dataset.

def define\_imagepath(index):
 '''Function to define image paths for each index'''
 imagepath = fashion\_train.at[index, 'image\_path']
 return imagepath

Remember the problem of poor connection? To ensure that retry handling is also performed during import operations from GCP, I used the retrying package as a simplified way to add retry behavior to the Google API Client function. Here’s the Python code I used:

from retrying import retry
from google.colab import auth

[@retry](http://twitter.com/retry)(wait\_exponential\_multiplier=1000, wait\_exponential\_max=10000)
def gcp\_imageimport(index):
 '''Import image from GCP using image path'''
 from googleapiclient.discovery import build
 # Create the service client.
 gcs\_service = build('storage', 'v1')

from apiclient.http import MediaIoBaseDownload

 colab\_imagepath = '/content/' + define\_imagepath(index)

with open(colab\_imagepath, 'wb') as f:
 request = gcs\_service.objects().get\_media(bucket = bucket\_name, object = define\_imagepath(index))
 media = MediaIoBaseDownload(f, request)

done = False
 while not done:
 \_, done = media.next\_chunk()

Okay, let’s proceed to define our functions for pre-processing the image into numpy arrays.

Step 4: Convert each image to standardized numpy array

It is observed that the images in the dataset are of different formats (some are RGB while others are RGBA with an additional alpha channel) and different dimensions. As machine learning models usually require inputs of equal dimensions, pre-processing is required to convert each image in the dataset to a standardized format and resize the images into equal dimensions. Here’s the Python function for RGB conversion, resizing and numpy array conversion:

**from**  **PIL**  **import** Image
**def** image\_resize(index):
_'''Convert + resize image'''_  
 im = Image.open(define\_imagepath(index))
 im = im.convert("RGB")
 im\_resized = np.array(im.resize((64,64)))  
**return** im\_resized

Seems easy to follow so far? Okay, let’s put all the above steps together and attempt to write the processing code for the entire image dataset:

**def** image\_proc(image, start, end): 

 gcp\_imageimport(image)
_#download\_blob('shopee-cindyandfriends', image)_

 im\_resizedreshaped = image\_resize(image)

**if** (image + 1) % 100 == 0 **or** (image == N - 1):
 sys.stdout.write(' **{0:.3f}% c** ompleted. '.format((image - start + 1)\*100.0/(end - start)) + 'CPU Time elapsed: **{}** seconds. '.format(time.clock() - start\_cpu\_time) + 'Wall Time elapsed: **{}** seconds. **\n**'.format(time.time() - start\_wall\_time))
 time.sleep(1)

**return** im\_resized

**def** arraypartition\_calc(start, batch\_size):
 end = start + batch\_size
**if** end \> N:
 end = N
 partition\_list = [image\_proc(image, start, end) **for** image **in** range(start, end)]
**return** partition\_list

_###### Main Code for Preprocessing of Image Dataset ######_
**import** sys
**import** time

N = len(fashion\_train['image\_path'])
start = 0
batch\_size = 1000
partition = int(np.ceil(N/step))
partition\_count = 0

imagearray\_list = [None] \* partition

start\_cpu\_time = time.clock()
start\_wall\_time = time.time()

**while** start \< N:
 end = start + batch\_size
**if** end \> N:
 end = N

imagearray\_list[partition\_count] = [arraypartition\_calc(image) **for** image **in** range(start, end)]

start += batch\_size
 partition\_count += 1

For the code sample above, I attempted to process the ~300,000 images in the image dataset sequentially in batches of 1,000 and kept track of progress using a rudimentary output indicator within the image processing function. List comprehension was used to create a new list of numpy arrays for each processing batch.

After more than 7 hours of leaving the code running on a CPU cluster overnight, barely around 1.1% (~3300) of the images were processed — and that’s just for one dataset. If we were to process almost 1 million images sequentially using this approach, it’ll take almost 3 months to finish processing all the images — and that is practically infeasible! Besides switching to a GPU cluster, are there any other ways to speed up this batch processing code so that we could pre-process the images more efficiently?

Speed Up with numba and concurrent.futures

In this section, I introduce two Python modules that helps speed up computationally intensive functions such as loops — numba and concurrent.futures. I will also document the thought process behind my code implementation.

JIT compilation with numba

Numba is a Just-in-Time (JIT) compiler for Python that converts Python functions into machine code at runtime using the LLVM compiler library. It is sponsored by Anaconda Inc., with support by several organizations including Intel, Nvidia and AMD.

Numba provides the ability to speed up computationally heavy codes (such as for loops — which Python is notoriously slow at) close to the speeds of C/C++ by simply applying a decorator (a wrapper) around a Python function that does numerical computations. You don’t have to change your Python code at all to get the basic speedup which you could get from other similar compilers such as Cython and pypy — which is great if you just want to speed up simple numerical codes without the hassle of manually adding type definitions.

Here’s the Python function for image conversion and resizing, wrapped with jit to create an efficient, compiled version of the function:

**from**  **numba**  **import** jit _# JIT processing of numpy arrays_
**from**  **PIL**  **import** Image

@jit
**def** image\_resize(index):
_'''Convert + resize image'''_

 im = Image.open(define\_imagepath(index))
 im = im.convert("RGB")
 im\_resized = np.array(im.resize((64,64)))

**return** im\_resized

I tried using njit (the accelerated no-Python mode of JIT compilation) and numba parallelization (parallel = True) to achieve the best possible improvement in performance; however, compilation of the above function fails in no Python mode. Hence, I had to fall back to using the jit decorator, which operates in both no-Python mode and object mode (in which numba compiles loops that it can compile into functions that run in machine code, while running the rest of the code in the Python interpreter). The reason why njit fails could be due to numba being unable to compile PIL code into machine code; nevertheless, numba is able to compile numpy code within the function and a slight improvement in speed was observed with jit.

Since numba parallelization can only be used in conjunction with no-Python JIT, I needed to find another way to speed up my code.

Parallel processing and concurrent.futures

To understand how to process objects in parallel using Python, it is useful to think intuitively about the concept of parallel processing.

Imagine that we have to perform the same task of toasting bread slices through a single-slice toaster and our job is to toast 100 slices of bread. If we say that each slice of bread takes 30 seconds to toast, then it takes 3000 seconds (= 50 minutes) for a single toaster to finish toasting all the bread slices. However, if we have 4 toasters, we would divide the pile of bread slices into 4 equal stacks and each toaster will be in charge of toasting one stack of bread slices. With this approach, it will take just 750 seconds (= 12.5 minutes) to finish the same job!

Sequential vs Parallel Processing — illustrated using toasts

The above logic of parallel processing can also be executed in Python for processing the ~300,000 images in each image dataset:

Split the list of .jpg image files into n smaller groups, where n is a positive integer.
Run n separate instances of the Python interpreter / Colab notebook instances.
Have each instance process one of the n smaller groups of data.
Combine the results from the n processes to get the final list of results.

What is great about executing parallel processing tasks in Python is that there is a high-level API available as part of the standard Python library for launching asynchronous parallel tasks — the concurrent.futures module. All I needed to do is to change the code slightly such that the function that I would like to apply (i.e. the task to be implemented on each image) is mapped to every image in the dataset.

_#N = len(beauty\_train['image\_path']) # for final partition_
N = 35000
start = 0
batch\_size = 1000
partition, mod = divmod(N, batch\_size)

**if** mod:
 partition\_index = [i \* batch\_size **for** i **in** range(start // batch\_size, partition + 1)]
**else** :
 partition\_index = [i \* batch\_size **for** i **in** range(start // batch\_size, partition)]

**import**  **sys**
 **import**  **time**
 **from**  **concurrent.futures**  **import** ProcessPoolExecutor

start\_cpu\_time = time.clock()
start\_wall\_time = time.time()

**with** ProcessPoolExecutor() **as** executor:
 future = executor.map(arraypartition\_calc, partition\_index)

From the above code, this line:

**with** ProcessPoolExecutor() **as** executor:

boots up as many processes as the number of cores available on the connected instance (in my case, the number of GPU cores in Colab that are made available during the session).

The executor.map() takes as input:

The function that you would like to run, and
A list (iterable) where each element of the list is a single input to that function;

and returns an iterator that yields the results of the function being applied to every element of the list.

Since Python 3.5, executor.map() also allows us to chop lists into chunks by specifying the (approximate) size of these chunks as a function argument. Since the number of images in each dataset are generally not round numbers (i.e. not multiples of 10s) and the order of the image arrays are important in this case (since I have to map the processed images back to the entries in the corresponding CSV dataset, I manually partitioned the dataset to account for the final partition which contains the tail-end remainder of the dataset.

To store the pre-processed data into a numpy array for easy “pickling” in Python, I used the following line of code:

imgarray\_np = np.array([x **for** x **in** future])

The end result was a numpy array containing lists of numpy arrays representing each pre-processed image, with the lists corresponding to the partitions which form the dataset.

With these changes to my code and switching to the GPU cluster in Colab, I was able to pre-process 35,000 images within 3.6 hours. Coupled with running 4–5 Colab notebooks concurrently and segmenting the entire image dataset into subsets of the dataset, I was able to finish pre-processing (extracting, converting and resizing) almost 1 million images within 20–24 hours! Not too bad a speed-up on Colab, considering that we initially expected the image pre-processing to take impracticably long amounts of time.

Some Reflections and Takeaways

It was our first time taking part in a data science competition — and definitely my first time working on real-life datasets of such a large scale compared with the clean-and-curated datasets that I worked with for my academic assignments. There might be plenty of spotlight on state-of-the-art data science algorithms within a data science project; however, I’ve also learnt through the experience that data processing can also often make or break a data science project if it becomes a bottleneck in terms of processing time.

On hindsight, I could have created a virtual machine on Google Cloud Platform and run the codes there, instead of relying solely on Colab and having to keep track of code execution in case I hit the 12-hour time limit for the GPU runtime.

In conclusion, here are my takeaways:

Command line interface is typically faster than GUI — if you can, try to work on the command line as much as possible.
Facing poor connection issues? Retry handling can help save you the heartache from having to re-upload or re-download your files.
Numba and concurrent.futures are useful when you are looking for a hassle-free way to speed up pre-processing of large datasets without manually adding type definitions or delving into the details of parallel processing.

For reference, the codes accompanying this write-up can be found here.

DEV Community: Ong Chin Hwee

Functional "Control Flow" - Writing Programs without Loops

Recap

There is no “If-Else” in Functional Code

A Brief Intro to Function Composition

Functions as First-Class Objects

Higher-Order Functions

Anonymous Functions

Recursion as a form of “functional iteration”

Tail Recursion and Tail-call Optimization

What’s next: Higher-Order Functions

References

3 Key Principles of Functional Programming for Data Engineering

Recap

What is Functional Programming

Declarative vs Imperative Programming

Key Principles of Functional Programming

Pure functions and avoid side effects

Immutability

Referential transparency

What’s next: Functional Programming for Data Pipeline Design

References

I Started Learning Scala as a Python Programmer. Here’s Why.

Motivations for learning Scala

How I learn different programming paradigms

Why is Python not ideal for learning functional data engineering

Why learn Scala for data engineering

What’s next: my progress in learning Scala as a Python programmer

Year 2020 in review - when tech conferences go virtual

Recap of Year 2020: Speaking

Facing Impostor Syndrome as a Tech Speaker

Speaking in-person at a Conference amidst a Pandemic

My First Time Speaking at a Conference outside of Asia

Speaking virtually at an in-person Conference amidst a Pandemic

Pre-recording talk with “live” elements at my first PyData Conference

When Murphy’s Law Struck Down The Internet and Mobile Network (and One of My Talks)

My first Keynote

Epilogue

#Shitoberfest: How free T-shirts ruined #Hacktoberfest2020

About Hacktoberfest

How low-quality PRs turned #Hacktoberfest2020 into #Shitoberfest

Cause

Psychology and Aftermath

I feel like an impostor in tech. I'm still here.

Learning to be human through coding and tech

Year 2019 in review - getting started with speaking at a tech conference

Recap of Year 2019: Speaking

Speaking at Tech Events: Why?

Getting Started

Source of Inspirations for Talk Proposals

My First Time Speaking at a Tech Event

My First Time Speaking at a Conference

My First Time Speaking at a Conference outside of Singapore

Epilogue

Understanding Python Dependency Management using pideptree

Using pipdeptree for dependency management

Dependency tree output

Installing Graphviz in virtual environment

References

Accelerating Batch Processing of Images in Python — with gsutil, numba and concurrent.futures

How to accelerate batch processing of almost a million images from several months to just around a few days

Problem Description

Data Processing Workflow

Step 1: Using wget command to download large image datasets

Step 2: Upload image files to Google Cloud Storage using gsutil cp

Step 3: Import each image file from Cloud Storage to Colab

Step 4: Convert each image to standardized numpy array

Speed Up with numba and concurrent.futures

JIT compilation with numba

Parallel processing and concurrent.futures

Some Reflections and Takeaways

References: