DEV Community: Anna Zubova

SOLID Programming (Part 2): Open/Closed Principle

Anna Zubova — Fri, 27 Sep 2019 14:22:00 +0000

‘O’ in SOLID stands for Open/Closed Principle. It is closely related to the Single Responsibility Principle as code written with Single Responsibility in mind tends to comply with the Open/Closed Principle too.

Open/Closed Principle says that the code should be open for extension but closed for modification. In other words, the code should be organized in such a way that new modules can be added without modifying the existing code.

Let’s look at an example of the function that counts words in a text after extracting it in a local file:

def count_word_occurrences(word, localfile):
content = return open(file, "r").read()
 counter = 0
    for e in content.split():
        if word.lower() == e.lower():
            counter += 1
    return counter

But what if we wanted to extract content from other source, like for example a url? If we worked with the above function, to implement new functionality, we would need to modify the existing function which would go against the Open/Closed Principle.

The better way of organizing the code would be the following:

def read_localfile(file):
    '''Read file'''
    return open(file, "r").read()


def count_word_occurrences(word, content):
    '''Count number of word occurrences in a file'''

    counter = 0
    for e in content.split():
        if word.lower() == e.lower():
            counter += 1
    return counter

This way, if we wanted to add reading from url functionality, we would just add the function that extracts text from url:

from bs4 import BeautifulSoup
from urllib.request import urlopen

def get_text_from_url(url):
    '''Extract html as string from given url'''
    page = urlopen(url).read()
    soup = BeautifulSoup(page)
    text = soup.get_text()
    return text

Next we would just call the already existing function count_word_occurrences() with the content extracted using get_text_from_url() function:

content = get_text_from_url('https://en.wikipedia.org/wiki/Main_Page')
count_word_occurrences('and', content)

So we added new functionality to the code without needing to modify the existing code.

Code in GitHub

You can find the code for this article in this GitHub repository:
https://github.com/AnnaLara/SOLID_blogposts

SOLID Programming (Part 1): Single Responsibility Principle

Anna Zubova — Sat, 21 Sep 2019 10:43:34 +0000

SOLID principles are among the most valuable in Software Engineering. They allow to write code that is clean, scalable and easy to extend. In this series of posts I will explain what each of the principles is and why it is important to apply.

Some people believe that SOLID is only applicable to OOP, while in reality most of its principles can be used in any paradigm.

‘S’ in SOLID stands for single responsibility. The error of many novice programmers is to write complex functions and classes that do a lot of things. However, according to the Single Responsibility Principle, a module, a class or a function has to only do one thing. In other words, they have to have only one responsibility. This way the code is more robust, easier to debug, read and reuse.

Let’s look at this function that takes a word and a file path as parameters and returns a ratio of number of the word's occurrences in the text to the total number of words.

def percentage_of_word(search, file):
    search = search.lower()
    content = open(file, "r").read()
    words = content.split()
    number_of_words = len(words)
    occurrences = 0
    for word in words:
        if word.lower() == search:
            occurrences += 1
    return occurrences/number_of_words

The code does many things in one function: reads file, calculates number of total words, number of word's occurrences, and then returns the ratio.

If we want to follow the Single Responsibility Principle, we can substitute it with this code:

def read_localfile(file):
    '''Read file'''

    return open(file, "r").read()


def number_of_words(content):
    '''Count number of words in a file'''

    return len(content.split())


def count_word_occurrences(word, content):
    '''Count number of word occurrences in a file'''

    counter = 0
    for e in content.split():
        if word.lower() == e.lower():
            counter += 1
    return counter


def percentage_of_word(word, content):
    '''Calculate ratio of number of word occurrences to number of all words in a text'''

    total_words = number_of_words(content)
    word_occurrences = count_word_occurrences(word, content)
    return word_occurrences/total_words


def percentage_of_word_in_localfile(word, file):
    '''Calculate ratio of number of word occurrences to number
       of all words in a text file'''

    content = read_localfile(file)
    return percentage_of_word(word, content)

Now each function does only one thing. The first one reads the file. The second one calculates the total number of words. There is a function that calculates the number of occurrences of a word in a text. Another function calculates the ratio of word's occurrences to total number of words. And if to get this ratio we prefer to pass the file path instead of text as a parameter, there is a function for that specifically.

So what are we gaining restructuring the code this way?

The functions are easily reusable and can be mixed depending on the task, thus making the code easily extendable. For example, if we wanted to calculate the frequency of a word in a text that is contained in a AWS S3 bucket instead of a local file, we just need to write a new function read_s3, the rest of the code would work without modification.
The code is DRY. No code is repeated, so if we need to make a modification in one of the functions, we would only need to do it in one place.
The code is clean, organized and very easy to read and understand.
We can write tests for each function separately, so it is easier to debug the code. You can check out tests for these functions here.

Code in GitHub

The code and tests from this article are available in GitHub:
https://github.com/AnnaLara/SOLID_blogposts

Introduction to Web Scraping with Selenium And Python

Anna Zubova — Thu, 12 Sep 2019 14:57:36 +0000

Web scraping is a fast, affordable and reliable way to get data when you need it. What is even better, the data is usually up-to-date. Now, bear in mind that when scraping a website, you might be violating its usage policy and can get kicked out of it. While scraping is mostly legal, there might be some exceptions depending on how you are going to use the data. So make sure you do your research before starting. For a simple personal or open-source project, however, you should be ok.

There are many ways to scrape data, but the one I prefer the most is to use Selenium. It is primarily used for testing as what it basically does is browser automation. In simple language, it creates a robot browser that does things for you: it can get HTML data, scroll, click buttons, etc. The great advantage is that we can tell specifically what HTML data we want so we can organize and store it appropriately.

Selenium is compatible with many programming languages, but this tutorial is going to focus on Python. Check this link to read Selenium (with Python) documentation.

First Steps

To download Selenium use this simple command in your command line:

pip install selenium

If you are working in a Jupyter Notebook, you can do it right there instead of the command line. Just add an exclamation mark in the beginning:

!pip install selenium

After that all you need to do is import the necessary modules:

from selenium.webdriver import Chrome, Firefox

Other browsers are also supported but these two are the most commonly used.

Two simple commands are needed to get started:

browser = Firefox()
(or browser = Chrome() depending on your preference)

This creates an instance of a Firefox WebDriver that will allow us to access all its useful methods and attributes. We assigned it to the variable browser but you are free to choose your own name. A new blank window of the Firefox browser will be automatically opened.

Next get the URL that you want to scrape:

browser.get('https://en.wikipedia.org/wiki/Main_Page')

The get() method will open the URL in the browser and will wait until it is fully loaded.

Now you can get all the HTML information you want from this URL.

Locating Elements

There are different ways to locate elements with Selenium. Which is the best one, depends on the HTML structure of the page you are scraping. It can be tricky to figure out what is the most efficient way to access the element you want. So take your time and inspect the HTML carefully.

You can either access a single element with a chosen search parameter (you will get the first element that corresponds to your search parameter) or all the elements that match the search parameter. To get a single one use these methods:

find_element_by_id()
find_element_by_name()
find_element_by_xpath()
find_element_by_link_text()
find_element_by_partial_link_text()
find_element_by_tag_name()
find_element_by_class_name()
find_element_by_css_selector()

To locate multiple elements just substitute element with elements in the above methods. You will get a list of WebDriver objects located by this method.

Scraping Wikipedia

So let’s see how it works with the already mentioned Wikipedia page https://en.wikipedia.org/wiki/Main_Page

We have already created browser variable containing an instance of the WebDriver and loaded the main Wikipedia page.

Let’s say we want to access the list of languages that this page can be translated to and store all the links to them.

After some inspection we can see that all elements have a similar structure: they are <li> elements of class 'interlanguage-link' that contain <a> with a URL and text:

<li class="interlanguage-link interwiki-bg">

   <a href="https://bg.wikipedia.org/wiki/" title="Bulgarian"
   lang="bg" hreflang="bg" class="interlanguage-link-target">

       Български

   </a>

</li>

So let’s first access all <li> elements. We can isolate them using class name:

languages = browser.find_elements_by_class_name('interlanguage-link')

languages is a list of WebDriver objects. If we print the first element of it with:

print(languages[0])

It will print something like this:

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="73e70f48-851a-764d-8533-66f738d2bcf6", element="2a579b98-1a03-b04f-afe3-5d3da8aa9ec1")>

So to actually see what’s inside, we will need to write a for loop to access each element from the list, then access it’s <a> child element and get <a>'s text and 'href' attribute.

To get the text we can use text attribute. To get the 'href' use get_attribute('attribute_name') method. So the code will look like this:

language_names = [language.find_element_by_css_selector('a').text 
                 for language in languages]

links = [language.find_element_by_css_selector('a').get_attribute('href') 
        for language in languages]

You can print out language_names and links to see that it worked.

Scrolling

Sometimes not the whole page is loaded from the start. In this case we can make the browser scroll down to get HTML from the rest of the page. It is quite easy with execute_script() method that takes JavaScript code as a parameter:

scroll_down = "window.scrollTo(0, document.body.scrollHeight);"
browser.execute_script(scroll_down)

scrollTo(x-coord, y-coord) is a JavaScript method that scrolls to the given coordinates. In our case we are using document.body.scrollHeight which returns the height of the element (in this case body).

As you might have guessed, you can make the browser execute all kind of scripts with execute_script() method. So if you have experience with JavaScript, you have a lot of room to experiment.

Clicking

Clicking is as easy as selecting an element and applying click() method to it. In some cases if you know the URLs that you need to go to, you can make the browser load the page with URLs. Again, see what is more efficient.

To give an example of the click() method, let’s click on the 'Contents' link from the menu on the left.

The HTML of this link is the following:

<li id="n-contents">
   <a href="/wiki/Portal:Contents" title="Guides to browsing Wikipedia">

        Contents

   </a>
</li>

We have to find the <li> element with the unique id 'n-contents' first and then access its <a> child

content_element = browser.find_element_by_id('n-contents') \
                         .find_element_by_css_selector('a')

content_element.click()

You can see now that the browser loaded the 'Contents' page.

Downloading Images

Now what if we decide to download images from the page. For this we will use urllib library and a uuid generator. We will first locate all images with CSS selector 'img', then access its 'src' attribute, and then creating a unique id for each image download the images with urlretrieve('url', 'folder/name.jpg') method. This method takes 2 parameters: a URL of the image and a name we want to give it together with the folder we want to download to (if applicable).

from urllib.request import urlretrieve
from uuid import uuid4

# get the main page again
browser.get('https://en.wikipedia.org/wiki/Main_Page')

# locate image elements
images = browser.find_elements_by_css_selector('img')

# access src attribute of the images
src_list = [img.get_attribute('src') for img in images]


for src in src_list:
    # create a unique name for each image by using UUID generator
    uuid = uuid4()

    # retrieve umages using the URLs
    urlretrieve(src, f"wiki_images/{uuid}.jpg")

Adding Waiting Time Between Actions

And lastly, sometimes it is necessary to introduce some waiting time between actions in the browser. For example, when loading a lot of pages one after another. It can be done with time module.

Let’s load 3 URLs from our links list and make the browser wait for 3 seconds before loading each page using time.sleep() method.

import time

urls = links[0:3]

for url in urls:
    browser.get(url)
    # stop for 3 seconds before going for the next page
    time.sleep(3)

Closing the WebDriver

And finally we can close our robot browser’s window with

browser.close()

Don’t forget that browser is a variable that contains an instance of Firefox() method (see the beginning of the tutorial).

Code in GitHub

The code from this article is available in GitHub:
https://github.com/AnnaLara/scraping_with_selenium_basics

Dockerize It!

Anna Zubova — Wed, 15 May 2019 04:40:11 +0000

Solving problems with code is a lot of fun. But when your creative process gets interrupted by a dependency issue where you have to dig into the terminal and check versions fearing that one wrong move can break what you have been building for weeks, it is definitely a frustrating setback.

On my path to learn Data Science, I have struggled a lot with creating the right environment for my project and making sure that all my packages are installed and are not creating any issues. But what happens when I need to run my application on a server where I don’t have my hand-crafted development environment? Luckily, Docker saves the day.

Docker is an open source platform for developers and sysadmins to develop, deploy, and run applications with containers (from Docker documentation). Here we are talking about linux containers, or in other words, applications that let developers wrap a project into one package that contains all the libraries and dependencies along with the project code itself. A container can be compared with a virtual machine, but it is much more lightweight since it uses only the right amount of resources from the host machine, rather than creating a full operating system inside the host machine.

In this tutorial I will introduce you to some key Docker concepts and components to be able to start using them in your development process.

Installation

Download Docker: https://docs.docker.com/docker-for-mac/install/

There are 2 Docker editions available: Docker CE (Community Edition) and EE (Enterprise Edition). The documentation recommends CE for learning purposes and small team projects. Docker can be run on AWS or downloaded to run on your local machine. In this tutorial I am going to download Docker for MacOS. If you don’t have a Docker account, you will need to create one to be able to download the installation file.

Run Docker.dmg installation file and move the application to your Applications folder
Open Docker from your Applications folder. You will see an icon appearing in the upper right corner of your screen.
Sign in with your Docker ID

In Terminal type the following to see if it is working correctly:

docker run hello-world

The output should look like thi:



latest: Pulling from library/hello-world
1b930d010525: Pull complete 
Digest: sha256:5f179596a7335398b805f036f7e8561b6f0e32cd30a32f5e19d17a3cda6cc33d
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

Docker Images and Dockerfiles

There are 2 key components of Dockerizing your project: Docker images and Dockerfiles.

An image can be described as a set of tools and instructions that we need to execute the project's code: system tools, libraries, dependencies, etc. Conveniently, this set of tools can be reused in a very easy way, so you would not need to define a new project environment, but can rather reuse an existing one.

A Dockerfile is a text file with a set of instructions to assemble an image. Each line of the Dockerfile is considered a layer which can be later reused.

There is also a cloud service called Docker Hub where Docker users can share Docker images. This service is similar to what GitHub does for git.

Running an existing image

To test how Docker can run a Jupyter server, I followed this tutorial

I started with running this command in my terminal:

docker run ubuntu:16.04

This command will run an image called [ubuntu] with image version [16.04]. If Docker doesn’t find the image on a local machine, it will then look in the Docker Hub to download the image.

There are some extra options for the run command that can be found in the Docker Official Documentation.

As I mentioned before, it is very convenient to use existing images. Let’s run an image already created by Jupyter development community that has just Python and Jupyter installed: https://hub.docker.com/r/jupyter/minimal-notebook

docker run -p 8880:8888 jupyter/minimal-notebook

In the above line, -p <host_port>:<container_port> is the part that tells Docker to open connection between the Docker container and host machine, so interaction with the running container is possible.

jupyter/minimal-notebookis the image that we want to run.

After running this command, you will see this type of output in your terminal:



To access the notebook, open this file in a browser:
        file:///home/jovyan/.local/share/jupyter/runtime/nbserver-6-open.html
    Or copy and paste one of these URLs:
        http://(2f0da4326d97 or [my IP address]):8888/?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

From the last URL we have to take the token number, which I represented with 'x' since it is a secret key. With the port number 8880 that I used to run the image, I was able to access the notebook:
http://localhost:8880/?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Next step would be to allow to create and make changes to a Jupyter notebook with a container running. There is a special option in run command:

-v <host_directory>:<container_directory>

Host directory specifies where to store the notebook we are going to create, and container directory should be specified in the container documentation (if using Docker Hub container). In our case the container directory from the documentation was /home/jovyan

So the final code to run an image with the option to access it and create notebook is:

docker run -p 8880:8888 -v ~/docker_tests:/home/jovyan jupyter/minimal-notebook

Going to http://localhost:8880/?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx in your browser will allow you access your Jupyter server and create notebooks.

Dockerizing your Python project

To Dockerize your project, you will first need to create a Dockerfile containing instructions on what image you are going to be using, what packages you need to install, and what your project’s directory is.

Imagine that you need to Dockerize your python script called add_numbers.py that would require installation of the scikit-learn library.

First, create a file called Dockerfile (don't add extension to it!). Add these lines to the file in any text editor:



FROM python:3

ADD add_numbers.py /

RUN pip install -U scikit-learn

CMD [ "python", "add_numbers.py" ]

FROM command says what image you are using as a template.

ADD tells Docker to add certain script to the Dockerfile. This command takes 2 parameters: source and destination

ADD <source> <destination>

RUN says that before executing the script, installation of scikit-learn should be done.

CMD provides the default command that will be executed after the image has loaded unless overwritten by other command.

My python file add_numbers.py has just the following code:



def add_numbers (a, b):
   return a + b

c = 3
d = 4

print(add_numbers(c, d))

After creating my script and my Dockerfile that has the instructions on how to create an image, I can run this command to build a new image based on newly created Dockerfile:

docker build -t add_numbers .

This commands creates an image called ‘add_numbers’ based on a Dockerfile from the same directory we are running the command from.

Finally, we can run the image with this command:

docker run add_numbers

As my python script contained a function that would add two numbers and print out the result, and also had one function call to add 3 + 4, I got 7 as an output in my terminal.

To learn more, please refer to the official Docker documentation:
https://docs.docker.com/

Useful tutorials on starting working with Docker:
https://runnable.com/docker/python/dockerize-your-python-application
https://www.dataquest.io/blog/docker-data-science/

Deconstructing the Box and Whisker Plot

Anna Zubova — Tue, 30 Apr 2019 03:57:37 +0000

When trying to understand what a set of data looks like, there are plenty of options as to how to visualize it. It is important to pick the ones that serve the specific question we want to ask.

A histogram is usually the first choice when visualizing data and making a preliminary analysis of a distribution. A box and whisker plot (often referred to as box plot), however, can be used on its own or as an additional tool in data analysis.

A box plot uses 5 important descriptive statistics of a distribution: median value, lower quartile, upper quartile, and maximum and minimum values. It quickly gives us a sense of what data looks like and allows to compare different groups of data in one simple plot.

Here is an example of a basic box plot:

Limitations

It is important to understand that these 5 statistics cannot be the only measure of spread used to describe a distribution, being inferior to metrics like mean and standard deviation. However, in case the distribution is highly skewed or if there are outliers, it can be a very useful tool to check shape, spread and variability of data.

Box plots are great in showing whether the data is symmetric, but they will not show the type of symmetry. For example, two sets of data can look exactly the same as box plots, but one can have a significant variability of frequencies and another is uniformly distributed. A box plot wouldn’t be the right tool to check for those features. For that reason, box plots are better to be used in combination with other visualization methods like, for example, a histogram.

Visualizing outliers with box plots

One of the main purposes of the box plot is to quickly visualize outliers to see if it is necessary to remove them for further analysis. But to actually understand what is considered an outlier, let’s look at the following representation of the box plot and PDF of a normal distribution.

Source: https://commons.wikimedia.org/wiki/File:Boxplot_vs_PDF.svg

The whiskers actually represent values beyond which the data will be considered outliers. To determine the lower limit, interquartile range times 1.5 is subtracted from the 1st quartile value. To determine the upper limit, we should add 1.5 times the interquartile range to the 3rd quartile value.

In a normal distribution the box plot with whiskers represents 99.3% of all the data; that is, the outliers are only 0.7% of data.

Comparing data

Another very important utility of box plots is to compare data from different groups. Plotting several box plots next to each other gives us a perfect sense of whether the groups are similar.

The things we have to look for are:

If boxes overlap.
If there is no overlapping, it is quite clear that the groups are different.
If the medians are in visual range of the box compared with. If not, it is likely that the groups are different.
Ranges of the boxes.
It is helpful to evaluate the comparative range of the boxes to see how much difference there is in the spread of the data.
Skewness.
As skewness is easily observed from the box plots, it can be useful to compare this parameter between two plots.

This preliminary visual analysis can help understand if two groups we are looking at are similar and if we need to apply some other techniques to further measure how different they are.

Let’s look at the data from the World Happiness Report from Kaggle. First, let’s look at the happiness scores from 2017 and 2016.

The groups are clearly very similar since the medians are located at the same level. The spread of the second plot is slightly wider that the first one.

However, if we compare health and freedom scores, the box plots will show more differences.

We can actually extract the values of the statistics calculated by the box plot. The object that is returned after creating a plot has all the values stored in it. To see what keys it has, we can run bp.keys(). For example, to extract the median, we can use the following code:

#get values for the medians
#bp is a box plot object

medians = []
for i in bp['medians']:
    medians.append(i.get_data()[1][0])

medians now is equal to [0.6060415506362921, 0.43745428323745705]

To get the upper and lower levels of the boxes, we can implement this code, where we will access the second element from the bp['boxes'] object that represents y-axis values for the lines. After that we will select first and third element that are lower and upper y-axis value of the box:

#get values for boxes' lower and upper values
boxes = []
for i in bp['boxes']:
    boxes.append(i.get_data()[1][0])
    boxes.append(i.get_data()[1][3])

boxes now contains the list [0.36986629664897896, 0.723007529973984, 0.3036771714687345, 0.5165613889694209]

So, the range of the first box (where 50% of the data is located) lies between 0.37 and approximately 0.72, with 0.61 as median value. The second box plot has a range of 0.30 to 0.52 with median value at 0.44.

Notched box plot

One interesting feature of the box plots that is often overlooked is the notched parameter, which allows to compare confidence intervals for the median value. By default, the confidence level is 95%. This option is especially useful to compare groups of the same values, and we would look for visual overlapping of the notches that would indicate similarities/differences in median values.

Notched box plots can be used together with another parameter in Matplotlib’s box plot, bootstrap. By default it is set equal to None. If set to an integer, that would indicate how many times bootstrapping should be performed in order to determine confidence intervals.

Other useful options

There are some other parameters that can be useful when creating a box plot with Matplotlib library.

sym: determines the look of flier points. Setting it equal to empty string will tell Matplotlib that we don’t want to show outliers.

whis: parameter allows to change the reach of whiskers. By default this parameter is equal to 1.5. Lower and upper range of whiskers is determined by Q1 - 1.5*IQR and Q3 + 1.5*IQR accordingly. If whis is set to 'range' string, the whiskers reach to minimum and maximum values.

vert: accepts a boolean value. By default it is set to True, but if set to False, the box plot will appear horizontally.

positions: accepts an array-like parameter. By default it is (1, N+1) where N is the number of box plots. If set to (1,1), 2 box plots will overlap.

widths : sets the width of each box.

labels : sets labels for each box plot.

References and further reading

Visualizations from this blogpost can be found in my GitHib profile.

Why I Decided to Become a Data Scientist

Anna Zubova — Mon, 22 Apr 2019 15:20:37 +0000

I did not always dream of becoming a data scientist. It was rather a thought-through decision that I made after trying different things, learning what I can be good at, and getting to know myself better.

The first time I heard about Data Science was when I read a story about how Target was able to figure out a girl was pregnant before her parents knew, using data they collected from their customers. That time I thought that those people had some kind of wicked superpowers. It wasn’t until I started to learn programming when I found out that there is this fascinating field called Data Science, which I have recently chosen as my new career trajectory. At the moment I am a Flatiron School student going through the Immersive Data Science Bootcamp. Here are the most important reasons why I am confident I have chosen the right path.

1. I am curious to find hidden information insights

The world has never been so confusing and is changing so rapidly. There is more information than people can possibly deal with, but with a lot of insights hidden in the data. I would love to be able to discover those insights and make use of a huge stream of information that is out there thanks to the new technologies. Besides, it makes me feel confident that I can find out the truth for myself instead of relying on other people telling me what is true and what is not. The skills I am learning help me make independent conclusions and decisions, so I feel empowered and incredibly optimistic about the future.

2. I love coding

It was a big surprise to me when I realized that I can code and that I actually love doing that. When I was growing up, everybody thought that programming was boring and one had to have a certain math-oriented and analytical personality profile to do that. But later I discovered that coding actually has a great deal of creativity in it. Creating a program can be compared to building a physical object like a house or a piece of furniture. You need to design it, think of what features it will have, and how it can be useful. This is my favorite part about coding: the tools it gives to create something incredibly useful out of just our imagination. In Data Science coding skills are essential to process data unbelievably fast, which lets you focus on the most interesting things like interpreting the results and drawing conclusions.

3. Data Science promotes social responsibility

I have a special interest in Open Data and using Data Science for the benefit of my community. It seems that there is so much for people to gain from twisting the facts and manipulating public opinion, so I almost feel it is my responsibility as a data scientist to contribute to providing society with reliable analyses of information.

4. Data Science is the career of the future

My choice of data science as my career path was also quite pragmatic. I feel that it will be one of the most useful skills that businesses will want to benefit from. Data is a very powerful asset for all kind of business processes: sales, marketing, operations, human resources, etc. It is relatively easy to collect a lot of information these days, but to be able to analyze it and use it to boost a company's growth is a relatively new idea that is going to spread very fast. The demand for these kinds of skills is growing, and I wanted to make sure I jump on this train too.

5. I can make use of my background and experience

To be a good data scientist, it is not enough to just know how to code and do statistics. One of the most important things is to be able to ask the right questions that you want to find answers to. So it is very helpful to have a broad understanding of different industries to be able to help different kinds of companies. I have a degree in Economics, Management, as well as work experience in Tourism and hopefully some life experience. I wanted all these skills to actually be useful for me. It all came together when I realized that in Data Science every experience counts, because I will need to make sense of the things that don’t seem to make sense, and for that every additional skill/experience is helpful.

BONUS: Data Science is fun!

Maybe it is not the most obvious reason one can think of, but the further I am into the Data Science Bootcamp, the more I realize this. There is always a way to creatively approach the problem, turn it around and come up with a completely new insight!

King County House Sales dataset: log-transformations and interpreting results

Anna Zubova — Fri, 19 Apr 2019 16:03:43 +0000

Log transformations turn out to be very useful when working with linear regression models. They are very helpful in correcting skewness in the data, and allow to make the distribution more normal.

Alexander Bailey and I worked on a dataset of housing prices in King County in order to develop a linear regression model that explains price variations.

We tried log transformation on several continuous variables but we had best results applying log transformation to the set of distance from major central locations: downtown Seattle, downtown Bellevue, and South Lake Union. The distance to each of these locations was expressed in miles.

Let’s look at one of the variables: distance in miles from downtown Seattle. This is how the distribution looked before and after the log transformation:

We can see that the log transformation corrected the skewness and made the distribution more normal, which was beneficial for linear regression model performance. By taking the natural logarithm of parameter values, we were able to improve our model’s metrics: R squared from 0.627 to 0.686 and MAE from 134016.02 to 131342.48 in one of model fitting iterations.

In case of the distance from downtown Bellevue, the distribution was also significantly improved by taking the natural logarithm of the variable values:

Change in distribution after log-transformation of the distance from South Lake Union parameter:

Log-transformed distance parameters in the linear regression model

Our final model for the housing prices focuses on making predictions for the prices up to approximately $1.1 million. Distance from downtown Seattle, downtown Bellevue, and South Lake Union are among the strongest predictors of the price.

Here are the coefficients for the distance values in our model:

Coefficient name	Value
'mi_from_downtown_log'	210431.83668516215
'mi_from_bellevue_log'	-109855.7636174455
'mi_from_south_lake_log'	-294919.7693134867

The general explanation of the coefficient values is that, an increase in one unit of the 'mi_from_downtown_log' variable, assuming other variables in the model remain constants, would result in an increase on the housing price by $210,431.84 on average. However, the difficulty is that the explaining variable was log-transformed.

Distance from downtown parameter

Let’s simulate how the increase of the distance in miles from downtown Seattle would affect housing prices, assuming that other variables are constant.

First, let’s generate an array of numbers from 1 to 100, representing an increase of distance from downtown Seattle. Next step is to take the log value of these numbers that resulted in an array in range from 0 to 1.6. We then calculate y values (that is, our predicted price) for each of the values of log(x) using the coefficient of 210,431.84 from our linear regression model.

miles_range = np.arange(1, 101)
miles_range_log = np.log(miles_range)
y = 210431.83668516215 * miles_range_log

Here is the plot representing the relationship between y and log(x) compared to relationship between y and x:

fig, axs = plt.subplots(2, 1, constrained_layout=True)

axs[0].plot(miles_range_log, y) axs[0].set_title('Log transformed variable') axs[0].set_xlabel('log(mi_from_downtown)') axs[0].set_ylabel('price')

axs[1].plot(miles_range, y) axs[1].set_title('Raw data') axs[1].set_xlabel('mi_from_downtown') axs[1].set_ylabel('price');

We can see that log-transforming this variable converted the relationship from exponential to linear, which serves the purpose of improving metrics of the linear regression.

Interestingly, in the case of the distance from downtown Seattle, the correlation is positive, which is the opposite to what we had expected. However, there is a negative correlation of housing prices with distance from downtown Bellevue and South Lake Union. The explanation might be that people would prefer to be further away from downtown Seattle in favor of proximity to other points.

Distance from downtown Bellevue parameter

According to our model the coefficient is -109855.76.

Let’s build an array of y values:

y_bellevue = -109855.76 * miles_range_log

Plotting the log(x) and x vs y:

fig, axs = plt.subplots(2, 1, constrained_layout=True)

axs[0].plot(miles_range_log, y_bellevue) axs[0].set_title('Log transformed variable') axs[0].set_xlabel('log(mi_from_bellevue)') axs[0].set_ylabel('price')

axs[1].plot(miles_range, y_bellevue) axs[1].set_title('Raw data') axs[1].set_xlabel('mi_from_bellevue') axs[1].set_ylabel('price');

Here the relationship is more logical: the further away a house is from Bellevue, the lower will be its price.

Distance from South Lake Union parameter

The coefficient for this parameter is -294919.77, which is the most significant between all three distance parameters.

Calculating an array with y values:

y_s_lake_union = -294919.7693134867 * miles_range_log

Visualization of the x and log(x) vs y:

Interpreting the results

When log transformations are done on a dataset, it can be difficult to explain to the non-technical audience how the model works. Here is how it can be done using our model example.

Let’s look at the distance from South Lake Union as it is the most significant variable out of three distance parameters. The coefficient for this value is -294919.76, which in case of non-log-transformed variable would mean that an increase in distance from South Lake Union by 1 mile would be associated with a $294,919.76 decrease of price on average.

However, the values of the price are log-transformed, so we can’t use this 1-unit-increase technique. In the case of logarithmic data transformations, we can talk about percentage changes.

To find out what the increase in target price would be, let’s look at the equation:

price(x1) - price(x0) = coef * log(x1) - coef*log(x0) = coef * (log(x1) - log(x0)) = coef * log(x1/x0)

So in our case to find out what would be the change of price resulting from an increase of the distance by 10%, we will have to calculate the following:

change_in_price = -294,919.76 * log(1.1)

change_in_price = -294,919.76 * 0.95 = -28,108.85

Based on the above calculation, with an approximately every 530 ft (10% of a mile) increase in distance from South Lake Union, the price will decrease by $28,108.85.

If we want to know the effect of increasing the distance by 1 mile, we would need to do the following calculation:

change_in_price = -294,919.76 * log(2) = -204422.8

As a side note, it is helpful to know that if we are talking about small percentage of changes in x value (up to 5%), the increase of x by 5%, for example, is almost equivalent to adding 0.05 to the log(x). Similarly, the increase of x by 2%, would be almost equivalent to adding 0.02 to the log(x).