DEV Community: Yulia Zvyagelskaya

Your First Job In The Cloud / 3

Yulia Zvyagelskaya — Sat, 25 Aug 2018 16:21:55 +0000

Chapter 1. The First Job In The Cloud

Your First Job In The Cloud

Yulia Zvyagelskaya ・ Aug 18 '18 ・ 5 min read

#machinelearning #tutorial #cloudcomputing

Chapter 2. Getting The Job Done

Your First Job In The Cloud /2

Yulia Zvyagelskaya ・ Aug 19 '18 ・ 5 min read

#machinelearning #tutorial #cloudcomputing

Chapter 3. Let My Cloud Work

And I wrote my happy songs
Every child may joy to hear
— William Blake

Splitting The Job Into Pieces

In the previous chapter we had an environment prepared to launch our lovely job. The only thing left before we might let the ship sail freely would be to split the whole job into logical pieces. Why?

Well, the things might go wrong. One might want to tune something on the last step. The cloud computing saves our laptop uptime but it costs money. In most cases while playing with our model parameters we do not need to start from the scratch redoing the whole bunch of preparation steps, like cleaning up text, splitting the input into train and test datasets and like.

I would suggest to have following steps separated for text processing (for images or anything else that might differ, but the whole approach would still work.)

drop all the unnecessary data and leave only that we are to process
clean the input with sorta regular expressions
split the input into train and test datasets
train the model on train dataset
check the model on test dataset

One By One

My advise would be to split the script into several classes, each playing it’s own role. The files would also contain the __main__ section allowing to execute them as standalone scripts, accepting parameters that are relevant to that particular step.

Smaller Is Better

The first step should be done locally to decrease the size of the file we are going to upload. That saves us the time while uploading and the size taken by the input data in our basket.

For that one should simply load the CSV, get rid of all the unnecessary columns and save it back. That is it. The result should be copied to the bucket and all steps will be done in the cloud.

Janitor For The Rescue

In most articles, books, talks on the subject people recommend to get rid of all the punctuation and to expand syntactic sugar (“I’d’ve been” ⇒ “I could have been”) to help the trainer to recognize same words. Most if not all propose the obsolete regular expression to do that.

Texts in the Internets differ from those produced by typewriting machines. They are all use unicode now. Nowadays the apostrophe might be both “'” typed by the lazy post author and “’” if the creator of the text has a sense typographical beauty. The same is for quotes ' " ’ ”, dashes - -- – –, numbers 1 2 ¹ ² ½ and even ʟᴇᴛᴛᴇʀs 𝒾𝓃 𝓉𝒽𝑒 𝖆𝖗𝖙𝖎𝖈𝖑𝖊. And legacy regular expressions would not recognize all that zoo.

Luckily enough, modern regular expressions have matchers that might match exactly what we need to semantically. Tose are called Character Classes and might be used to match e. g. all the punctuation, or all the letters and digits.

For the text processing I would suggest to preserve only alphanumerics (letters and digits) and the punctuation. The latter should be unified (all the double quotes should be converted to typewriter double quotes, the same with single quotes and dashes.)

After this is done, the list of shortened forms might be used to expand all “I’m”_s to _“I am”_s etc. That is basically it. _Dump the result of this step to the bucket. Until the input data has changed, this step might be avoided in subsequent executions of model training process. I personally use pickles for that, but the format does not absolutely matter.

Sibling Sets: Train And Test

The common approach would be to split the dataset into to parts, one to train the model and another one to test it. Split and save the result to the basket. This process usually takes a few parameters, like padding and max-words which are rarely changed. Chances are you would not tweak them very often.

Save the result.

Train The Model

Train the model. Make sure you log everything to the logger, not to standard output to preserve logs of the process. I usually use logging.info for debugging message and logging.warn for iportant ones I want to be reported. Google ML Log Viewer allows to filter out log messages by severity and later one might have a glance at important stuff only.

Save the model.

Check It

I do check in the cloud as well. Later on one might download the model to local and use the testset saved in the step 3 to test it, but I am fine with examining logs in the cloud.

The Summing Up

If you are like me and had the steps put into classes, the resulting __main__ of the package that will be run in the cloud would look like:

if __name__ == "__main__":
  # parse all the arguments

  if arguments.pop('janitize'):
    Janitor(**arguments).cleanup(remote_dir)
  if arguments.pop('split'):
    Splitter(keep_n=arguments.pop('keep_n')).split(remote_dir)
  if arguments.pop('train'):
    model = Modeller(**arguments).train_model(remote_dir)
    if arguments.pop('check'):
      Modeller(**arguments).check_model(remote_dir, model=model)

And that is all I wanted to share for now. Happy clouding!

Your First Job In The Cloud /2

Yulia Zvyagelskaya — Sun, 19 Aug 2018 09:00:02 +0000

Chapter 1. The First Job In The Cloud

Your First Job In The Cloud

Yulia Zvyagelskaya ・ Aug 18 '18

#machinelearning #tutorial #cloudcomputing

Chapter 2. Getting The Job Done

Hear the voice of the Bard!
— William Blake

Setting Up Packages

In the end of the Chapter 1 we were able to run a job in the cloud. It was completed successfully (if not, please blame Google, not me.) We’ve seen this fascinating green light icon next to it. Let’s now try the real job‽

Not yet. We are not mature enough to enter a cage with lions. Let’s do it step by step.

I assume you have the code—that apparently trains a model—on hand. Unlike other tutorials, this one won’t provide a code training model for you. We are talking about deployment to ML Engine. Sorry for that.

This code probably has a bundle of imports. It probably requires Tensorflow, Keras and some other packages that are not included into python standard library. If all you imported was Tensorflow and the whole rest was written by you, you barely need this tutorial. Such a brave person should wade through everything on their own.

So, yeah. Packages. Copy all the import lines from all your files to the top of the job scaffold we have created in the Chapter 1 (“Our First Job” section, I struggled to find how do I do an anchor to the subtitle inside the post here.)

Submit a job to the cloud. Go check logs in approximately 8 minutes to see that the task has failed. If it has not, I envy you, you seem to use only standard packages included into ML Engine setup by default. I was not as lucky. Google includes the very limited set of packages (probably to reduce the docker launch time.)

So, we need to demand the additional packages we need explicitly. Unfortunately, AFAICT, there is no way to retrieve a diff between what ML Engine provides out of the box and what we need, So we are obliged to add them one by one, submit job, check logs, cry, yell at the sky, repeat. To add packages, open the file setup.py we have created in the Chapter 1 and add the following lines.

 from setuptools import find_packages
 from setuptools import setup

 setup(
     name='test1',
     version='0.1',
+    install_requires=['scikit-learn>=0.18','annoy>=1.12','nltk>=3.2'],
+    packages=find_packages(),
+    include_package_data=True,
     description='My First Job'
 )

Keep adding entries in install_requires until ML Engine is satisfied and the job turns back to successful processing. It might happen, that some packages you need were designed for python younger than 3.5 (in my case it was the package that used fancy new string formatting f'Hi, {name}', introduced in 3.6. I did not find better solution, rather than download this package locally, backport it to 3.5 and re-package it myself. Build the package and put it both into your bucket (I have created a subfolder packages there for that purpose) and the same folder with setup.py script. Now we have to tell ML Engine to use our version of it. Update your shell script with:

     --module-name test1.test1 \
     --package-path ./test1 \
+    --packages my_package-0.1.2.tar.gz \
     --config=test1/test1.yaml \

The same should be done with all your own packages you need to include into the distribution to run a job. The install_packages parameter in call to setup in setup.py has to be updated accordingly. Also, update your cloud config with packageUris parameter:

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m_gpu
  pythonVersion: "3.5"
  runtimeVersion: "1.9"
  packageUris:
    - 'gs://foo-bar-baz-your-bucket-name/packages/my_package-0.1.2.tar.gz'

Submit a job and check that the job is now green.

Doing Their Job

Not all 3rd party packages are ready to be used in cloud. It’s the very same machine as our own laptop, but virtual, so everything should be the same, right?—Well, yes and no. Everything is the same. But this VM is shut down as the job finishes. Meaning if we need some results besides logs (like a trained model file, you know,) we have to store it somewhere outside of the container where this job was run, otherwise it’ll die together with the container.

Python standard open(file_name, mode) does not work with buckets (gs://...../file_name). One needs to from tensorflow.python.lib.io import file_io and change all calls to open(file_name, mode) to file_io.FileIO(file_name, mode=mode) (note the named mode parameter.) The interface of the opened handle is the same.

Some 3rd party packages have an explicit save method, accepting the file name instead of FileIO and the only possibility would be to copy it to the bucket afterwards. I did not manage to use Google SDK for that, since it is absent in containers, so here is the tiny snippet copying the file from the container to the bucket.

The code below is quite inefficient, because it loads the whole model at once and then dumps it to the bucket, but it worked for me for relatively small models:

model.save(file_path)

with file_io.FileIO(file_path, mode='rb') as i_f:
    with file_io.FileIO(os.path.join(model_dir, file_path), mode='wb+') as o_f:
        o_f.write(i_f.read())

The mode must be set to binary for both reading and writing. When the file is relatively big, it makes sense to read and write it in chunks to decrease memory consumption, but chunking files on binary IO read operations seems to be a bit out of scope of this tutorial. I recommend to wrap this into the function and call this function as needed.

Now, when we have all the needed packages installed, and the function to store files in the bucket on hand, there is nothing preventing us from trying our model in the cloud with a whole load of the data in the cloud. Right?

Nope. I strongly suggest to split the job into smaller steps, like “cleaning up,” “splitting into training and testing sets,” “preprocessing,” “training,” etc and dump all the intermediate results to the bucket. Since you needed a training process to be run in the cloud, it should be very time consuming. Having the intermediate results on hand might save you a lot of time in the future. For instance, adjusting the parameters given to the train process itself does not require all the preparation steps to be re-run. In such a case we might just start with Step N, using the output from the previous step that was dumped before as the input.

How to do it, I will show in the next chapter. Happy clouding!

Your First Job In The Cloud / 3

Yulia Zvyagelskaya ・ Aug 25 '18

#machinelearning #tutorial #cloudcomputing

Your First Job In The Cloud

Yulia Zvyagelskaya — Sat, 18 Aug 2018 16:34:51 +0000

Chapter 1. The First Job In The Cloud

Intro

On a cloud I saw a child,
And he laughing said to me: “...
— William Blake

Nowadays one should live in a cave on an uninhabited island lost in Arctic ocean to have never heard of “Artificial Intelligence,” “Machine Learning,” “NLP” and the family of buzzwords. Having a masters in Data Science, I feel a bit less excited about tomorrow’s AI revolution. That does not mean DS is boring or undue—rather it requires a lot of effort to be put into and I really like that feeling of being doing stuff on the bleeding edge.

As a relatively new industry, ML has not set up the processes yet. I have heard something opposite about Google and Facebook, but we are still considered nerds in small businesses. The role developers used to play twenty years ago. That’s great to see that more and more people are getting into ML, either being excited by Google slides on a last conference, or just being curious whether the neural nets can indeed distinguish between cats and dogs seen on a photo.

Big corps prepare and share (thanks God it’s XXI century) huge datasets, trained models and everything the junior data scientist might use to play in the sandbox. After we made sure that models trained on Google or Facebook data somehow work and even might predict things (in some cases under some very eccentric circumstances, but it’s still so thrilling,) we usually want to try to train our own model ourselves. It takes hours on our own laptop, even despite the dataset is limited to tweets from our forty two friends for the last single year. Results usually look promising, but unsatisfactory. There is no way the laptop could proceed with the whole tweet feed for the last decade without exploding the SDD and blowing up.

That is the time we get to the magical words: cloud calculus. Or how do you name it. Let’s Google servers do explode instead of our lovely laptops, right? Right. Our next job will be in the cloud. Pun intended.

There are not that many resources, explaining how one might stop procrastinating starring at the laptop monitor for when the model is built and start getting benefits of living in the 2018 AD. There are Google ML, Amazon SageMaker, Azure Machine Learning Studio, but the documentation everywhere was written by developers for gray-bearded geeks. There is an enormous threshold to execute the very first job in the cloud. And this writing is supposed to bridge that gap.

That is not a rocket science and there is nothing really complex. Just few steps to make and several things to take into consideration. That’s the breathtaking journey and once done, the subsequent trips will seem a cakewalk. Let’s go.

All the below is written for Google ML Engine, but it might be applied to any cloud computing system almost as is. I will try not to go deeply into details, concentrating more on whats rather than on hows.

Before We Start

First of all, I want to reference the paper that helped me a lot to move my job into the cloud. Tensorflow beginner guide by Fuyang Liu's Blog is almost perfect, save for it does not cover pitfalls and does not suggest shortpaths where it could have made sense.

Google also has a documentation on ML Engine, I wish I were as smart as to use it as a guide. We still need it though to quickly look up this and that.

First we need to set up our cloud environment. I refer to Google guide here because things tend to change within time and I hope they will keep this info up-to-date.

After we have the account enabled for ML, we should set up our local environment. I strongly advise using Linux, MacOS is more or less robust, Windows will make you cry. Once we are to run jobs in the cloud, I believe you have python installed and configured. What we need to install is Google SDK. It’s pretty straightforward though, download it from the page linked and install.

Now we need to setup our credentials. gcloud init should do.

Let’s check it works as expected:

$ gcloud ml-engine models list
Listed 0 items.

Wow. We are all set.

Our First Job

That is the important part. Don’t try to upload and run your fancy last project. It’ll fail and you’ll get frustrated. Let’s enter cold water slowly. Let’s make your first job completed successfully, showing a fascinating green light icon when you’ll check your jobs status.

The cloud expects the python package to be uploaded and the main module to execute it specified. So, let’s go with a pretty simple python package. Let’s assume it’s named test1.py and resides in the directory named test1.

# coding: utf-8
import logging
import argparse

if __name__ == "__main__":
  parser = argparse.ArgumentParser()

  parser.add_argument(
    '--job-dir',
    help='GCS job directory (required by GoogleML)',
    required=True
  )
  parser.add_argument(
    '--arg',
    help='Test argument',
    required=True
  )
  arguments = parser.parse_args().__dict__
  job_dir = arguments.pop('job_dir')
  arg = arguments.pop('arg')

  logging.info("Hey, ML Engine, you are not scary!")  
  logging.warn("Argument received: {}.".format(arg))

We use logging because unlike simple stdout logs are available through the web interface.

Also you’ll need a cloud configuration file on your local. It might be placed everywhere, I prefer to have a config file per project. Put test1.yml in the same directory:

trainingInput:
  scaleTier: CUSTOM
  # 1 GPU
  masterType: standard_gpu
  # 4 GPUs
  # complex_model_m_gpu
  runtimeVersion: "1.9"
  pythonVersion: "3.5"

I am not sure who took that decision, but the default python version for ML Engine is 2.7, that’s why two last lines are mandatory.

Also you would need to create a file setup.py, containing the description of our project. It will be processed by Google SDK.

from setuptools import find_packages
from setuptools import setup

setup(
    name='test1',
    version='0.1',
    description='My First Job'
)

Well, that is it. Let’s try (this file test1.sh should be on the same level as the package folder.)

#!/bin/bash

export BUCKET_NAME=foo-bar-baz-your-bucket-name
export REGION=us-east1
export JOB_NAME="test1_$(date +%Y%m%d_%H%M%S)"
export JOB_DIR=gs://$BUCKET_NAME/$JOB_NAME

gcloud ml-engine jobs submit training $JOB_NAME \
    --staging-bucket gs://$BUCKET_NAME \
    --job-dir gs://$BUCKET_NAME/$JOB_NAME \
    --region $REGION \
    --runtime-version 1.9 \
    \
    --module-name test1.test1 \
    --package-path ./test1 \
    --config=test1/test1.yaml \
    -- \
    --arg=42

NB! you have to specify your bucket name and you might need to change the region as well.

I strongly advise to create a shell script to run (schedule/queue) a job from the very beginning. It’s much easier to tackle with when it comes to modifications.

There are three ‘subsections’ of arguments there: first four are job-specific and remain unchanged from job to job. The second is job-specific settings. The third one (after --) contains argument that will be passed to the __main__ function of your package.

Go try it:

./test1.sh

Job [test1_20180818_085812] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe test1_20180818_085812

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs test1_20180818_085812
jobId: test1_20180818_085812
state: QUEUED

Now you might execute gcloud ml-engine jobs describe ... as suggested. It’ll spit out another portion of text. Copy the last link and paste in into your browser address line. You should see...

What should you see there I will describe in the next chapter. Happy clouding!

Your First Job In The Cloud /2

Yulia Zvyagelskaya ・ Aug 19 '18

#machinelearning #tutorial #cloudcomputing

Your First Job In The Cloud / 3

Yulia Zvyagelskaya ・ Aug 25 '18

#machinelearning #tutorial #cloudcomputing