Zan Markan

Posted on May 21, 2020

What I learned in 6 months at an AI company

#ai #career #development #machinelearning

Back in January I joined a company called DataRobot as a Developer Advocate. My job is to inspire, enable, and represent developers using our machine learning (ML) products.
If you're not familiar with DataRobot, we provide artificial intelligence (AI) products and solutions to some of the world's largest and well known enterprises and governmental organizations.

Before joining, I had no meaningful experience in ML, data science, or AI. The farthest I got to ML was building an AI powered chatbot for a product I used to work with. I just knew that ML was hard 😅.
Since then I have built a few AI powered apps of my own, so I feel like I can talk a bit more about it. This article describes some of the things I learned in my first half year in the AI and ML industry - going from complete newbie to someone who can leverage ML in apps.

AI is everywhere! 🗺

I used to dismiss ML as just another hype, in the same bucket as blockchain or virtual reality. I thought it was mostly confined to Big Tech, academia, and hobbyists, definitely not something you'd interact with on a daily basis.

This was mostly true... about a decade ago. Boy, was I wrong.

The reality is, many organizations you interact with on a daily basis use ML in some way in their business, without you even realising. Your bank might use ML for fraud detection, your supermarket to forecast how much flour to stock, or a factory might use it to detect defects in the products they produce before they get to you.

Where ML especially shines is with economies of scale. When you have huge amounts of data to run through - imagine the number of transactions a bank or a large retailer might process, and even savings in the low single digit percentages compound dramatically.
These are only a few examples, but ML is often good fit in many areas where you want to automate and augment tasks that would otherwise be hard, or tedious for human workers... Now where have I heard this one before? 🤔

There's a reason why we call the current rise and rise of AI the Fourth Industrial Revolution.

Machine Learning is super hard... 😱

As an industry, ML is evolving rapidly. It's a moving target on all fronts - algorithms, frameworks, hardware, to use cases ranging from academic projects like AlphaGo, to self driving cars, to medical and health research. There's lots of hard science at work pushing the boundaries of what's possible. All of this progress is incredibly difficult to keep track of, and ensure you're really following the best practices.

Beyond all the technical difficulties of the bleeding edge, you have some great operational challenges as well. For example, once you've trained a model, how do you ensure that your model remains accurate over time. This is crucial, especially when data you're using starts to change in ways you haven't foreseen, and your models aren't reflective of the real world.

Last but not least, there are also challenges with governance and AI bias. Most AIs are seen as black boxes where you can't know why a certain answer was given. If a laboratory wants to use AI to help produce a diagnosis, they can't risk relying on a black box when human lives are at stake.

...but using ML doesn't have to be a struggle 🦸‍♀️

Luckily, today's bleeding edge is more accessible than ever. We've solved many hard things.
You don't need a degree in electronics to use a computer, or a networking degree to be a web developer, you don't need a data science degree anymore to build and deploy AI. The whole AI industry is moving towards democratization, and enabling everyone to work with, and benefit from AI.
There are awesome tools out there that let you build and leverage leverage models, and you don't even need to touch the frameworks like Tensorflow or Pytorch yourself.

Take Automated Machine Learning (AutoML) for example. AutoML is DataRobot's product and an associated technique that lets you train dozens of ML models using a variety of algorithms on a single dataset, pitting them against each other in order to determine which model is the best for your data. What used to take days can now be accomplished in mere minutes. New and refined algorithms are constantly being added to AutoML tools, making the models they produce ever better and better. All you need to do is to press a button.

We have also made some serious progress towards overcoming operational and governance challenges I mentioned earlier. We have tools available that let you monitor your ML models in the wild and that let you know if a model is at risk of becoming inaccurate due to changes in underlying data (we call that data drift).
Our models are also not black boxes anymore, and we can explain with increasing confidence why a given model has returned a certain prediction. Unsurprisingly, this field is called Explainable AI.

Difficulties with Data Dealing 🐘

For all the problems we've solved already, one thing remains clear - we need data for machine learning, and obtaining that data for ML is difficult.

Datasets need to be representative of your problem domain. In very few cases are you able to apply a dataset from the web that works well for your own use-case. Face and object detection is a good example of where that works - that's why many companies are in the business of providing AI-powered image and speech recognition software and services.

Often you will also need to combine multiple data sources from various sections of the business into a single learning dataset (unless you're lucky and have a data warehouse at hand already).
For this, SQL rules the world. Sometimes you need to revisit your SQL knowledge because you have obviously forgotten everything apart from the most basic commands in the last 8-ish years. Speaking for a friend, of course. 😇

Finally, and this might be pretty shocking for developers, but sometimes constructing a good training dataset also requires lots of human effort at some point in the process. Whether you're manually classifying a dataset you're preparing, or validating the output of a script - it's eyes before AI. 👀

The Pervasive Popularity of Python 🐍

Python is one of the most popular programming languages. I knew that, yet it still surprised me. This shouldn't be strange as it's the one language that caters to both data scientists as well as developers (as opposed to say Java, R, or Matlab). Python lets you build both apps, and manipulate large sets of data with the same syntax.

One of Python's strongest suits is it's readability. Python reads like English - it's probably the most English-like out of all programming languages. This makes it incredibly accessible for learning. It also has a mature set of libraries for a wide range of use-cases, from web development to scientific computing, giving it a broad appeal.
Yet, not everything about Python is perfect. The language is almost 30 years old, some 5 years senior to Java, JavaScript, and PHP, and this age shows. One of my pet peeves is the situation with dependency management. Sure, it comes with a package manager with PyPI, but the only way to ensure you're isolating your environment properly is by using an archaic combination of console commands to create and initialise Virtualenv. I find Virtualenv so confusing that I have a gist I can refer to each time I deal with a new project 🙈.

Finally, the ways developer and data scientists might use Python are vastly different. Sure, the syntax is the same, but the ecosystems vary wildly. From the libraries and frameworks you might use for either development or data science, all the way to tools used and best practices followed.

For example, data scientists love their Jupyter notebooks 📓. They are a sort of dynamic, executable documentation, where you write part markdown and part code, and then execute it alongside the documentation - rendering spreadsheets and graphs side by side! They are the preferred way to interact with Kaggle competitions and datasets. This concept is great for experimentation and therefore ideal for data scientists, yet not at all suited for building and maintaining production software - something I cherish more as a developer. What's more - all that fancy dynamic approach generates some extremely hard to review JSON code, if you ever decide to host that on GitHub 😬.

What about the Rise of the Robots? 🤖

You've probably encountered film, books, or games talking about the rise and revolt of intelligent machines - the Matrix, Terminator, 2001: A Space Odyssey, and many others. It's easy for media to paint a bleak picture.

Yet, from what I've seen so far, we have little to fear from the AI that we currently possess. Indeed, AI is developing at a rapid pace - some of the brightest people I know are working hard on pushing the boundaries.

At the same time, it's become more and more accessible for you and me to build - and benefit from - AI, and we have some amazing tools available that make it easy to approach.

So, if you'd like to see what the bleeding edge looks like and what it can do, we recently opened a trial for DataRobot for developers. If you'd like to try out the latest and greatest of ML tools, please reach out to me for access to DataRobot trial - at zan@datarobot.com, quoting dev.to in the subject.

That's just a few thoughts on my perspective after 6 months in the AI industry. There has been a lot of learning, and a lot of surprises, and I'm more enthusiastic about the future than ever, and how dramatically our lives will change because of it.
What about you? What are your thoughts and perspectives on AI and the tools that are available? Let me know in the comments!

Top comments (5)

Jonathan Wheat • May 22 '20

Great article, thanks for sharing that.

I'm in a similar situation, I started 9 months ago and am now heavily embedded in the NLP / chatbot world. True story, I actually despise using chatbots but I'm fascinated with the technology and advances in NLP and find building them is quite enjoyable. I like solving puzzles, so when I try something with my bot and it doesn't work quite how it should, I really get into the research involved in fixing the deficiency. The big problem here is now I can spot holes in other commercial bots I use, and can tell right off what their approach is, so it makes them even more annoying and I want to fix them :)

Definitely going to check out the thinks you all are doing at Data Robot, it's always nice to meet other people in the industry and see how they approach things.

Zan Markan • May 22 '20

That's a great question!

On one hand using a common language for multiple tasks is very appealing - look at the JS ecosystem with browser/Node (and now Deno), and other newer efforts like Kotlin for multiplatform.

In the short term, I think it will be super difficult to unseat Python for data science, given how prevalent it seems to be. I've heard people say good things about Julia as a superior computational language but not sure where it is going.
Developers will be quicker to change and are more likely to diverge into other areas.

In the long term, we're moving towards greater abstraction everywhere (both data science and dev) - and with this there'll be more emphasis on specific and specialised tools, glued together with scripting. (At least for the most common use-cases)
The programming language itself will become less relevant as it becomes just a glue technology.

Jonathan Wheat • May 22 '20

We're kind of split language I guess.

Developers here can help prototype things out in python if they have an understanding of the data and are working closely with a data scientist on a problem. I think all of our data scientists are also C/C++ guys, and rewrite and compile things down for production for speed and performance gains.

Do you use python on production? If so, do you see any hits on speed / performance?

John Doe • May 22 '20

Thanks for sharing!