DEV Community

Cover image for The Pitfalls of Starting a Data Science Journey
Label Insight
Label Insight

Posted on

The Pitfalls of Starting a Data Science Journey

When someone is not familiar with data science techniques, they look like magic, but they aren't! It might not seem like it at first, but how you approach a data science problem really isn't very different from how you approach a traditional engineering problem. And it might not be very far fetched for you to start a data science journey with your company.

But be warned: getting started with data science can be a slippery slope, with some pitfalls when it comes to pushing a proof of concept through to become a production-level model.

Starting out and treating data science differently

One of the first things we realized was that the project scope should be a huge concern when building a solution. When we were starting out, we often tried to make data science solutions that could shoot the moon. Quickly, we learned that this is NOT the way to do things. We ended up making things difficult for our data scientists, hard to debug, and truthfully the solutions were not very effective.

The problem was when we’d start planning a system for solving a complex problem, we’d fail to start from a modular level, similar to how we’d apply DRY or Single Responsibility software design patterns to a traditional engineering problem.

The result would be a vicious cycle, starting with a small amount of engineering work to serve the model and do some data manipulation, followed by a large model that tried to solve the entire complex problem. Ultimately, our solution would work great given a very specific input, but we’d encounter a use case where the model wasn’t effective, sending the data scientist back to the drawing board. After a couple of weeks, we’d have a new model solving for the specific use case, but less effective for the original one.

To illustrate, let's say we’re trying to predict the color of images, so we gather our data and start training, using a Convolutional Neural Net (CNN). But then, we realize we can also predict shape! We add that prediction and retrain the model.

This one little change potentially causes two problems:

  1. If the shape prediction is struggling, we make a change to fix it, training both shape and color predictions since they are the same model. This can be the beginning of that game of whack-a-mole.
  2. We increased the complexity of our model by up to 50%. Every prediction we add to our model adds a level of complexity to our solution, adversely impacting the speed and ease of working on our model.

Starting again with a new approach

At this point, we tried applying some trusty old software design patterns, ultimately breaking our solution into multiple models that could feed into each other (or not, if we chose), with each tailored to one piece of the problem. This did mean a little more engineering and data science work upfront but it was worth it.

Now, when we encounter an inefficient use case, our data scientist can take a couple of days to fix it, leaving the rest of the system still in place. This made updates like adding additional use cases simpler and less stressful, testing got easier, and things started to become reusable.

The importance of having the right data

You need data!!! Our company has a lot of data. We build solutions around consumer packaged goods, and thus, we have a lot of data on these products.

Having data and having the right data are very different—this is something we learned very painfully.

In our case, the data being captured for our existing solutions didn't quite match the format, nor the specificity of what was needed for the data science solutions that we wanted to build.

There are techniques to get around having a good amount of data, and most of them involve some sort of synthetic data generation. It’s important to note, however, that solutions built with synthetic data versus real-world data end up performing very differently as a result of the opinions inevitably built into the synthetic data. For these solutions, the variety of the real world can wreak havoc on the accuracy of a solution.

Nothing is a “silver bullet”—not even data science

Going back to our first comment:

When someone is not familiar with data science techniques, they look like magic

Kafka, GraphQL, Graph Databases, Serverless Computing. When someone first sees them, they look shiny and new and they think it will solve everything. This is never the case—no one particular solution is perfect. At the core, it comes down to statistics and probability, and data science is really good at solving for solutions that are probabilistic at their core as well.

Take something like Optical Character Recognition (OCR), a widely accepted technique to pull text off of an image. This idea came around in the 1920s. It has since seen almost 100 years of iteration, with the first real solution coming around in the 1970s and more widespread cloud-based solutions coming around in the late 1990s.

Even with all that iteration, there are still very common issues. For example, depending on training data, OCR struggles with recognizing “g” versus “9” and distinguishing “1”, “i”, and “l”. It can also struggle with colors or fonts. All data science solutions are going to have their limitations, and this has to be understood and accounted for when developing a solution!

Managing Expectations

It’s very easy to get excited about the potential of data science applications, and this can be especially true outside of the engineering department. When someone who interacts with customers regularly, whether they are from a sales or customer success team, hears about a potential data science solution, they can get excited and this can lead to pressure to deliver. But if that person is not involved in planning, developing, training, and maintaining the solution, this pressure can be stressful, leading to rushed timelines and ultimately to poor implementations.

People can be skeptical about what can really be done, especially if these problems have been difficult to solve in the past, so on the other end of the spectrum, you could have to build hype and excitement if it’s lacking.

We failed at managing expectations in the beginning but have learned how to do this better over time, and data science is now at the forefront of some of our initiatives and trusted by our entire organization. We accomplished this by keeping the scope of our solutions small and being deliberate and disciplined about metrics and sharing learnings.

Data science truthfully builds hype for itself if the solutions work. As data science evolves in the organization, it becomes more trusted and the space for building solutions.

Final Thoughts

Data science is awesome. It makes some difficult things very easy, is fun to build, and the results can be downright awe-inspiring. There are tons of pitfalls as someone begins introducing data science into their organization. The pitfalls and tribulations are going to be different for every organization, but patience is key and visibility and communication will make the entire journey easier.

Top comments (0)