Julia Silge is a data scientist at Stack Overflow. She studied physics and astronomy, finishing her PhD in 2005. She worked in academia (teaching and doing research) and ed tech before moving into data science and discovering R. Her O'Reilly book Text Mining With R, written with coauthor David Robinson, is now available.
The data I get to work with at Stack Overflow is amazing! Stack Overflow is where people who code come to learn, share knowledge with each other, and build their careers, so our data is all about that. It's data about who developers are, how they build communities together, how technologies interact and are changing, and how technologies are impacting the world we live in.
We use this rich data in a lot of different ways at Stack Overflow. My team works on machine learning to match developers with relevant jobs, text mining to understand what makes a developer more likely to respond to a company, and of course understanding the technology ecosystems themselves.
One specific issue I've worked on understanding is the geographic distribution of our users worldwide and comparing that to the geographic distribution of the jobs available on Stack Overflow. This project did not involve particularly fancy AI or anything, but it did involve integrating diverse, messy datasets and building an internally facing tool that shareholders from executives to sales people can engage with and use to make decisions.
My own work as a data scientist is influenced by data journalists, and communication and storytelling in general. This doesn't mean I am less technical, but it does mean that I care a lot about what someone takes away from the analysis I did, or the model I built.
Andrew Flowers gave a talk at
rstudio::conf last year about how to find and tell stories, and it resonated so much with how I approach my work. I pay attention to what's going on in data journalism, and I find that when something is really compelling, whether it's The Pudding's work on hip hop or bestselling books or the Upshot’s "you draw it" graphs, I consider why and how those principles can be integrated into my own work. I communicate with people like software developers, product managers, and salespeople in my job, and I want them to both understand and be delighted with the work that I do.
I wrote lots of code during my research years, using programming as a tool for scientific computing and analyzing real-world complex data. After grad school, I worked in research as a postdoc and then in education as a professor. After that, I took several years away from the paid workforce when my children were tiny and was home full-time with them.
In 2012, I transitioned back into the workforce with a job at an ed tech start-up where I developed interactive content for higher ed STEM courses, but through a series of circumstances (including a layoff) I decided the time was right for me professionally and personally to move toward a more technical, analytical role. I hadn’t been coding full-time since my postdoc days, so I jumped into a whole slew of opportunities for learning, from MOOCs to books to eventually getting involved in open source.
I discovered the statistical programming language R, which I have taken to like a fish to water, and worked to update and develop my skills. The open-source R community provided me with amazing opportunities to improve as an R developer and to build relationships with people who have helped me along the way. I eventually started applying for data science jobs with a portfolio of analysis projects demonstrating my skills on my blog. My first job as a data scientist was at an amazing statistics/data science consulting firm called Datassist that does important, interesting work. I work now as a data scientist at Stack Overflow; it's my second job where my title is data scientist.
I usually see people from two backgrounds who are interested in moving into data science. The first set are people from really academic backgrounds in physics or ecology or the code-heavy sides of the social sciences, with PhDs like me, who have strong quantitative skills. Usually what these people need to move into data science is to adopt some important practices from software engineering like version control, unit testing, and continuous integration. Basically, they need to become more fluent coders.
The second set are software engineers who are already great at writing code and have data-oriented mindsets, but less statistical training. Usually what this group needs to move into data science is to hone their modeling and machine learning chops.
Another cohort of people I am super interested to follow are the students who are right now in the new academic programs training people to do data science. These are students getting master's or bachelor's degrees with explicit training in data science as a major or minor. I've gone to speak at some of these programs here where I live in Salt Lake City and I am very interested to see how this space evolves in the near future.
I predict that in 5 or 7 years there will be more data scientists who have less education (fewer PhDs, more bachelors') but more specialized education (fewer astronomers and biostatisticians, more people with masters' degrees in data science). I don't think the field will be worse off, but it will be different!