DEV Community: Walker Harrison

Using Data Science for EdTech and Public Policy, with Alicia Powers

Walker Harrison — Mon, 31 Jul 2017 18:07:22 +0000

Alicia Powers is the Senior Vice President of Research at the ‎New York City Economic Development Corporation. She has a degree in Statistics and Cognitive Science from Rice University and a PhD in Public Policy and Management from Carnegie Mellon University.

What impresses you most about modern artificial intelligence, which is often modeled after cognitive processes, your field of study?

There have been great strides in the area of AI since I was first studied it a few decades ago when it was mostly a set of algorithms in a textbook. That being said, I think we may not be as far along as we think we are when it comes to artificial intelligence. There’s a level of potentiality and plasticity to the brain—an ability to react to fresh information—that I think computers are yet to replicate.

Sometimes I find myself in a scenario where I’m impressed with one form of machine learning but underwhelmed by another. I was listening to music the other day on a streaming service and thought that the music recommendations were curated wisely, but that the targeted ads in between songs had nothing to do with me.

So there’s room for improvement in AI. I’d say I’m most impressed by the advances in sensory and spatial awareness in machines. It wasn’t so long ago that we were unable to build a robot that could take a few steps without falling over, but they’ve since gotten much more sophisticated.

What software does your team use for its analysis?

We use R. But most of my colleagues come in with degrees in economics or policy and not computer science, so they often don’t have much programming experience. So we utilize resources like DataCamp, and the team also holds R learning sessions. I teach base R but also tidyverse packages like dplyr — I find using the %>% pipe in your syntax makes it easier to follow the analysis.

I’ve been using R since I was a grad student in the early 2000s. But for personal projects I prefer Python—Jupyter notebooks have made it much easier to use libraries like numpy and pandas for data analysis. Plus it makes me feel more like a software engineer! I can tell my programmer friends that I’ve been coding in Python.

After getting your doctorate, what was your first role in tech?

I was at an edtech hackathon when an acquaintance introduced me to Noodle Education, founded by John Katzman of the Princeton Review, which was a young company trying to serve as a knowledge base for individuals trying to get into school at all levels. A few months later I joined their team.

I worked mostly on what could be considered a search engine for education. We wanted you to be able to search for schools and courses and degrees from kindergarten all the way through to the university level. We had moms searching for local preschools, 8th graders in New York City trying to find the right high school, and students in China trying to inform their college apps.

Much of my job was combining disparate data sources, often in the form of excel spreadsheets with pretty complex schema, into a single database. This role skewed more toward aggregation than analysis, so it was more of a data engineering position than a data science one. We went with MongoDB for our database needs, which at the time was just getting started.

As someone who’s experienced in edtech and worked in academia, would you recommend traditional education or modern techniques, like bootcamps, to aspiring data scientists?

I’m still in favor of the more traditional route if you can find a way to afford it. Everyone thought that MOOCs were going to completely disrupt academia and change the world…but no one ever finishes them! So I still think that the structure and network that in-person learning at a university can provide is worthwhile.

Working as a Data Scientist at Stack Overflow, with Julia Silge

Walker Harrison — Wed, 26 Jul 2017 16:15:52 +0000

Julia Silge is a data scientist at Stack Overflow. She studied physics and astronomy, finishing her PhD in 2005. She worked in academia (teaching and doing research) and ed tech before moving into data science and discovering R. Her O'Reilly book Text Mining With R, written with coauthor David Robinson, is now available.

What kind of data problems do you encounter at Stack Overflow?

The data I get to work with at Stack Overflow is amazing! Stack Overflow is where people who code come to learn, share knowledge with each other, and build their careers, so our data is all about that. It's data about who developers are, how they build communities together, how technologies interact and are changing, and how technologies are impacting the world we live in.

We use this rich data in a lot of different ways at Stack Overflow. My team works on machine learning to match developers with relevant jobs, text mining to understand what makes a developer more likely to respond to a company, and of course understanding the technology ecosystems themselves.

One specific issue I've worked on understanding is the geographic distribution of our users worldwide and comparing that to the geographic distribution of the jobs available on Stack Overflow. This project did not involve particularly fancy AI or anything, but it did involve integrating diverse, messy datasets and building an internally facing tool that shareholders from executives to sales people can engage with and use to make decisions.

What lessons can data scientists take from data journalism?

My own work as a data scientist is influenced by data journalists, and communication and storytelling in general. This doesn't mean I am less technical, but it does mean that I care a lot about what someone takes away from the analysis I did, or the model I built.

Andrew Flowers gave a talk at rstudio::conf last year about how to find and tell stories, and it resonated so much with how I approach my work. I pay attention to what's going on in data journalism, and I find that when something is really compelling, whether it's The Pudding's work on hip hop or bestselling books or the Upshot’s "you draw it" graphs, I consider why and how those principles can be integrated into my own work. I communicate with people like software developers, product managers, and salespeople in my job, and I want them to both understand and be delighted with the work that I do.

How did you transition from academia to data science?

I wrote lots of code during my research years, using programming as a tool for scientific computing and analyzing real-world complex data. After grad school, I worked in research as a postdoc and then in education as a professor. After that, I took several years away from the paid workforce when my children were tiny and was home full-time with them.

In 2012, I transitioned back into the workforce with a job at an ed tech start-up where I developed interactive content for higher ed STEM courses, but through a series of circumstances (including a layoff) I decided the time was right for me professionally and personally to move toward a more technical, analytical role. I hadn’t been coding full-time since my postdoc days, so I jumped into a whole slew of opportunities for learning, from MOOCs to books to eventually getting involved in open source.

I discovered the statistical programming language R, which I have taken to like a fish to water, and worked to update and develop my skills. The open-source R community provided me with amazing opportunities to improve as an R developer and to build relationships with people who have helped me along the way. I eventually started applying for data science jobs with a portfolio of analysis projects demonstrating my skills on my blog. My first job as a data scientist was at an amazing statistics/data science consulting firm called Datassist that does important, interesting work. I work now as a data scientist at Stack Overflow; it's my second job where my title is data scientist.

What are the common career paths that lead to data science?

I usually see people from two backgrounds who are interested in moving into data science. The first set are people from really academic backgrounds in physics or ecology or the code-heavy sides of the social sciences, with PhDs like me, who have strong quantitative skills. Usually what these people need to move into data science is to adopt some important practices from software engineering like version control, unit testing, and continuous integration. Basically, they need to become more fluent coders.

The second set are software engineers who are already great at writing code and have data-oriented mindsets, but less statistical training. Usually what this group needs to move into data science is to hone their modeling and machine learning chops.

Another cohort of people I am super interested to follow are the students who are right now in the new academic programs training people to do data science. These are students getting master's or bachelor's degrees with explicit training in data science as a major or minor. I've gone to speak at some of these programs here where I live in Salt Lake City and I am very interested to see how this space evolves in the near future.

I predict that in 5 or 7 years there will be more data scientists who have less education (fewer PhDs, more bachelors') but more specialized education (fewer astronomers and biostatisticians, more people with masters' degrees in data science). I don't think the field will be worse off, but it will be different!

Flash is dead 💀 (kind of)

Walker Harrison — Tue, 25 Jul 2017 19:21:11 +0000

Earlier today, Adobe published a blogpost announcing that they would be gradually sunsetting their well known plugin, Flash. The company will continue "issuing regular security patches, maintaining OS and browser compatibility and adding features and capabilities as needed," according to the post, but the plan is to stop updating and distributing Flash by the end of 2020.

Nostalgic users might shed a tear for Flash's death—who amongst us didn't rely on Flash to serve us an early viral video or whimsical browser game?—but the general consensus is that Flash's wind-down is overdue if anything. With near universal agreement, web developers consider Flash a potential source of security vulnerabilities, sluggish performance, and wasted battery. Major browsers like Chrome and Safari automatically block the plugin—you may recall being asked to allow it to run on various websites.

Originally rolled out in the 1990s, Flash was the standard for online audio and video playback for over a decade. But eventually superior technologies like HTML5 supplanted the plugin, and Flash's persistent bugginess and poor security attracted scores of critics, most notably Steve Jobs who penned a long essay denouncing Flash in 2010.

But Flash's erstwhile prominence means that it's still all over the web, which is why Adobe is relying on its powerful partners, Apple, Facebook, Google, Microsoft, and Mozilla, to ease the transition by encouraging developers to migrate current Flash content to modern formats. The three years until the official end-date makes such conversions easier, but Flash was instrumental in the digital emergence of industries like education and gaming, so it's likely that this change will break more than a few old and untended websites.

Many dev.to users surely have experiences, both positive and negative, working with Flash over the years. Feel free to share them in the comments below!

Turning a side interest in programming into a data engineering career, with Josh Laurito

Walker Harrison — Thu, 13 Jul 2017 14:37:02 +0000

Josh Laurito is the head of data engineering at Fusion Media Group, the publisher of the web’s most widely read media brands, reaching over 90MM unique visitors a month. His team occasionally blogs at fmgdata.kinja.com. He also runs a newsletter for the NYC data community that you can subscribe to here.

What's a recent data problem that your team had to solve?

Most of our work is dedicated to our publishing platform, Kinja. Kinja used to only allow homepages organized like blogs, with stories listed newest-to-oldest. About a year ago, we started supporting stories being 'pinned' to the top of the homepage. This was a great change that was applauded throughout the company, giving our editors a chance to highlight each publication's best work.

The flip side of manual curation, though, was that we as an organization hadn’t ever really picked out stories for special coverage this way before. While it was generally pretty clear to the editors what stories should be pinned, we had no experience with how long they should stay pinned. Is a big story from 8 hours ago more engaging to our audience than a smaller story that’s just breaking now?

At first my team did some analysis, thinking we could automatically promote stories to the top of the page, or at least recommend what stories should go there.

However it was pretty clear that the editors needed to have a lot of control here. Editorial organizations have their own sensibilities that are really difficult to articulate in a model. Additionally, there were lots of planned event coverage, like the Apple WWDC for Gizmodo or E3 for Kotaku, where we’d need to have posts queued up to override anything automated.

So how did you find the right level of automation?

In the end, we built a lightweight alerting system. The math is pretty straightforward: we calculate expected click rates for stories on each site that we support, then see how stories are performing against that. We integrated the alerts into each publication’s Slack room, and added a glyph system so editors knew exactly what the numbers meant. Here’s an example from Lifehacker:

So instead of building something that was completely automated around pinning stories, we used the data to educate and support our editorial team. It’s been successful in that it drives changes to the homepage, but doesn’t dictate terms: if editors want to keep up a post despite lower objective performance, maybe because the reporting is excellent, or they just think a piece is really fun, they have that option.

Beyond curation, does the data team also make suggestions about the actual headlines or content of articles?

We've talked about this a lot, and we've shied away from it so far. Occasionally for sponsored content or the sales team we'll help research a strategy, but not for editorial so far.

There are two main data-oriented reasons we've stayed away (our editorial staff could probably give you a few more). One is that we're worried that we'll dilute what makes our publications special, which is their voice. When lots of people are testing headlines or content, there's going to be convergent evolution, which means sites will be harder to differentiate between. While I have not doubt that leads to short-term improvements in metrics, I am certain it will dilute what makes us a destination for our readers.

The second reason is that the implicit assumption behind testing is that the audience you're training on (today's traffic) looks a lot like your test set (future traffic). But the fact is that most of our traffic is filtered through someone else's algorithm first, be in Facebook, Google, Twitter, or someone else. I don't have a lot of confidence in the stability of those algorithms, and I think over-optimizing for them could make us susceptible to changes. I'd rather let the editorial and audience teams try what they want and have a more diverse set of headlines and content.

When and how did you learn to program?

I actually graduated from college with a degree in chemistry, and moved into finance, figuring that even if I didn't know what I wanted to do, at least I could make some money and meet some smart people. It was right during the housing boom, and I had a friend who worked in a mortgage derivatives group at a big bank who helped me get a job as a structurer, which is basically someone who does scenario modeling.

The work was interesting and challenging, but the workload was overwhelming if you couldn't manage it. I learned how to program in order to keep up with all the requests for my time. Most of these lessons happened thanks to the guy who sat next to me, who had a master's in computer science and took pity on me manually entering scenario parameters over and over again at 1AM.

The bottom fell out of the market a few years later and all of us got laid off. It was a tough lesson in the limits of what our models could do. I actually worked with a few people who ended up as characters in The Big Short.

How did you turn your side interest in programming into a data engineering job?

I ended up at an insurance company and one day, the CEO of the company walked by my desk and complained to me how he had no way to match his data with government data. So if he wanted to see how many of his insurance policies were written in areas that have, for example, high unemployment rates, he was stuck.

It sounded like a really interesting problem, so between Christmas and New Year's when nobody else was around I prototyped an app that took information from internal and government sources, and put together choropleth maps.

When people came back in January, I showed off my app and everyone was pretty excited about it. We built small team around the idea. What I didn't know at the time was that the insurance company had invested too much in mortgages, just like the bank, and was going to be effectively going bankrupt soon.

Fortunately, two of the executives I worked with were interested in spinning off a startup based on my app. That became my first startup, Lumesis. The two founders wanted to move the company up to Connecticut, which didn't really appeal to me, but I was excited about the tech world after getting a taste, and started looking for other jobs in tech that used math and statistics to solve problems.

I bounced around a few startups before a friend of mine who I had work with at CrowdTwist, a Techstars company in Flatiron, recruited me over to Gawker. I started working there in 2014, and have stayed here through the bankruptcy and acquisition by Univision.

How has your job changed since you were first brought on?

Oh wow, it's like night and day. When I started, I was effectively the only data scientist/analyst (we had a data engineer though). I spent almost all my time coding, testing, and writing up results.

Now that the team is bigger and we work for a large corporation, I spend a lot of time managing, doing strategy work, and hiring. Speaking of which, we're hiring for product, engineering, and data roles, as well as elsewhere in the organization. Your readers should take a look!

What do you look for in a new hire and what advice would you give to aspiring data scientists?

The advice I give to everyone is to complete a project and make it public. When I taught data visualization, I forced my students to make their final projects public. It doesn’t really matter what it is: tied to work or not, a visualization or an open-source library, whether it’s polished or not, you need to get it out there.

All of us who work with numbers or code for a living know that most project include some ugly parts, whether you think of them as kludges or technical debt or hand-waving through assumptions/theory. Just being able to ship something that delivers what it promises puts you ahead of most people who would like to work in data science.

When I got started I did all sorts of dumb fun projects around things I was interested in, just to learn. I made a Mouse Speedometer, a map of the US banking system, a comparison of European Languages, a bunch of stuff. None of these are going to win any awards, but they helped my build a portfolio, and generally were a lot of fun to play around with as I was learning new tech.

When I’m hiring, I love seeing what people have done before, either at work or on their own. I think the number one complaint about data-oriented people in industry is that we aren’t always good at shipping products out the door, so knowing that someone actually gets stuff done means a lot to me.

Pseudo-Random Numbers in Python: From Arithmetic to Probability Distributions

Walker Harrison — Fri, 23 Jun 2017 18:21:25 +0000

Randomness is something that we tend to take for granted in our daily lives. "That's so random!" we'll say, when someone does something abnormal or unexpected, even though there's evidence that humans can't consciously achieve randomness. For truly stochastic processes, we turn to nature: the growth of bacteria, the timing of radioactive decay, and the thermal noise created by electric currents all lay claim to a true random sequence.

There's not always some Plutonium-239 lying around every time we need to use true randomness though, so computer scientists developed close approximations called pseudo-random number generators, or PRNGs. Today these algorithms are ubiquitous in software development. As Wikipedia notes, "PRNGs are central in applications such as simulations (e.g. for the Monte Carlo method), electronic games (e.g. for procedural generation), and cryptography."

Radioactive decay is considered a truly random process.

Today I'm going to walk through a relatively simple kind of PRNG called a Linear Congruential Generator (LCG). In modern implementation, most random numbers rely on more sophisticated PRNGs, but the LCG gives us a solid foundation and will be sufficiently random for our modest intentions.

As a pre-req, you should be familiar with modular arithmetic, which is essentially just representing numbers as remainders once divided by a "modulus." In other words, numbers can increase until they reach a set modulus and start wrapping around: 5 modulo 3 is equal to 2, 25 modulo 2 is equal to 1. Most clocks are on a modulo-12 system.

For an LCG we need a modulus, m, an initial value (or "seed"), X0, a multiplier, a, and an increment, c. From there, to get from one number Xn to the next Xn+1, you simply execute the following algorithm:

Xn+1 = (aXn + c) mod m

Let's write this as a function in Python and then assign some simple values: a modulus of 10, a multiplier of 3, an increment of 1, and a seed value of 5.

def lcg(n, m, a, c, seed):
    sequence = []
    Xn = seed
    for i in range(n):
        Xn = (a*Xn + c) % m
        sequence.append(Xn)
    return(sequence)

lcg(10, 10, 3, 1, 5)
# => [6, 9, 8, 5, 6, 9, 8, 5, 6, 9]

As you can see, we asked for ten random numbers and were given the sequence [6, 9, 8, 5, 6, 9, 8, 5, 6, 9]. Now while this passes the requirements of an LCG, this is a truly horrible PRNG. First of all, in modulo 10 there are only ten possible values a number can take on, and further more there is an unbreakable 6-9-8-5 loop in our case that further reduces our state space to just four numbers.

What we need is a much larger modulus and the assurance that every possible number will be produced once by any seed value that's input, or what's known as having "full period." Enter advanced number theory. By the Hull and Dobell Theorem, an LCG will have full period if:

c and m are relatively prime (i.e., the only positive integer that divides both c and m is 1).
If q is any prime number that divides m, then q also divides a – 1.
If 4 divides m, then 4 also divides a – 1.

You can read the original 1962 paper proving this theorem here. Instead of trying to come up with some numbers that work, we'll steal the ones that are actually used for random number generation in some languages, including C++: m = 2³², a = 22695477, c = 1.

But before trying out these new inputs, we need to address the deterministic nature of our LCG: plug in the same seed and you will get the same output, which sort of invalidates any sense of randomness that we'd claimed. This is a fundamental flaw in PRNGs in general.

However, there is an upside to their predictability — namely that by setting and sharing a seed, one can create reproducible results, which are critical in the sciences — and also a few ways around it. By tying the seed to something unseen and ever-changing, like the current number of microseconds past the second, we can reintroduce some faux-randomness. So along with our new inputs, we'll use the built-in Python datetime library for our seed:

from datetime import datetime
lcg(10, 2**32, 11695477, 1, datetime.now().microsecond)
# => [1090430687, 498347756, 2363706845, 1780651778, 1345777131,
#        826090344, 275756681, 2092763550, 396794679,3763540772]

That looks a lot better. Usually, however, we're not so much interested in a random number between 0 and 2³², but rather a random number between 0 and 1 — the standard uniform distribution U(0,1). Since this distribution is bounded at 0 and 1, it can be multiplied by larger numbers to create larger uniform distributions and also maps smoothly to probability spaces, which will come in handy later.

By dividing our output by the modulus (or, to be precise, the modulus minus 1), we can transfer our random numbers onto the line between 0 and 1. Technically, only a finite group of numbers can be outputted so we haven't achieved a true uniform distribution, where every conceivable real number between 0 and 1 is possible and there's no point mass anywhere. But this distinction is negligible with a large number like 2³². Let's rewrite our LCG to spit out numbers between 0 and 1, and then graph a sequence of a thousand random numbers compared to a thousand random numbers generated by Python's built-in random() function:

from random import random
import matplotlib.pyplot as plt

def uni(n, m, a, c, seed):
    sequence = []
    Xn = seed
    for i in range(n):
        Xn = ((a*Xn + c) % m)
        sequence.append(Xn/float(m-1))
    return(sequence)

x = range(1000)
y_1 = uni(1000, 2**32, 11695477, 1, datetime.now().microsecond)
y_2 = [random() for i in range(1000)]

plt.plot(x, y_1, "o", color="blue")
plt.show()

plt.plot(x, y_2, "o", color="red")
plt.show()

Two pseudo-random outputs: our LCG on the left and Python's built-in PRNG on the right.

On the left are the thousand random numbers graphed in the sequence we produced them, and on the right are the thousand emitted by Python's built-in random() function, which, for the record, relies on the Mersenne Twister, a relatively modern algorithm that today is the gold standard for PRNGs. While any computer scientist would tell you the second algorithm is much more robust, to the naked eye they appear comparable.

Moreover, the standard uniform distribution acts as a gateway to all sorts of other, more sophisticated distributions because it operates on the probability space from 0 to 1 (everything has somewhere between a 0% and 100% probability of happening). So as long as we can find a way to map our events to the uniform distribution, we can randomly sample from them.

As a simple example, we can simulate flipping a coin from the uniform distribution by calling any random number greater than 0.5 a heads and anything less a tails. Here's a function that takes a number of flips and returns how many heads turned up (formally, a binomial distribution with n trials and p of 0.5). We'll run a thousand trials of hundreds flips each and build a histogram with the results:

def coin_flips(n):
    flips = uni(n, 2**32, 11695477, 1, datetime.now().microsecond)
    heads = sum([i<0.5 for i in flips])
    return heads

trials = [coin_flips(100) for i in range(1000)]

plt.hist(trials, bins=range(min(trials), max(trials)))
plt.show()

A histogram of thousand trials of our PRNG flipping a coin a hundred times and counting the heads.

As you can see, our trial data is unimodal and nearly symmetrical, which we'd expect from a binomial random variable.

Now, as we've mentioned, these PRNGs are indisputably flawed. They are deterministic and not truly uniform. Famed mathematician and computer scientist John von Neumann once said that "anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin."

Still though, I find it pretty remarkable that with addition and multiplication, the most basic mathematic building blocks, we can approximate a sample from a random variable like the binomial distribution.

Github releases results from its 2017 Open Source Survey

Walker Harrison — Fri, 02 Jun 2017 20:09:30 +0000

If you're reading this page on Firefox or even using a Linux operating system, you're the beneficiary of open source software. On a broader level, open source tools likely facilitate all sorts of everyday actions of yours, which makes Github's yearly open source software survey, released today, an important assessment of a major concept within the technical universe.

By surveying over 5,000 respondents from open source repos on their site, Github has put together a "open data set about the attitudes, experiences, and backgrounds of those who use, build, and maintain open source software."

There's a lot to mine here — and Github provides the raw data for those who want to analyze it themselves — so we'll be releasing a series of posts over the coming weeks that examine individual topics. Some of the insights are less than encouraging, as the issues that plague the tech world in general, such as harassment and gender diversity, are present if not exaggerated in open source projects.

For today though, we'll focus on a positive finding: the ever-growing pervasiveness and acceptance of open source software. As indicated by the below graph, the vast majority of respondents reported that they use open source software at work. Moreover, "most report that their employers accept or encourage use of open source applications (82%) and dependencies in their code base (84%)."

These figures speak to the journey open source software has taken in the past two decades from a questionable standard to something that's widely tolerated if not preferred today. As the New York Times reports, "in 2000, the open-source operating system Linux was viewed askance in many corporations as an oddball creation and even legally risky to use, since the open-source ethos prefers sharing ideas rather than owning them."

The collective corporate tune has changed though. These days, "where free, collaborative software projects were once the flags flown by indie developers bucking corporate computing, today even companies like Exxon Mobil, Wal-Mart, and Wells Fargo are releasing their own open source tools," via Wired.

There's much more to discuss from the survey's results, so keep an eye out for more posts on dev.to, or better yet, write them yourselves. I'll be diving into the data under the data and ossurvey tags. Let's keep the conversation going.

What's the best software for creating flowcharts and other visuals to document application logic?

Walker Harrison — Wed, 24 May 2017 21:54:37 +0000

I'm looking to create some flowcharts, but I'm only aware of the obvious solutions like PowerPoint or Keynote. Does anyone have any suggestions?

Here's an example of a decently complex flowchart, via Slack:

Google announces support for Kotlin on Android, now what?

Walker Harrison — Wed, 17 May 2017 19:50:55 +0000

At their annual I/O conference in the Bay Area, Google announced this afternoon that Android would now support the Kotlin language. Kotlin, released by the software company JetBrains in 2011, is a relatively new language with growing popularity among developers.

Liquid error: internal

Supporting Kotlin was viewed as a logical next step for Google by many in the tech space. The language is interoperable with Java, allowing developers to mix the languages when writing code, and already has legions of loyal users who appreciate its modern features and clean syntax. A quick peak at the rising incidence of "Kotlin" tags on Stack Overflow provides further evidence of the language's upswing:

Of course, Java still dwarfs Kotlin in absolute popularity.

On their own blog, JetBrains announced the news as "a chance to use a modern and powerful language, helping solve common headaches such as runtime exceptions and source code verbosity."

The announcement has been well received so far. Commenters in the Hacker News thread anticipated "a good Kotlin native experience" thread expressed excitement over the opportunity to use "Java's ecosystem without actually having to write Java."

The move to support Kotlin natively has a lot of parallels with Apple's release of the Swift programming language to give the Apple development experience a boost. While the move has different repercussions because Kotlin was not developed by Google, it should nonetheless provide a big breath of fresh air to their developer community.

One has to wonder about Google's legal battle with Oracle over their use of certain Java APIs.

K. 👓

@k4y1s

@ThePracticalDev lol, is this because oft the beef with Oracle? :D

18:40 PM - 17 May 2017

So what now?

There are a lot of great reasons to make Kotlin the next programming language you learn, but where to start? Naturally, this isn't the first dev.to has heard of Kotlin. Graham Cox vouched for the language in the beginning of the year:

Why I prefer Kotlin

Graham Cox ・ Jan 2 '17 ・ 2 min read

#100daysofcode #kotlin #java

Later, Daniele Botillo took us through some basic examples:

Kotlin by examples: Class and Properties

Daniele Bottillo ・ Feb 21 '17 ・ 3 min read

#android #kotlin #java

Hadi Hari appeared on a great episode of Software Engineering Daily to discuss Kotlin. I highly recommend it to get a deep dive into the origins of Kotlin and why a company would develop a new language in the first place.

Kotlin with Hadi Hariri

Software Engineering Daily

Your browser does not support the audio element.

1x initializing... ×

Any devs out there with experience in Kotlin? Leave your thoughts in the comments!

Stack Overflow released a new mobile app

Walker Harrison — Tue, 16 May 2017 15:58:45 +0000

Stack Overflow released a mobile app today for iOS and Android:

Stack Overflow

@stackoverflow

Q&A on the go with the new Stack Overflow mobile app. 🍎 iOS → buff.ly/2qo5cma 🤖 Android →… twitter.com/i/web/status/8…

15:07 PM - 16 May 2017

The website's blogpost about the release notes that Stack Exchange has had an app available for several years that included access to Stack Overflow, but this app is designed solely for access to the company's flagship site.

My first impulse would be to question the utility of a Stack Overflow app, since I'm confident that the vast majority of programmers that rely on the question-answer website do so while they write code on a personal computer and not their phones. Plus, asking a question on Stack Overflow requires users to adhere to somewhat strict protocol and often to use code blocks, special characters, or other non-traditional text that doesn't seem easy to thumb from a phone.

That being said, there are plenty of less complicated actions important to the Stack Overflow experience. "Meta" posts about the actual product are popular and often prominently featured—in fact the site devoted one to the app's release this morning. Plus there's plenty of back-and-forth in a question's commentary that could be done easily from a phone.

It's also important to remember that the lifeblood of Stack Overflow is the class of power users who spend hours a day assisting communities, and not people like me who ask the occasional question. A mobile app might be more useful for such users, and anything that expands the experience for them is likely worth it for Stack Overflow.

What do you guys think?

Stack Overflow just released a tool called "Trends" that tracks the popularity of programming terms over time.

Walker Harrison — Thu, 11 May 2017 15:47:17 +0000

It can be frustrating to post a question on Stack Overflow while worrying if the library or language you're asking about is popular enough to get the quality response you're looking for. And it can be reassuring to put up a question about widely used tools that is likely to get answered.

You can now capture those sentiments graphically with the Stack Overflow Trends tool, which, given a set of tags, will whip up a chart that shows their popularity over time going back to 2009. Here's one of their suggested searches, which shows the trends for Javascript frameworks:

As you can see, Angular and React have been increasingly asked about in Stack Overflow questions since about 2015, while jQuery's relevance has steadily declined in the same time period.

Another pre-made chart shows the emergence of technologies in the data science realm:

The statistical programming language R has become much more prevalent on Stack Overflow, as has the Python package pandas, while MATLAB has struggled to stay above water. More recently, questions surrounding machine learning library TensorFlow have heated up.

Of course, it's more fun to come up with your own queries. Here's one showing every iOS release since iOS 7 along with a general iOS tag:

Each iteration has a personal peak around its release in the fall, only to be replaced in the next calendar year. Meanwhile, the general iOS tag is relatively stable.

Keeping an eye on emerging technologies is an important part of being a developer, no matter what language you're working in. It's also a worthwhile intellectual exercise to wonder why certain tools decline and others replace them. So give the Stack Overflow Trends tool a try and let us know what you can dig up—I'm sure there are findings worth blogging about!

The 500-foot Cab Ride: Using BigQuery to find out how dirty (sinister?) NYC's cab ride data is

Walker Harrison — Fri, 28 Apr 2017 18:53:47 +0000

I may be a few years late to this, but Google's BigQuery is freakin' awesome. The data warehouse is the home to a bunch of really neat datasets which users can traverse shockingly quickly (and for free, up to the first terabyte). In the past few weeks, I've explored the GitHub dataset to find popular Python packages and the frequency of cussing programmers.

Today, I'm going to recreate a blogpost I wrote last year about cab rides in New York City. When I realized the Taxi & Limousine Commission published their ride data, including starting and ending geographic coordinates, I figured there were some interesting patterns waiting to be discovered.

In particular, I wondered how often people took cab rides for ludicrously short distances (less than a tenth of a mile) and how those rides might differ from more conventional rides. I mostly expected geographic and timing discrepancies: surely only drunken late-night patrons or residents of wealthy neighborhoods could justify taking a cab a few hundred feet, right?

While there wasn't much evidence of that hypothesis, I did find something else that was peculiar. Let's split every yellow cab ride since 2009 into those that were longer than a tenth of a mile ("normal") and those that weren't ("micro-rides"), and also into those that have complete geolocational info (pick-up and drop-off both present) and those that don't.

Within those four cohorts, we'll count how often the ride is paid for in cash, how often the fare is $50 or more, and how often that fare is negotiated.

(For the record, this command queried over a billion rows in under six seconds.)

SELECT 
  CASE WHEN trip_distance > 0.1 THEN "normal" ELSE "micro-ride" END AS type,
  CASE WHEN pickup_longitude = 0 OR pickup_latitude = 0 
        OR dropoff_longitude = 0 OR dropoff_latitude = 0 
        THEN 'yes' ELSE 'no' END AS missing_geo,
  COUNT(*) AS count,
  STRING(ROUND(SUM(CASE WHEN payment_type = 'CSH' THEN 1 ELSE 0 END)/count*100, 0)) + '%' AS cash,
  STRING(ROUND(SUM(CASE WHEN total_amount  >= 50 THEN 1 ELSE 0 END)/count*100, 2)) + '%' AS fifty_plus,
  STRING(ROUND(SUM(CASE WHEN rate_code = '5' THEN 1 ELSE 0 END)/count*100,2)) + '%' AS negotiated
FROM [nyc-tlc:yellow.trips]
GROUP BY type, missing_geo
ORDER BY type DESC, missing_geo

type	missing_geo	count	cash	fifty_plus	negotiated
normal	no	1,076,851,430	49.00%	2.11%	0.10%
normal	yes	20,417,733	48.00%	2.21%	0.19%
micro-ride	no	9,828,272	56.00%	15.07%	8.32%
micro-ride	yes	1,682,028	55.00%	25.53%	26.63%

Some of these trends are predictable. There are about 100 normal rides for every micro-ride, in which cash is used about half the time. But some of them are a bit odd. Why are there so many instances of micro-rides costing more than $50 when they by definition should be much cheaper? And why is that trend exaggerated for micro-rides missing their geo-locational data? And why does missing geo data increase negotiation frequency so much more for micro-rides than for normal rides?

There might be reasonable explanations for all this. But at first sight, I convinced myself that something sinister was behind this data: In the dark and complex world of New York City, there are a certain class of trips where either the driver or the passenger does not want their exact location logged, and these are the same rides that are paid for in cash at a negotiated price. Feel free to fill in the details with tales of crooked cabbies, brick-shaped packages, and loaded firearms.

Or maybe the rows with missing coordinates or unlikely distances have just been corrupted,in which case our question becomes: is the data dirty in a conventional sense or in a criminal manner?

Working Smart: What performance metrics do developers value and when do they feel most productive?

Walker Harrison — Fri, 14 Apr 2017 18:24:20 +0000

If you are reading this article, you probably care about becoming a better developer and producing at a more efficient rate. It is not always clear, however, what the best means of measuring performance in this field are. Unfortunately, number of dev.to articles consulted hasn't (yet) caught much speed as a reliable performance metric — but there are some existing measurements that can be explored.

As part of their 2017 Developer Survey, Stack Overflow asked users to pick the most important ways to evaluate a project:

While these results aren't groundbreaking, there are still some neat trends. Developers serve the customer (72%) and the product (41%) before sales numbers (28%), perhaps with the understanding that perfectly good products can underperform financially and vice versa.

They also value efficiency, and for that reason questionable counting statistics like "Lines of code" (6%), "Commit frequency" (9%), and "Hours worked" (16%) aren't that highly valued. In terms of impressing the right people, devs trust their immediate peers (55%) more than their superiors (36%) or even themselves (35%).

// Detect dark theme var iframe = document.getElementById('tweet-837749167049031681-186'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=837749167049031681&theme=dark" }

So how do you make sure to hire people that will execute on these priorities? The survey also asked how employers should recruit and evaluate potential hires:

Interestingly, the top two answers, "Communication skills" (4.10 on a five-point scale) and "Track record of getting things done" (4.09), aren't usually explicitly quantifiable criteria. They're also things you can get across before even getting an interview using a strong resume or cover letter, respectively. Of course, hard skills are also very important, as we see knowledge of algorithms, data, and frameworks filling out the next two top spots.

Once you've picked the right people, you need to ensure they're collaborating effectively, which is why Stack Overflow also asked about favored development practices:

Of the three questions, there's the least parity in this graph. The majority of programmers appreciate the speed and flexibility offered by agile methods (77%) and scrums (65%) over older school approaches like waterfall (27%) that can feel slow-footed and unresponsive in today's accelerated environment.

In total, this section of the survey serves as a drive-by blueprint for getting things done. Hire programmers that communicate well and have documented successes with the right tools; allow them to iterate often and adapt quickly; and worship the customer or user's experience while understanding time and budget restraints.

It's almost as simple as...