DEV Community: Peter Miller

Mapbox's Sheet Mapper with D3

Peter Miller — Wed, 13 May 2020 15:58:32 +0000

As Killed By Google dramatically illustrates, Google frequently creates and then shuts down products and APIs. This culture of rapid evolution can leave folks behind if they are not paying attention. Today, I'm taking a look at how this could happen for a nice tool from Mapbox and how to quickly avoid it.

In the spring of 2020, Mapbox introduced Sheet Mapper, a tool that displays points of interest (POI) on a map. That covers the mapper part; for the sheet part, the tool uses Tabletop.js to read POI data from a Google Sheet. All told, a quick and dirty visualization tool that requires little programming background to get started.

Subsequently, Mapbox revised Sheet Mapper with Sheet Mapper Advanced, adding caching with S3 and Lambda. In Sheet Mapper Advanced, the app reads POIs from CSV files on S3 using the D3.js library. A more scalable and robust solution for sure.

However, the original Sheet Mapper is still a fun tool that unfortunately is going to break as of September 2020 when Google turns off the version of the Sheets API that Tabletop.js uses. The creator of Tabletop.js has a short section in the readme with a workaround using the Papa Parse library.

I created a repo in Github, replacing Tabletop.js with D3.js to follow Sheet Mapper Advanced. The changes were minimal, copied from Sheet Mapper Advanced, including publishing the Sheet as CSV and using d3.csv() to load the data.

You can see the result online at https://phm200.github.io/sheetmapper-d3/ and check out the repo for more details.

phm200 / sheetmapper-d3

Mapbox's Sheet Mapper using D3.js instead of Tabletop.js

sheetmapper-d3

Mapbox's Sheet Mapper using D3.js instead of Tabletop.js

Mapbox's Sheet Mapper impact tool is a quick way to:

Create a live-updating map that displays the locations of all of your POIs or events, powered by a simple spreadsheet.

Mapbox's code template (as of publishing this repo on May 13th, 2020) uses Tabletop.js to read data from a Google Sheet.

As of September 2020, Google is retiring v3 of the Sheets API and Tabletop.js will no longer work. The creator recommends using the Papa Parse library instead.

Given that Mapbox uses D3.js for the Sheet Mapper Advanced impact tool, I updated Sheet Mapper to follow suit and published this code as reference.

See a live preview at: https://phm200.github.io/sheetmapper-d3/

NOTE: With the change to D3.js, this sample cannot be run directly from an HTML file on your local disk. You need to run it off a server, local or online

View on GitHub

Range and T-Shaped Things

Peter Miller — Wed, 08 Apr 2020 16:17:03 +0000

One of the fascinating points in his wonderful book, Range: Why Generalists Triumph In A Specialized World, is when David Epstein explains why a "T-shaped" person with a broad base of knowledge across many subjects is often more innovative and makes better decisions within their area of specialty.

Epstein finds that a breadth of knowledge encourages a person to make connections, think laterally and avoid the group-think of other experts in their field. This agile thinking is particularly important for solving what Epstein calls "wicked" problems, problems that are unique, complex and not rule-bound. This describes a lot of the most important problems out there and is contrast to "kind" problems, that have well defined rules and are repeatable.

For example, a golf swing is kind (despite what many golfers may think) because the general parameters remain the same every time. You aren't often given a shovel to use in the back 9. You can practice and perfect the relevant techniques.

Examples of wicked problems include geo-political conflicts and the famous examples of decision making gone wrong like the Challenger space shuttle disaster. Think novel situations with imperfect information and clouded motivations.

T-Shaped Software Design

For (much, much) lower stakes, the concepts of T-shaped and having broad range helped me articulate an advantage I thought my previous company had while working on contact center projects. We were brought in as subject matter experts on implementing contact center software like Twilio. Clients expected us to implement a contact center that was cost-effective in a timely fashion.

What ended up frequently happening was that our teams would dig into the problem as presented and surprise the client by suggesting non-contact center solutions. For example, we'd suggest ways to improve an app experience or streamline a process such that there was less need for customers to even use the contact center.

Our advantage here was having teams with broad consulting backgrounds who didn't live and breathe just contact center all the time. Having developed lots of types of apps for lots of business scenarios, we had the range and the T-shaped knowledge to connect our current problem to past ones and propose novel solutions. And it was really the combination of deep knowledge about the contact center platform and broad knowledge of other contexts that gave us the edge.

Range has a lot more going for it that I didn't cover here. Give it a spin and I think you'll learn even more about how diversity of thought, experience and background can give you and your teams an edge as well.

As always happy reading!

Getting To Yes: A Classic with a Caveat

Peter Miller — Fri, 20 Mar 2020 18:53:54 +0000

Robert Fisher and William Ury's book, Getting to Yes: Negotiating Agreement Without Giving In, is a classic study of how to conduct successful negotiations. It has gone through several revisions over the past 30 years and continues to be a best seller.

As the authors address in the forward, folks doing creative work sometimes dismiss negotiation as a specialized activity and not part of our everyday life. Instead, the authors point out how we are constantly negotiating, not just on a contract, but on work-life balance, how to divvy up work for a project, how to co-exist in a relationship.

Having established the prevalence of negotiation, the authors address another misperception, which is that negotiation is all about tips and tricks to push the other side into accepting your demands. This is the stereotype of the shady salesperson who uses psychological tactics to pressure you into moving off your price to theirs. Rather than this type of behavior being central to successful negotiation, the authors identify bargaining over positions as the problem with most negotiation. Instead, they recommend focusing on the underlying interests behind given positions, focusing on the problem, not the people, looking for mutual gains and using objective criteria to evaluate options.

If you've done any software consulting or product development, you've probably done a lot of negotiation and I'll illustrate a few of these tips with some anecdotes from my past experiences.

Focus on Interests, Not Positions
While working on a contact center implementation for a customer, I was tasked with gathering requirements and implementing a solution for displaying realtime metrics. The customer's opening ask was 30 to 40 metrics, each of which "had to be" updated every few seconds. This was going to be a large effort and a natural way to proceed would have been to haggle down the number of metrics between the two sides.

My team dug deeper into why those metrics were needed in that time frame. Turns out the initial ask was a wishlist from several groups who thought they were listing just what their wildest dreams would be. That got translated into a hard requirement along the way. Once we got past that, we could focus on the customer's interest, around managing the agent population against activity surges and populating performance reports. From there we prioritized and shrank the work ahead of us and could move forward. Instead of trying to win by getting the list down to 20 instead of 40, we exited that bad game and moved into a better one.

Disentangle the People from the Problem
Consulting projects can be tough. Sometimes you end up with a "difficult" client, where interactions are fraught and the project life is stressful. In these projects, it is hard to get through negotiations. The relationship is not setup for a natural give and take. I was in such a project a while ago and even the most basic discussions of ongoing work were difficult.

Personal insults were flung around and I was tempted to throw up my hands and write the project off, but got some good advice from a colleague along the lines of Fisher and Ury's book. To focus on the problem, not the people. In this case, we paused to take an inventory of how we were running the project vs how the client typically worked. We figured out that a major sticking point was that we perceived as being open about our development process seemed confusing and unstructured to them. This led to a lack of trust in our work.

We started providing extremely clear and structured updates on what we were doing; we got way stricter on how meetings were run and on only discussing the work items in flight. These adjustments helped rebuild the client relationship. Our acting more "rigid" was actually a comfort to the client. The same people were no longer as difficult and the project could move forward.

The Value of Walking Away
The only critique I have of Fisher and Ury's book is that it does not focus enough on the value of walking away. It is a book focused on solutions, on making negotiations work and has a hopeful tone throughout. The authors acknowledge that it can be difficult to negotiate this way if the other party is not willing, but only glance off the possibility of the best outcome being a failed negotiation with the idea of a BATNA or best alternative to a negotiated agreement.

A BATNA lets you evaluate how far you are willing to go in a negotiation, and how good the outcome really is, with the implication that you should walk away if the BATNA is better. However, the BATNA is situated in the context of two parties where one party is more powerful or unwilling to negotiate.

From my past experience, even a negotiation in good faith, with parties looking for the win-win can end up the worse for all sides. It can be rushed, circumstances change midway through or any number of other reasons. I'd love to see more from these authors on these type of scenarios.

As always happy reading!

Software Lessons from Scarcity

Peter Miller — Fri, 13 Mar 2020 14:57:15 +0000

Sendir Mullainathan and Eldar Shafir's book Scarcity: The New Science of Having Less and How it Defines Our Lives is a wonderful achievement and a great read for anyone with an interest in psychology and behavioral economics. Mullainathan and Sharif present a novel frame for the common problem of scarcity, of not having enough money, time or other resource.

Their insight is that scarcity taxes our attention, what they call a bandwidth tax, and causes us to narrowly focus our (compromised) attention on the most immediate problem ahead, what they call tunneling. The result is consistent and predictably poor decision making by those facing scarcity. Poor decision making hinders getting more of the scarce resource and so in the end, scarcity systematically creates more scarcity.

Mullainathan and Sharif use descriptive anecdotes and experimental data to support their thesis. Again, worth your time to check out if you liked books like Thinking Fast and Slow by Daniel Kahneman or Freakonomics by Steven D. Levitt and Stephen J. Dubner.

The idea of scarcity is an interesting lens to apply to software development as well. Here's a few thoughts that came to mind, some conventional wisdom, some not, all informed by scarcity.

Organizations and teams are right to cautiously adopt new technologies
As a long-time software consultant, I'm used to hearing complaints, and complaining myself about a client that seems stuck in the mind, intent on using what seems like a Stone Age tech stack. How can they not realize how much better, cooler, faster new technology X is?

From a scarcity perspective, sticking with a known solution can be a smart strategy. In the context of a consulting project, it is often the case that the teams' bandwidth is limited by time or money. Operating under this bandwidth tax, the team doesn't have enough capacity to fairly evaluate a new tech approach, in addition to implementing the specific deliverable. In contrast, if those tech decisions are already made, the team can focus their limited attention on the business value.

This does not mean orgs and teams are always right to be cautious. Creative organizations will find a way to give the right people enough time and support to make a well reasoned evaluation of new technology. Organizations that want to innovate can also build more slack into their timelines. Slack is a critical way to mitigate the mistakes that come about in a scarce environment. A team that has enough time to make a mistake is often one that can learn from it.

One person cannot shape and implement at the same time
I've long been a huge fan of Basecamp, formally known as 37Signals. Recently, Basecamp released guidance on their software development lifecycle, Shape Up. A key facet of their process is that a small group shapes (defines) the parameters of a small cycle of work and then another group implements that work. Once the pitch for the work is complete and approved, the implementing team has freedom within the pitch definition to implement it.

Having these two tracks, of shaping and building, makes perfect sense from a scarcity perspective. When leading a team to build an application, I'm focused (tunneled) into the implementation. If I'm also trying to figure out what we're building, one is going to get short-changed. Our brains are good at focusing and we can be incredibly productive in the tunnel, but at the expense of items outside it. Whatever track we are on, our brain wants to get back to and will shortchange the other track.

This insight seems mundane on the surface, but in my experience it is quite common for technical leads on projects to be in charge of both implementing the current phase and planning the next one. While they may have the skill to do both, the expectation that those different tracks can occur in parallel without a loss of quality in one or the other is misleading.

Even LeBron James needs rest days
If NBA teams were run more like software projects, then star players like LeBron James would never be given a rest day, or limited minutes. Why would you take your best performer off the court? What seems obvious in a physical undertaking like basketball, that overwork, a scarcity of rest, leads to injury or poor performance is just as true for mental work. From Scarcity:

...our effects [of scarcity] correspond to between 13 and 14 IQ points... losing 13 points can take you from "average" to a category labeled "borderline deficient"

More to the point of high performers, the (temporary) loss of IQ is also enough to take someone from "superior" to "average". This effect has nothing to do with that person's inherent grit or toughness. Put the same person in a better, more abundant situation can perform to their potential.

To put it another way, when a team is told to consistently put in extra hours, the implicit message is that we are no longer concerned about the quality of the work, we just hope to get the work done at any quality level in a given calendar timeframe. For software consultancies that differentiate on quality of work, this doesn't sound too appealing.

There's a lot more in Scarcity that I didn't cover here. And I'm sure other sources to provide contrary lenses on these points. Keep reading and learning, and as always leave me comments and questions below. Thanks!

More on Geohashing: Covering and an updated DynamoDB library

Peter Miller — Wed, 05 Feb 2020 20:54:06 +0000

This is the third part in a series of posts. See the links above to jump to the prior posts

S2 vs. H3 Covering

In my last post I talked through how we can use S2 cells to fill or cover a shape on the map. In this case a circle, within which we want to search for items of interest. I also mentioned that Uber came up with H3 cells, a technique for covering a shape on the map with hexagons.

The images below show S2 and H3 coverings of the same area in Washington, DC.

S2 Covering (shown in blue)

H3 Covering (shown in red)

The S2 covering is composed of cells of different sizes or levels. Some are larger cells that fit mostly into the circle, others are smaller cells that help cover the edges. This means fewer cells overall are needed to cover the space vs the H3 example, albeit with the downside of a more uneven covering compared to what we see in H3.

With that somewhat uneven covering, we can see how important that distance calculation is in the final step of our S2 geo-search from the prior post. It excludes those points of interest that came from the outer edges of covering cells.

These images also illustrate how S2 cells can nest. Any given S2 cell will fit completely into its parent cell at a higher level. S2 cells nesting allow us to calculate coverings with diverse cell sizes. Nesting also would allow us to roll up location based data into larger aggregates. If data is stored at a small (high level) cell, then we can roll up that data to a neighborhood, city or region level by summing up the data from all the child cells in a bigger cell that covers the target area.

In contrast while H3 provides a more precise covering (at least a high resolution), hexagons do not nest. As we scale up levels, the hexagons overlap incompletely. So we cannot have a meaningful H3 covering with diverse hexagon sizes. Without nesting, aggregation is not as automatic. While we can still roll up data from the smaller hexagons, we cannot map it directly to larger ones.

Dash-Labs dynamodb-geo

While watching a brilliant discussion on Twitch between DynamoDB gurus Rick Houlihan and Alex DeBrie, at around an hour in, Rick dropped a quick reference to an AWS customer who had taken the DynamoDB Geo library I looked at last post and improved upon it. Thanks to twitch user switch184 in the comments for pointing me to the repo at: https://github.com/Dash-Labs/dynamodb-geo

Dash-Labs / dynamodb-geo

#Geo Library for Amazon DynamoDB

This library was forked from the AWS geo library.

Following limitations with the aws geo library were the main reasons that necessitated this fork:

Usage required a table’s hash and range key to be replaced by geo data. This approach is not feasible as it cannot be used when performing user-centric queries, where the hash and range key have to be domain model specific attributes.
Developed prior to GSI, hence only used LSI
No solution for composite queries. For e.g. “Find something within X miles of lat/lng AND category=‘restaurants’;
The solution executed the queries and returned the final result. It did not provide the client with any control over the query execution.

What methods are available for geo-querying in this library?

Query for a given lat/long
Radius query
Box/Rectangle query

All of the above queries can be run as composite queries, depending on their…

View on GitHub

While this fork of the original dynamodb-geo is in Java, not JavaScript, it is worth your time to take a look. There are many improvements, which are summarized in the repo's readme.

One bit of documentation that stuck out to me was more discussion on hash key length, including the number of queries the library produces for a radius search with a given hash key. Something that tripped me up at first is that the geohash key they are talking about here is the first X digits of an S2 cell id, and the length of that key does not match 1:1 to the level of the cell. As the Dash-Labs repo suggests, a 5 or 6 digit long geohash key is well suited for near proximity searches. I still struggle with understanding the math behind that assertion, but the sample query results are convincing.

Thanks for reading and happy geohashing!

The Problem of Nearness: Part 2 - A Solution with S2

Peter Miller — Fri, 24 Jan 2020 19:32:00 +0000

This is part 2 of a series, for the full context see the first post

We covered a lot of theory and math around calculating distance last time. For today's post, we will focus on some of the details of an implementation I referenced.

Caffeinate-Me by @jbesw

Location-based search results with DynamoDB and Geohash [Medium], [Caffeinate me! Build a serverless app to find the nearest Starbucks Medium](https://medium.com/swlh/caffeinate-me-build-a-serverless-app-to-find-the-nearest-starbucks-54512124e639), and if you are interested in how it scales Will it scale? Let’s load test geohashing on DynamoDB [Medium]
Using geospatial searches with DynamoDB [YouTube] and Caffeinate me! Using VueJS to query your API Gateway [YouTube]
Caffeinate-Me backend API repo [Gitlab] and front-end Vue JS repo [Gitlab]

James Beswick is a developer advocate at AWS for Serverless. He wrote three fantastic posts about using DynamoDB to implement location-based searches, accompanied by explanatory videos and the implementing Git repo's. Please take a moment to check out his work, it is excellent. I learned a lot playing around with it and also found a few interesting items that I highlight below.

"Geohash" and Google's S2 Geometry

My prior post talked about geohashing, specifically the canonical implementation of GeoHash by Gustavo Niemeyer using alphanumeric hashes to address a grid of nested squares and rectangles that covers the Earth.

The DynamoDB geo library that James' Starbucks locator uses does not use that geohash algorithm. Instead, it uses Google's S2 Geometry for addressing locations. I promised less math, so the big takeaway to focus on here is that points of interest in S2 are placed in cells, that like our squares and rectangles from geohash are nested and cover the Earth's surface. S2 cells are addressed by 64-bit integers (not alphanumeric strings) and certain distance and covering calculations are much faster than with GeoHash.

For a more detailed look at how S2 works, including a fun animated gif of the Hilbert Curve, check out Christian Perone's post on S2.

An important concept in S2 is "covering" geographic shapes. This means identifying the neighboring cells that when tiled together, fill (or nearly fill) the specified shape. For (math) reasons, generating a range of covering cells can be done very quickly with S2. To nerd out a bit more on S2 vs. geohash, check out this post from Fabrice Aneche. Google's presentation on S2 has a nice visual of what covering (referred to as approximating regions here) looks like in practice.

When I was trying to reach the end of the internet while writing this post, I came across an additional geohashing or spatial library. As any old school war-gamer knows, hexagonal grids are way cooler than square grids, and sure enough, Uber created H3 a "hexagonal hierarchical geospatial indexing system". I'll leave the details for you to look into if you are interested, but Uber states it is a good fit for use cases like analysis of "locations of cars in a city". If that's your thing.

Now that we know we are working with S2, let's check out some of the details of James' Caffeinate-Me app.

Finding Items within a Circle

When using the Caffeinate-Me app, you click around a map and are shown all the Starbucks that are within a circle centered on your click. The code to get the Starbucks within that circle is shown below (from query.js):



myGeoTableManager
  .queryRadius({
    RadiusInMeter: 1000
    CenterPoint: {
      latitude: 40.7769099,
      longitude: -73.9822532
    }
  })

The queryRadius method on the GeoDataManager.js shows how the dynamodb-geo package breaks down this request:



     * @param queryRadiusInput
     *    Container for the necessary parameters to execute radius query request.
     *
     * @return Result of radius query request.
     * */
    GeoDataManager.prototype.queryRadius = function (queryRadiusInput) {
        var _this = this;
        var latLngRect = S2Util_1.S2Util.getBoundingLatLngRectFromQueryRadiusInput(queryRadiusInput);
        var covering = new Covering_1.Covering(new this.config.S2RegionCoverer().getCoveringCells(latLngRect));
        return this.dispatchQueries(covering, queryRadiusInput)
            .then(function (results) { return _this.filterByRadius(results, queryRadiusInput); });
    };

In pseudo-code:

(Line 8) Get a rectangle that defines the min and max latitude and longitudes of a bounding box that encloses a circle of the specified RadiusInMeter from the center point
(Line 9) Get a collection of S2 cell addresses (hashes) that cover this rectangle of space
(Line 10-100) Query DynamoDB to retrieve the Starbucks within the specified S2 cells and then drop the Starbucks that were part of the covering rectangle, but beyond the radius of the circle

The S2 library handily takes care of the details of #1 and #2. More specifically, to back to questions from my first post, that getCoveringCells method is figuring out the neighboring geo-bins (cells). Like with GeoHash, S2 cells have different levels, from 0 (huge) to 30 (1cm squared).

By default, the S2 library will attempt to return 8 S2 cells (possibly at different levels) to cover the given shape. This creates some work for the dispatchQueries method, which has to generate one or more DynamoDB queries per covering cell:



GeoDataManager.prototype.dispatchQueries = function (covering, geoQueryInput) {
        var _this = this;
        var promises = covering.getGeoHashRanges(this.config.hashKeyLength).map(function (range) {
            var hashKey = S2Manager_1.S2Manager.generateHashKey(range.rangeMin, _this.config.hashKeyLength);
            return _this.dynamoDBManager.queryGeohash(geoQueryInput.QueryInput, hashKey, range);
        });
        return Promise.all(promises).then(function (results) {
            var mergedResults = [];
            console.log(results);
            results.forEach(function (queryOutputs) { return queryOutputs.forEach(function (queryOutput) { return mergedResults.push.apply(mergedResults, queryOutput.Items); }); });
            return mergedResults;
        });
    };

In pseudo-code:

(Line 3) Get a collection of geohashes of the table's hash key length that encompasses the covering S2 cells
(Line 4-5) Setup a DynamoDB query that uses the partition key of the hash key, and the range key of with the covering S2 cell addresses to get all the Starbucks in that part of the covering region

That's a lot to take in. For a concrete example, I added some logging to the library, and from a center point in New York City, I got 8 covering S2 cells, which went 1:1 to 8 DynamoDB queries. For example, one query was of hash key -82501, sort key of S2 cell ids between -8520150788008312831 and -8520141991915290625. Another query was of hash key -85199, sort key of S2 cells ids between -8519982196314865663 to -8519982196180647937.

This is where selecting the length of the hash key becomes important. As James explains in his post, based on the radius of the circle you are searching in and length of that key, the number of queries against DynamoDB and how much you are hammering individual partitions (hash keys) can vary dramatically.

The dynamodb-geo library defaults to a 6 digit hash key. James uses a 5 digit hash key in his example. A rule of thumb for these mostly local type of searches seems to be 5 to 7 digits.

There's a lot more to dig into on S2, the dynamodb-geo library and spatial searches, but for, let's call it a day. Please reach out with any comments or questions. Happy geo searching!

The Problem of Nearness: Part 1 - Geohash

Peter Miller — Mon, 20 Jan 2020 21:02:12 +0000

updated 2020-01-24
-noted that the example in part 2 will use S2, not Geohash
-removed "relational" term when discussing spatial data types

Google Maps, Yelp, and Meetup all can help us answer a variation on the same question of "what X is nearest to Y". What subway stop is closest to my friend's house, where's the nearest sushi place, who else is interested in hiking in my town, etc. These sites use geographic data about points of interest, and our current location or a location of our choosing to calculate what's close.

In this post, I'll discuss how we can make these types of calculations in our own apps. Because of the number of points in these datasets and the number of visitors querying the datasets, we will be looking at not only how to make these calculations, but also how to make them faster and more efficient.

We'll start by looking at points of interest on a line, then move our way up to 2d planes and then a globe. Along the way we'll encounter geohashing, an elegant solution to the problem of nearness.

Staying Close in Lineland

Let's imagine a world with one-dimension. All points are described by a single value, the relative position on that single great line of existence. To avoid any unfortunate incidents, we can stay in our 3d world and just observe this lineland from afar.

We see 5 points of interest on the line, with the following positions:



Place of Interest: Position on Line
"The Red Fox Tavern": -5
"Charlie's Chicken Shack": -3
"Smoothie Town": 0
"The Wilted Cauliflower": 2
"Eggplant Paradise": 4

With only one-dimension, the distance between any two points in lineland is simply the difference in their positions. If we want a lineland Yelp clone to answer the question of what restaurants are within 5 units of my house at position 1, we can use a brute-force approach by calculating the distance from every point to the center point:



Distances from Points of Interest to 1
"The Red Fox Tavern": -5 => 1 - (-5) => 6
"Charlie's Chicken Shack": -3 => 1 - (-3) => 4
"Smoothie Town": 0 => 1 - 0 => 1
"The Wilted Cauliflower": 2 => 1 - 2 => -1
"Eggplant Paradise": 4 => 1 - 4 => -3

We find three restaurants in the specified range.

As our dataset of points of interest grows larger, calculating the distance from every point to the center point is more and more work and our algorithm gets slower (or we parallelize the calculation and spend more computation in exchange for time).

To keep the size of dataset manageable in our example, we can modify our algorithm to return all restaurants in the position range of -4 through 6, taking the farthest allowable distance to the left through the farthest allowable distance to the right.

Despite it's range of culinary delights, there's not much going on in lineland, so let's bump up another dimension.

Infinite Planes and the Distance Formula

We are now in the world of two-dimensions, with the familiar X and Y coordinate system of geometry and mathematics. We plot points of interest on an infinite plane, with a horizontal X position and vertical Y position.

In our plane world, we see the same 5 restaurants from lineland, but with positions composed of X and Y pairs:



Place of Interest: (X, Y) coordinates
"The Red Fox Tavern": (-5,8)
"Charlie's Chicken Shack": (-3,-1)
"Smoothie Town": (0,-4)
"The Wilted Cauliflower": (2,7)
"Eggplant Paradise": (4,-6)

With two-dimensions the distance between any two points is calculated using a derivation of the Pythagorean theorem, the Distance formula, the square root of the squares of the differences:

In our 2d Yelp clone, we want to find restaurants within 5 units of my house at position (1,2). Again, we start with brute-force and do the calculation for every point to the center point:



Distances to (1,2)
"The Red Fox Tavern": (-5,8) => sqrt((-5 - 1)*(-5 - 1) + (8 - 2)*(8 - 2)) => 7.75
"Charlie's Chicken Shack": (-3,-1) => sqrt(16+9) => 5
"Smoothie Town": (0,-4) => sqrt(1+36) => 6.08
"The Wilted Cauliflower": (2,7) => sqrt(1+25) => 5.10
"Eggplant Paradise": (4,-6) => sqrt(9+64) => 8.54

Doing this calculation for 5 points is no big deal, but as the dataset grows larger, our algorithm needs more and more compute capacity and/or time. As with lineland, we address scalability by limiting the size of the dataset. The fewer points to feed into the distance formula, the faster.

To limit the size of the dataset we use a minimum bonding rectangle or bounding box:

The minimum bounding rectangle (MBR), also known as bounding box (BBOX) or envelope, is an expression of the maximum extents of a 2-dimensional object (e.g. point, line, polygon) or set of objects within its (or their) 2-D (x, y) coordinate system, in other words min(x), max(x), min(y), max(y). The MBR is a 2-dimensional case of the minimum bounding box.

As the diagram below illustrates, we can create a bounding box that is guaranteed to include at least all the points within R units of the center point. Given a point at (x, y), the four corners of the box will be at:



Upper Right Corner: ((x + r), (y + r))
Lower Right Corner: ((x + r), (y - r))
Lower Left Corner: ((x - r), (y - r))
Upper Left Corner: ((x - r), (y + r))

With these outer bounds defined, we can restrict the set of points we calculate the distance for to be only points between these bounds.

In pseudocode, our approach becomes something like:

Pick a center point and how far out we are going to look (the distance)
Calculate a bounding box around the center point, based on the distance
Find the subset of the points of interest with a X between the min and max X of our bounding box and Y between the min and max Y of our bounding box
For each point in the bounding box, run the Distance formula to calculate distance from the center point
Return points whose distance to the center point is within the desired distance

Without getting into the technical details here, most data stores have indices that support range queries to make finding the subset of points within the bounding box efficient.

With that, we're ready to enter the third dimension and think about distances over a spherical object, like say planet Earth.

Spheres, Latitude, Longitude and the Haversine Formula

While we use street addresses to locate places in our everyday lives, under the covers we are all using latitude and longitude coordinates:

The combination of these two components specifies the position of any location on the surface of Earth, without consideration of altitude or depth.

From GISGeography's site: Latitude coordinates are essentially Y-values between -90 and 90 degrees with 0 at the Equator. Longitude values are essentially X-values between -180 and 180 degrees with 0 at the Prime Meridian.

For example, the latitude and longitude of the Washington Monument is 38.8895°, -77.0353°.

Before we move on, it is important to note that calculating the exact distance between points on Earth is far more complex than for points on a two-dimensional plane. Fortunately, for the scenarios we are concerned with, say locating the nearest taco stand, we are OK with simplifying our distance calculations at the expense of precision by making these assumptions:

The Earth is a perfect sphere (it's actually an oblate spheroid)
All points of interest are directly on the surface of the Earth, i.e. we ignore elevation changes

With these assumptions in place, let's take a look at the 5 points of interest from our prior examples, now specified by latitude and longitude coordinates:



Place of Interest: (Latitude, Longitude) coordinates
"The Red Fox Tavern": 32.7549°, -117.1425°
"Charlie's Chicken Shack": 32.7524°, -117.1427°
"Smoothie Town": 32.7376°, -117.1714°
"The Wilted Cauliflower": 32.5229°, -117.1165°
"Eggplant Paradise": 32.5073°, -117.0855°

Me drawing those points on a sphere isn't going to help anyone, but if you are curious, these are all just locations selected around San Diego and Tijuana.

To calculate the distance between points on Earth with latitude and longitude, many software packages use the Haversine formula, a type of great-circle distance calculation. The idea of great-circle distance is:

the shortest distance between two points on the surface of a sphere, measured along the surface of the sphere (as opposed to a straight line through the sphere's interior). The distance between two points in Euclidean space is the length of a straight line between them, but on the sphere there are no straight lines. In spaces with curvature, straight lines are replaced by geodesics. Geodesics on the sphere are circles on the sphere whose centers coincide with the center of the sphere, and are called great circles

The Haversine formula itself is described in great detail on the Wikipedia page if you want to flex your geometry muscles. Lots of sines and cosines. The phi variables are latitude, the lambda variables are longitude:

In our 3d Yelp clone, we want to find restaurants within 5 miles of my house at 32.7584°, -117.1402° (not actually my house). Again, we start with brute-force and do the calculation for every point to the center point (in this case using an online calculator to avoid doing the math myself):



Distances to 32.7584°, -117.1402°
"The Red Fox Tavern": 32.7549°, -117.1425° => 0.276 miles
"Charlie's Chicken Shack": 32.7524°, -117.1427° => 0.44 miles
"Smoothie Town": 32.7376°, -117.1714° => 2.315 miles
"The Wilted Cauliflower": 32.5229°, -117.1165° => 16.339 miles
"Eggplant Paradise": 32.5073°, -117.0855° => 17.649 miles

Many modern data stores have support for storing latitude and longitude pairs, as well as built-in functions for calculating the distance between these points. Typically these features or packages are called Spatial Data support or GIS (Geographic information system) support.

Even it is built-in to the data store, this type of calculation can get more and more expensive as we increase the size of our dataset. Again, our approach to limit computational and time use is to limit the dataset we operate on.

Conceptually, like with planes, we want to only calculate the distance between points that are within a bounding box around the center point. One method for calculating the edges of such a bounding box using latitude and longitude is described in this paper. The linked paper suggests that this method could be used with a SQL-compliant datastore to leverage latitude and longitude indices to select the subset of points in the bounding box and only then calculate the distance. A similar approach using MySQL specifically, is detailed in this post. There are many more similar techniques I came across while writing this post. I haven't tested them to give a concrete recommendation of one over the other.

The solution I want to talk about is called geohashing.

Geohashing

Geohash is a public domain geocode system invented in 2008 by Gustavo Niemeyer[1] and (similar work in 1966) G.M. Morton[2], which encodes a geographic location into a short string of letters and digits. It is a hierarchical spatial data structure which subdivides space into buckets of grid shape, which is one of the many applications of what is known as a Z-order curve, and generally space-filling curves.

from https://en.wikipedia.org/wiki/Geohash

In other words, through geohashing you divide the world into a grid of squares and rectangles, addressed by hash. Any point on the earth (latitude and longitude) has a corresponding hash that fits into one of these grid cells or bins. To see a visual explanation of geohashing, check out this video or this page.

The length of the geohash we use determines how large the bin is. For example, a geohash length of 5 will give us an approximately 5 km by 5 km bin.

Geohashes were originally used as part of a URL-shortening service, but across the industry are now used for spatial indexing and searches as well.

To return to our Yelp-like example scenario of finding nearby restaurants, let's add 5 character geohashes to our points of interest:



Place of Interest: (Latitude, Longitude) coordinates | Geohash
"The Red Fox Tavern": 32.7549°, -117.1425° | 9mudq
"Charlie's Chicken Shack": 32.7524°, -117.1427° | 9mudq
"Smoothie Town": 32.7376°, -117.1714° | 9mudj
"The Wilted Cauliflower": 32.5229°, -117.1165° | 9mu9n
"Eggplant Paradise": 32.5073°, -117.0855° | 9mu8z

Again, my fake house is at my house at 32.7584°, -117.1402° or 9mudq.

A naive algorithm to find nearby points of interest would be to return points with the same geohash as the center point. This gets us some of the nearby points of interest, but depending on the distance we want to search, likely leaves out nearby points of interest that are in neighboring geohash bins.

To get a more comprehensive set of points of interest, we can get all the points in the center geohash bin and its 8 immediate neighbors. From there, assuming we want to be precise, we calculate the distance between all the points and the center, dropping the points that are too far away.

Like a bounding box algorithm, a geohash based algorithm can handle larger datasets by first chopping those datasets into manageable chunks for our distance calculations. Also, while the latitude and longitude based bounding boxes described above depend on the data store having support for spatial data, we can manipulate geohashes in any data store, as they are just strings.

The ability to use geohashes without custom spatial data types will come in handy in my next post, where I dig into a solution I studied that uses DynamoDB as a data store, along with some JavaScript functions to implement a simple Starbucks locator. This solution actually uses Google's S2 as a different type of geohash, which I'll explain. In that post, I'll also get into two issues of some importance that I glided past here: first, deciding how big a geohash to use in your application and second, how to find the neighboring geohash bins.

See you then!