DEV Community

Help Me Understand This Vectorized Logic

Ryan Palo on July 26, 2019

I'd like to get some help understanding vectorized operations on multi-dimensional arrays. Specifically, I've got a problem and some code that I t...

Read full post

Evan Oman • Jul 26 '19

Without getting into the weeds, here are my high-level thoughts:

Given the input data and approach, are you sure that you should be able to get accuracy in the 0.94 range? Does the exercise indicate you should be able to get results like this using the algorithm approach you tried?

This ☝️ point brings up a second important point: it would be useful for you to decompose the problem into its key components and check each of those components separately. Are you sure the loading and broadcasting steps are working like you think it should? Have you tried your distance function on some simple data and got the result you expected? Finally, if all of the above are working as expected, could you put together some data which should receive 100% accuracy, and then test that out?

By looking at each piece individually (ideally with a light-weight unit test) you can start to narrow down the list of potential causes.

Ryan Palo • Aug 1 '19

Thanks for your help. I had been putting off tests, but that was the next step in the plan.

Actually, running through a much simpler case in a REPL ended up doing it for me. See my comment about my solution.

But your comments about going back to debugging basics and slowly and methodically validating one piece of logic at a time were what put me back on the right track, so thanks!

Evan Oman • Aug 1 '19

Glad you were able to figure it out, nice work!

Ryan Palo • Aug 1 '19

I've got an article in the works that walks through this in more detail, but I wanted to post my solution in case somebody ran into the same or similar issue.

The main problem I was having was using the reshape method is the wrong one. It would give me the right dimensions, but it jumbled up all of the individual numbers and didn't keep the "records" together.

After doing some experimenting in a REPL with simpler cases, I discovered that what I really wanted was swapaxis. This keeps the numbers in the correct order, but allows you to pivot an array into other dimensions (e.g. roll your 2D array into a plane in the third dimension).

So what I ended up with is:

def vectorized_distance(vec1, vec2):
    return np.sqrt(np.sum((vec2 - vec1)**2, axis=1))


def nearest_neighbor_classify(data, neighbors, labels):
    # Reshape data so that broadcasting works right
    data_rows, data_cols = data.shape
    data = data.reshape((data_rows, data_cols, 1))
    neighbor_rows, neighbor_cols = neighbors.shape
    flipped_neighbors = np.swapaxes(
        neighbors.reshape((neighbor_rows, neighbor_cols, 1)),
        0, 2)

    # Now data should be (n x m x 1) and flipped_neighbors (1 x m x n)
    # Broadcasting should produce an (n x m x n) array, but `np.sum` will 
    # squash axis 1 so we get a (n x n) point-to-point distance matrix

    distances = vectorized_distance(data, flipped_neighbors)

    # The index of the smallest value for each row is the index of the prediction
    closest_neighbor_indices = distances.argmin(axis=1)

    return labels[closest_neighbor_indices]

Evan Oman • Aug 1 '19

Ah, is this column-major vs row-major ordering issue?

Ryan Palo • Aug 1 '19

Yeah or maybe the multidimensional version of that, although I tried numpy’s different ordering strategies and none seemed to work quite right.