DEV Community: Dizzyspiral

"Beating" Checksums

Dizzyspiral — Wed, 26 Oct 2022 23:49:06 +0000

Checksums. The often-ignored string of hex characters that accompany most software downloads. Checksums are meant to help you, the end user downloading and installing software from the wild wild internet, verify that what you got is actually what the developers released.

Being the very responsible and security-conscious person you are, I'm sure you verify the checksum on everything you download. Do you actually diff the checksums? Or do you, like me, just eyeball it sometimes?

I bet you eyeball it. It's faster. We're lazy. And checksums are designed to be wildly different with just small changes in the input. So if, say, the beginning and the end of the checksums match, chances are the whole thing matches. Right?

Right?

That's the subject of today's blog post! Beating checksums is a thing security researchers often have to, though typically they're not packaging their stuff to be a particular checksum if they can help it. That's hard to do, and usually it's easier to find a way to defeat the check - like stuffing your code/data/whatever into something that doesn't get checksum'd at all. But what if you have to beat the checksum? And what if the checksum you're beating is checked by a human?

I hypothesize (from my very rigorous sample size of me and my coworkers) that lots of people only check the beginning and end of a checksum, and maybe a few stand-out patterns in the middle if they're feeling particularly paranoid. Can we build something that'll pack arbitrary data to match the beginning and end of a checksum? Let's call the beginning/end about 5 characters each, in keeping with the researched limits of short-term memory.

I know nothing about checksums, and today isn't the day I'm going to get all hot and bothered for math, so I'm going to do a little light research on manipulating checksums and see what techniques are generally used.

Research

Okay so first off, I found this paper that explained something I knew to be true - CRC checksums aren't secure (and are largely not used anymore because of this). I mostly see sha256 these days, so that's what we'll try to "beat."

Second... it turned out there's no way to get a handle on this problem without learning some math (ugh) so here we go on SHA-256:

By kockmeyer - Own work, CC BY-SA 3.0

SHA-256 is a hashing algorithm that turns your input into 512-bit chunks and does some math on them, then uses that to produce a 256-bit value. This hash is computed 64 times, with new input data for each round based on the previous round. If you want a deep dive into the math on this, here's a video of doing a round out by hand. If you want something a bit more digestible, this video is an excellent visual breakdown.

Third, there's lots of bitcoin bullshit videos out there saying that hashes are always unique. This isn't true. 256 bits is a lot of playground to work with - 115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,039,457,584,007,913,129,639,935 unique values, to be exact. And while conceptually this means you could have a unique hash for every bit of unique data that could ever be created, things are just not quite that simple. Hash collisions (where the same hash is generated for two completely different inputs) exist. They're just hard to find.

Credit

Fourth, hey look bitcoin miners are basically already doing this. I knew that, but I guess I forgot. Proof-of-work crypto schemes take a transaction and append a nonce to them in order to achieve a particular prefix on the resultant sha256 hash of transaction + nonce. This seems like a great place to start this research.

Implementation

Okay so what we want to do is manipulate arbitrary data so that the first and last five characters of its sha256-sum match an arbitrary sha256-sum. We think we can take the same thing crypto miners are doing - fuzzing an appended nonce - to find a hash that has a given prefix and suffix.

While ostensibly I could take some code already written for this by the crypto bros, as a practical matter that's all going to have coin-specific code and blockchain network blah blah in it, and anyway I didn't find anything suitable in my 10 seconds of looking.

Instead, I'm going to try rolling my own. I'm itching to code - let's go!

The code

This problem is actually easy enough that we could probably get a prototype going using bash. But since I'm confident from my research that the approach is going to work, and I'm really not a fan of debugging bash, I'm going to throw this together in python.

Quick aside, python is really not the right choice for a problem like this, since it's an obviously intense and parallelizable problem, and python is shit at parallel processing. However, in the long run I want to experiment with this by modifying known data types (e.g. ELF, PDF) in non-destructive ways other than just appending a value. Python has a lot of great tools to do that quickly, so that's why python.

Here's our initial implementation:

import argparse
import hashlib

NONCE_SIZE = 1  # 1 byte
MAX_NONCE = (2 ** (NONCE_SIZE * 8)) - 1

def parse_args():
    parser = argparse.ArgumentParser(description="SHA256 prefix and suffix mutator")
    parser.add_argument('filename', metavar='filename', type=str, help='file to hash')
    parser.add_argument('--prefix', dest='prefix', action='store', default="", help='prefix to match')
    parser.add_argument('--suffix', dest='suffix', action='store', default="", help='suffix to match')

    return parser.parse_args()

def mutate(data, prefix, suffix):
    message = hashlib.sha256()
    message.update(data)
    digest = message.hexdigest()

    print(f"Initial SHA256 digest: {digest}")
    print(f"MAX_NONCE = {MAX_NONCE}")

    nonce = -1
    # Precompute for speed
    prefix_len = len(prefix) if prefix else 0
    suffix_len = len(suffix) if suffix else 0

    while (digest[:prefix_len] != prefix or digest[-suffix_len:] != suffix) and nonce <= MAX_NONCE:
        nonce += 1
        message = hashlib.sha256()
        message.update(data)
        message.update(nonce.to_bytes(NONCE_SIZE, "big"))
        digest = message.hexdigest()

    print(f"Got hash {digest} using nonce {nonce}")

if __name__ == '__main__':
    args = parse_args()

    with open(args.filename, 'rb') as f:
        data = f.read()

    mutate(data, args.prefix, args.suffix)

Running that, we get:

┌─[dizzyspiral@akatsuki:~/repos/github/sha-bam]
└─╼ python3 main.py main.py --prefix a --suffix b
Initial SHA256 digest: 35064e6ca1abd7b5f51f353e5d558aa947e389480ca6bdadc5660a32e615f88d
MAX_NONCE = 255
Traceback (most recent call last):
  File "main.py", line 47, in <module>
    mutate(data, args.prefix, args.suffix)
  File "main.py", line 32, in mutate
    message.update(nonce.to_bytes(NONCE_SIZE, "big"))
OverflowError: int too big to convert

So we up our nonce size:

NONCE_SIZE = 4

And now:

┌─[dizzyspiral@akatsuki:~/repos/github/sha-bam]
└─╼ python3 main.py main.py --prefix a --suffix b
Initial SHA256 digest: 9482419d29e5a2579d7ce71373d3aa5fc4a12a45a2378629c8e34bdd95174015
MAX_NONCE = 4294967295
Got hash aca28c372935e0b10acacbb81bf3b427ca5d64ae91b80870be3d71a4b0a9023b using nonce 28

You'll note that this modified code only took 28 tries, which is less than 255 - because we're running this with our code as input (because lazy), this isn't very deterministic or repeatable as we iterate on the code.

Let's use some static sample data.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

And let's add some profiling

import cProfile
...
cProfile.run("mutate(data, args.prefix, args.suffix)")

┌─[dizzyspiral@akatsuki:~/repos/github/sha-bam]
└─╼ python3 main.py lorem.txt --prefix a --suffix b
Initial SHA256 digest: 56293a80e0394d252e995f2debccea8223e4b5b2b150bee212729b3b39ac4d46
MAX_NONCE = 4294967295
Got hash ae7ee87838542a984c3b82c9884f75231c870776500d6dfed3777566c5cdf07b using nonce 269
         1362 function calls in 0.001 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.001    0.001 <string>:1(<module>)
        1    0.000    0.000    0.001    0.001 main.py:16(mutate)
      271    0.000    0.000    0.000    0.000 {built-in method _hashlib.openssl_sha256}
        1    0.000    0.000    0.001    0.001 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        3    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      271    0.000    0.000    0.000    0.000 {method 'hexdigest' of '_hashlib.HASH' objects}
      270    0.000    0.000    0.000    0.000 {method 'to_bytes' of 'int' objects}
      541    0.000    0.000    0.000    0.000 {method 'update' of '_hashlib.HASH' objects}

The profiling will be nice when we start upping the prefix and suffix lengths... as we'll do now.

┌─[dizzyspiral@akatsuki:~/repos/github/sha-bam]
└─╼ python3 main.py lorem.txt --prefix aa --suffix bb
Initial SHA256 digest: 56293a80e0394d252e995f2debccea8223e4b5b2b150bee212729b3b39ac4d46
MAX_NONCE = 4294967295
Got hash aa08146063c11f7b153146027543b0d0de02207048ce024d074b6010f86d6fbb using nonce 111950
         559767 function calls in 0.258 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.258    0.258 <string>:1(<module>)
        1    0.076    0.076    0.258    0.258 main.py:16(mutate)
   111952    0.013    0.000    0.013    0.000 {built-in method _hashlib.openssl_sha256}
        1    0.000    0.000    0.258    0.258 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        3    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
   111952    0.039    0.000    0.039    0.000 {method 'hexdigest' of '_hashlib.HASH' objects}
   111951    0.013    0.000    0.013    0.000 {method 'to_bytes' of 'int' objects}
   223903    0.117    0.000    0.117    0.000 {method 'update' of '_hashlib.HASH' objects}

Adding two more bytes (one to the prefix, one to the suffix) has drastically increased our running time. Add two more, and:

┌─[dizzyspiral@akatsuki:~/repos/github/sha-bam]
└─╼ python3 main.py lorem.txt --prefix aaa --suffix bbb
Initial SHA256 digest: 56293a80e0394d252e995f2debccea8223e4b5b2b150bee212729b3b39ac4d46
MAX_NONCE = 4294967295
Got hash aaaace2fd053c08a86ed9aa2d6b9bb4cd2ee98935f197fc9f28c7020fe6debbb using nonce 5950739
         29753712 function calls in 13.329 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   13.329   13.329 <string>:1(<module>)
        1    3.800    3.800   13.329   13.329 main.py:16(mutate)
  5950741    0.648    0.000    0.648    0.000 {built-in method _hashlib.openssl_sha256}
        1    0.000    0.000   13.329   13.329 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        3    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  5950741    2.035    0.000    2.035    0.000 {method 'hexdigest' of '_hashlib.HASH' objects}
  5950740    0.734    0.000    0.734    0.000 {method 'to_bytes' of 'int' objects}
 11901481    6.112    0.000    6.112    0.000 {method 'update' of '_hashlib.HASH' objects}

It is very obvious where the time is spent. Let's graph those data points, for science.

We want to get it to match 5 characters (so 5 bytes) from the prefix and suffix each, so we're really aiming to mutate 10 bytes. Initial findings are... not good, on the runtime scale. Which is not a surprise.

We can get some extra performance by multi-threading, but as mentioned earlier, python is not the best language for this. Still, let's give it a go and see how we do.

import argparse
import hashlib
import threading
import math
import cProfile

NONCE_SIZE = 4  # 4 bytes
MAX_NONCE = (2 ** (NONCE_SIZE * 8)) - 1  # Largest int you can make with the number of nonce bytes
NUM_THREADS = 8

done = False

def parse_args():
    parser = argparse.ArgumentParser(description="SHA256 prefix and suffix mutator")
    parser.add_argument('filename', metavar='filename', type=str, help='file to hash')
    parser.add_argument('--prefix', dest='prefix', action='store', default="", help='prefix to match')
    parser.add_argument('--suffix', dest='suffix', action='store', default="", help='suffix to match')
    parser.add_argument('--multithread', dest='multithread', action='store_const', default=False, const=True, help='enable multithreading')

    return parser.parse_args()

def mutate(data, prefix, suffix, nonces):
    global done
    message = hashlib.sha256()
    message.update(data)

    # Precompute for speed
    prefix_len = len(prefix) if prefix else 0
    suffix_len = len(suffix) if suffix else 0

    for nonce in nonces:
        # If another thread got the answer, quit
        if done:
            break

        message = hashlib.sha256()
        message.update(data)
        message.update(nonce.to_bytes(NONCE_SIZE, "little"))  # Change to little, because intel
        digest = message.hexdigest()

        # If we got the anwswer, set the done flag and quit
        if digest[:prefix_len] == prefix and digest[-suffix_len:] == suffix:
            done = True
            print(f"Got hash {digest} using nonce {nonce}")
            break

def start_threads(data, prefix, suffix):
    threads = []

    for i in range(1, NUM_THREADS + 1):
        start_nonce = math.floor(MAX_NONCE / NUM_THREADS) * (i - 1)
        end_nonce = math.ceil((MAX_NONCE / NUM_THREADS) * i)
        print(f"Starting thread with nonce range {start_nonce} to {end_nonce}")
        threads.append(threading.Thread(target=mutate, args=(data, args.prefix, args.suffix, range(start_nonce, end_nonce))))
        threads[-1].start()

    for thread in threads:
        thread.join()

if __name__ == '__main__':
    args = parse_args()

    with open(args.filename, 'rb') as f:
        data = f.read()

    if args.multithread:
        cProfile.run("start_threads(data, args.prefix, args.suffix)")
    else:
        cProfile.run("mutate(data, args.prefix, args.suffix, range(MAX_NONCE))")

Since I changed the byte ordering to little endian, and we changed some other stuff about how the algorithm works, let's re-benchmark our single-threaded execution:

┌─[dizzyspiral@akatsuki:~/repos/github/sha-bam]
└─╼ python3 main.py lorem.txt --prefix aaa --suffix bbb
Got hash aaa07ea3b6faf37e8ad580ad7b60bf5b58fa1d30cf898c22819fff562261abbb using nonce 35489965
         177449839 function calls in 77.409 seconds

And now, multithreaded:

┌─[dizzyspiral@akatsuki:~/repos/github/sha-bam]
└─╼ python3 main.py lorem.txt --prefix aaa --suffix bbb --multithread
Starting thread with nonce range 0 to 536870912
Starting thread with nonce range 536870911 to 1073741824
Starting thread with nonce range 1073741822 to 1610612736
Starting thread with nonce range 1610612733 to 2147483648
Starting thread with nonce range 2147483644 to 2684354560
Starting thread with nonce range 2684354555 to 3221225472
Starting thread with nonce range 3221225466 to 3758096384
Starting thread with nonce range 3758096377 to 4294967295
Got hash aaae5c17d3d40008c490879d1d227c63dda58d9fc4e9244025965f191df5fbbb using nonce 1076940590
         356 function calls in 36.090 seconds

So we've cut execution time in about half, based on this one datapoint. I guess let's let it rip on just one more byte, in the prefix:

┌─[dizzyspiral@akatsuki:~/repos/github/sha-bam]
└─╼ python3 main.py lorem.txt --prefix aaaa --suffix bbb --multithread
Starting thread with nonce range 0 to 536870912
Starting thread with nonce range 536870911 to 1073741824
Starting thread with nonce range 1073741822 to 1610612736
Starting thread with nonce range 1610612733 to 2147483648
Starting thread with nonce range 2147483644 to 2684354560
Starting thread with nonce range 2684354555 to 3221225472
Starting thread with nonce range 3221225466 to 3758096384
Starting thread with nonce range 3758096377 to 4294967295
^C         274 function calls in 534.320 seconds

It ran for a while, but I got impatient and killed it.

So now we're faced with a fundamental question. Do we throw more compute at this, or do we try to do something clever to reduce the compute required? The former option isn't very interesting; crypto-mining has shown us that we can solve this problem with beefy GPUs. But that's not super attainable for the average dudette. Is there a way we can give ourselves a better chance of affecting the prefix and suffix of the hash?

First thing that comes to mind is changing up where we're inserting our nonce. If we put it at the beginning of the data, is it more likely to affect the prefix? We can do some benchmarking on randomly generated data to find out.

For what it's worth, I do doubt that will make a difference, because it would indicate a flaw in the checksum that could allow you to learn something about the original data. And sha256 has been in use for so long that such a trivial attack would be well known by now if it existed. But I take nothing for granted, so I'll definitely try it. In the next blog post ;)

Reversing the asciiPad SG-6 serial link (part 2)

Dizzyspiral — Sun, 24 Jul 2022 23:42:45 +0000

This is a writeup I've moved here from my personal website. It was originally published on 4/13/19

So remember how I said you could probably just hook your controller up to a bench power supply? Yeah, don’t do that. It’s a no-go. Apparently the Sega controller receives a clock signal from the Sega that drives some of the pin output. The controllers for the Sega Genesis don’t behave at all like the pinout I showed you yesterday. This is what a good trace of a stock Sega Genesis controller looks like:

Let me explain how I learned this.

This morning I woke up energized and ready to go (masala chai tea for the win!), so I tackled the issue of getting the Sega to actually output video to a TV. While I was twiddling the video cable I learned something exciting.

My Sega’s video output now comes in two pieces! Lovely. I futzed with this a bit until it fit snugly back into its housing. This resulted in a sort of okay video output to the TV. I could at least see that the game was booting up and enjoy visual feedback when I pressed controller buttons. I decided it was time to eliminate some variables and go back to the baseline — I hooked up the stock controller.

Without pressing any buttons, we get three lines with an even clock signal. This was way different from what I saw yesterday messing with the asciiPad. It could be a difference in the controller, but it could also be a result of having a game loaded. I rebooted the Sega and took a trace of it at startup to see when the clocked signals appeared. If they appeared as soon as the controller got power, it would be likely that the controller was generating the clock and sending it to the Sega. But if there was a delay between when the lines went high and when they gained a clock, it was more likely that the clock was a result of the game being loaded, and therefore generated by the Sega and sent to the controller.

Sure enough, we see the latter case. For further sanity checking and information gathering, I replaced the OEM Sega controller with the asciiPad after bootup and took a trace while pressing buttons.

This is a logic trace with clock, and a number of button presses in order of: up, down, left right, z, c, x, y, a, b, start, mode

We still see the clocked signals, which continues to support the idea that they’re coming from the Sega. But more than that, check out what happens when you press buttons! With the clock, we can actually see many of the buttons that were missing yesterday. You’ll note that the standard buttons exactly match the trace with the stock controller, but we get one additional button here that we didn’t have there — we have Z, which apparently drives all of the clock lines (and possibly just all of the lines in general) high. This is interesting because without a clock signal, other button presses like Start should not be possible.

X and Y are still missing from the trace. It doesn’t make sense to me that the AsciiWare folks would put non-functional buttons on the controller unless they used this controller interface for other systems too. In that case, it’d be a cost savings for them to only manufacture one controller and program the chip inside with whatever logic it needed to interface with the game system it’s marketed for. I could do some research to try and figure this out, but I had another theory I wanted to poke at first.

Recall the trace of the lines at bootup. There’s a little blip at the beginning that’s separated from the rest of the uniform clock signal. This could be the Sega telling the controller “hi, I’m here.” It could just be noise. But it’s definitely different from the rest of the trace. I zoomed in on it, and saw something weird.

The clock signal after this “blip” has pairs of short pulses instead of the usual single pulse. I thought maybe it was an anomaly in the reading, but when I took another, I saw the same thing. The double pulses continue for quite a while, about 8 seconds, before they switch to the single pulse.

This got me thinking, do different games have different bootup behavior with respect to the clock? I’ve got a plethora of old Sega games, so I went to town plugging and unplugging them, taking traces of their behavior at bootup (only cursing occasionally as I blew into cartridges and enjoyed all of that old-timey dust-and-electronics smell). Here are a few traces just for your viewing pleasure.

A Sonic 2 bootup trace:

NFL 95 bootup trace:

Lion King bootup trace:

NHL Hockey bootup trace:

Beyond each game having their own unique clock bootup shenanigans, they also clock different lines, and clock them inconsistently. Check out this zoomed-in view of the NFL 95 game. This is one clock “pulse” as it appears on several different lines.

Since NFL 95 drove clock signals over more lines, I decided to see if I could find my missing buttons in them. I didn’t, but I did find a few other interesting things. First, when running NFL 95 it seems (under most circumstances) pressing a button related to a clocked line doesn’t remove the clock signal from the line. This is different from other games, where pressing a button makes a clocked line steady at 5v or ground. The exception to this is when you press Start. In this case, we see a single pulse on pin 9, and then all of the lines are drive high for about a second, including pin 9. The duration of the silence on the lines is not dependent on how long you hold the start button.

It’s hard to tell whether the game is intentionally removing the clock from the lines in order to prevent button presses from registering (if the controller doesn’t have a clock signal, it can’t send most button signals, as we saw yesterday) or if it’s a side effect of some processing going on in the Sega. Either way, it seems that the way the Sega handles the controller coms is completely dependent on the game. This is cool because it means a game developers can ship their own unique equipment to enhance a game if they want to, without being tied to the template provided by Sega — but it’s also annoying for anyone who wants to manufacture a controller for the game, because it means there is no guarantee that it will actually work with all games.

They’re about what I expected. There’s a package that is probably just storage that the CPU on the Sega reads in order to get the game’s program. The only thing that surprised me was that the game name is printed right on the chip — that suggests a custom run of these chips just for that cartridge.

The RetroShock project has a ton of information about individual cartridges, including which chips are used in which games, and the pinouts used by the different cartridges. The two photos I took are of cartridges that use the same pinout, but there are several different schemes, with some using all of the pins available on the 64-pin header and some using only a few.

Reversing the aciiPad SG-6 serial link (Part 1)

Dizzyspiral — Sun, 24 Jul 2022 23:41:00 +0000

This is a writeup I've moved here from my personal website. It was originally published on 4/13/19

The asciiPad SG-6 is an aftermarket Sega Genesis controller with extra buttons and macro features. As a kid, I had no idea what any of the weird switches did. I just sort of flicked them up and down until the controller worked as I expected. As an adult, I started wondering how the controller is able to support more buttons than the stock controller, let alone the frame rate slowing capability I remembered as a child. It’s a mystery I couldn’t just leave alone. To the batcave!

I’m mostly just getting started as a hardware reverser, so my tools are nothing fancy. For this, I used a bunch of jumper wires, male and female DB-9 breakouts, and a super budget logic analyzer. All told, materials cost around $20. Here’s the setup.

I wanted to capture the signals coming from the controller when different buttons were pressed and different modes were set. The controller is powered over the serial connection, though it uses a non-standard pinout. Ground is pin 8, VCC at 5v is on pin 5. The other pins float high at 4v9. I checked all of this with a multimeter (mine is nothing fancy) before hooking anything up. Measure twice, cut once, as the saying goes.

The controller should be completely standalone, so if you have a bench power supply, you can just set it to 5v and hook the + rail to pin 5 and the – rail to pin 8. I didn’t do this. Instead I opted to use a female DB-9 breakout connected to a male DB-9 breakout connected to the controller in order to power the controller and provide access to the pin signals. I did this because it’s cheap and it works. My little logic analyzer is connected to the male breakout with some jumper wires so it can read the pin logic levels. I collected traces for pins 1–4, 6, 7, and 9 (everything but the ground and VCC pin).

From there, it was time to have some fun. I started with powering on the Sega and taking a trace to see if there were any bootup signals sent over the line by the controller or the Sega. There aren’t — the lines all just go high. Then I started taking traces with individual buttons. Which quickly became… odd.

For all of the pins, a logic level of 0 (the pin being high at 4v9) indicates that the button associated with that pin is not being pressed. A logic level of 1 (the pin being being driven to ground) indicates that the button is being pressed. The standard controller pinout generally associates one button with each pin, however there are a couple of cases where the pins are multiplexed and two buttons are capable of driving the pin. In the latter case, a special Select pin differentiates the two buttons by being either high or low. For example, the Start button and the C button are multiplexed on pin 9 — if the Select pin is low, and pin 9 is low, then this is interpreted by the Sega as the Start button being pressed. If the Select pin is high and pin 9 is low, it is interpreted as the C button being pressed.

When I started pressing buttons on the asciiPad, only some of them affected the logic level on the pins. Really important, standard buttons, like A and Start, didn’t seem to have any effect on the pin outputs. But, you know, the asciiPad has a bunch of toggle switches and, who knows, some of them might be futzing up the operation of the buttons. It also has this nice “mode” button which, hell if I know what it does. It seemed like the next logical step was to hook the Sega up to a TV and get some intuition about what the odd buttons on the controller actually do, instead of relying on my decades old memories of how it worked from playing video games with my brother.

While I kept all of the gear that came with the Sega, including the TV tuner, unfortunately the RF interface is and always has been super finicky. I wasn’t able to get it working in an amount of time that my patience would allow for. I did however take apart the Sega and attempt to diagnose any apparent connector issues (I didn’t find any). Here’s a few photos (note that this is not the “Mega Drive” described in other teardowns):

I went back to taking logic traces of the controller pins and, by some amount of trial and error, figured out what the “Auto/Turbo/Off” (ATO) switches on the controller do. First, visually it’s apparent that one switch is associated with each button on the controller other than the “Start” and “Mode” buttons. I switched each of ATO switches to off and captured the signals pictured below.

In order of activation, the buttons pressed during the trace were: Up, Down, Left, Right, Z, C, X, Y, A, B, Mode, Start. You’ll notice that we don’t have nearly enough pin activations in the trace to account for all of those buttons. What shows up on the trace are the D-pad presses and the B and C buttons, which correspond directly to the pinout described for the stock Sega controller. Clearly, something is up with the missing buttons. But we ignore that for now, and move on to setting all of the ATO switches to Turbo.

Turbo appears to quickly toggle a button as long as it is held down. This might be a great way to get in some quick punches in a fighting game, or float in a game with flying, or any number of other things.

The “Auto” mode of the asciiPad behaves like Turbo except it’s active as long as the button isn’t pressed. Once a button is pressed, the signal for that button is driven high, effectively telling the Sega that the button is currently not being pressed.

The great mystery in all of these traces is why the X, Y, Z, A, Start, and Mode buttons don’t appear to affect the output on the pins at all. I could believe that Mode is perhaps a button that affects the controller operation and does not directly send anything out over the serial line. Similar arguments could be made for the other buttons, e.g. maybe they’re deactivated since a standard Sega controller only supports the A, B, and C buttons. But the A button is standard, and so is Start, and nowhere do we see either appearing in the output. Time to take the controller apart.

Inside the controller, the first thing that’s revealed is the chip that’s doing all of the signals processing for the controller. Beyond that, there really isn’t much to be seen — there’s some resistors and capacitors, but very few, and they’re probably just doing some stuff to the signals coming in from the buttons to the chip. Obviously there is also the header with the colorful wires coming out of it, which gets wrapped up into the black cord and becomes the serial connection to the Sega. The signals routed to that header come directly from the chip.

On the other side, we have the board components that interface with the buttons. For the pressure buttons, when they’re pressed they complete a circuit, which changes the signal that the chips on the other side of the board receives. For the toggle switches, they complete different circuits as they slide into one of three different positions. In the “Off” position, they complete no circuit; in the “Turbo” position they connect with the middle small grey piece, and in the “Auto” position they connect with the outside small grey piece. The exception to this is the “Fast/Slow” toggle, which only has two modes (this switch is located next to the Start and Mode buttons).

When my patience returns, I’ll probably begin again by attempting to get the Sega hooked back up to a TV so I can functionally diagnose any issues with the controller. It’s possible that the buttons that don’t appear to send signals truly aren’t working. It’s an old piece of technology, and it was well used when it was new.

Next up: part 2

Teaching A Machine To Identify Vulnerabilities (Part 3)

Dizzyspiral — Sun, 24 Jul 2022 23:36:00 +0000

After blindly attempting to train the classifier (as seen in Part 2), I had a good long hard think about all the variables and all the possible combinations of things I could try and tweak to get results out of my naive Bayes classifier. I also had a good long hard think about how I would measure those results. What are good results? This is an important thing to consider in machine learning. You may think that accuracy is all that matters — but that’s not the case.

In this experiment, we’re attempting to classify observations into one of 34 different bins. If we look only at accuracy, we see only part of the picture. We can’t answer questions such as “is one category often misclassified as another category?” or “how many different categories does the classifier bin things into?” These questions have answers that are helpful in “debugging” our classifier, i.e. determining why it’s classifying things the way it is, and how we can improve our data and strategy in order to improve its accuracy. This is why confusion matrices are useful tools for visualizing the result of classification. Confusion matrices explain the how the errors in classification are distributed among the possible classes, shedding light on the overall behavior of the classifier.

If you’ll recall, our dataset class distribution looks like this:

The labels are heavily skewed toward CWE-121. A classifier could achieve high accuracy in this dataset by simply guessing CWE-121 for every observation. This is a type of overfitting. If the data does not have a lot of descriptive power, i.e. the features I’ve chosen to extract from the binaries are not related to the CWE classes, I would expect the classifier to have this behavior.

To validate this, I made the decision to perform a trial where I trained the classifier on the dataset, but randomized the labels. This trial provides a baseline that can be inspected to compare subsequent, properly trained trials to in order to differentiate their results from random noise.

With this in mind, I also identified a few different parameters to tweak in order to potentially improve classifier performance. We can perform classification at the basic block level or at the binary level. We can also choose to use a threshold to discard “uninformative” basic blocks, as described in the previous post. In addition, there is more than one naive Bayes classifier variant. Both Bernoulli naive Bayes and multinomial naive Bayes are applicable classifiers for our data. Finally, we can represent our features in two different ways: we can use a boolean representation for each opcode in each basic block, i.e. whether or not that opcode was present, or we can use an integer, i.e. how many times that opcode was present in the block. I attempted to exercise a reasonable combination of these parameters when training and evaluating the classifier.

Throughout the rest of this post, you will see results for three separate classifiers; one a Bernoulli naive Bayes classifier trained on the binary-featured dataset, one a multinomial naive Bayes classifier trained on the binary-featured dataset, and one a multinomial naive Bayes classifier trained on the integer-featured (also referred to as “event-featured”) dataset.

Tuning The Threshold

As a refresher, in the previous post I talked about a clever way we could use the probability estimates returned by naive Bayes to identify basic blocks that are common across many CWE classes vs. those that are unique to a CWE class. The idea is that if the classifier thinks each of the classes is equally likely for a block, then that block is not informative for classification and can be discarded. Conversely, if the classifier is very confident that a block belongs to a particular class, then it is informative and should be kept.

If we’re going to use this idea to discard basic blocks, we first need a way to measure this difference in probabilities between the classes. I chose to use the difference between the highest probability class and lowest probability class returned by the classifier. If the difference is small, the distribution is relatively even across the classes — if the difference is large, it is likely to be uneven. Semi-formally:

m=x_highest — x_lowest

Now that we have a metric, we need to know at what value blocks become “informative” vs. “uninformative.” In other words, we need to know what our threshold is. A good way to determine this is by training the classifier and seeing how classification performs with respect to different threshold values. This is a type of classifier tuning similar to that done by K-nearest Neighbors when choosing a value for K. This type of tuning requires a validation data set.

Typically, when training a classifier, you need at least two datasets: a training data set, with which to train your classifier, and a testing dataset, to evaluate your classifier’s accuracy on. But when you want to test your trained classifier in under different conditions and pick the best condition as a parameter of your classifier, you need an additional dataset called a validation set in order to maintain the independence of the testing data from the process of training. Otherwise, you risk overfitting your classifier to your data, and will likely produce results in your experiment that translate poorly to unseen data.

To tune the threshold, I trained each of the three classifiers on their respective data and plotted their accuracy with respect to different threshold values. I created two plots — one where the accuracy was calculated with respect to basic block level classification, and one where the accuracy was calculated by having the “informative” basic blocks “vote” on what they believed the binary they came from should be labeled as.

Accuracy curve with respect to threshold for each naive Bayes classifier applied to basic blocks

Accuracy curve with respect to threshold for each naive Bayes classifier applied to binaries
The accuracy curve per basic block looks kind of like garbage — each of the classifiers performs fairly differently. However, when voting is applied to classify whole binaries, we see each classifier’s accuracy peaks at a threshold value of about 0.5. Therefore, we choose this as our threshold value for each classifier.

Also, for fun, I plotted the distribution of the threshold metric. If I’m right about some basic blocks being common to all the CWE classes and others being unique, we should see a bimodal distribution of this metric.

Threshold metric distribution for Bernoulli naive Bayes on binary-featured dataset

Threshold metric distribution for multinomial naive Bayes on binary-featured dataset

Threshold metric distribution for multinomial naive Bayes on integer-featured dataset

The bimodal distribution hypothesis holds, if only barely. It’s not very pronounced, but it can be seen in each of the three distributions. This is useful information to have, because we could have chosen other metrics by which to create a threshold. It’s possible that another metric would plot a more emphatic bimodal distrbution. Likely such a metric would perform better as a threshold.

Calculating The Baseline

The next thing to do is calculate the accuracy and confusion matrices for the baseline trial with the labels in the dataset randomized. I calculated this for each of the three classifiers, and each returned a very similar confusion matrix. One is shown below.

An example confusion matrix produced from random noise baseline testing

You’ll notice that the classifier is classifying nearly everything as CWE-121, which is exactly as expected. We’ll try to improve on this overfitting with our different training strategies.

The accuracy of each classifier under different conditions was also pretty similar between classifiers in the random trials. To calculate these accuracies, twenty trials were run. The data was randomized differently for each trial and split into training, validation, and testing datasets containing 1/3 of the data each.

Baseline accuracies for the different classifiers, trained on randomly labeled data

Unsurprisingly, using a threshold on the randomly labeled data reduces the accuracy by an enormous amount. And while it’s not shown here, it’s also worth noting that using a threshold on this data reduces the dataset by a significant amount. We expect more of the data to be preserved during normal trials.

Training The Classifiers

Finally we’re ready to evaluate our classifiers on real data with some different parameters. I evaluated each classifier under three different conditions — first, training and testing for basic block classification without any threshold value, second, basic block classification with a threshold value of 0.5, and third, whole binary classification with a threshold value of 0.5 using the basic blocks to vote on a label for their binary.

Each of these conditions were tested with twenty trials of randomized data, training/validation/testing split of 1/3 of the dataset for each. The results for each of the three classifiers are shown below.

Bernoulli naive Bayes accuracy on binary-featured dataset

Multinomial naive Bayes accuracy on binary-featured dataset

Multinomial naive Bayes accuracy on integer-featured dataset

The first thing you’ll notice is that the accuracy of each of the classifiers at basic block classification is lower than the random noise. This seems like a bad sign, but it is also odd. Any sort of significant deviation from the random baseline implies that the classifier is picking some kind of pattern out of the data. We turn to the confusion matrices to try to diagnose the difference between the random baseline and the actual run.

Confusion matrix for Bernoulli naive Bayes applied to binary-featured dataset for basic block classification

While the random baseline classifier overfit to CWE-121, there is some evidence here that our properly-trained naive Bayes classifier does not overfit as strongly. In particular, CWE-119 and CWE-416 are guessed quite often as well. In addition, we are able to correctly classify a significant number of blocks from CWE-416 and CWE-122, in addition to CWE-121. Unfortunately, this also causes many CWE-121 basic blocks to be incorrectly guessed as other classes. Realizing that this is likely due to the poor labeling of the dataset, it seems we can say that there is some predictive value in the opcode features extracted from the basic blocks, though there’s too much noise for the classifier to produce an acceptable accuracy.

The other unfortunate observation about the classifier accuracies is that applying a threshold does not increase the accuracy of the classifier. In the best case, it only reduces it by about 0.01 — which is a marked improvement over the baseline, but not terribly helpful. Voting even further decreases the accuracy, debunking the theory that we can use the probability outputs from naive Bayes for whittling down the list of informative basic blocks.

The takeaways

As with all good science, just because something doesn’t work out the way you want or expect it to, doesn’t mean there isn’t a reason. I looked through the documentation for the naive Bayes classifier implementation I used from scikit-learn, attempting to find something to help me gain some insight into the probability outputs, and ran across this lovely gem:

That's a note from sckit-learn.org about naive Bayes classifiers.

…so the idea of using the probabilities is fundamentally flawed, and in order to implement this, I need to use another classifier. Back to the drawing board. However, this experiment hasn’t been a waste. I’ve learned valuable information about my data composition, and eliminated a possible classifier from the list of candidates. Better, I learned that the data does have some informative value! Next up is identifying an appropriate replacement classifier and continuing the research.

Teaching A Machine To Identify Vulnerabilities (Part 2)

Dizzyspiral — Sun, 24 Jul 2022 23:33:00 +0000

This is a writeup I've moved here from my personal website. It was originally published on 4/13/19

In my previous post, I talked about the data processing needed to turn a bunch of binaries into a dataset for use with a machine learning classifier. While I said that talking through machine learning basics is out of scope for this series, I do want to talk through a bit about how the Naive Bayes classifier works, why I’ve chosen it, and how I plan to exploit the particular way in which it handles my data to do some cool things.

If you recall, we have a dataset whose rows looks something like this:

Naive Bayes is an inference classification algorithm. Inference relies on logical properties in order to provide a guess as to how likely it is that a given observation should belong to a particular class. In our case, Naive Bayes will be telling us how likely it is that a particular basic block belongs to one of our CWE classes. It will be doing this by looking at the opcodes present in that basic block, comparing that to basic blocks that it saw when it was trained, and returning a probability for each possible CWE class indicating how likely it is to be that CWE. We can use this to do something clever.

There is something fundamentally wrong with the data. If you haven’t spotted it already, don’t worry, it’s subtle. When I created the dataset, I chose to create many observations from a single binary. Namely, I created one observation for each basic block in that binary. Vulnerabilities are generally localized to a small set of the basic blocks within a binary — however, I have labeled entire binaries with a CWE class. This means that there are many basic blocks, i.e. observations, which are incorrectly labeled, as they do not actually contain a vulnerability. That poor labeling is going to affect the training of the classifier, and in turn, adversely affect accuracy.

Knowing what I do about binaries, though, I can use our classifier to cheat a little. What I know is that binaries often have some similar code, e.g. setup and teardown code, string manipulation code, console or network code, etc. Binaries which serve similar purposes often use similar code and therefore share more similar blocks, and conversely binaries which do drastically different things have less blocks in common. I make the hypothesis that blocks that are common amongst binaries from many different CWE classes are not indicative of a vulnerability, whereas blocks which are more unique to a CWE class are good indicators of that vulnerability. Because naive Bayes is an inference algorithm, and it returns for us a probability for each CWE class when classifying a basic block, we can inspect this probability distribution at classification time and determine if that block skews heavily toward a single class or if the classifier thinks it’s seen it before amongst many different classes. For example…

If we have five different CWE classes we are classifying across, a block which has been seen before in all five CWE classes might return a probability from naive Bayes something like this:

Class 1: 20%
Class 2: 23%
Class 3: 21%
Class 4: 19%
Class 5: 17%

On the other hand, a basic block that has only been seen before in one category might return a distribution like this:

Class 1: 2%
Class 2: 3%
Class 3: 93%
Class 4: 1%
Class 5: 1%

It’s worth noting that while I say “a basic block that has been seen…”, in reality any basic block that is similar-ish to a basic block that’s been seen before, i.e. differs in handful of features, will be classified in the same way. That’s the power of machine learning — we train our classifier to recognize things it’s never seen before, using data that is hopefully representative of the problem space.

So now that we understand how the classifier is being used, let’s give it a go!

Trial 1: Start Simple

For the first attempt, I kept everything as simple as possible. I ignored all of the clever stuff I mentioned about using probabilities to classify blocks as significant or not. While I had intuition that those things would be important, it’s important to challenge and validate your assumptions before acting on them. So for this trial, I randomized over the entire dataset, not bothering to keep samples from the same binary together, and I simply had my classifier predict the CWEs it thought each block had.

This did not go well.

I mean it could have gone worse. My classifier achieved a ~25% accuracy, predicting across 27 CWE classes. I plotted a confusion matrix, so I could see if it was doing anything shady, like only classifying samples as class 121 or 122 (because if you recall, those are the classes that the data was skewed toward).

The errors are all over the place. Most classes do get misclassified as 121 more often than others, but not overly so. The fact that the errors are so spread out across the different classes is encouraging, because it may indicate that my approach has credence. It’s possible that the basic blocks that are getting misclassified are pretty common across most classes, and therefore get misclassified as any of them.

To figure out if this was true, I took a look at the probabilities that were generated during classification. I wanted to get a general sense of what the probability spread looked like, so I calculated some basic stats.

Highest probability ever seen: 0.9560759838758044
Lowest probabilities ever seen: 2.4691830828330134e-09
Average difference between in-sample highest and lowest probabilities: 0.32504693947187246
Standard deviation of in-sample probability differences: 0.06998244753425269

Hm. It seems like on average, the classifier is pretty confident about what it’s classifying. But the average could be skewed by it being confident classifying the blocks with vulnerabilities, and far less confident in the others. If that were the case, I’d expect to see the bi-modal distribution of the in-sample probability differences. So, let’s plot it.

That could somewhat be considered a bi-modal distribution. Not quite the separation that I would have hoped between the two modes, but there is certainly a second spike in the graph. You’ll notice that our average sits right in between the two spikes, at around 0.3.

At this point, I still have no idea if the data points that have a higher probability difference actually classify better than the ones that do not. There’s one easy way to find out, though — try it. I wrote some code to drop classified blocks that had an in-sample probability difference of less than 0.4, and then looked at the results.

Trial 2: Probability Inspection

Correctly identified 90/347
Accuracy: 0.259365994236

Not terribly convincing. Let’s try upping the threshold to 0.45.

Correctly identified 42/137
Accuracy: 0.306569343066

Accuracy is increasing, but not by as much as I’d like. And the confusion matrices still have the same general spread. It honestly doesn’t look much at all like the difference between the highest and lowest probability in a prediction is correlated at all with the accuracy of the classifier. I gave it another run at 0.5.

Correctly identified 17/67
Accuracy: 0.253731343284

…and the accuracy went down. I’m willing to believe that it’s just not going to have an effect.

It’s time to take stock of where we are and reform an approach. It appears that my approach utilizing the predicted probabilities to weed out uninformative basic blocks may not work, however I would like to try not splitting binaries across training and testing data to eliminate the possibility that this is preventing the classifier from learning what common basic blocks are. It seems unlikely, but it’s possible. It’s also possible that our data needs better labeling to be useful — this will take time, time that I don’t believe I have for this project. We can re-featurize the data to include counts of the number of times an instruction occurred in the basic block, rather than just whether or not it occurred. This will give the classifier more information to work with, which may improve its accuracy.

Next up: part 3

Teaching A Machine To Identify Vulnerabilities (Part 1)

Dizzyspiral — Sun, 24 Jul 2022 23:32:00 +0000

This is a writeup I've moved here from my personal website. It was originally published on 4/13/19

For the past couple of months, I’ve been heavily immersed in a graduate level machine learning course as part of my degree program. I haven’t been able to post about the cool work I’ve been doing because that would enable others to cheat (tsk tsk), but now I’m working on my final project and I would like to informally share my experience training a machine learning classifier to identify vulnerabilities in binaries.

Before we get too deep into this, please be aware that this is an ongoing project and I do not currently know if this technique will work. This series is meant to provide some insight into the practical application of machine learning, regardless of whether or not the results are positive (that’s science!).

The machine learning techniques we covered in the course are considered classical machine learning. We covered supervised and unsupervised learning and implemented several well-known algorithms from scratch. I’m not going to teach you the basics of machine learning in this post — if you’d like to learn how to choose a classifier or how machine learning algorithms work, I recommend perusing this curated list of tutorials. Because the course focused on classical techniques, I chose to also focus my attention there and ignore the more shiny neural networks and deep learning options. My project sets out to show that a straightforward inference technique, Naive Bayes, can be used to provide valuable information for a real problem. That problem is identifying vulnerabilities in binary code.

Binary code is the 0’s and 1’s that a program is made up of. If you’re a software developer, it’s the thing that your source code gets compiled into. If you’re a regular user, it’s the thing that you double click to launch an application. Binary code is made up of machine instructions that your CPU executes. These instructions essentially have two parts: the opcode, and any arguments.

All of these instructions and opcodes are packed into your binary file one right after the next. Different instructions take up more space than others, and their arguments can also be of varying length. If you open up a binary in a text editor, it looks something like this:

Because the size of instructions and their arguments can vary, we need a special tool called a disassembler to decode that binary file for us so that we can view the opcodes. Disassemblers do something else that’s cool, too, which is show you how logic flows through your program by decomposing it into blocks called “basic blocks” and drawing arrows between those blocks where there are jumps or conditionals that link them.

Some basic blocks from the “yolodex” CGC binary, as disassembled by BinaryNinja

My approach to discovering vulnerabilities relies on this block-level decomposition. I featurize binaries into basic blocks, enumerate the opcodes present in those blocks, and use that block-level data as one observation. I believe this may work because it is the series of instructions that a program executes that makes it vulnerable. Specific sequences, and their associated arguments, can leave a program vulnerable to buffer overflows or use-after-free or many other vulnerabilities. At the block-level, we have a fairly decent picture of which instructions are associated with one another in a sequence, and thus may be able to draw some conclusions about whether the program is vulnerable there.

The different kinds of vulnerabilities that can crop up in programs are well enumerated by the Common Weakness Enumeration (CWE). My approach attempts to classify which of these CWEs a given binary may manifest. Of course, to train a machine learning classifier to recognize vulnerabilities, I have to have some kind of training data set. I am using DARPA’s Cyber Grand Challenge set of Challenge Binaries as the basis for my training data. These binaries were written to contain vulnerabilities, and have readme files that describe the vulnerabilities in detail, including which CWEs they fall under. The first step on the road to creating my classifier is featurizing the binaries as I described above.

As I said, a regular text editor won’t do for reading a binary file, so I needed to choose a disassembler to break the challenge binaries out into their basic blocks. I chose to use Binary Ninja because it has a very easy-to-use Python API, and it’s hobbyist-level cheap (for comparison, the industry-standard disassembler is IDA Pro, which they will sell to you for roughly an arm, and continue to pick off your fingers and toes with renewal fees). I began by writing a quick script to go through a single binary and print out the opcodes it encountered in each block, just to validate that I was able to acquire the data I wanted.

The output of a Python script using Binary Ninja’s API to print basic blocks

Awesome. Step 1 complete. But none of this data is labeled. I want to label my basic blocks with the CWE that the binary contains. The CWEs are all described in a bunch of README.md files with no standard format. I briefly considered writing a script to pull the CWE labels out of the READMEs, but decided that the amount of time I would spend debugging edge cases and validating that the script actually grabbed the correct CWE numbers would be at least as much as just doing it by hand. So away I went, plugging CWE numbers into a spreadsheet to create a CSV mapping binaries to their CWEs.

Spreadsheet of CGC binaries labeled with their CWEs

This took me some hours, of which I mostly spent watching YouTube (bigclive’s teardowns are always a good way to spend some time). Rote tasks are good opportunities to learn a thing or two. Or to watch a show, pick your poison.

The first thing I noticed going through my labels is that some binaries are labeled with more than one CWE. This makes sense, but wasn’t something I had considered in my original approach. To simplify the problem, I made the (somewhat bad) decision to discard samples that are labeled with more than one CWE. There are better ways to handle this problem, but I only have three weekends to train, evaluate, and present my classifier, so I unfortunately do need to cut some corners. I calculated some basic statistics on my dataset to gain some insight into its composition.

Number of binaries:
108
Number of unique CWEs:
27

Distribution of binaries with respect to CWEs

Most of my samples are for CWEs 121 and 122. It’s important to note that this does not necessarily mean that my dataset is unrealistically biased. In fact, CWEs 121 and 122 are “Stack-based Buffer Overflow” and “Heap-based Buffer Overflow,” which are two very common vulnerabilities. That said, it’s worth being mindful of this skewing, as it will impact how our classifier trains.

Putting it all together, we now have a dataset where each row describes the instructions in one basic block, labeled with the CWE represented by the binary that the basic block was taken from. We’re ready to try to train a classifier.

Sample rows from the featurized data CSV

Next up: part 2

The argument for terminal multiplexers

Dizzyspiral — Sat, 16 Jul 2022 15:12:27 +0000

I went years without using a terminal multiplexer. I thought they were for kids who thought they were l33t hax0rz. My approach to tooling has always been minimalistic - I'm all for trying new things out, but if there's no significant improvement in my experience after a couple of days, I drop the tool.

Despite the learning curve, I saw immediate improvements in my quality of life when I started using tmux. It was definitely a "why didn't I do this sooner moment."

Embarrassing though it is in hindsight, I'll share how I used to work before learning tmux. I ssh into a dev VM for my work. I need multiple windows, so, I open up about six terminal instances and ssh each one to the dev VM. I'd use my window manager to snap the terminal windows how I liked them, and click between each one when I wanted to work in a different window. If I needed to suspend or reboot my machine, I'd close all of the ssh sessions, and then re-establish them next time, meaning I had to re-navigate to the files I was working on and redo any other working state in order to pick up my work.

^ What tmux looks like for me - here's a post on my customizations

I cannot tell you how painful that sounds now.

Because tmux (or screen, or any other terminal multiplexer) runs on the machine you're ssh'd into, you can detach, close your ssh session, then re-connect later and pick up exactly where you left off. This nicety alone is worth it, and is why I decided to use tmux in the first place. But then you also:

Have the ability to switch windows using the keyboard (in a much more intelligent and seamless way than ctrl+tab).
You can spawn and close new terminal sessions on the machine you're ssh'd into effortlessly, making it possible to open a disposable terminal to run a single command and then close it.
Each new terminal pane will automatically start from a current working directory of wherever you established the tmux session, so if you cd into your project before starting tmux, every new window has a base directory there.

But wait, there's even more! Say you have multiple inter-dependent projects. You can open a session for each one, and switch between those sessions easily. It's like the equivalent of swapping workspaces on your desktop. Super handy.

Have I convinced you to try tmux? Check out what I did to make tmux work better for me, because out of the box it has a pretty terrible user experience.

Cheatsheet

Here are all the tmux commands I use to do the stuff I just talked about.

Switch sessions

<prefix> + s
arrow keys up and down to highlight session
enter to select session

New session

tmux new -s <session name>

New pane

Split the current pane horizontally in half:

<prefix> + "

Split the current pane vertically in half:

<prefix> + %

Detach from session

<prefix> + d

Attach to session

tmux a

Making tmux suck less

Dizzyspiral — Sat, 16 Jul 2022 15:12:20 +0000

tmux is a great tool. But it's basically unusable out of the box. Try it - fire it up in a bare environment without a tmux config file. You'll have no colors. You'll have weird keybindings issues. The vim clipboard will be broken. Copying text with your mouse will be broken (unless you really did want to copy random bits of text from all of your panes, and not just the active one). In short, you'll have a terrible user experience.

I foolishly thought tmux would inherit all of the niceties of my base terminal emulator. As it turns out, tmux behaves a lot more like a standalone terminal than an add-on to your existing one. I can respect a tool that assumes nothing about its use, but it does mean you have to do some configuring to make it work well for you. Here's what I did:

1. Enabling colors

Add this to your .tmux.conf

set -g default-terminal "xterm-256color"

2. Fixing weird keybind issues

For me, adding this to my config solved it:

set-window-option -g xterm-keys

It's been a while, so I'm not totally sure what problem it was that it solved, but I think it was this

3. Highlighting and copying text with the mouse

Credit to this SO post

Add this to your .tmux.conf:

set -g mouse on
bind -n WheelUpPane if-shell -F -t = "#{mouse_any_flag}" "send-keys -M" "if -Ft= '#{pane_in_mode}' 'send-keys -M' 'select-pane -t=; copy-mode -e; send-keys -M'"
bind -n WheelDownPane select-pane -t= \; send-keys -M
bind -n C-WheelUpPane select-pane -t= \; copy-mode -e \; send-keys -M
bind -T copy-mode-vi    C-WheelUpPane   send-keys -X halfpage-up
bind -T copy-mode-vi    C-WheelDownPane send-keys -X halfpage-down
bind -T copy-mode-emacs C-WheelUpPane   send-keys -X halfpage-up
bind -T copy-mode-emacs C-WheelDownPane send-keys -X halfpage-down

and install xclip

sudo apt install xclip

Turning mouse mode on will also enable you to scroll each pane with the mouse wheel, and select panes with the mouse.

4. Set prefix to `

I prefer the prefix to be a single key. I thought using ` as the prefix and enabling entering it when pressing the key twice quickly was an elegant solution.

unbind C-b
set -g prefix `
bind-key ` send-prefix

5. Add a theme

While not strictly necessary, adding a theme to tmux really does make it look way cooler. On a functional note, though, some themes update what information is shown in the bottom status bar by default, which can really improve the user experience.

I went with a simple powerline theme from the tmux-themepack.

Putting it all together

Here's my .tmux.conf:

set-window-option -g xterm-keys on
set -g default-terminal "xterm-256color"
unbind C-b
set -g prefix `
bind-key ` send-prefix

source-file "${HOME}/repos/tmux-themepack/powerline/default/purple.tmuxtheme"

set -g mouse on
bind -n WheelUpPane if-shell -F -t = "#{mouse_any_flag}" "send-keys -M" "if -Ft= '#{pane_in_mode}' 'send-keys -M' 'select-pane -t=; copy-mode -e; send-keys -M'"
bind -n WheelDownPane select-pane -t= \; send-keys -M
bind -n C-WheelUpPane select-pane -t= \; copy-mode -e \; send-keys -M
bind -T copy-mode-vi    C-WheelUpPane   send-keys -X halfpage-up
bind -T copy-mode-vi    C-WheelDownPane send-keys -X halfpage-down
bind -T copy-mode-emacs C-WheelUpPane   send-keys -X halfpage-up
bind -T copy-mode-emacs C-WheelDownPane send-keys -X halfpage-down

# To copy, left click and drag to highlight text in yellow, 
# once you release left click yellow text will disappear and will automatically be available in clibboard
# # Use vim keybindings in copy mode
setw -g mode-keys vi
# Update default binding of `Enter` to also use copy-pipe
unbind -T copy-mode-vi Enter
bind-key -T copy-mode-vi Enter send-keys -X copy-pipe-and-cancel "xclip -selection c"
bind-key -T copy-mode-vi MouseDragEnd1Pane send-keys -X copy-pipe-and-cancel "xclip -in -selection clipboard"

Cmocka Quick Start

Dizzyspiral — Tue, 07 Jun 2022 20:39:35 +0000

Today I have occasion to learn a C unit test framework again. This comes about in my career every 3 years or so, so each time, I do a quick re-survey of what's available and pick something that looks good. This time, I've picked cmocka.

The main page links to the API documentation and a slideshow. Neither of these, in my quick perusal, actually give you an example of building a project with Cmocka. I really appreciate it when there's some kind of quickstart skeleton project you can just roll with, so I've taken their first example from the API and done just that.

The first thing you'll want to do to follow along here is download cmocka. I grabbed the source release of 1.1.5.

Untar the archive:

tar -xzvf cmocka-1.1.5.tar.gz

build the cmocka library. For my project, I want the cmocka library to be statically compiled, so I configure cmocka with cmake accordingly:

cd cmocka-1.1.5
mkdir build
cd build
cmake -DWITH_STATIC_LIB=ON ..
make

cmocka will have built the libcmocka-static.a library to the build/src directory. This is what you'll need to link your tests against to get them to build. Using the static cmocka library means that the library will be copied into the binary that's created. This makes it so that the C interpreter doesn't have to find libcmocka at runtime. This is handy if you don't want to install the libcmocka library on your system, or if you're going to be running your tests in an environment other than you development environment. Be aware though that this does make your binary larger and has other portability concerns attached to it, so in lots of (production) cases, static libraries are not the way to go.

Next, make yourself a test.c file with the following contents:

#include <stdarg.h>
#include <stddef.h>
#include <setjmp.h>
#include <cmocka.h>
/* A test case that does nothing and succeeds. */
static void null_test_success(void **state) {
    (void) state; /* unused */
}
int main(void) {
    const struct CMUnitTest tests[] = {
        cmocka_unit_test(null_test_success),
    };
    return cmocka_run_group_tests(tests, NULL, NULL);
}

and build your test file:

gcc -I cmocka-1.1.5/include/ -L cmocka-1.1.5/build/src/ test.c -l:libcmocka-static.a

This will have produced a a.out binary. Run it!

$ ./a.out 
[==========] Running 1 test(s).
[ RUN      ] null_test_success
[       OK ] null_test_success
[==========] 1 test(s) run.
[  PASSED  ] 1 test(s).

And there you go. If you want to write unit tests for real C code, the only difference is you need to #include your headers for the source you're testing, and compile the source files you're testing along with the tests.

Creating A Spotify to YouTube Music Playlist Converter

Dizzyspiral — Sun, 12 Sep 2021 02:55:02 +0000

Photo by Kaboompics.com from Pexels

Just want the code? Got ya covered. GitHub

Right. So I've been using Spotify for some years now. I have a collection of carefully curated playlists and liked music. I'm loathe to move off the platform and start over with another platform. But, YouTube music comes free with my YouTube premium subscription... so, it's a bit silly for me to keep shelling out $$ to Spotify each month. What a conundrum.

Doing a quick search, there are plenty of third-party apps that purport to transfer your playlists for you. You just log into your Spotify and YouTube accounts through their webapp and away you go. I find this kind of service... sketchy, though - you're essentially giving that site your logins/auth tokens for your music accounts. The service is free, so I'd put money on them scraping your accounts for metadata they can sell. And if you've read any of my other posts, you probably know how much this irks me. I value my privacy. And perhaps, you value yours.

The good news is these webapp services aren't doing anything that we can't do ourselves. They're accessing the Spotify and YouTube developer APIs to grab data from one account and shove it into the other. Building a tool to do this from scratch wouldn't take too long, probably. But this seems like something a lot of people would find useful - which makes it super likely someone in the open source community has already done something like it. Let's check github.

There's this project, exportify, that exports playlists from Spotify into CSV format. This sounds great. CSVs are easy to work with, so we should be able to take this data and import it into YouTube music, even if we have to do a little massaging to make it the right format.

We could download their source and try to make it run, but it looks like the developer hosts the webapp here. While this is basically the same setup as the third party web apps I detest, I have a lot more trust in the open source community than I do random results from Google. I'll use it.

Okay, after logging into my Spotify account via exportify, I get a list of all my playlists. I grabbed my liked songs and a couple other playlists. The CSVs are full of all sorts of spotify-specific data that we probably don't need.

Let's take a look at what formats YouTube Music accepts for import. Oh, ugh. YouTube Music doesn't appear to offer any sort of convenient import function. What's more is it doesn't seem to have any API of its own - the YouTube API is basically also its API, maybe. That seems about right for Google, which is making all the same mistakes that Microsoft did back in the 90s.

Anyway, I digress. There exists this unofficial API, which we might be able to use. Let's give it a try.

The API is python (yessss), so we'll make a virtual environment for playing with the API. This is always good practice, as it keeps one project's dependencies from interfering with another. If you install all of your python packages to your global python install, after a while you can end up with conflicts that break things in weird and unexpected ways. Just better to compartmentalize.

$ virtualenv -p python3 venv

That creates a virtual environment for python3 in a new directory, venv. I'm on Linux, so that's what the terminal command looks like. I haven't used Windows seriously in a very long time - if you're a Windows or Mac user, you're gonna have to find your own way (on Mac it's probably the same thing).

source venv/bin/activate

will activate the virtual environment. Now we're playing in the sandbox with a fresh python3 install. Let's grab the unofficial YouTube Music API package.

pip install ytmusicapi

Follow the rest of the setup instructions to get an authenticated session. You can press ctrl+shift+c in Chrome to get to the developer console, and click on the Network tab to see the network traffic. You may have to click on some content or search for some music before you'll have the appropriate POST request to snag.

Their instructions were a bit unclear to me - copy-pasting all the header data from accept onward didn't really work... probably because without some tweaking, the copy-paste content is an improperly formed JSON file. All you actually need for auth is the cookie, so I used the "manual file creation" method and pasted in my own cookie, and that worked fine.

Now that we're auth'd, we can play around with the API. I tried a quick song search to verify that the API seemed to work as expected, using a few lines of code:

from ytmusicapi import YTMusic
ytmusic = YTMusic("auth_file.json")
search_results = ytmusic.search('Oasis Wonderwall')
print(search_results)

It barfed back some results in JSON format. Lovely.

Wandering through the reference for the API, I'm particularly interested in create_playlist and add_playlist_items. I'll try creating a test playlist, and see if it shows up in my account.

from ytmusicapi import YTMusic
ytmusic = YTMusic("/home/dizzyspiral/projects/spotify-to-youtube/auth_file.json")
playlist_id = ytmusic.create_playlist(title="Test Playlist", description="Testing 1, 2, 3")
print(playlist_id)

It worked. Excellent. Now, about add_playlist_items... It requires a YouTube video ID in order to add a song, so it looks like we'll have to first search for a song, grab the corresponding music video ID, and add it to the playlist using that. But when we searched for Wonderwall, we got a bunch of results. How do we know which one is the "right" one...

Nothing for it - we don't. But the search results are ordered from most to least relevant, with the "top result" labeled in the JSON and appearing first. I'll just trust that that's the right one.

Okay, now we come back around to our exported Spotify playlists. We need the song title and artist in order to search for the song using ytmusicapi. Each row of the exported playlist CSV is a song, and the track name and artist name are in the second and fourth fields, or indices 1 and 3, respectively. Python has a csv module for handling CSVs. We'll write a little code to import and read the CSV.

import csv
from ytmusicapi import YTMusic

def load_spotify_playlist(playlist_file):
    songs = []

    with open(playlist_file) as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')

        for row in reader:
            songs.append("{}{}".format(row[1], row[3]))

    return songs

As we noted earlier, the top result is returned first by the search API call. But, the very first result is some kind of "top result" result, which actually doesn't contain a video ID. And we need the video ID. The second result should have it. (My guess here is that the "top result" thing is just metadata for displaying the top result separately from the rest of the results, since these "API" calls are really only intended to be used by the actual YouTube Music app(s).)

def search_for_songs(ytmusic, search_strings):
    video_ids = []

    for s in search_strings:
        result = ytmusic.search(s)
        video_ids.append(result[1]['videoId'])

    return video_ids

Finally, to put it all together, we'll make a main function:

if  __name__ == '__main__':
    print("Connecting to YouTube Music...")
    ytmusic = YTMusic("auth_file.json")
    playlist_file = "exported_playlist.csv"
    print("Loading exported Spotify playlist...")
    songs = load_spotify_playlist(playlist_file)
    print("Searching for songs on YouTube...")
    video_ids = search_for_songs(ytmusic, songs)
    print("Creating playlist...")

    # See if the playlist exists already before creating it
    try:
        res = ytmusic.search("Some Playlist", filter="playlists", scope="library")
        playlist_id = res[0]['browseId']
    except:
        playlist_id = ytmusic.create_playlist("Some Playlist", "Some description")

    print("Adding songs...")

    for video_id in video_ids:
        ytmusic.add_playlist_items(playlist_id, [video_id])

    print("Done!")

While the API claims that add_playlist_items takes a list of strings as an argument, for whatever reason, that doesn't actually work. It silently fails to add any songs. So, we just iterate through our list of strings, artificially making them a list of one for each API call. It takes a bit longer, but it gets the job done.

And that's it. It's not pretty, but it was a one-day hack. I'll probably toss the code on github after cleaning it up a bit.

Cheers.

Update: Code is here.

Making a COVID warning light

Dizzyspiral — Sun, 20 Dec 2020 17:07:55 +0000

This post is designed for beginners at Linux and circuits. If you've done a project like this before, maybe just skip to the code.

With COVID-19 on the upswing again, I thought it would be nice to build a small appliance that could tell me how bad it is outside, without having to look up the local statistics. Here's what I used:

1x Raspberry Pi Zero W
1x 8GB micro SD card
1x red LED
1x yellow LED
2x 200 ohm resistor
1x breadboard

The general idea is to create a device that scrapes the local COVID infection data and turns on a yellow light if the situation deserves a warning, and red if the situation is dire. We'll define what those situations mean in terms of data later.

Because web scraping is full of sadness, we'll be using the reliable covidtracking.com API. This will give us data at state granularity. As a bonus, if you live in the US, you can easily follow along and make one of these to track the pandemic in your state.

Let's start by building out the Pi. Grab a copy of Raspberry Pi OS Lite and Balena Etcher. Decompress the Raspberry Pi OS image. Insert your SD card in your computer, run Balena, and flash the decompressed image to your SD card. You should end up with an SD card that now has boot and rootfs paritions. Congratulations, you've installed Raspberry Pi OS for your Raspberry Pi Zero W.

That doesn't do us much good if we can't log into the pi once it's booted, so before we eject the SD card from our computer, let's set up networking. Navigate to the boot partition of SD card. Create an empty file called "ssh" in the top-level of the boot partition. If you're on linux, you can do this in the terminal with:

touch ssh

This will let us log into the pi using an SSH client after it's booted. Next, let's give the pi the network configuration for our local wireless network. Create and open the file "wpa_supplicant.conf" in the top-level of the boot partition on the SD card. Copy and paste in the following contents, replacing the SSID and PSK with that of your wireless network:

country=US
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1

network={
    ssid="NETWORK-NAME"
    psk="NETWORK-PASSWORD"
}

Now go ahead and eject your SD card, and install it in the pi zero W. Power it up, and it should come online. If your local network has local DNS, you can probably wait a few minutes and then log in using the pi@raspberrypi as the address. But for most of us, we'll need an IP address. How you determine the IP address of our Pi Zero W will vary depending on your local router, but generally you can go to 192.168.1.1 in your web browser, and it will bring you to a log-in page for your router. Log in (if you don't know what the credentials are, try looking on the bottom of the router, or searching for "default username and password [router brand]". Often, it's admin:admin or admin:password"). Look for a page that shows the active DHCP leases. If you see one for a device called raspberrypi, that's probably it! If you don't see any hostnames, grab whatever the most recently leased IP is. Try that with the instructions below, and if it doesn't work, grab the next most recent, and so on. If you still can't get it, power down the Pi and pop the SD card back in your computer, and make sure you got the wireless network credentials right.

Once you have your Pi's IP address, log into it using an SSH client. If you're on Windows, you might choose Putty. If you're on Linux, just open up the terminal and run the following, replacing raspberrypi with your Pi's IP:

ssh pi@raspberrypi

The default password for the pi user is "raspberry". You should change that once you've logged in. Do that with:

passwd

and change the password as prompted, to whatever you want, so long as you'll remember it.

Since our Pi Zero W is a headless unit, meaning it only provides a terminal and not a desktop environment, we're going to need to do all of our development work in the SSH session, without a GUI. Since I've already written all of the code, you can skip this if you just want to clone the repo and aren't going to try hacking it with your own changes. But if you are going to make some tweaks, grab yourself a terminal editor. Nano is easy to get started with, but personally I'm a big fan of vim:

sudo apt install vim

Warning: vim is for advanced users, it takes some getting used to. But it's really powerful, and once you learn it, you'll have a great tool in your arsenal that works on almost any PC.

We'll start by writing the code to get the current COVID data. To get an idea what we're working with, let's grab the COVID data for New York State (if you want to follow along using a different state, just replace "ny" with your state's code in the URL):

wget https://api.covidtracking.com/v1/states/ny/daily.json

This downloads a file called "daily.json" to the current directory. Let's open it and see what we're working with:

Okay. How to turn this into an indicator? I chose to use the positivity rate with respect to testing because it was easy to calculate and, for my area, testing has stabalized, i.e. if you want a COVID test, you can get one. This means that the positivity rate should roughly track any upticks or downswings in COVID infections at large. There are more robust ways to track the outbreak, such as inspecting the death rate, hospitalizations, etc., but they are more complicated and require tracking the data over time. For now, I have opted not to do that.

Our positivity rate can be calculated with:

positivity rate = new positive cases / total new cases

Arbitrarily, I've decided that a positivity rate of 2% will turn on the yellow light, and a positivity rate of 5% will turn on the red one.

Let's write the code.

We're gonna use Python because we don't hate ourselves. Python's great for tiny little projects like this. It's quick to develop in and requires very little of the developer - almost everything you need to do can be done with an imported library. In this case, we're going to use the gpiozero package to control the LEDs, requests to fetch the web data, and json to parse it. gpiozero requires some dependencies. Let's install those:

apt install pigpiod python-pip
pip install gpiozero

And now we can start working on our code. Like I said, I use vim for editing, but please use whatever you're comfortable with.

vim covidwarn.py

We'll start by importing the modules we'll need. Put this at the top of the file:

from gpiozero import LED
from time import sleep
import requests
import json

Next, write the function to get the data from the covidtracking API and calculate the positivity rate.

def get_data():
    r = requests.get('https://api.covidtracking.com/v1/states/ny/current.json', allow_redirects=True)
    j = json.loads(r.content)

    positivity_rate = float(j["positiveIncrease"]) / float(j["totalTestResultsIncrease"])

    return positivity_rate

Here, the requests.get(...) call does the same thing that wget did - it grabs the JSON-formatted COVID tracking data from the internet and downloads it into variable r. Since r is just a raw string, it's not very convenient for parsing. So we load it as a JSON object and turn it into a map, so that we can access different parts of the data by providing the key associated with it. A key is just a string that comes before a value in the string - in this case, we're using the keys "positiveIncrease" to grab the number of new positive tests, and "totalTestResultsIncrease" to grab the total number of new tests. If you stare at the original download from wget, you'll see these strings in there, and you'll see some numbers come after them.

Now we have a function that grabs data from the web, calculates the positivity rate, and returns it. Next is to create something that will use that to light the LEDs.

Let's start by initializing our pi to a known state and setting up some constant variables. We want to make sure both of our LEDs are off to start with, and that we have thresholds set for when to turn them on.

if __name__ == '__main__':
    yellow_led = LED(2)
    red_led = LED(3)

    warning_threshold = 0.02
    danger_threshold = 0.05
    seconds_in_day = 86400

    yellow_led.off()
    red_led.off()

We set the LEDs up on GPIO pins 2 and 3 and set our yellow and red light thresholds to 0.02 and 0.05. We've also set up the variable seconds_in_day - we're going to use this as our polling rate. In other words, we're going to check the COVID data status once every day. This is the most sensible thing to do, since the COVID data from the COVID tracking API updates once every day at 4PM EST. Let's make our polling loop:

while True:
    p_rate = get_data()

    print("Positivity rate: %f" % p_rate)

    sleep(seconds_in_day)

This will execute forever outputting the COVID positivity rate once per day. Now let's make the LEDs operate on that data. Put this in between the print statement and the sleep:

if p_rate > danger_threshold:
    yellow_led.off()
    red_led.on()
elif p_rate > warning_threshold:
    red_led.off()
    yellow_led.on()
else:
    red_led.off()
    yellow_led.off()

This turns the yellow LED on if we've passed the warning threshold and turns the red LED on if we've passed the danger threshold. Only one of the LEDs will be lit at a time. If we're beneath both thresholds, neither LED will be lit.

Now all that's left is to wire up the LEDs. Let's set up the circuits.

LEDs are basic electronic components called diodes that only pass current in one direction. LED stands for Light Emitting Diode - they just happen to be a type of diode that makes visible light when they pass current. In order to hook up your diode in the correct direction (i.e., with the correct polarity) so that it actually makes light, you need to know which side of it is positive and which side is negative. For a through-hole LED, the long leg of the LED is the positive side, and the short leg is the negative (they're called through-hole because when they're installed in a circuit board, the legs go through the holes in the board).

We'll install the LEDs into a breadboard for convenience. If you decide to make this project permanent, you can think about soldering the circuit together and encasing it all in an enclosure (maybe in something that will diffuse the light! LEDs can light up a pretty big space if they're properly diffused). Set up your breadboard like this:

If you're new to breadboards, a key thing to understand is that each row of the breadboard is continuous - you can think of it as a wire that you get to plug electronic components into. These rows end just before the two vertical rails on either side, which are notionally for power (red) and ground (blue). The vertical rails are separate but also each continuous.

What we're doing is taking power from the positive rail of the breadboard, feeding it through a resistor, then to the LED, and then out to the negative rail of the breadboard. Note that each LED is powered by a separate power rail - this is important later! We need them to be separate so that we can control them each separately. The resistor acts as a current limiter for the LED, providing some assurance that the LED won't receive so much current that it burns up. In practice, a Pi's GPIO pins are (I believe) internally limited, so this isn't really required. But it's good practice any time you use LEDs or sensitive electronics. If you're curious how to choose the right resistor for the job, check out ohm's law.

Now that the breadboard is set up, we need some way to power it. Since we want to be able to control when the LEDs get power, we're going to hook the positive rails of the breadboard up to a couple of the Pi's GPIO pins. These GPIO pins can be turned on and off in software, making them ideal for this and many other electronics hacking projects. We're going to use pins 2 and 3, since that's what we wrote in our python code.

Credit to pinout.xyz

Since the Pi pins aren't labelled, you should look at pinout.xyz to help you get oriented and find the right pins. Once you do, connect pin 2 to the positive rail feeding the yellow LED, and pin 3 to the positive rail feeding the red LED. Connect a ground pin to the ground you're using on the breadboard.

Now run the code! If you've done everything right, your LEDs should turn off when the code first starts, and once the COVID positivity rate has been calculated, you should see the appropriate LED light up (if any). If you'd like, you can connect a third, green LED, that will indicate that the positivity rate is below 2%. That's left as an exercise to the reader.

OpenHAB Home Automation

Dizzyspiral — Sun, 26 Apr 2020 00:54:18 +0000

This is part 1 in a series of posts documenting what worked and what did not while building a privacy and security focused smart home

IoT and the smart home realm of electronics is largely a joke in the security industry. Smart locks and garage door openers provide little to no physical security against a hacker or someone with one of their tools [1] [2] [3]. Smart bulbs and other "innocuous" IoT devices can be used for infection and propagation of malware, and en masse can pose a risk as botnets [4]. Cameras spy on citizens and provide intel to foreign (or sometimes domestic) governments[5]. The point is, IoT devices are either wilfully or negligently full of security flaws that punch holes in your network and let attackers in or leak your personal data out.

...So why on Earth am I building myself a smart home? Because I want this:

"Hey Mycroft, turn on my kitchen lights."
"Right away."
Four light fixtures turn on at once

How freaking cool? We're living in the future, folks.

But how do I do this without compromising my privacy, or my network security? I've done a lot of research and I think I've found a decent compromise. Leaning heavily on open source software, I'm building a home automation network with OpenHAB, Z-wave, and Mycroft. Let me explain why this is a good solution.

Z-wave is a mesh network protocol that creates a "LAN" of IoT devices that are, themselves, never connected to the internet. These mesh IoT devices connect to a Z-wave "hub" that can be connected to the internet, if you want it to (e.g. if you want to control your smart home from oot and aboot). The hub sends commands to the devices, updates the devices, and overall coordinates the smart home show. There are plenty of Z-wave hubs already out there (cough SmartThings cough), but I wanted something less opaque that would be amenable to the addition of some non-z-wave devices, such as the Wyze cam. Something more... open source.

OpenHAB is a powerful smart home automation suite that frankly has impressed the socks off of me. This thing was clearly architected by people who knew what they were doing from a design standpoint, because their concepts (bindings, things, channels, items) generalize so well that once you understand the OpenHAB ecosystem, it's a joy to add and control a new device. It just works. I love open source success stories. OpenHAB has a binding for Z-wave that allows it to act as a hub as long as there's a Z-wave serial device present on the system. It also has bindings for all sorts of other things you might want to integrate with your smart home solution, including for example Spotify. The beauty of a single-stop solution like OpenHAB is that you can gather all your state information in one place, and use it to create complex interactions between your home devices.

While OpenHAB even provides a means for creating custom UIs for controlling your devices, sometimes talking to your house is just more natural (or more awesome). But voice assistants are all cloud-based services because the hard work of lexing human speech and deriving meaning is the work of an AI, and sadly, they seem to require more horsepower than a raspberry pi. Even more sad, though, is that these cloud-based services snarf up your data and sell it to advertisers. There's no guarantee that they aren't always listening. an indeed there's some evidence that they are, and that our personal moments are being recorded and transcribed by humans in order to improve the AI's speech recognition (I read this article, but I can't find it now - guess you will have to take my word for it).

Mycroft is a cloud-based AI service, but it is open source, to the degree that it is possible. It requires an online account in order to link your Mycroft instance to its cloud service, and their privacy policy does contain language that enables them to sell your data to advertisers. They swear that they don't - being privacy conscious and respectful of their users is their biggest selling point and something that they tout on their website - but they absolutely could. It's up to the end user if they believe them. I decided they were the lesser of all the evils, and I'm excited about the potential to write my own skills for Mycroft, as its skills engine is python-based. It also integrates with OpenHAB, giving it access to all of the items exposed through OpenHAB's configuration. In theory, one can create Mycroft commands to control literally any of the smart home things.

Here's the shopping list so far:

$100 Raspberry Pi 4 Kit from CanaKit
$39 Zooz Z-wave Plus 2 USB Stick
$30 Zooz Z-wave Toggle Smart Switch
$9 PS3 Eye Microphone
$35 Raspberry Pi 3 B+
$25 x3 Raspberry Pi 3 A+
$25 Wyze Cam V2

That all runs a bit over $300. Expensive, sure, but what's life without a hobby. What all this buys me so far are three sets of networked and synchronized speakers (hooked up to Pi 3 A+'s running Max2Play), a voice assistant (the PS3 eye attached to the Pi 3 B+ running Mycroft.ai, using their picroft image), a solitary smart light switch, a camera to watch my front door, and a home automation hub (the Pi 4 with the Zooz Z-wave USB stick). So... I can turn on a single light, play some music, detect when someone's coming into the house, and ask the internet some questions. It's not bad but it's not exactly like the house is alive. All I've bought so far is proof of concept toys.

I'll continue this series with posts about each individual node in the smart home network, including any relevant configuration, wiring, networking, travails, and triumphs. For now, I wait for the pieces to come in.