DEV Community: Michael Mior

Game of Firsts

Michael Mior — Wed, 07 Aug 2024 16:17:07 +0000

A lot of things seem to serve to unintentionally nerd snipe me. Over the weekend, a friend introduced me to The Initials Game. The gist of it is that you are given two letters and a series of clues. The goal is to guess a two-word phrase that starts with the given letters. It's a pretty entertaining game and I wondered how easy it would be to generate such puzzles.

The next day I was able to hack something together in a couple hours that actually worked reasonably well. The entire thing is based around Ollama since I wanted to be able to run locally. I started off with a system prompt

You are a puzzle designer who is going to help create a word puzzle. Any pieces of the puzzle you generate should be short and simple and easily understandable to the average English-speaking adult anywhere in the world.

Then I split the remainder into two prompts. Given two letters, the first prompt asks for a phrase to be used in the puzzle.

I will give you two letters and then you will think of a very simple two word phrase that starts with those two letters. For example, if I give you the letters C and F, you might pick Correctional Facility. For the letters P and T, you might select Party Trick. Respond with only the two word phrase and nothing else. The letters are {letter1} and {letter2}.

The second prompt asks for clues given this phrase.

I am going to give you a two word phrase and I would like you to devise five very short clues (three or four words maximum) that will help someone guess the phrase. The first clue should be very vague and subsequent clues should get increasingly specific. Do not use any of the words from the phrase anywhere in any of the clues or elsewhere in your response. Output should be a simple numbered list in Markdown format. The two word phrase is "{phrase}".

I later made a few further tweaks to these prompts, but this is the gist of it. At this point, I was sometimes getting some reasonable phrases, but sometimes phrases that didn't make sense at all. Sometimes the letters in the phrase didn't match the letters that were asked for. But the biggest problem was that sometimes the phrases generated were pretty nonsensical, for example, "Bird Age." To solve this, I ended up using NGRAMS which is a REST API for querying the Google Books n-gram dataset. This allows easily checking for how common the phrase is. For example, "Bird Age", appears only 269 times, while the much more reasonable phrase "Business Association" appears 200,334 times. The solution was to generate multiple phrases until one meets an arbitrarily-determined threshold of popularity.

Up until this point, I was just printing all the puzzles out via text. But it's a bit more fun to have them read out. I ended up using OpenTTS to generate audio from the puzzles. One unexpected problem is that coqui-tts, the actually speech system used, seems to have a real problem pronouncing letters when just written out as single letters. For example, here's what I got when I tried to have it say "The letters are E and A."

To solve this, I wrote out phonetic spellings of each letter, tweaking until each one sounded right. If I instead generate audio for "The letters are Eeh and Ae," I get much more reasonable output.

I first started with requiring the user to specify which letters to generate. When switching to randomly picking letters, a uniform distribution doesn't really work well. English words aren't randomly distributed so it makes sense to match this frequency. To do this, I instead pick letters randomly but weighted based on the frequency of the first letters of English words. To avoid picking a lot of double letters, I also halve the probability of the first letter when generating the second letter.

Finally, here's an example of a complete puzzle. The code is available on GitHub. Not sure if I'll keep working on this project further, but it's pretty impressive what you can quickly accomplish these days with a bit of use of AI.

LLMs for Schema Augmentation

Michael Mior — Tue, 18 Jul 2023 16:55:43 +0000

I have recently been experimenting with the use of large language models (LLMs) to augment JSON Schemas with useful features. While ChatGPT gets most of the press, there are many other LLMs that are specifically designed to work with code. The idea is that these LLMs can be used to augment incomplete schemas with additional useful information.

Consider an example schema such as the one below. This is a basic schema which might be created from an automated schema mining process. For such a small schema, this is probably sufficient information to tell you useful things about the dataset.

{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "address": {"type": "string"}
  }
}

However, as schemas grow in size and complexity, additional metadata can be useful. For example, JSON Schemas can contain a description attribute which provides a natural language description of each property. To generate a value for such a property, we can prompt the LLM with a prefix of the schema such as the following.

{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "address": {"type": "string", "description": "

From here, we just need to continue generating tokens until we get to a closing quote. This approach was borrowed from Jsonformer which uses a similar approach to induce LLMs to generate structured output. Continuing to do so for each property using Replit's code LLM gives the following output:

{
  "type": "object",
  "properties": {
    "name": {
      "type": "string",
      "description": "The name of the item"
    },
    "address": {
      "type": "string",
      "description": "The address of the person"
    }
  }
}

While it's not perfect, and not obviously useful for such a small schema, I think the results are promising. I've also tried several other schema formats such as Zod, Typescript object types, and Pydantic. A lot more experimentation is required with these different formats and various LLMs to see which works best. So far, I'm pretty pleased with results. The code for this post is available on GitHub.

Zotero reMarkable sync

Michael Mior — Fri, 23 Mar 2018 00:00:00 +0000

I’ve really been enjoying my reMarkable tablet the past several months. (I wrote a short review last year if you care to see a few more details). One of the biggest gripes I have about the device is that it can be a pain to get documents on it. There’s currently no web app so the only choice is desktop apps for Windows and macOS or a mobile app for Android. Once I saw someone had released a reMarkable API on GitHub, I knew I would have to find some way to ease my pain.

I use Zotero to manage my paper references and unsurprisingly, reading papers is one of the primary uses of my reMarkable. I decided to figure out a way I could have new papers I wanted to read in Zotero show up on my reMarkable. Using the reMarkable and Zotero APIs, this proved to be a fairly straightforward weekend project. You can find the result on GitHub.

To get started, there are simply a few environment variables that need to be set which are detailed in the README. Then once the script is run, it will look for items in a Zotero collection, download their attachments, and upload them to the reMarkable API. Putting this script into a cron job means that my papers are now synchronized regularly. I’m currently hosting this for myself for free on Heroku using the scheduler add-on to run the job. I disabled any web process (heroku ps:scale web=0) so the only thing that runs is the cron job. Since the cron job is quite quick to run, it falls well within the free usage limits of Heroku.

I may decide to add a Heroku button to the repository in the future, but it’s fairly straightforward to configure manually. Just set the required environment variables, disable the web dyno, and set up the cron job with the scheduler. Hope this ends up being useful to someone else!

LaTeX Skeleton

Michael Mior — Fri, 02 Mar 2018 00:00:00 +0000

A repeated task I run into when I start working on a new paper is the laying out the initial structure of the repository to store the paper text. I recently pushed a simple skeleton that I use a starting point to GitHub. There’s nothing really fancy here, but it’s a good starting point. I use latexmk to build all my documents since it takes care of running BibTeX among other things. One of the nice other things is that it will automatically try to use make to build any missing files. The repository basically consists of a Makefile that generates the paper along with a simple LaTeX skeleton and an empty BibTeX file. Hope it might be helpful to someone else!

Reinforcement learning for Las Vegas

Michael Mior — Sun, 04 Feb 2018 00:00:00 +0000

During a department board games night, we were playing Las Vegas when a fellow player remarked that he wondered how an AI for the game would perform. Since I had been meaning to spend some time learning to implement neural network techniques, this seemed like a great opportunity. One of the first things that came to mind was a paper from the DeepMind team on using deep neural networks to implement a variant Q-learning. The gist behind classical Q-learning is maintaining a table with the expected utility of a particular action in a given state. This table is updated while the game is played based on the observed rewards.

The idea behind deep Q-learning is to use a neural network to replace the table which is traditionally used. One of the big advantages is that it’s possible to handle much larger state-action spaces using a neural network. The first step was to decide how to represent the game state. For anyone not familiar with Las Vegas, Yucata has a good overview of the rules and a mechanism for playing online. The short version is that players take turns rolling dice and placing them on differently numbered casinos in an attempt to get the highest cash reward.

I first built a simple class structure for the game in Python to represent all the game state and stubbed out a couple functions to implement the game logic. The next step was to decide how the state was going to fed into the network. In the original deep Q-learning paper, the authors used a convolutional neural network to feed in frames from gameplay video. I instead chose to explicit represent the state using the following values:

Number of players in the game
Current game round number
Cash currently held by each player
Number of dice currently on each casino
Money available at each casino
Number of dice of each value in the current roll

Explicitly representing the state also resulted in a different structure for the network itself. The input layer simply received a normalized vector of the state values above. The second fully-connected layer was simply half the size of the first layer. Both of the first two layers used a ReLU activation function. The final output layer was also fully connected but with a linear activation function and a size of six to represent the choice of each possible die. After training the AI against 4 random opponents, the AI was able to win around 50% of games which I was pretty happy with given the inherent randomness of the game. However, the evaluation was by no means robust and something that definitely needs to be improved upon.

I later implemented hyperparameter optimization using the Hyperopt library. After much more training, I tried to optimize the reward values, discount factor), and other parameters specific to deep Q-learning. This led me to change the kernel initializer to LeCun uniform, the activation function of the first two layers to a sigmoid function, and the optimization algorithm from RMSprop to Adam. This was mostly for a bit more fun although it did seem to provide about a 20% performance improvement on some simple evaluations I tried.

Since this is just a fun side project, one of the next things on my agenda is to implement a UI using boardgame.io so I can get a sense of how the AI “feels.” All in all, this was a pretty fun project. The source code is available on GitHub for anyone who wants to play with it.

reMarkable review

Michael Mior — Tue, 31 Oct 2017 00:00:00 +0000

As an academic, I spend a lot of time reading papers. I generally hate the idea of printing out a paper for a one-time read since it feels wasteful, but I also don’t enjoy the reading experience of viewing papers on a desktop screen. I’ve always liked the look of the Sony Digital Paper that my advisor uses but it’s rather expensive and not easy to find. I used a Samsung Galaxy Tab 3 for several years which freed me from my desk and made note-taking easier, but still wasn’t as nice as paper.

This is why I was excited to hear about the reMarkable. The stated goal of reMarkable is to be a tablet for paper lovers. The cost is still fairly significant for a grad student, and I actually cancelled my first preorder before changing my mind and committing. While I did experience shipping delays that seem inevitable with crowd funded projects, I finally got my device this past Thursday.

So far, I’m pretty impressed by the hardware and the writing experience. There’s definitely no perceivable latency when writing and most other actions happen quite quickly as well. The software still needs a lot of work (especially the Android app) and I’ve had some issues with syncing not working correctly. My biggest disappointment so far, is that the Wi-Fi on the device doesn’t currently support networks which use usernames and passwords for authentication, which means I can’t use wireless connectivity in the office. This is a pretty big drawback, but there’s already a promise to fix this in a future software update. Overall, I’m looking forward to seeing what happens.

Benchmarking ScyllaDB

Michael Mior — Sat, 11 Mar 2017 00:00:00 +0000

ScyllaDB is an alternative to Apache Cassandra which claims to have 10x higher throughput than Cassandra while remaining the same positive properties of scalability and ease of use. Scylla functions as a drop-in replacement for Cassandra applications using CQL (currently with some disparity in features). Fortunately, my past work in NoSQL schema design for Cassandra only makes use of features supported by Scylla.

I’ve been meaning to do my own tests on Scylla for a while, but recently a faculty member shared about Scylla on our group’s Slack channel and suggested I share some results. I decided to run the same set of experiments I used to evaluate a schema I designed for the RUBiS online auction benchmark. First, several column families are created and loaded with data. Next, the different types of transactions in the RUBiS benchmark are executed while measuring the response time. Before I continue, first a disclaimer that the methodology behind the results is not as rigorous as it could be, but still leaves me skeptical of some of the claims made by Scylla.

These experiments were run using single node installations of Cassandra 3.0.9 and Scylla 1.6.1. While this is not a typical setup, much of the reasons for performance improvements claimed by Scylla (e.g. lock-free data structures and improved memory management) should still manifest themselves on a single node. One of the nice things about Scylla is the scylla_setup command that attempts to configure the OS for optimal performance including benchmarking the disk storing the data directory. This configuration was used for both Scylla and Cassandra and otherwise the default settings were used for both systems.

The first striking difference is that the on-disk size of the data for Scylla (9.8 GB) is nearly twice that of Cassandra (5.2 GB). Despite this, there was not a large difference in load times with Cassandra taking 3 hours 43 minutes and Scylla taking 3 hours 59 minutes. Below is a graph with the write throughput of the SSD storing the data files in each case. Scylla seems to push the drive much harder but it’s able to keep up.

Update : After looking at the number of keys in each table for both Scylla and Cassandra, it seems as though Scylla was storing significantly more data. Stay tuned for updates on resolving this issue.

Finally, the results of running the actual benchmark. RUBiS consists of a number of “interactions which corresponds to user requests for web pages. The graph below shows the top eight interactions by frequency and the average response times for both Cassandra and Scylla. I won’t go into a detailed analysis here, but the performance claims made by Scylla don’t seem to play out here.

A Calcite adapter for Apache Cassandra

Michael Mior — Sat, 20 Feb 2016 00:00:00 +0000

For those not familiar, Apache Calcite is a generic SQL query optimizer which can execute SQL queries over multiple backend data sources. This is a powerful concept because it allows complex queries to be executed over sources which provide much simpler interfaces from CSV files to MongoDB. Calcite is also leveraged as the cost-based-optimizer framework for the Apache Hive data warehouse.

Much of my PhD research has revolved around generating optimized schemas for NoSQL databases such as Apache Cassandra. (For a proof-of-concept tool, check out the NoSQL Schema Evaluator.) On discovering calcite, this seemed like a good fit with my work. One of the challenges with using NoSQL databases for complex queries is the necessity of working within the restrictions set by the query language. In previous work, I built a very simple query execution on top of Cassandra designed to execute a predefined set of query plans. Leveraging Calcite, it is possible to execute a very complete dialect of SQL on top of any defined data source (which calcite calls “adapters”).

Unfortunately, Calcite did not already have an adapter for Cassandra. Fortunately, writing an adapter is a fairly straightforward process, so I decided to take this on. The simplest possible implementation of an adapter provides a set of tables along with a scan operator to retrieve all the rows in the tables. While this is sufficient to enable Calcite to perform query execution, scanning a table in Cassandra is very inefficient. This is a result of the fact that partitions in a Cassandra table are commonly distributed across nodes via hash partitioning. While it is possible to retrieve all rows, they will be produced in a random order and the query will need to contact all nodes in the database. Assuming that the query the user wants to issue does not need to touch all rows in a table, it is possible to use filtering in the Cassandra Query Language (CQL) to push filtering down to Cassandra.

The current version of the adapter also supports exploiting the native sort order of Cassandra tables by clustering key. There is still a lot of work to be done, but an initial version of this adapter should be shipped in Calcite 1.7.0. Until the release, you’ll have to compile from source. A quick set of commands to get things running is below.

$ git clone https://github.com/apache/calcite.git
$ cd calcite
$ mvn install

# You will need to create a JSON document which provides connection information
# An example can be found in ./cassandra/src/test/resources/model.json
$ ./sqlline
sqlline> !connect jdbc:calcite:model=path/to/cassandra/model.json admin admin

At this point you can write SQL queries which reference your Cassandra tables. Note that table names need to be quoted and there will likely be some failures with certain query patterns. You can view the proposed plan for a query by prefixing it with EXPLAIN PLAN FOR in the sqlline shell. This will show whether the query is able to exploit filtering or sorting directly in CQL. This is a long way from making Cassandra a viable data warehouse, but it may be helpful for performing occasional analytical queries without needing to write a significant amount of code.

Update: March 27, 2016

Calcite 1.7.0 has now been released which includes the Cassandra adapter. In addition to what was discussed above, the adapter also now automatically recognizes materialized views.Documentation is available on the Calcite website.

Automated Testing of Dotfiles

Michael Mior — Thu, 24 Sep 2015 00:00:00 +0000

Several years ago I started managing my dotfiles based on Zach Holman’s dotfiles repo. His setup is quite nice and I found it relatively easy to adapt to my own purposes. My workflow generally consisted of making a bunch of local changes until I was happy and then pushing to my own GitHub fork.

The big problem I eventually found is that I wasn’t fully capturing the correct steps to reproduce my environment. Every time that I tried to install my dotfiles on a new machine, I would be met with several errors that I would eventually resolve. The fix would not always result in something which was reproducible on another machine. I wanted a solution that would let me automatically test that my dotfiles would cleanly install every time I pushed to GitHub, so I turned to Docker.

Traditional CI services would have been a bit of a pain to use with all the packages that needed to be installed.Docker Hub made things nice and easy. My Dockerfile simply installs the necessary OS packages, adds a new user and then tries to run my install script. I currently don’t have any other testing other than to ensure that the script exits without error, but this has already saved me a lot of trouble.

Update : I have since switched to using Travis CI as I do with my other projects. It turns out this is easier than I expected. I still haven’t explicitly added any tests but even being able to confirm that the installation steps succeed is useful to ensure nothing breaks.

Apache Cassandra benchmarking

Michael Mior — Thu, 21 Aug 2014 00:00:00 +0000

I was recently trying to run some benchmarks against Apache Cassandra on EC2 since unfortunately the servers I had in our machine room were destroyed in a fire. For all my local testing, I used a single instance running on my desktop machine, but I wanted to ramp things up for the real benchmarks and use three nodes. Since my workload is read-only and the dataset is fairly small, I also wanted a replication factor of three so each node would have a copy of all the data.

My first attempt to load all this data was to follow some documentation provided by DataStax. Their suggestion was to use ALTER KEYSPACE in CQL to change the replication factor, and then simply run nodetool repair on each node. However, I found that running repair on just one node took several hours for a modest-sized amount of data (~2GB). This was a pretty big time sink as I wanted to able to quickly spin up and down a cluster for testing.

Next I tried changing the configured replication factor locally before exporting the data. I then simply copied the data to all nodes in the cluster and tried to start them as normal. This created some weird conflicts as nodes seemed to be confused about who owned what portion of data.

Finally, I simply loaded up the data set on a single node and configured a replication factor of three. I then started each node in sequence and the auto bootstrapping process took care of copying the entire dataset to each node in the cluster. This whole process was complete in less than half an hour. This approach wouldn’t really work in a production setting since it assumes the node has no existing data (although if you can afford to bring a node offline for a while, I suppose that it might work). In any case, this solution worked great for me and hopefully someone else finds this useful.

Node.js skeleton project

Michael Mior — Sat, 15 Sep 2012 00:00:00 +0000

Unfortunately, it’s obviously been a long time since this blog has been updated. Since the last post, I’ve been hard at work rewriting our Web app in Django (finally got rid of our old PHP) and picking up iPhone app development. Keep an eye out for some cool stuff coming up in the near future.

In my spare time, I’ve been playing around with Node.js development. I decided to release the sample project I’ve been working on. You may find it useful if you’re looking to get up and running quickly. It’s still a work in progress, but it’s coming along nicely.

Check out node-loco-skeleton on GitHub.

Unit testing Django model mixins

Michael Mior — Sat, 14 Jan 2012 00:00:00 +0000

I recently found myself having to unit test some model mixins and I thought I would share the technique I used in case anyone else finds it useful. You could just pick a model which uses the mixin and run the test on instances of that model. But the goal of a mixin is to provide reusable functionality independent of any model. Instead, we create a dummy model we can use for testing.

The model shouldn’t reside in models.py since we don’t want it in our database. Instead, we create the model dynamically. However, I wanted to test some functionality which requires saving the model to the database. Fortunately, Django can construct the necessary SQL to create and destroy the database table. We simply override setUp and tearDown to do the heavy lifting.

from django.test import TestCase
from django.db import connection
from django.core.management.color import no_style
from django.db.models.base import ModelBase

class ModelMixinTestCase(TestCase):
    """
    Base class for tests of model mixins. To use, subclass and specify
    the mixin class variable. A model using the mixin will be made
    available in self.model.
    """

    def setUp(self):
        # Create a dummy model which extends the mixin
        self.model = ModelBase(' __TestModel__'+self.mixin. __name__ , (self.mixin,),
            { ' __module__': self.mixin. __module__ })

        # Create the schema for our test model
        self._style = no_style()
        sql, _ = connection.creation.sql_create_model(self.model, self._style)

        self._cursor = connection.cursor()
        for statement in sql:
            self._cursor.execute(statement)

    def tearDown(self):
        # Delete the schema for the test model
        sql = connection.creation.sql_destroy_model(self.model, (), self._style)
        for statement in sql:
            self._cursor.execute(statement)

To make use of this code, just subclass from ModelMixinTestCase and set the mixin class variable to the model mixin class you wish to test. You’ll then have access to a fully functioning model which uses this mixin via self.model. Happy testing!