DEV Community: vincent d warmerdam

Introducing dearme.email

vincent d warmerdam — Tue, 27 May 2025 21:32:15 +0000

This is a submission for the Postmark Challenge: Inbox Innovators.

What I Built

I made https://dearme.email/. It's a service that let's you send an email to yourself, which you receive after 30 days.

What could you use this for? You can send yourself reminders, check in on your goals, and maybe benefit from the perspective of listening to your past self. In an age where bots are everywhere, it's nice to be able to have a human connection with yourself. Not to mention the fact that you can actually do a bit of time travelling, which is awesome.

Demo

You can send an email to hello@dearme.email and it will be sent back to you in 30 days. We encrypt the email on our end, decrypt it when it's time to send it back and then we delete the message from our systems. We do demand that you varify your email. If you don't do that within 30 days we delete the message.

How I Built It

It's a fairly basic Python/postgres setup with Flask. Postmark really takes care of most of the hard bits here. There's a webhook and a cronjob too, but nothing too fancy.

I might add payments later to help cover the postgres costs, which I'll probably build with lemonsqueezy. But assuming there's not a huge demand for this app overnight I should be fine. Postmark has reasonable pricing and they also do some spam detection (which is a concern of mine given that anyone can email).

@koaning

My New Home Setup

vincent d warmerdam — Mon, 16 Aug 2021 07:24:44 +0000

A while ago I switched my home setup for development and it's been such a success that I decided writing a short blogpost about it.

The setup involves three machines.

I own an Intel NUC. This machine runs popOS (a variant of ubuntu) and is meant to run heavyweight code related things. It typically runs docker, databases and machine learning scripts that take a while. It also serves as a data server where all of the larger datasets that I use are stored.
I also bought the cheapest M1 mac mini that has 16GB of RAM. This machine runs all of my creative software. It works with my Wacom One drawing screen as well as all of my microphones and screen-recording software. This machine is also a user-interface of sorts for my NUC. I mainly run visual studio code via ssh, the code itself runs on my NUC but I can still use my big screen on my Mac as well as all of the snippet tools that I am used to.
I also own a laptop from my employer. This is an Intel Mac (the one with the terrible keyboard). It is able to run all of the aforementioned software when I'm on the road but it is also able to SSH into my NUC. This laptop is also the only machine that runs social media and slack. The idea is that this laptop is my distraction machine that I can physically turn off by shutting down the lid.

Things that work out very well

There's so many benefits to this approach.

I can run a heavy machine learning algorithm while recording. Since the machine that is running the recording software isn't the same machine that is running the machine learning script I suddenly don't need to worry about any lag when I'm making educational content.
In general, the NUC is a much better deal in terms of hardware. This is especially true for the disk, especially when you compare it with the apple hardware. It's much cheaper to invest in a second lightweight linux machine than it is to upgrade the specs on an apple device.
Upgrading an apple device is near impossible (boo apple!). But upgrading the NUC is a breeze. I got 32Gb of RAM but nothing prevents me from upgrading it to 64GB in the future.
Visual Studio Code has really nailed the SSH feature. It's a native experience, really. But the killer feature here is that all the things that currently don't run on the M1 chip (tensorflow, docker) all run extremely well on an Intel machine running linux.
I certainly need to use slack and I also need to check into social media from time to time. But the nice thing about limiting that to just my laptop is that I'm literally able to "shut the lid on distractions". It's proven to be a huge reducer of stress.
I could have chosen not to go with the Intel NUC in favor of something that's heavier. But, with 12 threads, the NUC is pretty darn good for the small form factor. It's also pretty light on energy consumption since the hardware is more like a laptop than a server. I also think it'd be a bad idea to go for heavier hardware. If I had bought a big machine with GPUs then I'd have huge fans spinning on my desk and I'd be writing my software on a machine that doesn't represent the average persons availability for compute.
If I wanted to, I could totally bring the M1 Mac and the NUC with me! Sure, there'd be some fumbling with cables involved but both totally fit inside of my backpack. It'd be very easy for me to move my entire setup for a few days if I wanted to work from my parents house for a week.
Theoretically, you could also replace my intel NUC with a VM in the cloud. My setup however, doesn't require the internet. It only needs the network in my home.
Should any of my machines break, I can still be productive enough with the other machines in my house.

Final Remarks

Especially with the advent of home work, a lot of us lucky developers can rethink what our ideal setup might be like. I don't like the idea of investing in big apple machines that aren't upgrade-able and this NUC setup really strikes a balance. I imagine at some point in the future, when laptop manufacturers finally get their keyboards in order, this setup might also work for windows machines. The laptop can be become much more of a user interface, where an upgrade-able linux machine can run all development software you'll need.

Proper Name Detection

vincent d warmerdam — Tue, 09 Mar 2021 08:54:01 +0000

Detecting names in a user message is a common challenge when designing a virtual assistant. It's a task many Rasa users face, which is why you can find many questions on the topic in the Rasa forum. It’s also an issue that is more complicated than many people initially think.

The Problem

Suppose you want to detect names in French—would you consider this to be a hard problem?

You should remember, French is spoken in a lot of places. It’s an official language in over 30 countries, many of which also have Arabic as an official language.

That means that when you are detecting names in French, you might be trying to detect names with an Arabic origin in a French body of text. This gives the problem a whole new dimension.

Suppose that you were thinking about using a pre-trained language model for French. Would it be able to find all the names? After all, it’s plausible that a pre-trained French model can overfit on names from France. Such a model might be good at detecting names like "Jacques" and "Véronique.” But what about names like "Aaadil" or "Heer"? Can we still consider pre-trained French models outside of France?

This problem isn’t just limited to regions outside of France. Inside of France, you would expect names with an Arabic origin too. In fact, this issue should be seen as a worldwide phenomenon! Whether your users are ex-pats or have a non-traditional name, you should still be able to detect their name. Detecting a person's name is a hard problem because it’s not limited to any country or language.

In this blog post, we’re going to discuss three approaches to detecting names in utterances. None of these solutions will be perfect, but they should offer reasonable places to start when you’re building a virtual assistant.

Approach 1: Pre-Trained Language Models

Pre-trained models should not be considered to be 100% perfect. That said, they can still be very useful. We can expect them to find some of the names so we shouldn’t ignore pre-trained models altogether.

A popular component for this task is the SpacyEntityExtractor. It can be configured to detect many kinds of entities and it supports many languages. In particular, we can configure spaCy to detect “PERSON” entities in a Rasa pipeline. You can see an example configuration for English below.

pipeline:
- name: SpacyNLP
  model: "en_core_web_md"
- name: SpacyTokenizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: SpacyFeaturizer
  pooling: mean
- name: SpacyEntityExtractor
  dimensions: ["PERSON"]
- name: DIETClassifier
  epochs: 100

With SpacyEntityExtractor configured, Rasa will detect any entities that spaCy detects. As said before, it won’t be perfect, but it is a start. A common issue is that a lot of names of people are commonly confused with the names of organizations. Another issue is that spaCy, at least partially, trains its models on Wikipedia, which does not represent text that your users might type. If you’d like to understand these issues in more detail you should watch our Algorithm Whiteboard video on the topic.

Approach 2: NameLists

If you're interested in detecting names of people, then you might wonder if we need machine learning. After all, there are many datasets available containing baby names from around the world, and you can try to apply basic string matching against these lists. To help, we've started collecting some of these lists over at the rasa-nlu-examples repository (contributions welcome!). These names can be combined with our RegexEntityExtractor to find names in utterances.

To get this to work, you’ll first need to add a lookup table to your NLU data. The lookup table is added to your project as a separate YAML file whose contents might look something like this:

nlu:
- lookup: PERSON
  examples: |
    - aafrae
    - aasmae
    - abad
    - ...
    - zoumourrouda
    - zounnoun

Once the lookup table has been defined, you can add the RegexEntityExtractor to the pipeline that references it. The pipeline below demonstrates an example that uses both spaCy and a name list.

pipeline:
- name: SpacyNLP
  model: "en_core_web_md"
- name: SpacyTokenizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: SpacyFeaturizer
  pooling: mean
- name: SpacyEntityExtractor
  dimensions: ["PERSON"]
- name: RegexEntityExtractor
  case_sensitive: False
  use_lookup_tables: True
- name: DIETClassifier
  epochs: 100

Note that we’ve set our RegexEntityExtractor to be case insensitive. The benefit of this is that we’re robust against users who don’t apply capitalization properly, but again, it won’t be perfect. The word “Mark” could be the name of a person, but it can also refer to a verb. A spelling error in the name could be another reason our name-lookup-table might not be able to catch it.

Approach 3: UI

Instead of detecting the name of a user, we can also just ask the user directly. If you must accurately capture the full name, it would be better just to give the user a form to fill in.

You can read more about forms in our documentation. When you create a form, you can define rules that determine how the required information is retrieved from the conversation. For example, you can add a step in your conversation that confirms if the name is spelled correctly. It might be an extra action in the conversation, but it will be more accurate than any machine learning model out there.

If you're interested in an example of how to build this yourself, be sure to checkout our youtube tutorial.

The main downside of using a form this way is that the user may need to take more time to repeat utterances. It’s a valid trade-off though since you’ll have more control over the quality of your data.

Conclusion: It’s still an unsolved problem.

We don't think there is a one-size-fits-all solution when it comes to digital assistants, just like there is no one-size-fits-all approach to finding names in the text. There are so many different languages, norms, and users out there, a custom approach is often necessary to find the best solution.

This doesn't mean that your custom solution needs to be very complicated. If you're detecting names, a small but thoughtfully defined form might be the best solution, versus relying on machine learning. In this post, we’ve outlined a few of the most common approaches to extracting names from text, but remember, in many cases, it’s a simple solution that works best.

A `for`-loop to stop writing.

vincent d warmerdam — Wed, 03 Mar 2021 16:49:44 +0000

Let's make life a whole log simpler.

When you're working in a notebook you've probably written a for-loop like the one below.

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

# This is the for-loop everybody keeps writing! 
for size in [10, 15, 20, 25, 30]:
    # At every turn in the loop we add a number to the list.
    data.append(birthday_experiment(class_size=size, n_sim=10_000))

We're doing a simulation here, but it might as well be a grid-search. We're looping over settings in order to collect data in our list.

Pandas

We can expand this loop to get our data into a pandas dataframe.

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

sizes = [10, 15, 20, 25, 30]
for size in sizes:
    data.append(birthday_experiment(class_size=size, n_sim=10_000))

# At the end we put everything in a DataFrame. Neeto! 
pd.DataFrame({"size": sizes, "result": data})

So far, so good. But will this pattern work for larger grids?

It gets bad.

Let's see what happens when we add more elements we'd like to loop over.

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

# We're now looping over a larger grid!
for size in [10, 15, 20, 25, 30]:
    for n_sim in [1_000, 10_000, 100_000]:
        data.append(birthday_experiment(class_size=size, n_sim=n_sim))

We now need to write two loops but this has a consequence. How can we possibly link up the size parameter with the n_sim parameter when we cast this into a dataframe? You could do something like this;

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

# We're now looping over a larger grid!
for size in [10, 15, 20, 25, 30]:
    for n_sim in [1_000, 10_000, 100_000]:
        result = birthday_experiment(class_size=size, n_sim=n_sim)
        row = [size, n_sim, result]
        data.append(row)

# More manual labor. Kind of error prone.
df = pd.DataFrame(data, columns=["size", "n_sim", "result"])

But suddenly we're spending a lot of effort in maintaining a for-loop.

Been here before?

I've noticed that this for-loop keeps getting re-written in a lot of notebooks. You find it in simulations, but also in lots of grid-searches. It's a source of complexity, especially when our nested loops increase in size. So I figured I'd write a small package that can make all this easier.

Decorators

Let's make a three minor changes to the code.

import numpy as np 
from memo import memlist

data = []

@memlist(data=data)
def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return {"est_proba": proba}

for size in [5, 10, 20, 30]:
    for n_sim in [1000, 1_000_000]:
        birthday_experiment(class_size=size, n_sim=n_sim)

We've changed three things.

We've added a memlist decorator to our original function from the memo package. This will allow us to configure a place to relay out stats into. Note that the decorator receives an empty list as input. It's this data list that will receive new data.
We've changed our function to output a dictionary. This way we can attach names to our output and we're able to support functions that output more than one number.
The for-loops now only run the function and don't handle any state any more.

If you were to run this code, the data variable would now contain the following information:

[
    {"class_size": 5, "n_sim": 1000, "est_proba": 0.024},
    {"class_size": 5, "n_sim": 1000000, "est_proba": 0.027178},
    {"class_size": 10, "n_sim": 1000, "est_proba": 0.104},
    {"class_size": 10, "n_sim": 1000000, "est_proba": 0.117062},
    {"class_size": 20, "n_sim": 1000, "est_proba": 0.415},
    {"class_size": 20, "n_sim": 1000000, "est_proba": 0.411571},
    {"class_size": 30, "n_sim": 1000, "est_proba": 0.703},
    {"class_size": 30, "n_sim": 1000000, "est_proba": 0.706033},
]

What's nice about a list of dictionaries is that this is pandas can parse this directly without the need for you to worry about column names.

pd.DataFrame(data)

Let's do more.

This pattern is nice, but we're still dealing with for-loops. So let's fix that and add some extra features.

import numpy as np 
from memo import memlist, memfile, grid, time_taken

data = []

@memfile(filepath="results.jsonl")
@memlist(data=data)
@time_taken()
def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return {"est_proba": proba}

setting_grid = grid(class_size=[5, 10, 20, 30], n_sim=[1000, 1_000_000])
for settings in setting_grid:
    birthday_experiment(**settings)

Pay attention to the following changes.

We've got two mem-decorators now. One decorator is passing the stats to a list while the other one appends the results to a file ("results.json" to be exact).
We've also added a decorator called time_taken which will make sure that we also log how long the function took to complete.
We've used a grid method to generate a grid of settings on our behalf. It represents a generate of settings that can directly be passed to our function. This way, we need one (and only one) for loop. Even if we are working on large grids. You can even configure it to show a neat little progress bar!

If you were to inspect the "results.json" file it would look like this:

{"class_size": 5, "n_sim": 1000, "est_proba": 0.024, "time_taken": 0.0004899501800537109}
{"class_size": 5, "n_sim": 1000000, "est_proba": 0.027178, "time_taken": 0.19407916069030762}
{"class_size": 10, "n_sim": 1000, "est_proba": 0.104, "time_taken": 0.000598907470703125}
{"class_size": 10, "n_sim": 1000000, "est_proba": 0.117062, "time_taken": 0.3751380443572998}
{"class_size": 20, "n_sim": 1000, "est_proba": 0.415, "time_taken": 0.0009679794311523438}
{"class_size": 20, "n_sim": 1000000, "est_proba": 0.411571, "time_taken": 0.7928380966186523}
{"class_size": 30, "n_sim": 1000, "est_proba": 0.703, "time_taken": 0.0018239021301269531}
{"class_size": 30, "n_sim": 1000000, "est_proba": 0.706033, "time_taken": 1.1375510692596436}

When is this useful?

The goal for memo is to make it easier to stop worrying about that one for-loop that we all write. We should just collect stats instead. Note that you can use the decorators from this package to send information to files and lists, but also to callback functions or as a post-request payload to a central logging service.

I've found it useful in many projects. The main example for me is running benchmarks with scikit-learn. I do a lot of NLP and a lot of my components are not serializable with a python pickle which means that I cannot use the standard GridSearch from scikit-learn. You can even combine it with ray to gather statistics from compute happening in parallel. It also plays very nicely with hiplot if you're interested in visualising your statistics.

If this tools sounds useful, feel free to install it via:

pip install memo

If you'd like to understand more about the details, check out the github repo and the documentation page. There's also a full tutorial on calmcode.io in case you're interested.