DEV Community: davide lettieri

Lox as a Racket language module

davide lettieri — Mon, 06 Apr 2026 12:36:37 +0000

For a long time, I wanted to implement a Racket language module with a non-lispy surface syntax. Lox, from Crafting Interpreters, was an obvious candidate: I didn't want to invent a new syntax nor new semantics and I already ported the project to C#. In this case, my main objective was to leverage Racket language-building facilities while learning Racket, Scheme, and macros.

I attempted this already a few years ago, with little success. This time I dropped yacc and lex libraries and instead followed the approach from the book more closely, along with the C# version I had written earlier. The result is not especially functional in style: the scanner and parser are fairly imperative and rely on mutation, mainly because that made the code easier to port from the earlier implementations. Another big help came from LLMs, I used GitHub Copilot and it helped me fill some gaps in my knowledge and troubleshoot issues that I honestly didn't have enough competencies to solve.

I do not use GitHub Copilot autocomplete because that removes all the fun from coding but I "chatted" extensively and I also asked it to generate parts I was not particularly interested in, such as the colorer¹.

The code is available on GitHub here. In the post I'll go through the implementation, highlighting all the parts that I consider interesting or helpful.

Implementation strategy

The objective of the project is to have a Lox implementation as a Racket language module. For me, that means passing all tests from the Crafting Interpreter repo up to Chapter 13. The original implementation repo provides a Dart script to execute the test suite against any interpreter:

dart tool/bin/test.dart chap13_inheritance --interpreter racket

In order to have this working I added the #lang racket-lox at the top of each test file and changed the expected line adding 1. This approach is effective once you have a "working" language module already in place. For this reason, the first few steps of the implementation have been done without "validation". I wrote a stub of the scanner, the parser and the language expansion. Once I was able to run the tests the development loop was pretty nice. I added a few unit tests to confirm some behaviors and iterate quicker on some bits of the implementation.

The implementation is validated at multiple levels:

pass the chap13_inheritance test suite from the book
major part of the implementation, scanner, parser and resolver have unit tests
there are a few runtime unit tests

The final implementation passes all the relevant Lox tests and all the unit tests added to the repo. However, this doesn't mean that all edge case is necessarily correct. For example, a Gemini review surfaced an issue in the evaluation of -(-0.0) that neither the original Lox tests nor my own tests caught.

Defining the racket-lox language

As a language-oriented programming language, Racket provides all the facilities needed to build custom programming languages. This means having to build 2 different pieces:

An expander module
A reader module

The expander module is the first module to be imported and it contains all the bindings that will be available. In particular there is an implicit form #%module-begin that must be provided, a few that can be provided such as #%top or #%datum. The reader module is responsible for reading the program text and converting it into Racket code. If the surface syntax is lisp-like there are additional facilities to help with the definition of the language.

To make a comparison between Lox implementation pieces and racket-lox parts we can say:

scanner, both Lox and racket-lox have a scanner. Behavior is almost the same, racket-lox scanner returns a list of tokens (plus errors if any).
parser, both Lox and racket-lox have a parser. Behavior is different, racket-lox parser returns racket syntax objects. There is no pre-defined AST with classes.
interpreter, racket-lox does not have an interpreter. The language is not interpreted, a lox.rkt file contains macros and functions that replicates Lox behavior in Racket.
resolver, both Lox and racket-lox have a resolver. The behavior is different, Lox resolver is executed at runtime before passing the AST to the interpreter. In racket-lox, the resolver is executed at compile time. Its responsibilities are to forbid:
- invalid top-level return
- returning a value from init
- invalid this usage
- invalid super usage
- class inheriting from itself
- reading a local variable in its own initializer
- duplicate local declarations in the same scope
resolve-redefinitions. This is only in racket-lox, I used it to support variable re-definition in a top-level scope.
all "infrastructure" for supporting racket language modules like the reader, colorer, etc. is obviously only in racket-lox. The reader is necessary

How to verify that resolver is executed at compile (expansion time)

Execute the following code

#lang racket-lox
print "before";
this;

and the output will be:

[line 3] Error at 'this': Can't use 'this' outside of a class.

Since there is no before printed anywhere we know the resolver is executed before the code from the source file is executed.

What is resolve-redefinitions

Lox supports variable re-definition in a top-level scope, in order to support that I defined a function to be executed at expansion time resolve-redefinitions. In order to have it available at expansion time I wrapped the definition in a begin-for-syntax. The function is going through all the top-level statements received from the parser and:

it keeps track of defined variables
it replaces a lox-var-declaration with a lox-assign whenever we are re-defining an existing variable. Please note that the function does not need to be recursive because we are interested in rewriting only the top-level statements.

The custom #%module-begin form

The racket-lox language uses a custom #%module-begin form for multiple reasons:

we want to execute resolve-statements to enforce Lox language scoping rules.
we want to execute resolve-redefinitions to allow top-level variable re-declaration.
we want to use #%plain-module-begin because the default #%module-begin prints out expression values to the default output port.

In order to make resolve-statements work we need to pass it the un-expanded syntax tree produced by the reader. However Racket might pre-expand some forms before passing it to the language custom module. To avoid that we wrap the list of statements produced by the reader with a lox-module-wrapper which is doing nothing, it wraps everything in a (begin ...) and which we are removing in with the unwrap-forms function if we received it un-expanded. If racket is deciding to "pre-expand" something before passing it to our module, it will only expand the wrapper and not the inner forms. In this way the resolver will encounter the expected forms and work as intended.

(define-syntax custom-module-begin
  (syntax-parser
    [(_ form ...)
     (define raw-forms (unwrap-forms #'(form ...)))
     (resolve-statements raw-forms)
     (with-syntax ([(fixed-forms ...) (resolve-redefinitions raw-forms)])
       #'(#%plain-module-begin
          fixed-forms ...))]))

The reader

The reader is a required part of a Racket custom language. Its job is to implement read-syntax and read, both functions are returning a Racket module. Quoting from Racket documentation:

The #lang at the start of a module file begins a shorthand for a module form, much like ' is a shorthand for a quote form.

and

The longhand form of a module declaration, which works in a REPL as well as a file, is
(module name-id initial-module-path
 decl ...)

This is exactly what our reader is doing:

(define (read-syntax src in)
  (define source (or src (object-name in)))
  (define tokens (scan-tokens in))
  (define ast (parse tokens))
  (define module-datum
    `(module anonymous-module racket-lox
       (lox-module-wrapper ,@ast)))
  (datum->syntax #f module-datum (list source #f #f #f #f)))

(define (read in)
  (read-syntax #f in))

Let's briefly look at the highlighted lines:

The module definition uses racket-lox as its initial module. This means that all exported definitions in the package are available to the final module. We need it because we want to use macros and functions defined in lox.rkt
We introduce here the lox-module-wrapper, as discussed previously we need it so that our resolver works correctly.

Scanner

The scanner mainly exposes the scan-tokens function and a few custom structs. scan-tokens takes an input port and walks through the source code. Its result is a scanner-output value containing both the tokens and an error flag. In the book’s implementation, scanner errors are reported through a static method on the interpreter. Since I do not have an interpreter here, I return that information explicitly instead.

Given the imperative style of the implementation, I also defined a while macro to use in loops and to closely mimic the book implementation. This macro is used also in the language expansion as well. Nothing fancy, there are plenty of examples online about this.

Something that resembles functional programming or at least more in line with Racket style, is the usage of for/list in conjunction with in-producer. At the beginning I was using my while macro or a loop and using cons to build up lists of objects and then reversing the list to get the correct order. This was ugly as hell and doing the reverse at the end was painful.

The for/list has options to stop the collection of items, skip items, etc. The in-producer is a lazily evaluated, possibly infinite, sequence of items provided by a producer function.

Overall the implementation is quite straightforward.

Parser

The parser resembles, as the scanner, the source implementation. I didn't define a class to hold the state, so I had two options:

Pass the state around as parameter.
Use nested function definitions and capture the state from the outer context. I chose the latter. It behaves almost like having an object instance whose methods access private fields, and it keeps the function signatures simpler because I do not have to pass the state around everywhere.

I don't have a proper syntax tree with pre-defined types, the parser is producing a "lisped" version of Lox syntax. Style is mixed, I'm using for/list and in-producer like in the scanner but also more imperative constructs as my while macro, hand-written loops and such things.

Thanks to the macro support, I was able to extract some repetitive logic that appears when parsing expressions:

logic_or       → logic_and ( "or" logic_and )* ;
logic_and      → equality ( "and" equality )* ;
equality       → comparison ( ( "!=" | "==" ) comparison )* ;
comparison     → term ( ( ">" | ">=" | "<" | "<=" ) term )* ;
term           → factor ( ( "-" | "+" ) factor )* ;
factor         → unary ( ( "/" | "*" ) unary )* ;

All these productions have a similar form, they are all binary expressions:

They depend on another production
They have a set of tokens for separating inner productions

With that in mind I defined the following macro and all the expressions:

Macro for parsing binary ops

(define-syntax-rule (iterative-production name production . token-types)
(define (name)
    (define expr (production))
    (while (match .
            token-types)
            (define op (previous))
            (define right (production))
            (define op-type (token-type op))
            (set! expr (datum->syntax #f `(lox-binary ,expr ,op-type ,right) (token->src op))))
    expr))
(iterative-production factor unary 'SLASH 'STAR)
(iterative-production term factor 'MINUS 'PLUS)
(iterative-production or-syntax and-syntax 'OR)
(iterative-production and-syntax equality 'AND)
(iterative-production equality comparison 'BANG_EQUAL 'EQUAL_EQUAL)
(iterative-production comparison term 'GREATER 'GREATER_EQUAL 'LESS 'LESS_EQUAL)

As in other parts of the code, a lot of imperative pieces such as the while loop and the set! to "accumulate" the result into the expr variable.

Another interesting part is the assignment production, here is my C# version:

Assignment parsing in my C# implementation

private IExpr Assignment()
{
    var expr = Or();

    if (Match(EQUAL))
    {
        var equals = Previous();
        var value = Assignment();

        switch (expr)
        {
            case Variable v:
                return new Assign(v.Name, value);
            case Get g:
                return new Set(g.Obj, g.Name, value);
            default:
                Error(equals, "Invalid assignment target.");
                break;
        }
    }

    return expr;
}

In C# and in Java we can check if an object is of a specific type and cast it to the desired type. Given that I don't have any types but only Racket expressions produced by the parser I can't follow that approach. However, by knowing a bit of macros which are essentially functions with signature syntax -> syntax bound to a given name, we can define a syntax -> syntax function to understand what we have and react accordingly. The type check on the expr variable is replaced with a syntax-parse that is able to recognize the shape of the syntax we are expecting and produce the desired output syntax (opposed to a new AST object from Lox original implementation). The #:datum-literals (lox-variable lox-get) is telling syntax-parse that those elements need to be matched literally. Those are not pattern variables as name:expr or obj:expr:

(define (assignment)
(define expression (or-syntax))
(when (match 'EQUAL)
    (define equal (previous))
    (define value (assignment))
    (with-syntax ([value value])
    (set! expression
            (syntax-parse expression
            #:datum-literals (lox-variable lox-get)
            [(lox-variable name:expr)
                ;; reuse expression’s source location
                (syntax/loc expression
                (lox-assign name value))]
            [(lox-get obj:expr name:expr)
                (syntax/loc expression
                (lox-set obj name value))]
            [_ (parse-error equal "Invalid assignment target.")]))))
expression)

Lox

The lox.rkt file takes the place of the interpreter from the book. Since this is a Racket language module, the goal is not to interpret Lox directly but to translate it into Racket. The expanded syntax is therefore responsible for reproducing the runtime errors that the interpreter would have raised.

The syntax produced by the parser is something like:

(lox-var-declaration v (lox-literal "Hello World!"))
(lox-print (lox-variable v))

We need to translate this to Racket standard forms and functions, with some attention because there are semantic differences between Racket and Lox that we need to handle on our own. For example Lox allows the following code (re-declaring a variable in a top-level scope)

var v = "Hello World!";
var v = "a";

Racket does not behave the same:

(define v "Hello World!")
(define v "a")

It fails with

module: identifier already defined in: v

This is an essential difference but there are others as well, for example "truthiness" and the nil object, the lack of a return statement in Racket to be used inside of functions' bodies. I will go through some parts of the Lox expansion module to explain how it works.

Nil and truthiness

Lox nil is represented by the lox-nil binding, whose value is the symbol nil. To support truthiness a helper function is defined and used to define other parts of the syntax interacting with boolean values.

(define (lox-truthy? v)
  (not (or (eq? v #f) (eq? v lox-nil))))

(define-syntax (lox-or stx)
  (syntax-parse stx
    [(_ left:expr right:expr) #'(let ([l-val left]) (if (lox-truthy? l-val) l-val right))]))

(define-syntax (lox-and stx)
  (syntax-parse stx
    [(_ left:expr right:expr) #'(let ([l-val left]) (if (lox-truthy? l-val) right l-val))]))

(define-syntax (lox-while stx)
  (syntax-parse stx
    [(_ cond:expr body:expr ...) #'(while (lox-truthy? cond) body ...)]))

(define-syntax (lox-if stx)
  (syntax-parse stx
    [(_ cond then)
     #'(when (lox-truthy? cond)
         then)]
    [(_ cond then else) #'(if (lox-truthy? cond) then else)]))

(define-syntax (lox-unary stx)
  (with-syntax ([line (syntax-line stx)])
    (syntax-parse stx
      #:datum-literals (BANG MINUS)
      [(_ BANG v:expr) #'(not (lox-truthy? v))]
      [(_ MINUS v:expr) #'(lox-negate-impl v line)])))

Binary operations

I already described the macro used during the parsing phase of the language module. Also during expansion, I used macros to remove some code duplication in addition to a function helper used to do arguments validation.

A rather big macro performs a "dispatch" to the function defining the binary operations. When possible numeric binary operations are defined using the helper lox-number-binary-with-validation, some others like lox-add-impl are defined ad hoc.

(define-syntax (lox-binary stx)
  (with-syntax ([line (syntax-line stx)])
    (syntax-parse stx
      #:datum-literals
      (PLUS MINUS GREATER GREATER_EQUAL LESS LESS_EQUAL SLASH STAR BANG_EQUAL EQUAL_EQUAL AND OR)
      [(_ left:expr PLUS right:expr) #'(lox-add-impl left right line)]
      [(_ left:expr MINUS right:expr) #'(lox-number-binary-with-validation - left right line)]
      [(_ left:expr GREATER right:expr) #'(lox-number-binary-with-validation > left right line)]
      [(_ left:expr GREATER_EQUAL right:expr)
       #'(lox-number-binary-with-validation >= left right line)]
      [(_ left:expr LESS right:expr) #'(lox-number-binary-with-validation < left right line)]
      [(_ left:expr LESS_EQUAL right:expr) #'(lox-number-binary-with-validation <= left right line)]
      [(_ left:expr SLASH right:expr) #'(lox-divide-impl left right line)]
      [(_ left:expr STAR right:expr) #'(lox-number-binary-with-validation * left right line)]
      [(_ left:expr BANG_EQUAL right:expr) #'(not (lox-eqv? left right))]
      [(_ left:expr EQUAL_EQUAL right:expr) #'(lox-eqv? left right)]
      [(_ left:expr AND right:expr) #'(lox-and left right)]
      [(_ left:expr OR right:expr) #'(lox-or left right)])))

Undefined variables

Lox undefined variable behavior, a runtime error with a specific error code and message, is implemented partially here and partially in the main module. In Racket any unbound identifier is expanded through a #%top form which I'm overriding to get the desired behavior. So I defined a lox-top macro:

(define-syntax (lox-top stx)
  (syntax-parse stx
    [(_ . id:id)
     (with-syntax ([line (or (syntax-line #'id) (syntax-line stx) 0)]
                   [str-id (symbol->string (syntax->datum #'id))])
       #'(lox-runtime-error (format "Undefined variable '~a'." str-id) line))]))

This macro is used in the main module to override the default #%top form.

Return statement

Racket does not have a built-in return statement for function bodies, but it provides the machinery needed to implement one. The core runtime piece is let/ec, which gives us an escape continuation. The macro-level piece is a syntax parameter: while expanding a function body, I rename return-param to that function’s escape continuation. This means that each function body expands with its own target for lox-return, so returns inside nested functions correctly exit the inner function rather than the outer one.

(define (foo)
  (let/ec k
    (displayln "inside function")
    (k 1)
    (displayln "post early return")
    2))

; sample execution
(define v (foo))
(displayln v)

; output
; inside function
; 1

The documentation on Racket docs for let/ec is pretty scarce and I don't have much to add besides that the approach works and it is simple to use.

We can see that the k in let/ec k is a new binding that's being introduced because (let/ec k body ...+) is equivalent to (call/ec (lambda (k) body ...)). We would like to return whenever we encounter a lox-return either alone or with a following value/expression. Using (let/ec lox-return ...) wouldn't work because we are defining a new binding and not using our parsed syntax object.

That is why I use a syntax parameter together with syntax-parameterize and make-rename-transformer: inside that expansion context, lox-return is rewritten to the escape continuation k. Nested function bodies are expanded under their own parameterization, so a return always targets the correct function.

(define-syntax-parameter return-param
  (lambda (stx) (raise-syntax-error #f "return used outside of function" stx)))

(define-syntax (lox-return stx)
  (syntax-parse stx
    [(_ val) #'(return-param val)]))

; lox-run-callable-body is used multiple times
; otherwise it could have been included in the lox-function definition
(define-syntax-rule (lox-run-callable-body ((param binding) ...) stmt ...)
  (let/ec k
    (syntax-parameterize ([return-param (make-rename-transformer #'k)]
                          [param binding] ...)
      (lox-block stmt ...))))

(define-syntax (lox-function stx)
  (syntax-parse stx
    [(_ name:id (arg:id ...) (stmt ...))
     #'(define (name arg ...)
         (lox-run-callable-body () stmt ...))]))

The lox-run-callable-body indirection is useful to re-use the same code in function and in class methods. The [param binding] ... part is used in class methods to implement this and super, both are defined as syntax-parameters.

Classes

Class definition is by far the most complex part of the project. As just mentioned we need to support this and super, but also methods, fields, printing of class definition, class instance, instance methods etc. It's really a lot of functionality for a single language construct.

My first attempt at defining a Lox class used Racket’s class system, since it already supports fields, methods, this, and related features. Possibly not everything with the same semantics and functionality required by Lox but it was worth to give it a try.

I tried, however the expansion of a very simple class is extremely complex. The following racket code

(define foo% (class object%))

expands to

(define-values
 (foo%)
 (#%app
  compose-class
  'foo%
  object%
  (#%app list)
  (#%app current-inspector)
  '#f
  '#f
  '0
  '()
  '()
  '()
  '()
  '()
  '()
  '()
  '()
  '()
  '()
  '()
  '()
  '()
  '()
  '()
  '()
  '()
  'normal
  (lambda (local-accessor local-mutator)
    (let-values ()
      (let-values ()
        (let-values ()
          (let-values ()
            (let-values ()
              (let-values ()
                (let-values ()
                  (let-values ()
                    (letrec-values ()
                      (#%app
                       values
                       (#%app list)
                       (#%app list)
                       (#%app list)
                       (lambda (self561
                                super-go
                                si_c
                                si_inited?
                                si_leftovers
                                init-args)
                         (let-values ()
                           (let-values ()
                             (let-values ()
                               (let-values ()
                                 (let-values ()
                                   (#%app void)
                                   '(declare-field-use-start))))))))))))))))))
  '#f
  '#f))

Troubleshooting my code scanner, parser and macros using this expansion for the class implementation was nearly impossible. I decided to follow an approach similar to the one used in the Crafting Interpreters book. I defined two types:

(struct lox-class-constructor (base name method-table superclass)
  #:property prop:procedure
  (struct-field-index base))

(struct lox-class-instance (class fields))

The first one, lox-class-constructor, is similar to LoxClass in Crafting Interpreters. The #:property prop:procedure makes the struct "callable", it is a procedure and, as the name suggests, an instance of lox-class-constructor once called will return an instance of the class it is defining. The base argument is the procedure that will be called when we call an instance of lox-class-constructor. The lox-class-instance is an instance of a given class.

Besides these custom types, the implementation needs several additional functions and macros. We want a lox-class-constructor to return a lox-class-instance. In practice, lox-class-instance is the runtime type representing an instance of a Lox class, and each instance must keep a reference to the lox-class-constructor that created it through its class field. Since these values refer to each other, their definitions are mutually recursive, so we use:

(define (make-lox-class-constructor class-name-str superclass-value method-table)
  (letrec ([class (lox-class-constructor
                   (lambda ctor-args
                     (define fields (make-hash))
                     (define self (lox-class-instance class fields))
                     (define maybe-init (lox-class-bind-method class 'init self))
                     (when maybe-init
                       (lox-call-impl maybe-init ctor-args (current-call-line)))
                     (when (and (not maybe-init) (not (null? ctor-args)))
                       (lox-runtime-error (format "Expected 0 arguments but got ~a."
                                                  (length ctor-args))
                                          (current-call-line)))
                     self)
                   class-name-str
                   method-table
                   superclass-value)])
    class))

Let's go line by line:

In line 1 we are defining a helper function make-lox-class-constructor that helps us build a lox-class-constructor instance for a given class.
In line 2 to define lox-class-constructor we need to define a function that returns a lox-class-instance holding a reference to the lox-class-constructor. We use letrec to allow using class inside its own definition.
In line 3 not much to say, syntax to define the lambda
In line 4 initiating a hash table to hold the fields of the instance
In line 5 defining the instance. Notice the class value passed into the struct constructor.
In line 6 looking for a init method using our helper lox-class-bind-method
In lines 7-8 if there is a init method, we call it (line 8)
In lines 9-12 if there is no init but we got parameters we raise an error
In line 13 we use self as return value for the lambda we are defining and the lambda definition is done and so we have the first parameter for the lox-class-constructor
In lines 14-16 we pass the remaining parameters
In line 17 we use class as return value for the make-lox-class-constructor we are defining.

So we have now a helper function that helps us create a class constructor but what are the superclass-value and the method-table? Let's look at the actual lox-class syntax object and its expansion:

(define-syntax (lox-class stx)
  (syntax-parse stx
    #:datum-literals (lox-function)
    [(_ class-name:id superclass:expr ((lox-function m-name:id (m-arg:id ...) (m-body:expr ...)) ...))
     (with-syntax ([class-line (or (syntax-line #'class-name) (syntax-line stx) 0)])
       #'(define class-name
           (let ([superclass-value superclass])
             (lox-validate-superclass superclass-value class-line)
             (define method-table
               (make-hasheq
                (list (lox-make-method-entry m-name superclass-value (m-arg ...) m-body ...) ...)))
             (make-lox-class-constructor (symbol->string 'class-name)
                                         superclass-value
                                         method-table))))]))

First we notice that the lox-class expands to a (define class-name ...) binding class-name to the constructor of the class. The superclass-value, when available, will be the constructor of the super class. The method table is built some helper functions and macros.

Keywords `this` and `super` and instance methods definition

In our lox class the methods are stored as factories in a method table in the class definition and not in the instance. The first helper we encounter to support this implementation is lox-make-method-entry which is a macro returning a pair of values: the method name and a method factory. The method factory is doing a lot of work:

it uses procedure-rename so that when we print a method we get the desired name.
it binds this to receiver, the this value is bound at runtime that's why we need to pass it to the method so that it points to the correct instance of the class.
it binds super to superclass-value, the superclass is defined at compile time and indeed it is an argument of the macro itself.
it defines the body of the method: result is set to be the value returned by lox-run-callable-body.
it passes two new syntax parameter bindings so that lox-this and lox-super are correctly rewritten in the method body.
force the return of this if the method is init

(define-syntax-rule (lox-make-method-entry m-name superclass-value (m-arg ...) m-body ...)
  (cons 'm-name
        (lambda (receiver)
          (procedure-rename
           (lambda (m-arg ...)
             (let ([this receiver]
                   [super superclass-value])
               (define result
                 (lox-run-callable-body ((this-param (make-rename-transformer #'this))
                                         (super-param (make-rename-transformer #'super)))
                                        m-body ...))
               (if (eq? 'm-name 'init) this result)))
           'm-name))))

Instance method executions

Calling a method on an instance is another complex part of the implementation. Let's start with a simple class definition with an empty method:

class A {
  method() {}
}
var a = A();
a.method();

Let's look at the expansion of this simple code:

(lox-class A #f ((lox-function method () ()))) 
(lox-var-declaration a (lox-call (lox-variable A))) 
(lox-call (lox-get (lox-variable a) "method"))

We are interested in the last line where we have an interaction between lox-call and lox-get. I said already that the instances in racket-lox hold a method factory table and we need to pass the receiver at runtime to get the actual function to be called. All of this is done by the lox-get macro and the more interesting lox-get-impl function:

(define-syntax (lox-get stx)
  (syntax-parse stx
    [(_ obj method:str)
     (with-syntax ([method-sym (string->symbol (syntax->datum #'method))]
                   [line (or (syntax-line #'method) (syntax-line stx) 0)])
       #'(lox-get-impl obj 'method-sym line))]))

(define (lox-get-impl o symbol-name line)
  (cond
    [(lox-class-instance? o)
     (hash-ref (lox-class-instance-fields o)
               symbol-name
               (lambda ()
                 (define maybe-method
                   (lox-class-bind-method (lox-class-instance-class o) symbol-name o))
                 (if maybe-method
                     maybe-method
                     (lox-runtime-error (format "Undefined property '~a'." symbol-name) line))))]
    [else (lox-runtime-error "Only instances have properties." line)]))

The macro is only extracting the original line in the source code and the method name and passing both to the function along with the instance of the class. The function checks if the instance contains a field with that symbol-name name and returns it if that's the case. Otherwise it tries to bind the method to the current instance of the class. The lox-class-bind-method receives the class (lox-class-instance-class o) as holder of the method factory table, the method name symbol-name and the instance o to be able to bind the method to the correct receiver. In order to work properly lox-class-bind-method needs to look recursively up to the inheritance tree to look for the method whenever it doesn't find it on the current class.

Property lookup

There is not much to add here. We already discussed how lox-get-impl searches for the symbol it receives inside the hashmap of the instance fields.

Super

With all we have seen so far, super implementation is pretty straightforward. Let's start with the expansion:

class A {
  methodA()  {print "inside A";}
}

class B < A {
  method() { super.methodA();}
}

(lox-class A #f ((lox-function methodA () ())))
(lox-class B A ((lox-function method () ((lox-call (lox-super "methodA"))))))

The implementation is made of a macro and a function:

(define-syntax (lox-super stx)
  (syntax-parse stx
    [(_ method:str)
     (with-syntax ([method-sym (string->symbol (syntax->datum #'method))]
                   [line (or (syntax-line #'method) (syntax-line stx) 0)])
       #'(lox-super-impl super-param this-param 'method-sym line))]))

(define (lox-super-impl superclass receiver method-sym line)
  (if (lox-class-constructor? superclass)
      (let ([method (lox-class-bind-method superclass method-sym receiver)])
        (if method
            method
            (lox-runtime-error (format "Undefined property '~a'." method-sym) line)))
      (lox-runtime-error "Superclass must be a class." line)))

The macro (again) is not doing much, it extracts the line number and method name and passes them to the lox-super-impl along with the this and super syntax parameters. The function lox-super-impl is passing the values to lox-class-bind-method, which we already discussed, starting the recursive lookup from the parent class instead of the current class like it is done in the method lookup flow.

Additional compatibility gaps between Lox and Racket

Racket semantics differs from Lox semantics on a few additional points:

Printing

The lox-print implementation feels a bit "hacky" however it works correctly. A bunch of runtime checks allow to tailor the printed string to the Lox requirements. The lox-class-constructor and lox-class-instance custom structs help with the printing as well. Native types, such as booleans and numbers have their own helpers to support Lox-style printing.

(define (lox-print value)
  (cond
    [(boolean? value) (print-bool value)]
    [(eqv? value 'nil) (displayln "nil")]
    [(number? value) (displayln (lox-number->string value))]
    [(lox-class-constructor? value) (displayln (lox-class-constructor-name value))]
    [(lox-class-instance? value)
     (displayln (format "~a instance" (lox-class-constructor-name (lox-class-instance-class value))))]
    [(procedure? value)
     (let ([function-name (object-name value)])
       (if (eqv? function-name 'clock)
           (displayln "<native fn>")
           (displayln (format "<fn ~a>" function-name))))]
    [else (displayln value)]))

(define (lox-number->string value)
  (cond
    ;; Preserve negative zero so `print -0;` matches Crafting Interpreters output.
    [(and (real? value) (inexact? value) (eqv? value -0.0)) "-0"]
    ;; Lox prints whole-valued numbers without a trailing ".0".
    [(and (real? value) (integer? value)) (number->string (inexact->exact value))]
    [else (number->string value)]))

Numbers

There are a few differences between how Racket handles and prints numbers and how Lox does it. The previous section showcases the snippet that handles printing of -0.0 and printing of 1.0 as 1. Division in Lox is implemented using Java double division. Similarly we are using inexact numbers. Number equality requires some extra handling for numbers as well.

(define (lox-divide-impl av bv line)
  (if (and (number? av) (number? bv))
      (/ (exact->inexact av) (exact->inexact bv))
      (lox-runtime-error "Operands must be numbers." line)))

(define (lox-eqv? a b)
  (cond
    [(and (real? a) (nan? a)) #f]
    [(and (real? b) (nan? b)) #f]
    [(and (number? a) (number? b)) (= a b)]
    [else (eqv? a b)]))

The clock function

The clock function is defined and exported in the main.rkt file. That makes it available to racket-lox programs.

DrRacket not supported properly

In order to implement Lox exit codes on failure, racket-lox has dedicated errors and errors handlers and doesn't raise syntax errors that would make syntax error highlighting work with DrRacket. The colorer, fully AI generated, gives racket-lox a decent aspect in DrRacket without being perfect. Indentation is wrong for example. I didn't invest much time in this I just wanted code and comments to be colored correctly.

Footnotes

More details on syntax coloring here ↩

Extensible Visitor Pattern in C#

davide lettieri — Fri, 23 Jan 2026 07:12:16 +0000

Recently, I stumbled upon this paper Synthesizing Object-Oriented and Functional Design to Promote Re-Use. The paper wants to provide a solution to the expression problem. The authors suggest an improved version of the visitor pattern that they call "extensible visitor pattern" which is essentially a combination of the visitor pattern with the factory method pattern.

While the paper and the expression problem statement don't explicitly mention SOLID principles, it looks like what they are really doing is exploring how to evolve a code base while respecting the Open/Closed Principle :

Software entities (like classes, methods, and functions) should be open for extension, but closed for modification.

The idea that software entities should be open for extension is intuitive, we can add new code, inherit types, compose etc. But what does it mean closed for modification? Should we never change code that has been deployed to production? Is that what the principle is saying?

In my opinion, changing a piece of code for fixing a bug doesn't go against this principle, I believe that the principle suggests that we should be able to not change existing code when we want to add some functionality.

Honestly, if I think about this principle and what I saw in my career in software development, I can safely affirm that I never saw this principle applied faithfully. Classes, methods and all software constructs are modified all the time.

The authors work through several examples all based on the same scenario: we have a set of shapes and a set of tools, essentially functions, over these shapes and we want either to add a new shape or to add a new tool. How can we update our code without changing the existing code? And what impact does our approach have on clients using our code?

They present 4 different implementations:

functional without any pattern
object oriented using the interpreter pattern
object oriented using the visitor pattern
object oriented using the extensible visitor pattern that is subject of the paper

The code samples are written in Java, a language called "Pizza" which is "a parametrically polymorphic extension of Java", and SML for the functional approach example. Given that I'm a big fan of the visitor pattern, I wanted to go through the paper and reimplement everything using F# for the functional approach and C# for everything else. The code will not be an exact port of the original, mostly because I don't know Pizza nor Java but also because they don't show all the code and I want to show a bit more than they did.

The key point to observe in the presented problem is that the types are recursive, in other words shapes are defined using other shapes. For example the translation of a shape, or a union of two shapes. This case cannot be ignored when we build a tree of types to represent some kind of domain and problem.

❕ Info
The authors used an abstract class as base type for the object oriented approaches, I used an interface. I'll try to motivate my choice later on.

All the code is available here https://github.com/davidelettieri/extensible-visitor.

Functional approach

The functional approach is the simplest and shortest of all, it is super easy to add a new tool and it is impossible to add new datatype without changing existing code. Let's examine why it's impossible.

Functional approach - F# implementation

type Point = { X: float; Y: float }

type Shape =
    | Circle of radius: float
    | Square of length: float
    | Translated of shape: Shape * offset: Point

let rec containsPoint shape point =
    match shape with
    | Circle radius -> point.X * point.X + point.Y * point.Y <= radius * radius
    | Square length -> point.X >= 0 && point.X <= length && point.Y >= 0 && point.Y <= length
    | Translated(shape, offset) ->
        let translatedPoint =
            { X = point.X - offset.X
              Y = point.Y - offset.Y }
        containsPoint shape translatedPoint

If we want to add a shape we need to modify the Shape definition, there is no way around that.

❗ Note
To be honest, I'm not sure this is totally bad. The code is so short and clean and adding a new type will trigger a compilation error. It's exactly what we want, we know where we have to fix the code so that the new shape is supported everywhere. But doubting the OCP is beyond the scope of this post.

As a bonus I tried a functional approach in C# and while the result is more verbose, we are able to define a new function ContainsPointV2() that supports a new shape and we don't need to modify any existing code.

Functional approach - C# Pattern Matching

public static class Tools
{
    public static bool ContainsPoint(Point point, IShape shape) =>
        shape switch
        {
            Square s => point.X >= 0 && point.X <= s.Length &&
                        point.Y >= 0 && point.Y <= s.Length,
            Circle c => point.X * point.X + point.Y * point.Y <= c.Radius * c.Radius,
            TranslatedShape ts => ContainsPoint(
                new Point(point.X - ts.Point.X, point.Y - ts.Point.Y),
                ts.Shape),
            _ => throw new NotSupportedException($"Shape of type {shape.GetType().Name} is not supported")
        };

    // Adding a new tool Shrink is easy - just add a new function
    public static IShape Shrink(double num, IShape shape) =>
        shape switch
        {
            Square s => new Square(s.Length / num),
            Circle c => new Circle(c.Radius * num),
            _ => throw new NotSupportedException($"Shape of type {shape.GetType().Name} is not supported")
        };

    // New ContainsPoint that supports UnionShape
    public static bool ContainsPointV2(Point point, IShape shape) =>
        shape switch
        {
            Square s => ContainsPoint(point, s),
            Circle c => ContainsPoint(point, c),
            TranslatedShape ts => ContainsPointV2(
                new Point(point.X - ts.Point.X, point.Y - ts.Point.Y),
                ts.Shape),
            UnionShape s => ContainsPointV2(point, s.Shape1) || ContainsPointV2(point, s.Shape2),
            _ => throw new NotSupportedException($"Shape of type {shape.GetType().Name} is not supported")
        };
}

The highlighted line is key in correctly supporting the new shape. Since the TranslatedShape type is recursive, when we define a new tool to support a new shape, any instance of TranslatedShape could contain an instance of the UnionShape. This means that the recursive call needs to be done using the new tool definition. In this case ContainsPointV2(). This recursion is the key for understanding the approach of the paper and the extensible visitor pattern implements this exact behavior.

Of course with this approach we are accepting the fact that we might get runtime exceptions, for example if someone passes a UnionShape instance to ContainsPoint(). Not exactly safe and while we are not changing our code, clients using our code need to update to ContainsPointV2() in order to be able to handle correctly the new UnionShape type.

Object oriented with the interpreter pattern

Using the interpreter pattern means that each tool is a function on the data type, this pattern is usually explained with grammars or expressions however it has a more generic applicability. Whenever we have a family of types which expose the same behavior, a method, then we have a usage of the interpreter pattern.

The approach that is proposed is the following:

we start with a set of types all extending a base abstract class with a method representing the initial tool supported by the types.
we want to add a new shape. We define a new type that inherits the base abstract class and implement the method.
we want to add a new tool. We cannot add a new method to the existing types because we don't want to modify existing code. Instead, we define new types that extend the original one implementing the new tool.

Unfortunately existing clients of our code need to change the types they are using in order to leverage the new tool.

Remarks on the interpreter pattern and the provided implementation

In the original GoF definition and in the code samples provided by the authors in the article the base type is an abstract class but there are some unclear points:

the base abstract class is having the shrink method but all the initial shapes implement containsPt, the authors probably wanted the base abstract class to have the containsPt method
the new union shape implements containsPt
the newly implemented types extending the original shapes to have the shrink method don't have a base type in common which is a requirement to be able to handle shapes in a polymorphic manner.

The last point is why I decided to use interfaces for the C# code, imagine we define a new base abstract class for the shrinkable shapes. Then in order to allow code reuse we would have to inherit from multiple base types for example:

Interpreter Pattern - Abstract class based inheritance problem

abstract class Shape {...}
abstract class ShrinkableShape {...}
class Square : Shape {...}
class ShrinkableSquare : Square, ShrinkableShape // Impossible!

As noted in the code listing and as probably every reader knows we cannot inherit multiple classes, however we can implement multiple interfaces.

Fig 4,5,6 from the article with some comments

So, omitting most of the code, a shrinkable shape using the interpreter pattern in C# with interfaces instead of abstract base classes would look like this:

Interpreter Pattern - Shrinkable Shape Implementation

public interface IShrinkableShape : IShape
{ 
    IShrinkableShape Shrink(double num);
}

public record ShrinkableSquare(double Length) : Square(Length), IShrinkableShape
{ 
    public IShrinkableShape Shrink(double num) => new ShrinkableSquare(Length / num);
}

Object oriented with the visitor pattern

The visitor pattern is very much the same approach as the functional one. Adding a tool is the easy part because it only entails defining a new visitor type. The following is how we would approach in C#.

Visitor Pattern - Core Implementation

public interface IShape
{
    T Process<T>(IShapeProcessor<T> processor);
}

public interface IShapeProcessor<T>
{
    T ForSquare(Square square);
    T ForCircle(Circle circle);
    T ForTranslatedShape(TranslatedShape translatedShape);
}

public sealed record Square(double Length) : IShape
{
    public T Process<T>(IShapeProcessor<T> processor) => processor.ForSquare(this);
}

// more shapes

public class ContainsPoint(Point point) : IShapeProcessor<bool>
{
    public bool ForSquare(Square square) =>
        point.X >= 0 && point.X <= square.Length &&
        point.Y >= 0 && point.Y <= square.Length;

    public bool ForCircle(Circle circle) =>
        point.X * point.X + point.Y * point.Y <= circle.Radius * circle.Radius;

    public bool ForTranslatedShape(TranslatedShape translatedShape) =>
        translatedShape.Shape.Process(new ContainsPoint(
            new Point(point.X - translatedShape.Point.X, point.Y - translatedShape.Point.Y)));
}

public sealed class Shrink(double num) : IShapeProcessor<IShape>
{
    // omitted
}

Adding a new shape without modifying existing code is more challenging. As for the interpreter pattern we proceed by adding new code: the new shape type, a new visitor interface that inherits from the existing one and is able to process also the new shape and, lastly, the implementation of our tools.

Visitor Pattern - Adding Union Shape Support

public interface IUnionShapeProcessor<T> : IShapeProcessor<T>
{
    T ForUnionShape(UnionShape unionShape);
}

public sealed record UnionShape(IShape Shape1, IShape Shape2) : IShape
{
    public T Process<T>(IShapeProcessor<T> processor)
    {
        if (processor is IUnionShapeProcessor<T> unionProcessor)
        {
            return unionProcessor.ForUnionShape(this);
        }

        throw new NotSupportedException($"Processor of type {processor.GetType().Name} does not support UnionShape");
    }
}

public class UnionContainsPoint(Point point) : ContainsPoint(point), IUnionShapeProcessor<bool>
{
    public bool ForUnionShape(UnionShape unionShape) =>
        unionShape.Shape1.Process(this) || unionShape.Shape2.Process(this);
}

The obvious difference is that our original implementation is type safe while the new one is relying on runtime checks to validate that the visitor instance is able to handle the new shape. However, as the authors point out, there is a less obvious, critical flaw: this implementation does not work for recursive types. The issue is with the recursive type TranslatedShape, the UnionContainsPoint visitor is reusing the base implementation of ContainsPoint. This means that it is executing the following code

Visitor Pattern - Recursive Type Limitation

public bool ForTranslatedShape(TranslatedShape translatedShape) =>
    translatedShape.Shape.Process(new ContainsPoint(
        new Point(point.X - translatedShape.Point.X, point.Y - translatedShape.Point.Y)));

Which is calling Process on the inner shape of the original object and it is passing a new instance of ContainsPoint (and not UnionContainsPoint!) so now we lost support for the Union shape. A unit test can confirm the expected behavior:

Visitor Pattern - Test for Nested Translated Shape

[Fact]
public void TestNestedTranslatedShapes()
{
    // Arrange
    var circle = new Circle(10);
    var square = new Square(10);
    var t1 = new UnionShape(square, circle);
    var t2 = new TranslatedShape(t1, new Point(5, 5));

    // Act & Assert
    Assert.Throws<NotSupportedException>(() => t2.Process(new UnionContainsPoint(new Point(0, 0))));
}

The power of the visitor pattern comes from the fact that with a single call t2.Process(...) we are calling 2 methods on 2 different object obtaining a double dispatch at runtime. However when the ContainsPoint visitor is creating an instance of itself and that behavior is inherited by new visitor types, we are breaking the double dispatch because ForTranslatedShape will pass to Process the original visitor instance and not the extended one.

Object oriented with the extensible visitor pattern

The key part to understand from the visitor pattern approach is that a tool implementation, a visitor, will break if its implementation is creating new instances of itself or other tools. Today, at least in C# world, we are quite used to not directly create instances of types and relying on dependency injection to get our instances and to plug-in different instances when needed.

A visitor directly creating an instance of its same type or another is clearly coupling itself to some specific implementation. The solution is to abstract the creation away so that we can use different instances, with the same interfaces, if we need to update a visitor to handle a new type. The following code shows how to implement it using a virtual method on the processor.

Visitor with virtual method to create instances of itself

public class ContainsPoint(Point point) : IShapeProcessor<bool>
{
    protected virtual ContainsPoint MakeContainsPoint(Point p) => new(p);

    public bool ForSquare(Square square) =>
        point.X >= 0 && point.X <= square.Length &&
        point.Y >= 0 && point.Y <= square.Length;

    public bool ForCircle(Circle circle) =>
        point.X * point.X + point.Y * point.Y <= circle.Radius * circle.Radius;

    public bool ForTranslatedShape(TranslatedShape translatedShape) =>
        translatedShape.Shape.Process(MakeContainsPoint(
            new Point(point.X - translatedShape.Point.X, point.Y - translatedShape.Point.Y)));
}

public class UnionContainsPoint(Point point) : ContainsPoint(point), IUnionShapeProcessor<bool>
{
    protected override ContainsPoint MakeContainsPoint(Point p) 
        => new UnionContainsPoint(p);

    public bool ForUnionShape(UnionShape unionShape) =>
        unionShape.Shape1.Process(this) || unionShape.Shape2.Process(this);
}

The UnionContainsPoint type, by overriding the MakeContainsPoint virtual method is able to update the behavior of the base class to recognize the new shape. This is very much similar to the function ContainsPointV2 that is recursively calling itself.

Sutton & Barto Gridworld example in C#

davide lettieri — Tue, 06 Jan 2026 14:22:00 +0000

Lately, I've been exploring various examples from Sutton and Barto's "Reinforcement Learning: An Introduction" book using C# and I already shared a few of them on this blog:

Today I'll be focusing on the gridworld example from chapter 3 of the book. The code is available in the existing repo as a new project. Gridworld is a simple example used to illustrate the Bellman equations and iterative policy evaluation. An excerpt from the book describes the environment:

The cells of the grid correspond to the states of the environment. At
each cell, four actions are possible: north, south, east, and west, which deterministically
cause the agent to move one cell in the respective direction on the grid. Actions that
would take the agent off the grid leave its location unchanged, but also result in a reward
of -1. Other actions result in a reward of 0, except those that move the agent out of the
special states A and B. From state A, all four actions yield a reward of +10 and take the
agent to A'. From state B, all actions yield a reward of +5 and take the agent to B'.

— Sutton & Barto, Reinforcement Learning: An Introduction, 2nd ed., Chapter 3.

The value function for each state is updated using the Bellman expectation equation for policy evaluation:

v_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s',r} p(s',r|s,a) [r + \gamma v_{\pi}(s')], \quad \forall s \in S

The components of the equation are:

$vπ(s)v_{\pi}(s)$ : the value of state $s$ under policy $π\pi$ , this is what we want to compute.
$π(a∣s)\pi(a|s)$ : the probability of taking action $a$ in state $s$ . This is called the policy.
$p (s^{'}, r ∣ s, a)$ : the probability of transitioning to state $s^{'}$ and receiving reward $r$ after taking action $a$ in state $s$ .
$γ\gamma$ : the discount rate, which determines the importance of future rewards and is a value between 0 and 1. In our case it is set to 0.9.

Now the example proceeds by giving us the policy: the agent selects each action with equal probability $π(a∣s)=14\pi(a|s) = \frac{1}{4}$ , so we can simplify the equation:

v_{\pi}(s) = \frac{1}{4} \sum_{a} \sum_{s',r} p(s',r|s,a) [r + \gamma v_{\pi}(s')], \quad \forall s \in S

Because the environment is deterministic, for each state-action pair there is exactly one next state $s^{'}$ and reward (probability 1). Therefore the update simplifies to:

v_{\pi}(s) = \frac{1}{4} \sum_{a} [r + \gamma v_{\pi}(s')]

Using this formula we iteratively update the value function for each state until convergence up to a certain tolerance.

The implementation

Regarding the implementation, I mostly followed the sample lisp code provided by the authors at http://incompleteideas.net/book/code/gridworld5x5.lisp. However I used clearer variable names, an enum for the actions, a better next-state and full-backup function and other minor improvements. If you look at the original full-backup it is actually also computing the next-state for a subset of cases, I decided to handle all cases in my NextState method and use the FullBackup only to compute the value of a given state-action pair.

Some of the Lisp code's complexity — which I preserved in the C# port — is the mapping between state indices (0–24) and grid coordinates (row and column). It's not clear why the original maps states to indices this way; I kept the mapping for fidelity to the original implementation.

As a side note, I executed the lisp code to validate the results and the methods I ported to C# using SBCL and apparently a function was missing so I added it and provided an updated lisp version in my repo here.

I decided to use a GridWorld class to hold the global state and the required functions.

Looking at the simplified Bellman equation we can see that we need to compute $s^{'}$ given a starting state and an action, this is implemented in the NextState method of the GridWorld class:

int NextState(int state, Action action)
{
    if (state == _specialStateA)
    {
        return _specialStateAPrime;
    }

    if (state == _specialStateB)
    {
        return _specialStateBPrime;
    }

    // OffGrid returns true if the action would take the agent off the grid
    if (OffGrid(state, action))
    {
        return state;
    }

    var (row, col) = CoordinatesFromState(state);
    return action switch
    {
        Action.East => StateFromCoordinates(row, col + 1),
        Action.South => StateFromCoordinates(row + 1, col),
        Action.West => StateFromCoordinates(row, col - 1),
        Action.North => StateFromCoordinates(row - 1, col),
        _ => throw new ArgumentOutOfRangeException(nameof(action), "Invalid action"),
    };
}

The sum formula is adding one element for each action, the single element for a given action and state is computed in the FullBackup method:

double FullBackup(int state, Action a)
{
    var nextState = NextState(state, a);
    double reward = state switch
    {
        _ when state == _specialStateA => 10,
        _ when state == _specialStateB => 5,
        // implicitly handles off-grid moves
        _ when nextState == state => -1,
        _ => 0
    };

    return reward + (_gamma * _v[nextState]);
}

The implementation of the value function is the following. Consider that we have 4 actions so average is dividing by 4:

private double ValueFunction(int state)
    => Enum.GetValues<Action>()
        .Select(a => FullBackup(state, a))
        .Average();

The rest of the implementation is almost a 1-1 mapping from the lisp code to C#. The value function is updated in a loop until convergence.

How to run the sample

Clone and run the sutton-barto-reinforcement-learning repository:
- git clone https://github.com/davidelettieri/sutton-barto-reinforcement-learning.git
- cd sutton-barto-reinforcement-learning/gridworld
- dotnet run -c Release
The app prints the value function after convergence; compare it with the book’s Figure 3.2.

Grid diagram and state mapping

To make the indexing clear, here's the 5x5 grid used in the example (rows increase downward, columns increase to the right). Special states A and B and their primes A' and B' are shown in the grid where applicable.

	Col 0	Col 1	Col 2	Col 3	Col 4
Row 0	0	1 (A)	2	3 (B)	4
Row 1	5	6	7	8	9
Row 2	10	11	12	13 (B')	14
Row 3	15	16	17	18	19
Row 4	20	21 (A')	22	23	24

Multi armed bandit exercise 2.5 with C#

davide lettieri — Tue, 06 Jan 2026 14:17:45 +0000

Recently I tried to code the 10 armed testbed example from chapter 2 of Sutton and Barto Reinforcement Learning: an introduction book.

The chapter continues introducing new theory elements and strategies to improve the approach shown in the 10 armed example. In particular, one of the points is about non-stationary problems.

The 10 armed testbed was a stationary problem, the probability distributions of the different actions don't change over time. If you remember the sample, at the beginning of the round we computed 10 random values, those values are then used to be the mean of a normal distribution from which we will pick the rewards at each step. The constant part is that this normal distributions don't change from a step to another, they stay the same for the whole round execution.

The focus of the exercise is to understand how the estimated reward computation impacts the performance of the $ϵ\epsilon$ -greedy strategy. In the 10 armed testbed, the estimate reward was computed averaging the rewards obtained from each action when selected. Note that this approach consider each reward with the same relative value, however in a non-stationary problem, where probability distributions change over time we would like to give more weight or importance to more recent rewards because they represent more realistically the current distribution the reward is generated from.

The text of the exercise is

Design and conduct an experiment to demonstrate the
difficulties that sample-average methods have for nonstationary problems. Use a modified
version of the 10-armed testbed in which all the $q_{}(a)$ start out equal and then take
independent random walks (say by adding a normally distributed increment with mean 0
and standard deviation 0.01 to all the $q_{}(a)$ on each step). Prepare plots like Figure 2.2
for an action-value method using sample averages, incrementally computed, and another
action-value method using a constant step-size parameter, $α\alpha$ = 0.1. Use $ϵ=0.1\epsilon = 0.1$ and
longer runs, say of 10,000 steps.

Figure 2.2 refers to the average reward graph and the best arm selection rate graph, the same graphs I produced in the previous post. The $ϵ=0.1\epsilon=0.1$ refers to the $ϵ\epsilon$ -greedy strategy to be used, both in the case of sample averages and in the constant step-size parameter.

The reward estimation formula

The sample average estimation can be naively computed by keeping track of all the rewards for an action and compute the average. However, as the book describes clearly, we can compute the average only by using the current reward, the previous estimate and the number of times the action has been selected. With this approach we have a computational advantage since we don't need to store the rewards for all steps and we need just a few operations to compute the new estimate.

If we denote the estimate at $i$ th as $Q_i$ , the reward as $R_i$ then the formula is:

Q_{n+1} = Q_n + \frac{1}{n}[R_n-Q_n]

Once we have this we can also see how the estimate is computed with constant-step size parameter $α\alpha$ :

Q_{n+1} = Q_n + \alpha[R_n-Q_n]

The two formulas are almost the same, the difference $R_n-Q_n$ is multiplied by $1/ n$ in one case and by a constant value $α\alpha$ in the other.

The implementation

If you checked the previous example I implemented you'll see that the code is mostly the same. I have to support:

non-stationary rewards distributions
different strategies to compute the estimate reward

I don't want to make the previous example too complex and add abstractions to plug in different implementation just to not duplicate code. For both examples I want an easy to follow, low abstraction implementation that I can understand easily in a year from now when all the context I have is lost. However I need to be able to plug in two different strategies for estimate computation, so in that case only I'll use some kind of abstraction:

delegate double UpdateEstimatedReward(double currentEstimatedReward, double reward, int armSelectedCount);

A delegate to capture the function signature of the update formula. The armSelectedCount parameter (I couldn't think of a better name) correspond to the $n$ of the formulas we have above.

And the two formulas translate to

UpdateEstimatedReward sampleAverage = (currentEstimatedReward, reward, armSelectedCount) =>
    currentEstimatedReward + (reward - currentEstimatedReward) / armSelectedCount;

UpdateEstimatedReward constantStepSize(double alpha)
    => (currentEstimatedReward, reward, _) =>
        currentEstimatedReward + alpha * (reward - currentEstimatedReward);

The sampleAverage is just an instance of the delegate while the constantStepSize is a function producing an instance of the delegate. That's because we have the $α\alpha$ parameter that needs to be fixed in order to have a concrete update formula. Note also that the third parameter, armSelectedCount, is ignored in the constantStepSize definition.

Regarding the non-stationary part, the exercise text says that we start with equal $q_(a)$ and the all of them take independent random walks. In the ten armed testbed example, the reward for each actions was a distribution, should be the same here? We have a distribution with variating mean or we should just pick $q_(a)$ as reward for any step given that it is already changing? I do not know to be honest what the expected approach here, my decision was to go with distributions with changing mean value.

In order to implement correctly the best arm selection rate we have to notice that the best arm is not defined once at beginning but at each step we could end up with a different best arm so we need to compute which one is the best again.

The average reward per step

The best arm selection rate per step

Ten armed testbed for the Bandit problem with C#

davide lettieri — Sun, 27 Apr 2025 00:00:00 +0000

I'm continuing my attempt to reproduce examples from Reinforcement Learning: An Introduction book using C#.

In a previous post I reproduced the tic-tac-toe example with some improvements and clarification with respect to the original text. I think it's worth taking a look at it.

Today I'm reproducing the ten armed testbed for the Bandit problem, in particular I want to reproduce the two graphs showing the average reward improvements and the selection rate of the best arm.

The problem, as stated in the book is the following:

You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

10 armed testbed

One key point to understand to follow the ten armed (k=10) testbed is that the total reward is never computed nor analized in the example itself. The underlying intuition is that we want to find the best action available and select that action as much as possible. Ideally everytime. By being able to select the best action, we will automatically optimize the total reward.

Once selected the value of k=10, we have to define the probability distribution for each action. That is done by using a normal distribution with mean=0 and variance=1, we pick 10 samples from this distribution and each sample will be the mean value of a normal distribution with variable=1. We end up with an array of 10 normal distributions, each time we select action i we will pick the i-th distribution and get a sample value from that distribution.

Given that we don't know the probability distribution assigned to each action we have to base our strategies on estimates values. Each time we select an action, we get an actual reward value and we can improve our estimate for that action. We iterate this process trying to improve our estimates. We start with an estimate of zero for all actions. Each of these selection and update of estimates is called a step. A round comprises multiple steps. In the example provided in the book, we found 2000 rounds of 1000 step each.

To produce the graphs presented in the book we need to compute:

for each step, the average reward over the different rounds. For each round, we keep track of reward of step i. At the end we sum all rewards and we divide by the number of rounds.
for each step, the best arm selection rate. For each step of each round, we keep track of how many times we selected the best action. At the end we sum all values and we divide by the number of rounds. Please remember that we know which arm is best, because we are creating the distributions for each arm. This information cannot be used during the learning process but we can use it to evaluate performances on the testbed.

One point that remains to define is how we select the action. It is evident that we would like to try all of them multiple times so that we can build a reasonable estimate of the value of each action. For example we could select each action in a round-robin fashion and repeat until we complete all steps. This would give us a uniform approach for updating the estimated values, however the best action will be selected roughly 1/k times, in our case corresponding to 10% of the times. That is not very good for the overall performance of the round.

The books suggests three different strategies:

greedy: we always select the action with the best estimate
ϵ-greedy: on a subset of cases we select a random action, on the remaining ones we select the best strategy:
- 10% of the selections is random (ϵ=0.1)
- 1% of the selections is random (ϵ=0.01)

Ties are resolved by picking any of the actions with the same expected reward. We already noted that on each step we perform a selection and an update of the estimates. This is true regardless of the strategy, in the tic-tac-toe example we saw that we learned only when the action was selected based on the value table but not when selected randomly. This is not the case here where learning when selecting randomly is the very base for actual improvements.

Greedy strategy

Let's consider for a moment how the greedy behaves on some cases.

We are at step 0, all estimates are 0 so we need to pick a random action between 0-9. Let's say we pick 0:

if the actual reward is negative, the estimate for 0 will become negative and we won't select 0 in the next step
if the actual reward is positive, the estimate for 0 will be positive and we will select 0 in the next step. Actually we will continue to select 0 until the estimate become 0 or less. If it is 0, depending on how the tie is resolved we might continue with 0.

From this it's clear that if 0 is not optimal we might end up stuck with 0 for several steps until we are able to evaluate another action.

An observation that we could make is that we could force the greedy strategy to test at least once each action by starting with a default estimate of double.MaxValue. We know that the reward for all actions will be lower than double.MaxValue so at the first 10 steps all actions will be tested, afterwards we will continue with the action that performed the best on the first 10 steps.

ϵ-greedy strategy

With the ϵ-greedy strategy we are never sure if we are picking the best action, according to our current knowledge, or a random one. However then random one will give us opportunities to improve our estimates.

An improvement over this strategy could be to make ϵ smaller as we progress further into our round. The more steps we perform, the better the estimates we have, the lesser need for exploration we have.

Some additional comments

The book notes (bold is mine):

The advantage of ϵ-greedy over greedy methods depends on the task. For example, suppose the reward variance had been larger, say 10 instead of 1. With noisier rewards it takes more exploration to find the optimal action, and ϵ-greedy methods should fare even better relative to the greedy method. On the other hand, if the reward variances were zero, then the greedy method would know the true value of each action after trying it once. In this case the greedy method might actually perform best because it would soon find the optimal action and then never explore. But even in the deterministic case there is a large advantage to exploring if we weaken some of the other assumptions.

I want to comment on the bold section, because I don't fully agree with what is said there. Let's consider for a moment a ten armed testbed which actions have stationary rewards as [0.05 0.1 0.2 0.3 0.4 0.5 0.7 0.8 0.9] and let's think how the greedy performs.

Given that:

estimate won't change once updated
we start with all estimates equal to 0
we always select the best

the first action selected will change its estimate to a value greater than 0 and it will always be selected without any further exploration. This is a rather artificial example however what is true is that the greedy strategy will always select an action with positive reward once it has found any.

So when the greedy strategy will indeed find the best arm in the stationary case? One case is when the best arm is the only one with a positive reward. Another one is when all arms have negative reward because it will pick one by one all of them on the first steps. Since in case of ties there is some randomization at play, the greedy strategy can select the best arm in other cases.

Different result can be obtained if we change the initial estimate of the actions. Again selecting double.MaxValue as default instead of 0 would force the greedy strategy to select at least once all the arm and then know exactly which one is the best and continue with that.

Code

I implemented of all this in the https://github.com/davidelettieri/sutton-barto-reinforcement-learning repo. I think I left enough comments to allow an easy read of the code, if that's not the case please contact me I'll be happy to add more or explain better if needed.

The graphs are implemented using https://scottplot.net/, you might need to install additional packages, please follow the documentation for more details. On my Fedora 42, at the time of writing, the code works as it is and produce the following graphs:

The average reward per step, per strategy

The best arm selection rate per step, per strategy

Initializing default estimates to double.MaxValue

NOTE: When I wrote all of this, I didn't read paragraph 2.6 and I thought that the dependance on the default estimate should have been discussed already. The authors of the book disagree on this and discussed right afterwards in 2.6 Optimistic Initial Values. I, of course, recommend to read the book which is available for free here.

As noted above a couple of times, setting the default estimate to double.MaxValue will force exploration at initial steps for all strategies. The performance of the average reward improves for all strategies, possibly excluding the initial exploration which is quite visible on the graph as brief (10 steps) almost horizontal progress on the three lines. It is noticeable also how the three strategies behave almost with the same performance on the average reward. The graph for the best arm selection rate is rather different from the previous case, showing again more similaties between the three strategies.

The average reward per step, per strategy using double.MaxValue as default estimate

The best arm selection rate per step, per strategy using double.MaxValue as default estimate

Tic-tac-toe reinforcement learning with C#

davide lettieri — Sun, 16 Mar 2025 00:00:00 +0000

A couple of weeks ago I wanted to take a look at reinforcement learning and possibly work on a very simple sample in C#. In search for a book to learn some basics I found Reinforcement Learning: An Introduction suggested in multiple places. The book is available for free as a PDF on the linked website, so I thought it would be a good starting point.

The book offers in the very first chapter, a tic-tac-toe example where an algorithm is described, albeit with not too much details. I decided to try to implement a C# version of that. Plenty of implementations are available online and the authors offer a lisp version on their website, so there is a wide range of option to explore and evaluate.

I will report part of the text of the example here to comment on it and provide some additional details that I think would helped me in understanding the example. In short the objective of the exercise is to implement a player, QPlayer in my case, that plays the X symbol and decides what move to play based on a function that assigns a value to each state of the game. This value function will be improved iteratively by making the QPlayer play against another automated player. The state of the game is a possible configuration of X and O on the 3x3 grid. The value function is a function that assigns a double value to each state of the game.

From the book:

Here is how the tic-tac-toe problem would be approached with a method making use of a value function. First we set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state’s value, and the whole table is the learned value function. State A has higher value than state B, or is considered “better” than state B, if the current estimate of the probability of our winning from A is higher than it is from B. Assuming we always play Xs, then for all states with three Xs in a row the probability of winning is 1, because we have already won. Similarly, for all states with three Os in a row, or that are “filled up,” the correct probability is 0, as we cannot win from them. We set the initial values of all the other states to 0.5, representing a guess that we have a 50% chance of winning. We play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves (one for each blank space on the board) and look up their current values in the table. Most of the time we move greedily, selecting the move that leads to the state with greatest value, that is, with the highest estimated probability of winning. Occasionally, however, we select randomly from among the other moves instead. These are called exploratory moves because they cause us to experience states that we might otherwise never see. While we are playing, we change the values of the states in which we find ourselves during the game. We attempt to make them more accurate estimates of the probabilities of winning. To do this, we “back up” the value of the state after each greedy move to the state before the move [...]. More precisely, the current value of the earlier state is adjusted to be closer to the value of the later state. This can be done by moving the earlier state’s value a fraction of the way toward the value of the later state. If we let s denote the state before the greedy move, and s' the state after the move, then the update to the estimated value of s, denoted V (s), can be written as V(s) = V(s) + α [V(s') − V(s)], where α is a small positive fraction called the step-size parameter, which influences the rate of learning. This update rule is an example of a temporal-difference learning method, so called because its changes are based on a difference, V(s') − V(s), between estimates at two different times.

Move selection and backup from the book

Blocks represent state of the game. Arrows represent moves taken by either player.

Comments on the example

Given the book description of the algorithm we need to build a value function V that can change at execution time, as our player learns more about the game. To do this we implement the value function as a dictionary lookup. We gave initial values for all states as described by the text above:

value 1: winning position for learning player.
value 0: losing position for learning player.
value 0.5: all other positions.

The first observation to be made is that the value table is depending on the player we are training because of course changing the player from X to O will switch winning and losing positions, switching by consequence the values of those positions.

Second point is that, while we give each position a value following a general rule, the state space can be divided into two disjoint sets:

the set of states that can be updated and used by the algorithm (but will not be updated necessarily, there is a bit of randomization at play)
the set of states that will never be updated or used by the algorithm

Again this two sets are fixed once we fix the player we want to train, in our case X. Let's think about the update rule V(s) = V(s) + α [V(s') − V(s)], we have two states s and s', in both states, X has just moved. This means that the number of X and O on the 3x3 grid are not equal because X always moves first and so we have one X more than O. So we already understand that states with equal number of X and O will never be updated. Also when we evaluate next positions to choose the best move to make, we evaluate all position where X has moved so again we only use states where X and O have the same amount of positions on the grid. This point is not really important, unless you want to observe the value function changing over time and you notice some values to never be updated.

While realizing this, I was wondering also: if we learn only when X moves, why do have the value function for state where X lose? We will never reach a losing state for X from a X move and a losing state for X is a grid with equal amount of X and O. Right? In my opinion that is not correct, so I amended the algorithm to learn from each non-random move of X and to learn, or "back-up" as the text put it, even when the X player is losing after a move from O. This change will allow us to use the zero valued states from the value function.

So after this considerations and my proposed change for the "back-up" algorithm, we know that the only states used in learning are the states where X has moved or X lose. All the other states won't impact the learning process. If we train the O player the sets of course change accordingly.

Proposed move selection and backup

Blocks represent state of the game. Arrows represent moves taken by either player. With respect to the original approach, we only add a final back-up if the ending position is losing position for our player.

An additional observation is that during training time, the O player is implemented as the X player only with more probability of choosing a random move, according to the text the O player should play randomly, with my implementation this means passing 1 as exploration rate in the QPlayer constructor when instantiating the O player. My code can be found on github.

Exercise 1.1 about the tic-tac-toe example

Last point is that I'm a bit confused about the first exercise of the Chapter:

Exercise 1.1: Self-Play Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself. What do you think would happen in this case? Would it learn a different way of playing?

I think there might be two different interpretations of this exercise:

We make the reinforcement learning player play against itself, interpreted as the same instance of the player. In this case we are saying that we are going to play both X and O on a value function built for X winning. Should me make the instance learn or back-up values of both moves? I feel like this doesn't make much sense.
We make two different instances of our reinforcement learning player with appropriate value functions for X and O and we make them play against each other. This is what I implemented, the QPlayer class accepts a parameter explorationRate that controls the randomization of the instance, with 1 it is fully random and it won't learn anything, with 0 it always choose the best move based on the value function (but no exploratory moves!). I played a bit with the randomization of the O player and I didn't notice much differences on the outcomes.

DEV Community: davide lettieri

Lox as a Racket language module

Implementation strategy

Defining the racket-lox language

How to verify that resolver is executed at compile (expansion time)

What is resolve-redefinitions

The custom #%module-begin form

The reader

Scanner

Parser

Lox

Nil and truthiness

Binary operations

Undefined variables

Return statement

Classes

Keywords this and super and instance methods definition

Instance method executions

Property lookup

Super

Additional compatibility gaps between Lox and Racket

Printing

Numbers

The clock function

DrRacket not supported properly

Footnotes​

Extensible Visitor Pattern in C#

Functional approach

Functional approach - F# implementation

Functional approach - C# Pattern Matching

Object oriented with the interpreter pattern

Remarks on the interpreter pattern and the provided implementation

Interpreter Pattern - Abstract class based inheritance problem

Interpreter Pattern - Shrinkable Shape Implementation

Object oriented with the visitor pattern

Visitor Pattern - Core Implementation

Visitor Pattern - Adding Union Shape Support

Visitor Pattern - Recursive Type Limitation

Visitor Pattern - Test for Nested Translated Shape

Object oriented with the extensible visitor pattern

Visitor with virtual method to create instances of itself

Sutton & Barto Gridworld example in C#

The implementation

How to run the sample

Grid diagram and state mapping

Multi armed bandit exercise 2.5 with C#

The reward estimation formula

The implementation

Ten armed testbed for the Bandit problem with C#

10 armed testbed

Greedy strategy

ϵ-greedy strategy

Some additional comments

Code

Initializing default estimates to double.MaxValue

Tic-tac-toe reinforcement learning with C#

Move selection and backup from the book

Comments on the example

Proposed move selection and backup

Exercise 1.1 about the tic-tac-toe example

Keywords `this` and `super` and instance methods definition

Footnotes