A while back, I explored the possibility of simplifying 1 PyMC4's model specification API by manipulating the Python abstract syntax tree (AST) of the model code. The PyMC developers didn't end up pursuing those API changes any further, but not until I had the chance to learn a lot about Python ASTs.
Enough curious people have asked me about my experience tinkering with ASTs that I figure I'd write a short post about the details of my project, in the hope that someone else will find it useful.
You should read this blog post as a quick overview of my experience with Python ASTs, or an annotated list of links, and not a comprehensive tutorial on model specification APIs or Python ASTs. For a full paper trail of my adventures with Python ASTs, check out my notebooks on GitHub.
Originally, PyMC4's proposed model specification API looked something like this:
The main drawback to this API was that the
yield keyword was confusing. Many users don’t really understand Python generators, and those who do might only understand
yield as a drop-in replacement for
return (that is, they might understand what it means for a function to end in
yield foo, but would be uncomfortable with
bar = yield foo).
yield keyword introduces a leaky abstraction2: users don’t care about whether model is a function or a generator, and they shouldn't need to. More generally, users shouldn't have to know anything about how PyMC works in order to use it: ideally, the only thing users would need to think about would be their data and their model. Having to graft several
yield keywords into their code is a fairly big intrusion in that respect.
Finally, this model specification API is essentially moving the problem off of our plates and onto our users. The entire point of the PyMC project is to provide a friendly and easy-to-use interface for Bayesian modelling.
To enumerate the problem further, we wanted to:
- Hide the
yieldkeyword from the user-facing model specification API.
- Obtain the user-defined model as a generator.
The main difficulty with the first goal is that as soon as we remove
yield from the model function, it is no longer a generator. However, the PyMC inference engine needs the model as a generator, since this allows us to interrupt the control flow of the model at various points to do certain things:
- Manage random variable names.
- Perform sampling.
- Other arbitrary PyMC magic that I'm truthfully not familiar with.
In short, the user writes their model as a function, but we require the model as a generator.
I opine on why this problem is challenging a lot more here.
First, I wrote a
FunctionToGenerator does) is the recommended way of modifying ASTs. The functionality of
FunctionToGenerator is pretty well described by the docstring: the
visit_Assign method adds the
yield keyword to all assignments by wrapping the visited
Assign node within a
Yield node. The
visit_FunctionDef method removes the decorator and renames the function to
_pm_compiled_model_generator. All told, after the
NodeTransformer is done with the AST, we have one function,
_pm_compiled_model_generator, which is a modified version of the user-defined function.
This class isn't meant to be instantiated: rather, it's meant to be used as a Python decorator. Essentially, it "uncompiles" the function to get the Python source code of the function. This source code is then passed to the
parse_snippet3 function, which returns the AST for the function. We then modify this AST with the
FunctionToGenerator class that we defined above. Finally, we recompile this AST and execute it. Recall that executing this recompiled AST defines a new function called
_pm_compiled_model_generator. This new function, accessed via the
locals variable4, is then bound to the class's
self.model_generator, which explains the confusing-looking line 25.
Finally, the user facing API looks like this:
As you can see, the users need not write
yield while specifying their models, and the PyMC inference engine can now simply call the
model_generator method of
linear_regression to produce a generator called
_pm_compiled_model_generator, as desired. Success!
Again, PyMC4's model specification API will not be incorporating these changes: the PyMC developers have since decided that the
yield keyword is the most elegant (but not necessarily the easiest) way for users to specify statistical models. This post is just meant to summarize the lessons learnt while pursuing this line of inquiry.
Reading and parsing the AST is perfectly safe: that's basically just a form of code introspection, which is totally a valid thing to do! It's when you want to modify or even rewrite the AST that things start getting
janky dangerous (especially if you want to execute the modified AST instead of the written code, as I was trying to do!).
If you want to programmatically modify the AST (e.g. "insert a
yield keyword in front of every assignment of a TensorFlow Distribution", as in our case), stop and consider if you're attempting to modify the semantics of the written code, and if you're sure that that's a good idea (e.g. the
yield keywords in the code mean something, and remove those keywords changes the apparent semantics of the code).
I've only given a high-level overview of this project here, and a lot of the technical details were glossed over. If you're hungry for more, check out the following resources:
- Notebooks and more extensive documentation on this project are on GitHub. In particular, it might be helpful to peruse the links and references at the end of the READMEs.
- For those looking to programmatically inspect/modify Python ASTs the same way I did here, you might find this Twitter thread helpful.
- And for those wondering how PyMC4's model specification API ended up, some very smart people gave their feedback on this work on Twitter.
Or should I say, complicating? At any rate, changing! ↩