DEV Community

Petr Muller
Petr Muller

Posted on

pyff: Python Diff

This post was originally published on 2018-04-06 on my old blog. I do not want to lose the content, so I migrated it here. I worked on pyff during my 2018 sabbatical which I ended sooner than expected after being hired by Red Hat. I would still like to revive this work sometimes.


GitHub logo petr-muller / pyff

Python Diff

The idea of syntactic/semantic-aware diff tool was in my head since we neededsomething similar for a project we were working in Red Hat Lab together with VeriFIT research group. We wanted to connect code differences (git commits or PRs) with test results and build a “riskiness classifier”. The rough idea was something like ”whenever people change I/O code in method M of class C, test T tends to break”. We were missing the analyzer that would easily give us, in machine-readable format, what actually changed in the code, besides changed lines that a simple diff tool can give you. We somehow managed to build something ad-hoc for C code differences and continued, but since then I thought the smart diff could be an interesting project.

Comparing abstract syntax trees

I decided to start in a simple way: take two versions of a Python file as in input, and work over their AST to detect differences. There is an AST module in Python standard library that can parse Python code easily but I remembered a talk on Pylint which described Astroid as an improved module with more functionality (build for usage in Pylint). I wanted to use it but failed to find a current documentation link; for some reason I kept discovering www.astroid.org which is dead for some time (I discovered the current documentation later).

So I decided to go with “vanilla” ast module for a while. I discovered the very helpful Green Tree Snakes - the missing Python AST docs documentation for it and from there, the first steps were quite simple. I chose the approach of driving the development by examples: I selected a git commit from a different project, looked at the diff and asked myself “what changed in that code?”, then went to implement the necessary code.

I have started with detecting added and removed imports, classes and high-level methods in the module, followed by detecting simple changes of these entities such as added/removed methods and changed implementations. The entities are currently identified by name, which means renaming is not properly detected (it will be reported as one class/method removed and another added). At the moment, the only supported output is the natural language summary of the changes.

After I had this MVP version of pyff ready I went on to set up some necessary project infrastructure: README, tests and some helper code.

Further steps

I would like to implement a programatical API and a machine readable output format (probably JSON), then follow with implementing further change types detection. I will probably continue with the example-driven approach, but I would like to implement some “smart” detection soon: something like recognizing that the program was not semantically changed (for example, a simple variable rename) and not reporting an implementation change in that case.

Heroku

Build apps, not infrastructure.

Dealing with servers, hardware, and infrastructure can take up your valuable time. Discover the benefits of Heroku, the PaaS of choice for developers since 2007.

Visit Site

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay