I asked this as a question on StackOverflow and then answered it myself by some own implementation.
I have some text (
bytes; actually gzipped in a file on disk) which can be parsed via
(It consists of a list of dicts, where the dict keys are strings, and values strings, int or float. But maybe this question could be generic for any string which can be parsed via
It is large: ~22MB uncompressed.
What is the fastest way to parse it?
Surely I can use
ast.literal_eval, but this seems quite slow. Standard
eval is slightly faster (interestingly, but probably as expected, depending how well you know Python; see the implementation of
ast.literal_eval) but still slow.
In comparison, when I serialize the same data as JSON, and then load the JSON (
json.loads), this is way faster (>10x). So this shows that in principle it should be possible to parse it just as fast.
Gunzip + read time: 0.15111494064331055 Size: 22035943 compile: 3.1023156170000004 parse: 3.3381092380000004 eval: 3.0252232049999996 ast.literal_eval: 3.765798232 json.loads: 0.2657175249999994
This benchmark script and also a script to generate such a dummy text file can be found: here
(Maybe the answer is: "this needs a faster C implementation; no-one has implemented that yet")
After posting this, I found some related questions. I did not found them via Google though (maybe my search term "faster literal_eval" was bad).
- Why is json.loads an order of magnitude faster than ast.literal_eval?
- python ast vs json for str to dict translation
This partly answers the question on why
ast.literal_eval is slow.
Also, this basically tells you, when you are thinking whether Python code is a good human readable serialization format (e.g. via
repr), then this tells you, better use JSON instead.
So, to the best of my knowledge, there currently did not exist a faster implementation than
eval itself is a bit faster, but unsafe).
So I implemented my own simple implementation, which converts the literal Python code into equivalent binary Pickle data.
So, for some bytes
data, instead of
ast.literal_eval(data.decode("utf8")), you would use
pickle.loads(py_to_pickle(data)), and get a speedup by 5.5x.
The repo is here.
This is a quite straight-forward implementation in C++, and you can easily directly use it with
ctypes (there is an example in the repo).
Gunzip + read time: 0.1663219928741455 Size: 22540270 py_to_pickle: 0.539439306 pickle.loads+py_to_pickle: 0.7234611099999999 compile: 3.3440755870000003 parse: 3.6302585899999995 eval: 3.306765757000001 ast.literal_eval: 4.056752016000003 json.loads: 0.3230752619999997 pickle.loads: 0.1351051709999993 marshal.loads: 0.10351717500000035