Some useful tips and libraries for manipulating json in your data science projects.
The standard json
module
Python has a standard module called json
that lets you quickly manipulate JSON files.
Loading
import json
with open("data/example.json", "r") as f:
data = json.load(f)
data
# [{'id': 0, 'content': [0.0, 0.0, 1.0]}, {'id': 1, 'content': [0.0, 1.0, 0.0]}]
Backup
First tip: when working with textual data, the ensure_ascii=False
option is very useful
to preserve, among other things, accents when saving
with open("data/example.json", "w") as f:
json.dump(data, f, ensure_ascii=False)
Second tip: the indent
option in the dump
method indents the data in the backup file.
with open("data/example.json", "w") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
Issues related to numpy
Data science projects often use numpy
.
However, numpy
objects are not JSON-serializable and therefore require conversion to standard python objects in order to be saved:
import numpy as np
data = np.array([[0., 0., 1.], [0., 1., 0.]])
with open("data/numpy.json", "w") as f:
json.dump(data, f, ensure_ascii=False)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[12], line 6
3 data = np.array([[0., 0., 1.], [0., 1., 0.]])
5 with open("data/numpy.json", "w") as f:
----> 6 json.dump(data, f, ensure_ascii=False)
TypeError: Object of type ndarray is not JSON serializable
By converting the array into a list, the data
object can be saved:
with open("data/numpy.json", "w") as f:
json.dump(data.tolist(), f, ensure_ascii=False)
But that's not very practical...
One solution is to create a custom JSONEncoder
which converts the numpy.ndarray
using its tolist
method at save time:
class NumpyJSONEncoder(json.JSONEncoder):
"""JSONEncoder to store python dict or list containing numpy arrays"""
def default(self, obj):
"""Transform numpy arrays into JSON serializable object such as list
see : https://docs.python.org/3/library/json.html#json.JSONEncoder.default
"""
if isinstance(obj, np.ndarray):
return obj.tolist()
return json.JSONEncoder.default(self, obj)
with open("data/numpy.json", "w") as f:
json.dump(data, f, ensure_ascii=False, cls=NumpyJSONEncoder)
The orjson
library
orjson
is the fastest JSON library available for python. It natively manages dataclass objects,
datetime, numpy
and UUID objects.
A few things to remember when working with orjson
:
- There is no
load
ordump
method, you have to useloads
anddumps
instead. - You must use flags to use certain functionalities, such as
orjson.OPT_SERIALIZE_NUMPY
to serialize serializenumpy
objects
import orjson
with open("data/example.json", "rb") as f:
data = orjson.loads(f.read())
with open("data/example.json", "wb") as f:
f.write(orjson.dumps(data, option=orjson.OPT_SERIALIZE_NUMPY))
Note that the json file is written in binary (hence rb
and wb
).
Performance
orjson claims to serialize numpy.ndarray 4 to 12 times faster than the standard library. This can be
by comparing the two methods described above:
data = {i: np.random.randn(100) for i in range(100)}
def save_json():
with open("data/fast.json", "w") as f:
json.dump(data, f, ensure_ascii=False, cls=NumpyJSONEncoder)
def save_orjson():
with open("data/orfast.json", "wb") as f:
f.write(orjson.dumps(data, option=orjson.OPT_SERIALIZE_NUMPY|orjson.OPT_NON_STR_KEYS))
%timeit save_json()
%timeit save_orjson()
# 15.5 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 1.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In this example, orjson is more than 10 times faster. Note that the OPT_NON_STR_KEYS
option is used to enable
allow orjson to save non-string keys.
(Tests run with python 3.11)
FastAPI and orjson
The FastAPI
documentation contains a guide to using orjson
](https://fastapi.tiangolo.com/advanced/custom-response/?h=orjson#use-orjsonresponse) to serialize JSON responses.
This is particularly useful for APIs that expose machine learning models whose outputs are often numpy.ndarray
import numpy as np
from fastapi import FastAPI
from fastapi.responses import ORJSONResponse
app = FastAPI()
@app.get("/random-vector", response_class=ORJSONResponse)
async def get_random_vector():
return ORJSONResponse(np.random.randn(100))
JSON Lines
When working with the JSON format, it's not uncommon to manipulate collections of objects.
[
{"id": 0, "name": "toto"},
{"id": 1, "name": "titi"},
]
To be valid, objects must be contained in a JSON list, hence the square brackets around the objects in the collection. However, this is not at all practical for reading large volumes of data, as you have to parse the entire file the entire file and load everything into memory.
This can be remedied by using the [JSON Lines] format (https://jsonlines.org/). This involves nothing more and nothing less than placing one JSON object per line, so that you can browse the objects without having to parse the entire
collection all at once.
{"id": 0, "name": "toto"}
{"id": 1, "name": "titi"}
The [jsonlines
] library (https://jsonlines.readthedocs.io/en/latest/index.html) is very useful for manipulating
such files. It can also be combined with orjson
.
import jsonlines
import orjson
with jsonlines.open("data/many_examples.jsonl", "r", loads=orjson.loads) as reader:
for obj in reader:
print(obj)
# {'id': 0, 'name': 'toto'}
# {'id': 1, 'name': 'titi'}
TL;DR
A brief summary of the tips seen here:
- Use
ensure_ascii=False
when working with the standardjson
library. - Consider the
orjson
library for the performance and functionality it offers. - Consider the
JSON Lines
format for collections of JSON objects.
Top comments (0)