Dealing with huge files has always been a challenging task to take care of. The memory consumption that this kind of processing requires is something to take into account independently from the language we are using and Node.js is no exception.
Let's see how node's streams can make this task bearable even for a process with minimal memory availability. Specifically we'll take advantage of streams in order to run a process that converts a Google Takeout Location History JSON into a GeoJSON.
The problem
We have as an input an array of locations that are not defined according to any of the Geographic Information System standards so we want to define them.
Google Takeout Location History input example:
{
"locations": [
{
"timestampMs": "1507330772000",
"latitudeE7": 419058658,
"longitudeE7": 125218684,
"accuracy": 16,
"velocity": 0,
"altitude": 66,
}
]
}
GeoJson output example:
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [ 12.5218684, 41.9058658 ]
},
"properties": {
"timestamp": "2017-10-06T22:59:32.000Z",
"accuracy": 16,
"velocity": 0,
"altitude": 66
}
}
]
}
The transformation that we want to perform is quite straightforward, we would like to apply to the entries of the locations array the following function.
const toGeoJson = googleTakeoutLocation => ({
type: 'Feature',
geometry: {
type: 'Point',
coordinates: [
googleTakeoutLocation.longitudeE7 / 10000000,
googleTakeoutLocation.latitudeE7 / 10000000,
]
},
properties: {
timestamp: new Date(Number(googleTakeoutLocation.timestampMs)),
accuracy: googleTakeoutLocation.accuracy
velocity: googleTakeoutLocation.velocity
altitude: googleTakeoutLocation.altitude
}
})
This could be achieved using a simple Array.map(), however if we try to process a 2GB Google Takeout Location History JSON in order to apply a map() over the locations array we are going to face the following outcome:
- Error message: Cannot create a string longer than 0x3fffffe7 characters
- Error code: ERR_STRING_TOO_LONG
The solution
The only way of dealing with these huge files is using a divide and conquer approach. Instead of loading them in memory all at once we are going to create a stream of data that is going to flow from the input file to the output one. This technique will allow us to manipulate small bits of data at a time, resulting in a slow but reliable processing that is not going to eat up all our memory.
Node.js Streams are the best tool to implement this technique. They allow us to create different pipes through which our data stream will flow and it can be steered and manipulated according to our needs.
There are four streams (pipes) types:
- Readable: data emitters, a given data source becomes a stream of data.
- Writable: data receivers, a given stream of data ends up into a data destination.
- Transform: data transformers, a given data stream is mutated into a new one.
- Duplex: data emitters and receivers at the same time
In order to accomplish our goal, what we will rely on is:
- One readable stream (pipe) in order to get the data out of the Google Takeout Locations JSON.
- A set of different transform streams (pipes) in order to modify our locations.
- One writable stream (pipe) in order to store mutated locations into a GeoJSON output file.
Here is how the different pipes of our stream processing approach are going to look:
Let's see why we need so many pipes and what role each one of them plays:
- [Read] fileToStream → Input file to stream.
- [Transform] streamParser → Consumes text, and produces a stream of data items corresponding to high-level tokens.
- [Transform] streamPicker → It is a token item filter, it selects objects from a stream ignoring the rest and produces a stream of objects (the locations field in our case).
- [Transform] streamArrayer → It assumes that an input token stream represents an array of objects and streams out those entries as assembled JavaScript objects (locations array entries in our case).
- [Transform] streamGeoJsoner → It transforms google takeout locations into GeoJson locations.
- [Transform] streamStringer → It stringifies GeoJson locations.
- [Write] streamToFile → Stream to Output file.
The actual transformation to GeoJSON happens at point five and it looks like this:
const streamGeoJsoner = new Transform({
objectMode: true,
transform({ key, value }, _, done) {
const googleTakeoutLocation = toGeoJson(value);
count++;
done(null, { key: count++, value: googleTakeoutLocation })
}
});
As you can see we are implementing our own version of a transform pipe in order to deal with objects coming from the pipe number 4 (streamArrayer) and to apply to them the mutation defined above in the article (toGeoJson).
Now that we have all the pieces (pipes) in our hands it is time to connect them and make our data flow into them. We are going to do that using the pipeline utility as follows:
pipeline(
fileToStream,
streamParser,
streamPicker,
streamArrayer,
streamGeoJsoner,
streamStringer,
streamToFile
);
Running the above pipeline is what is going to make us reach our goal, any google takeout location JSON, no matter how big it is, can be translated into a GeoJSON avoiding huge memory consumption.
If you are interested in the whole code you can find it here. What follows is the outcome of the described solution over 5 different input files - check out the logs and have a look at file size and processing time.
Top comments (1)
Nice Post!