Have you ever thought that you could process huge files in the browser? and even better, without freezing the browser screen. Browsers are well-known only for showing things on the screen, and all the hard processing is a server's responsibility. But, nowadays the browsers are much more powerful, and new great APIs have been implemented, which open many different possibilities to implement and improve our applications.
Application: We're going to create a pure Javascript application able to load many big CSV files and process them separately in different threads, transforming the CSV rows into a Javascript object and sending mocked requests for all these objects.
Main Principles:
Streams: Interface to load and process file chunks instead of the entire buffer.
Web Workers: Interface to deal with multithreading in the browser.
If you want to know about multithreading and Streams in NodeJS, you can access my tutorials through the links below:
Steps to Reproduce:
Create a thread worker to handle all the file processing
Inside our thread worker we are going to process the file using Streams and treat each chunk of the file.
At the end we are going to deal with the interface, to load the files and show the outcomes.
Requirements
I am using version
111.0.5563.65
of Google Chrome to test the application.You will need to serve the application in your
localhost
because Chrome blocks the creation of threads if you are in a local path. I am using the vs codelive server
plugin to do that.The CSV file with 13 million rows you can download here
Thread Worker
The thread worker must be responsible for the file processing, extracting the rows as Javascript objects, and in the end, simulating the requests for all the Javascript objects to a server.
To measure the file loading progress and transform CSV rows into a Javascript object just like we did at my first tutorial, We are going to use Transform Streams, and a Writable Stream to store the objects.
threadWorker.js
// (1)
let readableStream = null
let fileIndex = null
let bytesLoaded = 0
let linesSent = 0
const objectsToSend = []
let fileCompletelyLoaded = false
const readyEvent = new Event('ready')
// (2)
const ObjectTranform = {
headerLine: true,
keys: [],
tailChunk: '',
start() {
this.decoder = new TextDecoder('utf-8');
},
transform(chunk, controller) {
const stringChunks = this.decoder.decode(chunk, { stream: true })
const lines = stringChunks.split('\n')
for (const line of lines) {
const lineString = (this.tailChunk + line)
let values = lineString.split(',')
if (this.headerLine) {
this.keys = values
this.headerLine = false
continue
}
if (values.length !== this.keys.length || lineString[lineString.length - 1] === ',') {
this.tailChunk = line
} else {
const chunkObject = {}
this.keys.forEach((element, index) => {
chunkObject[element] = values[index]
})
this.tailChunk = ''
controller.enqueue(`${JSON.stringify(chunkObject)}`)
}
}
},
}
// (3)
const ProgressTransform = {
transform(chunk, controller) {
bytesLoaded += chunk.length
controller.enqueue(chunk)
postMessage({ progressLoaded: bytesLoaded, progressSent: linesSent, index: fileIndex, totalToSend: 0 })
},
flush() {
fileCompletelyLoaded = true
}
}
// (4)
const MyWritable = {
write(chunk) {
objectsToSend.push(postRequest(JSON.parse(chunk)))
},
close() {
if (fileCompletelyLoaded) {
postMessage({ totalToSend: objectsToSend.length, index: fileIndex, progressLoaded: bytesLoaded, progressSent: linesSent })
dispatchEvent(readyEvent)
}
},
abort(err) {
console.log("Sink error:", err);
},
}
// (5)
const postRequest = async data => {
return new Promise((resolve, reject) => {
setTimeout(() => {
linesSent++
postMessage({ totalToSend: objectsToSend.length, progressSent: linesSent, progressLoaded: bytesLoaded, index: fileIndex })
resolve(data)
}, 3000)
})
}
// (6)
addEventListener('ready', async () => {
await Promise.all(objectsToSend)
})
// (7)
addEventListener("message", event => {
fileIndex = event.data?.index
readableStream = event.data?.file?.stream()
readableStream
.pipeThrough(new TransformStream(ProgressTransform))
.pipeThrough(new TransformStream(ObjectTranform))
.pipeTo(new WritableStream(MyWritable))
})
From the code sections above:
-
Controller variables:
-
readableStream
: to store the file passed by the main thread. -
fileIndex
: to handle which file is being processed. -
bytesLoaded
: to sum the number of bytes processed. -
linesSent
: to sum the number of objects that were sent. -
objectsToSend
: array to store all the objects extracted. -
fileCompletelyLoaded
: flag to know if the file was loaded. -
readyEvent
: event to dispatch the requests to the server
-
Transform Stream to convert the rows into Javascript Objects. There is a session in one of the previous tutorials that I explain better how I do that. here
Transform Stream is responsible to control the progress, incrementing the
bytesLoaded
variables, and setting the flagfileCompletelyLoaded
when the file is loaded. Theenqueue
function We apply to pass the chunk to the rest of the Stream pipeline.Writable Stream is responsible to store all our converted objects and dispatch the
ready
event to start the requests.Function used to simulate a request behavior with a 3 seconds delay.
Listener is responsible to listen when the
ready
is dispatched and starting the mocked requests.Another listener, but it is waiting for messages from the main thread with the file index and the file object.
OBS: The postMessage
function is used to send messages to the main thread that created the worker.
UI behavior
Our focus is not the interface here, so we have a simple input
that accepts multiple files and a div
to show the progress and information related to the files.
main.js
// (1)
const input = document.getElementById('files')
const progress = document.getElementById('progress')
// (2)
const formatBytes = (bytes, decimals = 2) => {
if (!+bytes) return '0 Bytes'
const k = 1024
const dm = decimals < 0 ? 0 : decimals
const sizes = ['Bytes', 'KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']
const i = Math.floor(Math.log(bytes) / Math.log(k))
return `${parseFloat((bytes / Math.pow(k, i)).toFixed(dm))} ${sizes[i]}`
}
// (3)
input.addEventListener('change', async (e) => {
const files = e.target.files
const workersInfo = []
for (const i in files) {
const worker = new Worker("threadWorker.js")
if (files[i].name) {
worker.postMessage({ index: i, name: files[i].name, file: files[i] })
worker.addEventListener("message", event => {
if (event.data) {
const infos = {
progressSent: event.data.progressSent, progressLoaded: event.data.progressLoaded, index: event.data.index, totalToSend: event.data.totalToSend, fileSize: files[i].size, fileName: files[i].name
}
workersInfo[i] = infos
}
progress.innerHTML = `
<table align="center" border cellspacing="1">
<thead>
<tr>
<th>File</th>
<th>File Size</th>
<th>Loaded</th>
<th></th>
<th>Total Rows</th>
<th>Rows Sent</th>
<th></th>
</tr>
</thead>
<tbody>
${workersInfo.map(info =>
`<tr>
<td>${info.fileName}</td>
<td>${formatBytes(info.fileSize)}</td>
<td>${formatBytes(info.progressLoaded)}</td>
<td><progress value="${Math.ceil(info.progressLoaded / info.fileSize * 100)}" max="100"> 32% </progress></td>
<td>${info.totalToSend}</td>
<td>${info.progressSent}</td>
<td><progress value="${Math.ceil(info.progressSent / info.totalToSend * 100)}" max="100"> 32% </progress></td>
</tr>`
)}
</tbody>
</table>
`
})
}
}
})
From the code sections above:
Getting the references to manipulate the DOM.
This function I got on Stackoverflow elegantly shows the bytes.
The listener is responsible to listen when the user selects the files to process. When this event is dispatched we are going to create one thread for each, using
postMessage
function to pass the file to the thread created. Also, the thread should listen to messages to update the results on the screen.
Joining Everything
Now we can import all the scripts we created in an HTML file and put the necessary tags.
index.html
<!DOCTYPE html>
<html>
<head>
<meta charset='utf-8'>
<meta http-equiv='X-UA-Compatible' content='IE=edge'>
<title>Multithreading browser</title>
<meta name='viewport' content='width=device-width, initial-scale=1'>
<script src='threadWorker.js' async></script>
<script src='main.js' async></script>
</head>
<body>
<input type="file" multiple id="files" accept=".csv" /><br><br>
<div id="progress"></div>
</body>
</html>
Takeaways
Nowadays modern browsers make available awesome functionalities to improve performance and create nice things.
We can process hard jobs in the browser like manipulating files and taking off some server's responsibilities.
You can take a look at the entire code here
Top comments (7)
I think you wrote a good article.
But your project doesn't work on local.
First, browsers doesn't let us load a worker file from local.
In order to do it, you need to use
createObjectURL
.Second, you pass the file object to worker. but it doesn't work.
I think the data passed to worker as a message should be a serializable data.
So. in a word, unfortunately, readers who want to try to run your project can't run it on their local.
Try to run using "live server" plugin from vscode, I mentioned it in the beginning of the tutorial in order to solve it
then, Does the git project work on your local without any updates?
Yes, you need a http server to be able run it locally
Excellent work. But there's an error in your use of the
for...in
over the FileList. That's iterating the properties of the FileList itself (item
,length
, etc.) and therefore throws an error.You want to use
for...of
and then each iterator value is a File object.thanks for the comment!
I don't remember very well but there was a reason to use
for...in
instead offor...of
Hmm, can't imagine what that was, gotta use
...of
to work with the Files, pass 'em to the worker, etc.