DEV Community

Cover image for Processing 13 million rows from a CSV file in the Browser (Without freezing the screen)
Wesley Miranda
Wesley Miranda

Posted on • Updated on

Processing 13 million rows from a CSV file in the Browser (Without freezing the screen)

Have you ever thought that you could process huge files in the browser? and even better, without freezing the browser screen. Browsers are well-known only for showing things on the screen, and all the hard processing is a server's responsibility. But, nowadays the browsers are much more powerful, and new great APIs have been implemented, which open many different possibilities to implement and improve our applications.

Application: We're going to create a pure Javascript application able to load many big CSV files and process them separately in different threads, transforming the CSV rows into a Javascript object and sending mocked requests for all these objects.

Main Principles:

  • Streams: Interface to load and process file chunks instead of the entire buffer.

  • Web Workers: Interface to deal with multithreading in the browser.

If you want to know about multithreading and Streams in NodeJS, you can access my tutorials through the links below:

Steps to Reproduce:

  • Create a thread worker to handle all the file processing

  • Inside our thread worker we are going to process the file using Streams and treat each chunk of the file.

  • At the end we are going to deal with the interface, to load the files and show the outcomes.

Requirements

  • I am using version 111.0.5563.65 of Google Chrome to test the application.

  • You will need to serve the application in your localhost because Chrome blocks the creation of threads if you are in a local path. I am using the vs code live server plugin to do that.

  • The CSV file with 13 million rows you can download here


Thread Worker

The thread worker must be responsible for the file processing, extracting the rows as Javascript objects, and in the end, simulating the requests for all the Javascript objects to a server.

To measure the file loading progress and transform CSV rows into a Javascript object just like we did at my first tutorial, We are going to use Transform Streams, and a Writable Stream to store the objects.

threadWorker.js


// (1)
let readableStream = null
let fileIndex = null
let bytesLoaded = 0
let linesSent = 0
const objectsToSend = []
let fileCompletelyLoaded = false
const readyEvent = new Event('ready')


// (2)
const ObjectTranform = {
    headerLine: true,
    keys: [],
    tailChunk: '',

    start() {
        this.decoder = new TextDecoder('utf-8');
    },

    transform(chunk, controller) {
        const stringChunks = this.decoder.decode(chunk, { stream: true })
        const lines = stringChunks.split('\n')

        for (const line of lines) {
            const lineString = (this.tailChunk + line)
            let values = lineString.split(',')

            if (this.headerLine) {
                this.keys = values
                this.headerLine = false
                continue
            }


            if (values.length !== this.keys.length || lineString[lineString.length - 1] === ',') {
                this.tailChunk = line
            } else {
                const chunkObject = {}

                this.keys.forEach((element, index) => {
                    chunkObject[element] = values[index]
                })

                this.tailChunk = ''

                controller.enqueue(`${JSON.stringify(chunkObject)}`)
            }
        }
    },

}

// (3)
const ProgressTransform = {
    transform(chunk, controller) {
        bytesLoaded += chunk.length
        controller.enqueue(chunk)
        postMessage({ progressLoaded: bytesLoaded, progressSent: linesSent, index: fileIndex, totalToSend: 0 })
    },
    flush() {
        fileCompletelyLoaded = true
    }
}

// (4)
const MyWritable = {
    write(chunk) {
        objectsToSend.push(postRequest(JSON.parse(chunk)))
    },
    close() {
        if (fileCompletelyLoaded) {
            postMessage({ totalToSend: objectsToSend.length, index: fileIndex, progressLoaded: bytesLoaded, progressSent: linesSent })
            dispatchEvent(readyEvent)
        }
    },
    abort(err) {
        console.log("Sink error:", err);
    },
}

// (5)
const postRequest = async data => {
    return new Promise((resolve, reject) => {
        setTimeout(() => {
            linesSent++
            postMessage({ totalToSend: objectsToSend.length, progressSent: linesSent, progressLoaded: bytesLoaded, index: fileIndex })
            resolve(data)
        }, 3000)
    })
}

// (6)
addEventListener('ready', async () => {
    await Promise.all(objectsToSend)
})

// (7)
addEventListener("message", event => {
    fileIndex = event.data?.index
    readableStream = event.data?.file?.stream()
    readableStream
        .pipeThrough(new TransformStream(ProgressTransform))
        .pipeThrough(new TransformStream(ObjectTranform))
        .pipeTo(new WritableStream(MyWritable))
})


Enter fullscreen mode Exit fullscreen mode

From the code sections above:

  1. Controller variables:

    • readableStream: to store the file passed by the main thread.
    • fileIndex: to handle which file is being processed.
    • bytesLoaded: to sum the number of bytes processed.
    • linesSent: to sum the number of objects that were sent.
    • objectsToSend: array to store all the objects extracted.
    • fileCompletelyLoaded: flag to know if the file was loaded.
    • readyEvent: event to dispatch the requests to the server
  2. Transform Stream to convert the rows into Javascript Objects. There is a session in one of the previous tutorials that I explain better how I do that. here

  3. Transform Stream is responsible to control the progress, incrementing the bytesLoaded variables, and setting the flag fileCompletelyLoaded when the file is loaded. The enqueue function We apply to pass the chunk to the rest of the Stream pipeline.

  4. Writable Stream is responsible to store all our converted objects and dispatch the ready event to start the requests.

  5. Function used to simulate a request behavior with a 3 seconds delay.

  6. Listener is responsible to listen when the ready is dispatched and starting the mocked requests.

  7. Another listener, but it is waiting for messages from the main thread with the file index and the file object.

OBS: The postMessage function is used to send messages to the main thread that created the worker.


UI behavior

Our focus is not the interface here, so we have a simple input that accepts multiple files and a div to show the progress and information related to the files.

main.js


// (1)
const input = document.getElementById('files')
const progress = document.getElementById('progress')


// (2)
const formatBytes = (bytes, decimals = 2) => {
    if (!+bytes) return '0 Bytes'

    const k = 1024
    const dm = decimals < 0 ? 0 : decimals
    const sizes = ['Bytes', 'KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']

    const i = Math.floor(Math.log(bytes) / Math.log(k))

    return `${parseFloat((bytes / Math.pow(k, i)).toFixed(dm))} ${sizes[i]}`
}

// (3)
input.addEventListener('change', async (e) => {
    const files = e.target.files
    const workersInfo = []

    for (const i in files) {
        const worker = new Worker("threadWorker.js")
        if (files[i].name) {
            worker.postMessage({ index: i, name: files[i].name, file: files[i] })

            worker.addEventListener("message", event => {
                if (event.data) {
                    const infos = {
                        progressSent: event.data.progressSent, progressLoaded: event.data.progressLoaded, index: event.data.index, totalToSend: event.data.totalToSend, fileSize: files[i].size, fileName: files[i].name
                    }
                    workersInfo[i] = infos
                }

                progress.innerHTML = `
                    <table align="center" border cellspacing="1">
                        <thead>
                            <tr>
                                <th>File</th>
                                <th>File Size</th>
                                <th>Loaded</th>
                                <th></th>
                                <th>Total Rows</th>
                                <th>Rows Sent</th>
                                <th></th>
                            </tr>    
                        </thead>
                        <tbody>
                            ${workersInfo.map(info =>
                    `<tr>
                                    <td>${info.fileName}</td>
                                    <td>${formatBytes(info.fileSize)}</td>
                                    <td>${formatBytes(info.progressLoaded)}</td>
                                    <td><progress value="${Math.ceil(info.progressLoaded / info.fileSize * 100)}" max="100"> 32% </progress></td>
                                    <td>${info.totalToSend}</td>
                                    <td>${info.progressSent}</td>
                                    <td><progress value="${Math.ceil(info.progressSent / info.totalToSend * 100)}" max="100"> 32% </progress></td>
                                </tr>`
                )}
                        </tbody>
                    </table>
                    `

            })
        }
    }
})

Enter fullscreen mode Exit fullscreen mode

From the code sections above:

  1. Getting the references to manipulate the DOM.

  2. This function I got on Stackoverflow elegantly shows the bytes.

  3. The listener is responsible to listen when the user selects the files to process. When this event is dispatched we are going to create one thread for each, using postMessage function to pass the file to the thread created. Also, the thread should listen to messages to update the results on the screen.


Joining Everything

Now we can import all the scripts we created in an HTML file and put the necessary tags.

index.html

<!DOCTYPE html>
<html>

<head>
    <meta charset='utf-8'>
    <meta http-equiv='X-UA-Compatible' content='IE=edge'>
    <title>Multithreading browser</title>
    <meta name='viewport' content='width=device-width, initial-scale=1'>
    <script src='threadWorker.js' async></script>
    <script src='main.js' async></script>
</head>

<body>
    <input type="file" multiple id="files" accept=".csv" /><br><br>
    <div id="progress"></div>
</body>

</html>

Enter fullscreen mode Exit fullscreen mode

Takeaways

  • Nowadays modern browsers make available awesome functionalities to improve performance and create nice things.

  • We can process hard jobs in the browser like manipulating files and taking off some server's responsibilities.

You can take a look at the entire code here


Top comments (7)

Collapse
 
crazy_man profile image
crazy man

I think you wrote a good article.
But your project doesn't work on local.
First, browsers doesn't let us load a worker file from local.
In order to do it, you need to use createObjectURL.
Second, you pass the file object to worker. but it doesn't work.
I think the data passed to worker as a message should be a serializable data.

So. in a word, unfortunately, readers who want to try to run your project can't run it on their local.

Collapse
 
wesleymreng7 profile image
Wesley Miranda

Try to run using "live server" plugin from vscode, I mentioned it in the beginning of the tutorial in order to solve it

Collapse
 
crazy_man profile image
crazy man

then, Does the git project work on your local without any updates?

Thread Thread
 
wesleymreng7 profile image
Wesley Miranda

Yes, you need a http server to be able run it locally

Collapse
 
swhiteman profile image
swhiteman

Excellent work. But there's an error in your use of the for...in over the FileList. That's iterating the properties of the FileList itself (item, length, etc.) and therefore throws an error.

You want to use for...of and then each iterator value is a File object.

Collapse
 
wesleymreng7 profile image
Wesley Miranda

thanks for the comment!

I don't remember very well but there was a reason to use for...in instead of for...of

Collapse
 
swhiteman profile image
swhiteman

Hmm, can't imagine what that was, gotta use ...of to work with the Files, pass 'em to the worker, etc.