DEV Community

Storing a Java collection in a file storage

Alex Lunkov on February 15, 2024

If you deal with a large datasets stored in a Java collection, lets say java.util.List, sooner or later you would encounter with a situation that a...
Collapse
 
nilscoding profile image
Nils

Thanks for sharing this library, I really like the idea behind it to store big amounts of data on disk instead of in-memory in a simple way during processing.

I've been playing around for a little while with the usage example program and the code in general and - if you don't mind - I'd like to share some detailed feedback. There are some speed optimizations that might be useful and I also came across some thoughts on Java Collection integration.
As I don't want to look rude by just dropping several snippets of codes here in the comments out of the blue, I'd kindly ask beforehand if it's okay to do so.

Collapse
 
alex538 profile image
Alex Lunkov

it is true that the library needs optimization for speed, definitely writing to a file storage is always slower that storing in memory.

I welcome improvements and thoughts, please share your opinion :) it is really interesting to me!

Collapse
 
nilscoding profile image
Nils

The execution time of the example program dropped significantly on my machine when adding the following changes:

  • Implementing FOutputStream.write(byte[]) to forward the given data directly to raf.write(byte[])
  • Implementing FOutputStream.write(byte[], int, int) to forward the given data to .write(byte[], int, int)

A similar approach works for FInputStream by implementing read(byte[], int, int) but here a little more logic needs to be added to correspond with what you've implemented in read():

    @Override
    public int read(byte[] b, int off, int len) throws IOException {
        raf.seek(pos);
        int remaining = (int) (endOfBlock - pos);
        if (remaining <= 0) {
            return -1;
        }
        int readCount = raf.read(b, 0, Math.min(remaining, len));
        pos += readCount;
        return readCount;
    }

Enter fullscreen mode Exit fullscreen mode

And of course I've also looked into options for providing a (more or less) compatible Java Collections integration.
Similar to the unmodifiable Collections that Java itself offers, one could implement only the "read" methods of Collection and discard every write call silently or even with an exception. The only missing "read" methods size() and isEmpty could be implemented by seeking through the raf in FFileStorage, counting the jumps and just jumping from header to header (and ignoring the real data in between). And isEmpty() might be something like return (raf.length() == 0L);.
And because FCollection (and the underlaying FFileCollection) even offer a Java Iterator implementation, a "real" Stream could be easily derived:

    public Stream<T> asJavaStream() {
        return StreamSupport.stream(Spliterators.spliteratorUnknownSize(this.iterator(), Spliterator.ORDERED), false);
    }

Enter fullscreen mode Exit fullscreen mode

Also, I had some trouble starting the program unter MS Windows because the default storage path /tmp resolves to C:\tmp which mostly does not exist, so the file cannot be created and a manual configuration must be used. Maybe the logic of java.io.File.createTempFile(...) could be somehow incorporated for the default case to handle the selection of the temporary directory.

Thread Thread
 
alex538 profile image
Alex Lunkov

Hi Nils
Thank you for your suggestions,

it is really good catch for modifying output/input stream classes - that increases performance, also I will replace "/tmp" default with a system defined temporary directory.

it could seem that implementation an interface "Collection" in "FCollection" should be simple - indeed, there are not so many methods to implement, but I would avoid using default implementation for creating a stream out of an iterator as it will lead to accumulating all data in memory again, especially for sorting. When I find a way how property implement Stream interface for FStream, I believe I will extend Collection interface.

Collapse
 
phlash profile image
Phil Ashby

Nice fix for a difficult problem Alex 👏

Was there a reason you didn't extend the existing Java collection classes to provide file-backed storage? I ask since the impact of publishing a new API might be too high for lots of legacy systems (too many references across a codebase to sanely refactor).

Collapse
 
alex538 profile image
Alex Lunkov

The main reason is in complexity of adding such functionality to existing Java Collections - we would need to override not only basic operations of collections: add, remove, iterate (all of them would need to support storing and retrieving data in/from a file system), but also conduct very complicated work to support all Java "streams" methods to read and write data in a file storage instead of storing data in RAM. That would be really huge effort :) so far we have limited number of supported methods of java streams in FStream, only what we needed for solving our tasks, but probably we would extend in the future list of supported operation and finally implement Collection and Stream interfaces in our library

Collapse
 
dagnelies profile image
Arnaud Dagnelies

Reminds me of github.com/dagnelies/FileMap I wrote long ago. It's focused on maps rather than lists though.