If you deal with a large datasets stored in a Java collection, lets say java.util.List, sooner or later you would encounter with a situation that allocated memory is not enough to hold all your data. It may be even small number of elements in a collection, but many of them take much of memory. It happens that a process must accumulate large dataset, and there is no way to process data in chunks to fit allocated memory. There are multiple ways to resolve an issue, from storing data in a file to temporarily store data in Redis.
We encountered with similar issue in a system which is developed for many years, and it is not so simple just to change way how data is accumulated in memory and utilized. A Java collection is full of large objects and there are multiple threads in the same app for that specific activity. Of course, there are few instances of a service which processes data, sometimes a dataset is small, sometimes a dataset is large and that is really frustrating, it is not simple to choose a scaling model. In such cases I usually say - simpler, faster, and more reliable to rewrite rather than "fine-tune", but this time we do fine-tuning :) Let's leave behind a curtain why a system operates with data which does not fit allocated memory š¤ it happens. Sometimes we can change an implementation in more robust way, but sometimes we need to find a compromise with existing solution.
We implemented a small library: a Java collection data is stored in a file system instead of RAM and a convenient, Java Stream like interface for operating on data is provided.
Okay, the library - FStream
. A central class FCollection
which is similar to java.util.List
but with reduced number of methods. With FCollection
you can add new items, sort them with a comparator, iterate elements, and create a instance of FStream
which is also reduced version of Java Stream
. With FStream
you can apply sequential operations on elements of a collection.
Example
See, in following code snapshot all data is stored in a file located in a temporary directory. Data is written to a file system immediately as data is added. But it it possible to operate on the items over FStream
.
FCollection<SomeClassName> collection = FCollection.create();
// add elements to a collection
collection.add(instance);
// iterate elements of a collection
Iterator<SomeClassName> i = collection.iterator();
while (i.hasNext()) {
consumer.accept(i.next());
}
// also iterates over all elements in a collection
collection.forEach(this::consumer);
// create a new collection
FCollection<AnotherClassName> collection2 = collection.stream()
.filter(o -> o.isActive() == true)
.map(this::convert)
.sort((o1, o2) -> o1.compareTo(o2))
.collect();
// destroy collections' data in a file storage
collection.close();
collection2.close();
How it works
Create a collection
When a collection is created, for instance with a method create
, then a new file is created in a /tmp
directory or in a custom directory if specified.
FCollection<SomeClassName> collection = FCollection.create();
Add items in a collection
Adding operation of a new item to a collection consists of an item serialization and writing to a collection's file in a file storage. Serialization is done by default with a FJdkSerializer
, but it is possible to use a custom serializer. Customization is described below.
// add elements to a collection
collection.add(instance);
Apply operations on a collection
An approach here is absolutely the same with Java Stream - a developer can specify operations takes on each element of a collection in a function way. As result, a new collection is created, stored in a file storage.
FCollection<AnotherClassName> collection2 = collection.stream()
.filter(o -> o.isActive() == true)
.map(this::convert)
.sort((o1, o2) -> o1.compareTo(o2))
.collect();
Customization
So far it is possible to specify where to store temporary data of a collections, and assign a custom serializer for a collection. A serializer must implement FSerializer
interface. After that a collection can be created with a builder.
FCollection<String> c =
FCollection.builder()
.serializer(new CustomSerializer())
.storageLocation("/your/location")
.build();
Want to try it out?
Visit project's GitHub repository: https://github.com/alex-53-8/fstream
Top comments (7)
Thanks for sharing this library, I really like the idea behind it to store big amounts of data on disk instead of in-memory in a simple way during processing.
I've been playing around for a little while with the usage example program and the code in general and - if you don't mind - I'd like to share some detailed feedback. There are some speed optimizations that might be useful and I also came across some thoughts on Java Collection integration.
As I don't want to look rude by just dropping several snippets of codes here in the comments out of the blue, I'd kindly ask beforehand if it's okay to do so.
it is true that the library needs optimization for speed, definitely writing to a file storage is always slower that storing in memory.
I welcome improvements and thoughts, please share your opinion :) it is really interesting to me!
The execution time of the example program dropped significantly on my machine when adding the following changes:
FOutputStream.write(byte[])
to forward the given data directly toraf.write(byte[])
FOutputStream.write(byte[], int, int)
to forward the given data to.write(byte[], int, int)
A similar approach works for
FInputStream
by implementingread(byte[], int, int)
but here a little more logic needs to be added to correspond with what you've implemented inread()
:And of course I've also looked into options for providing a (more or less) compatible Java Collections integration.
Similar to the unmodifiable Collections that Java itself offers, one could implement only the "read" methods of
Collection
and discard every write call silently or even with an exception. The only missing "read" methodssize()
andisEmpty
could be implemented by seeking through theraf
inFFileStorage
, counting the jumps and just jumping from header to header (and ignoring the real data in between). AndisEmpty()
might be something likereturn (raf.length() == 0L);
.And because
FCollection
(and the underlayingFFileCollection
) even offer a JavaIterator
implementation, a "real"Stream
could be easily derived:Also, I had some trouble starting the program unter MS Windows because the default storage path
/tmp
resolves toC:\tmp
which mostly does not exist, so the file cannot be created and a manual configuration must be used. Maybe the logic ofjava.io.File.createTempFile(...)
could be somehow incorporated for the default case to handle the selection of the temporary directory.Hi Nils
Thank you for your suggestions,
it is really good catch for modifying output/input stream classes - that increases performance, also I will replace "/tmp" default with a system defined temporary directory.
it could seem that implementation an interface "Collection" in "FCollection" should be simple - indeed, there are not so many methods to implement, but I would avoid using default implementation for creating a stream out of an iterator as it will lead to accumulating all data in memory again, especially for sorting. When I find a way how property implement Stream interface for FStream, I believe I will extend Collection interface.
Nice fix for a difficult problem Alex š
Was there a reason you didn't extend the existing Java collection classes to provide file-backed storage? I ask since the impact of publishing a new API might be too high for lots of legacy systems (too many references across a codebase to sanely refactor).
The main reason is in complexity of adding such functionality to existing Java Collections - we would need to override not only basic operations of collections: add, remove, iterate (all of them would need to support storing and retrieving data in/from a file system), but also conduct very complicated work to support all Java "streams" methods to read and write data in a file storage instead of storing data in RAM. That would be really huge effort :) so far we have limited number of supported methods of java streams in FStream, only what we needed for solving our tasks, but probably we would extend in the future list of supported operation and finally implement Collection and Stream interfaces in our library
Reminds me of github.com/dagnelies/FileMap I wrote long ago. It's focused on maps rather than lists though.