If you deal with a large datasets stored in a Java collection, lets say java.util.List, sooner or later you would encounter with a situation that a...
For further actions, you may consider blocking this person and/or reporting abuse
Thanks for sharing this library, I really like the idea behind it to store big amounts of data on disk instead of in-memory in a simple way during processing.
I've been playing around for a little while with the usage example program and the code in general and - if you don't mind - I'd like to share some detailed feedback. There are some speed optimizations that might be useful and I also came across some thoughts on Java Collection integration.
As I don't want to look rude by just dropping several snippets of codes here in the comments out of the blue, I'd kindly ask beforehand if it's okay to do so.
it is true that the library needs optimization for speed, definitely writing to a file storage is always slower that storing in memory.
I welcome improvements and thoughts, please share your opinion :) it is really interesting to me!
The execution time of the example program dropped significantly on my machine when adding the following changes:
FOutputStream.write(byte[])to forward the given data directly toraf.write(byte[])FOutputStream.write(byte[], int, int)to forward the given data to.write(byte[], int, int)A similar approach works for
FInputStreamby implementingread(byte[], int, int)but here a little more logic needs to be added to correspond with what you've implemented inread():And of course I've also looked into options for providing a (more or less) compatible Java Collections integration.
Similar to the unmodifiable Collections that Java itself offers, one could implement only the "read" methods of
Collectionand discard every write call silently or even with an exception. The only missing "read" methodssize()andisEmptycould be implemented by seeking through therafinFFileStorage, counting the jumps and just jumping from header to header (and ignoring the real data in between). AndisEmpty()might be something likereturn (raf.length() == 0L);.And because
FCollection(and the underlayingFFileCollection) even offer a JavaIteratorimplementation, a "real"Streamcould be easily derived:Also, I had some trouble starting the program unter MS Windows because the default storage path
/tmpresolves toC:\tmpwhich mostly does not exist, so the file cannot be created and a manual configuration must be used. Maybe the logic ofjava.io.File.createTempFile(...)could be somehow incorporated for the default case to handle the selection of the temporary directory.Hi Nils
Thank you for your suggestions,
it is really good catch for modifying output/input stream classes - that increases performance, also I will replace "/tmp" default with a system defined temporary directory.
it could seem that implementation an interface "Collection" in "FCollection" should be simple - indeed, there are not so many methods to implement, but I would avoid using default implementation for creating a stream out of an iterator as it will lead to accumulating all data in memory again, especially for sorting. When I find a way how property implement Stream interface for FStream, I believe I will extend Collection interface.
Nice fix for a difficult problem Alex 👏
Was there a reason you didn't extend the existing Java collection classes to provide file-backed storage? I ask since the impact of publishing a new API might be too high for lots of legacy systems (too many references across a codebase to sanely refactor).
The main reason is in complexity of adding such functionality to existing Java Collections - we would need to override not only basic operations of collections: add, remove, iterate (all of them would need to support storing and retrieving data in/from a file system), but also conduct very complicated work to support all Java "streams" methods to read and write data in a file storage instead of storing data in RAM. That would be really huge effort :) so far we have limited number of supported methods of java streams in FStream, only what we needed for solving our tasks, but probably we would extend in the future list of supported operation and finally implement Collection and Stream interfaces in our library
Reminds me of github.com/dagnelies/FileMap I wrote long ago. It's focused on maps rather than lists though.