I’ve been working on an open-source Java library designed for scalable, multi-stage image comparison. It allows you to mix and match strategies (like CRC32 checksum and perceptual hashing) to de-duplicate massive collections efficiently.
The core design is modular, so you can implement your own strategies for both grouping and comparison. For example:
- Combine
CRC32Grouper
+PHash
+PixelByPixel
to identify duplicates. - Use some kind of meta data
Grouper
+PerceptualHash
to identify similar images.
I’d love to hear your feedback:
- Does this approach make sense for large-scale scenarios?
- What could I improve to make it more extensible?
Here’s the repository: LINK.
If you have ideas for new features or want to contribute, feel free to open an issue or submit a PR. Any thoughts appreciated!
Top comments (0)