DEV Community: Rooted

Macros Explained for Java Developers

Rooted — Sun, 29 Jun 2025 14:45:45 +0000

If you’re a Java dev, you’ve probably used or heard of Project Lombok, Jakarta Bean Validation (JSR 380), AutoValue, MapStruct, or Immutables. They all help reduce boilerplate and add declarative magic to your code.
And I’m sure you’ve come across the term “macro”, usually explained in some academic or cryptic way. But here’s the thing: these libraries are simulating macro-like behavior — just without true macro support.

What Are Macros Anyway?

In languages like Lisp or Clojure, macros are compile-time programs that transform your code before it runs. They let you:

Rewrite or generate code
Build new control structures
Create entire domain-specific languages

They're basically code that writes code — giving you full control of the compiler pipeline.

Java’s “Macro” Workarounds

Java doesn’t support macros. Instead, it uses annotation processors and code generation tools:

Lombok’s @data → generates constructors, getters, and equals()/hashCode()
Jakarta Bean Validation (@min, @notblank) → declarative validation
AutoValue → immutable value types
MapStruct → type-safe mappers (my personal favorite)
Immutables → generates immutable types with builders
Spring Validation → framework-driven validation

These are powerful tools — but they can’t create new syntax or change how Java works at its core. They're still working within the language, not extending it.

What Real Macros Look Like

In Clojure, you can define a new data structure and its validator in a single macro:

lisp
(defmacro defvalidated
  [name fields validations]
  `(do
     (defrecord ~name ~fields)
     (defn ~(symbol (str "validate-" name)) [~'x]
       (let [errors# (atom [])]
         ~@(for [[field rule] validations]
             `(when-not (~rule (~field ~'x))
                (swap! errors# conj ~(str field " failed validation"))))
         @errors#))))

Usage:

lisp
(defvalidated User
  [name age]
  {name not-empty
   age #(>= % 18)})

(validate-User (->User "" 15))
;; => ["name failed validation" "age failed validation"]

No annotations. No libraries. No ceremony.
Just your own language feature, built with a macro.

TL;DR

Java’s toolchain simulates macro-like behavior through annotations and codegen. But if you want to invent language, write less boilerplate, and build smarter abstractions — macros in languages like Clojure or Racket offer the real deal.

Java gives you a powerful toolkit. Macros give you the power to build your own.

Inspired by Paul Graham's essay collection "Hackers & Painters"

Simple Browser Tracking

Rooted — Sat, 03 May 2025 13:00:33 +0000

⚠️ Only the technical part is explained. If you care about legality, you're on your own. At the time of writing, this method works in Firefox and Chrome.

Browser Tracking with Fingerprints

Tracking users is a touchy topic. Should you rely on screen size? Favicon loading hacks (like this one)? Or something more exotic?

Honestly, it depends. What level of accuracy do you need? How much time are you willing to sink into it? There's a sea of FOSS libraries and SaaS platforms out there, but sometimes you don’t want the whole enterprise-grade circus—just a quick way to know if "User A" today is the same "User A" from last week. Ideally, it should also be low-maintenance and not break every time a browser sneezes.

So here's a dead-simple way to track users using browser fingerprinting. It’s not perfect, but it’s light, easy to implement, and does the job for a lot of use cases.

We're using Broprint.js — a tiny browser fingerprinting library that gives you a unique(ish) identifier based on a bunch of properties like canvas fingerprinting, user agent, timezone, etc.

🖥️ Client Side

Add this snippet to your frontend to grab a fingerprint and send it off somewhere. We’re using a CORS proxy and a GET request for simplicity—mostly because browsers these days don't like cross-origin POSTs.

const { getCurrentBrowserFingerPrint } = require('@rajesh896/broprint.js');


async function sendFingerprintToServer() {
    try {
        const fingerprint = await getCurrentBrowserFingerPrint();
        const userAgent = window.navigator.userAgent;
        const proxyUrl = 'https://corsproxy.io/?';
        const apiUrl = 'https://script.google.com/macros/s/CHANGE_ME/exec';

        const encodedUrl = encodeURIComponent(`${apiUrl}?id=${fingerprint}&userAgent=${userAgent}`);
        const fullUrl = `${proxyUrl}${encodedUrl}`;

        const xhr = new XMLHttpRequest();
        xhr.open('GET', fullUrl, true);
        xhr.send(null);
    } catch (error) {
        console.error('Error sending fingerprint to server:', error);
    }
}

You could expand this to send fingerprints on every click, scroll, or form submit if you're feeling fancy. Right now, we’re just capturing the browser fingerprint and user agent. If you start adding resolution, timezone, device memory, etc., you'll get better accuracy—especially if you later combine this with some ML to group behavior.

🗃️ Server Side (Google Sheets)

Google Sheets — the poor man’s database that actually works. Here’s how to catch that fingerprint data and dump it into a spreadsheet using Google Apps Script:

var ssID = 'CHANGE_ME';
var sheetLog = 'Log';

function doGet(e) {
  try {
    // Append the data to the spreadsheet
    SpreadsheetApp.openById(ssID).getSheetByName(sheetLog).appendRow([new Date(), e.parameter.id, e.parameter.userAgent]);
    // Return a simple response
    return ContentService.createTextOutput("Success");
  } catch (err) {
    // Log error details to the spreadsheet
    SpreadsheetApp.openById(ssID).getSheetByName(sheetLog).appendRow([new Date(), err.stack || 'No stack', err.message || 'No message']);
    // Return an error response
    return ContentService.createTextOutput("Error");
  }
}

When a browser visits your site, you’ll log the timestamp, fingerprint ID, and user agent. If something goes wrong, the error gets logged too. Good enough for debugging.

📈 Basic Analysis

Later we can analyze logs on all different way, one simple analysis can be number of visits per month.
For that we create another sheet and use following formula to map timestamps from Log sheet into here with only month and year.

=IF(Log!A1<>"";TEXT(Log!A1; "MM/YYYY"))

🤔 Final Thoughts

This isn’t foolproof tracking. If someone switches browsers, disables JavaScript, or uses a hardened privacy setup (Brave), they’ll slip through. But for casual, low-friction tracking, it’s surprisingly effective—and you can set it up in under an hour without deploying a single server.

Full-Text Search with Hibernate Search

Rooted — Sat, 12 Apr 2025 17:58:14 +0000

This article demonstrates full-text search integration using Hibernate Search in a Java 8+ application with Hibernate ORM for relational database storage.

The first section (Basics) gives a high level overview, and the second section (DEMO) provides example project and explains it's crucial parts. The project presents some more complex use case scenario, using custom analyzers, edgeNgram, and larger projections. Simple examples are omitted, as they can be found in the official Hibernate documentation.

⚠️ This is not a one-size-fits-all solution for every full-text search requirement. Hibernate Search is optimized for handling large datasets and high-throughput applications. Also, the extra resource greedy search engine is required. For different use cases, alternatives like client-side search or PostgreSQL's full-text search capabilities might be more suitable. However, these approaches are beyond the scope of this article.

Basics

This guide is based on Hibernate Search 6.1 documentation. For additional details, refer to the official documentation.

Why Hibernate Search?

Implementing full-text search in an application can be challenging, but Hibernate Search simplifies the process by offering a built-in solution that requires minimal configuration. It seamlessly integrates with powerful search engines like Elasticsearch and Lucene, enabling efficient and scalable search capabilities.

In the diagram below, the blue section represents a typical application that uses Hibernate ORM to interact with a relational database. The red section highlights the additional infrastructure required to enable full-text search with Hibernate Search.

However, introducing a search engine also means dealing with data synchronization between the database and the search index.

Synchronization Challenges & Solutions

Since Hibernate Search maintains a separate index in Elasticsearch, data must be kept synchronized. The default solution is automatic synchronization, which replicates all database modifications to the search index in real-time. However, for some use cases, automatic synchronization may not be optimal. Instead, batch synchronization (e.g., updating once a day) can be more efficient.

Querying the Data

Once the data is indexed and synchronized, Hibernate Search provides two primary ways to execute search queries:

Query Elasticsearch to retrieve only indexes, then fetch corresponding data from the database.
Query Elasticsearch to retrieve data directly, without additional database queries (using projections).

For a more detailed explanation, checkout the next presentation:

DEMO - Spring Boot Application with Full-Text Search

Netz00 / hibernate-search-6-example

Simple Spring Boot application demonstrating Hibernate Search 6 advanced usage

Hibernate Search 6 Example

Simple Spring Boot application demonstrating Hibernate Search 6 usage with Elasticsearch.

Hibernate Search 6.1.7.Final: Reference Documentation

Example app

ER diagram:

Full text search is available for Freelancer and Project entities.

Indexing Entities

3 search examples:

searchProjectsEntities demonstrates basic full text search of projects by
- project name
searchProjects demonstrates previous example with projections usage
searchFreelancers demonstrates full text search of freelancers (with projections) by
- username
- first name
- last name
- categories (M:N relationship)

Creating custom edgeNgram analyser

Running the Application

Before starting the Spring Boot application, ensure that the necessary Docker containers are running.

docker compose -f deployment/docker-compose-dev.yaml up -d

Running Tests:

newman run ./backend/src/test/postman/Hibernate-search-6-example.postman_collection.json -e ./backend/src/test/postman/Test\ Environment.postman_environment.json --reporters cli,json --reporter-json-export ./backend/src/test/postman/output/outputfile.json

Import postman collection from here

Elasticsearch browser extension: https://elasticvue.com/

Extras

View on GitHub

This example shows how Hibernate Search fits into a Spring Boot architecture, covering everything from controllers to the search engine and back, including handling real life scenarios. The following section explains crucial parts.

Adding Full-Text Search

For development, we configure Elasticsearch as a single-node cluster running on the same server as the application, single backend configuration. RAM usage is limited to prevent excessive memory consumption. You can use ElasticVue to explore your data. Also, it is good practice to secure Elasticsearch by enabling security and providing password for default user “elastic”. Advanced security options are not included in the free Security functionality.

ℹ️ Hibernate container configuration, maven dependencies and hibernate configuration can be found in repository.

Indexing Entities

Which data should be indexed? Hibernate Search offers annotations that allow developers to control this behavior.

To index an entity, annotate the class with @Indexed(index = "index_name"). Following annotation will create an empty index inside Elasticsearch with name idx_comment.

@Entity
@NoArgsConstructor
@AllArgsConstructor
@ToString
@Getter
@Setter
@Table(name = "comment")
@Indexed(index="idx_comment")
public class Comment {

    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "sequenceGenerator")
    @SequenceGenerator(name = "sequenceGenerator")
    @Column(name = "id")
    private Long id;

In order to map entity properties into index fields they also need to be annotated. Multiple annotations on same entity property are allowed. Following entity properties annotations will be explained:

@FullTextField – For analyzed text fields (supports tokenization and filtering).
@KeywordField – For exact match searches and sorting.
@GenericField – For other data types like Long or Date.
@IndexedEmbedded – For nested objects (e.g., searching Students by Course name).

@FullTextField

Works only with String and configures field as text. Text will be analyzed before indexing or searching. Analyzers consists of tokenizer and filters. Tokenizer splits the string to substring which are then processed by filters. That means before indexing, string "Thinking in Java" will be tokenized to ["Thinking", "in", "Java"] and then several filters can be applied, such as lowercase all chars or remove stop words… Then while searching "same steps" will be repeated on query. It is possible to configure different analyzers for indexing and for searching through configuration. Finally if user searched for "Learning Java" it will be tokenized to ["Learning", "Java"] and "Java" will match stored "Java" (Thinking in Java) which will be considered as match and “Thinking in Java” will be returned as result! Text fields can’t be sorted but the following annotation solves that problem (keyword).

It is possible to make custom analyzers combining specific tokenizer and filters. Except whitespace tokenizer and lowercase filter there are many others available here.

@KeywordField

Works only with String and configures the field as a keyword. On keyword fields only normalizers can be applied (no analyzers). Normalizers are similar to analyzers but without tokenizing.
That means before indexing, the string “Thinking in Java” can only be normalized and will be stored as a single keyword. Also, while searching, the term will be also normalized and the previous example wouldn’t match. This type is useful for sorting operation. Also we can combine keyword and fulltext field on same field.

@GenericField

A good default choice that will work for every property type with built-in support.
In the example it is used for Date and Long (primary key).

@IndexedEmbedded

The @IndexedEmbedded annotation is used to include fields from associated entities in the search index of the owning entity. This enables searching across nested object fields.

For example, consider an entity Student with a @ManyToMany association to a Course entity. By using @IndexedEmbedded on the courses field, you can perform a search for Student entities based on the name of the associated Course.

This annotation works with various types of associations, including @OneToOne, @OneToMany, and @ManyToMany.

It is not necessary to annotate the associated (nested) entity with @Indexed, unless you also want to index it independently. The following example demonstrates this usage:

@Indexed(index = "idx_student")
public class Student {
    ...
    @ManyToMany(cascade = CascadeType.ALL, fetch = FetchType.LAZY)
    @JoinTable(
    name = "student_courses", 
    joinColumns = @JoinColumn(name = "student_id"), 
    inverseJoinColumns = @JoinColumn(name = "course_id"))
    @IndexedEmbedded(name = "courses", includePaths = {"name"})
    private Set<Course> courses = new HashSet<>();
    ...
}

public class Course {
    ...
    @Column(name = "name")
    @KeywordField(name = "name", normalizer = "lowercase", projectable = Projectable.YES)
    private String name;
    ...
}

Hibernate Search automatically detects whether an entity should be reindexed at field level. For example, updating non-indexed fields does not trigger reindexing, which optimizes performance.

More annotations and explanations can be found here.

Handling Sync Issues with MassIndexer

In edge cases, such as I/O failures after data is stored in database, the database and search index may go out of sync. One solution is to use MassIndexer, which reindexes all data.

In the example project, this process is automated via a scheduled job, ensuring that data remains in sync.

Searching with Hibernate Search

There are two main ways to fetch search results:

1. Fetching Data Directly from Elasticsearch (Used in DEMO)

This approach uses projections and skips the database, retrieving only indexed data. It requires adding the projectable = Projectable.YES property to the annotated fields.

Pros:

Faster search results.
Reduces database load.

Cons:

Data stored in Elasticsearch must be structured properly.
More complex implementation (requires extra mappings and domain objects).

2. Fetching Indexes First, Then Retrieving Data from DB

In this method, only the entity IDs are retrieved from Elasticsearch, and the actual data is fetched from the database.

Pros:

Ensures database integrity.
Simpler to implement.
Indexing only required fields, and letting the database handle the rest can result with performance improvement (Search engines are optimized for searching, not for updating)

Cons:

Requires an additional database round-trip.

Thank you for reading! 🚀

Web scraping: Silent and Maintainable

Rooted — Fri, 28 Mar 2025 19:39:23 +0000

With size, complexity emerges.

Silent Scraping

While writing the scraper, we will first hide behind a VPN or proxy. Then we are going to scrape the target a significant number of times until we are satisfied with the results. But in the meantime, we’ll get blocked, then try another IP—which doesn't work... Then some sunbeam will hit the Lava Lamp in Cloudflare, and we’ll start receiving captchas... Solving a problem that maybe doesn’t even need solving. Why? Because during development, we’ll mostly abuse the targeted site, while in production, scraping might only run peacefully once a day. Also, our free proxy or VPN will throttle, causing delays for each execution.

This issue can be easily solved by caching the site during development. A scraping framework such as Scrapy already includes caching out-of-the-box. Otherwise, we could use Nginx to cache our requests.

This is a straightforward way to develop your scraper without raising red flags with suspicious requests while adjusting headers to circumvent anti-scraping measures. Also, the site data is cached locally—no more network issues or delays.

Maintainable scraper

Regression Tests

Using a cached version brings another benefit: the site data becomes immutable, and it’s a lot easier to hit a static target. If it moves, we can patch the code in the next version, but during development it won’t change and won’t cause more bugs than we already have.

This approach can be extended to the testing level. Let's store the current version of the scraper along with the site data it works with in VCS.

That way, maintaining the scraper when the target site changes becomes easier. We simply diff (automated or manual) the stored version of the site against the live one. From the diff, we know what changed—and where to fix the scraper. Finally, we store the expected results, and we’ve got ourselves a regression test suite.

Of course, this increases the complexity of the scraper and requires extra effort upfront. In the case of Scrapy, this can be done in such an elegant way that the added complexity is manageable. But ultimately, it depends on the context and the answers to key questions—such as the estimated lifetime of the scraper, number of targets, scraping frequency, and how often the targeted site changes.

Mitigating pre-rendering JS

However, in real life, we’re mostly dealing with fat clients using client-side rendering. Pre-rendered sites are either ancient relics or cutting-edge setups optimized for SEO (and scraping).

The fail-safe—but also most expensive—approach is using headless browsers. But rendering that mess is slow, resource-hungry, and most importantly, often avoidable.

We can often skip full JS rendering by simply fetching only the data we need. A basic analysis of the requests the site makes will quickly reveal the ones we're interested in scraping.

It might take a few steps to get there—for example, we first extract IDs from URI_1, then generate a list of endpoints like [URI_2_ID_1, URI_2_ID_2, ...] to fetch the actual data.

Some may argue this is more fragile than rendering the site and scraping the DOM. But I don’t see a strong reason why API endpoints would change more frequently than the HTML selectors in the rendered case. We're also closer to the actual data source, which means fewer moving parts and less that could break the scraper.

Source - Mitigating pre-rendering JS.

Scrapy Implementation Example

One solution is to use a downloader middleware. This way, our spiders don’t need to be aware of it. The spider requests https://dev.to, and inside the downloader middleware, we simply map that URL to a local file where the site is stored and forward the request.

This setup can be extended by using an env variable for dev / prod modes, allowing us to include the middleware conditionally in the settings.

Storing the site can be as simple as using CTRL + S, or handled through an extra mode like init, which scrapes the sites and saves them with filenames mapped from their URLs.

Now, one could simply extend this with tests and the init mode if necessary, but explaining that wouldn't add much value at this point. Stopping here.

Downloader Middleware example

Mapper function example