DEV Community: Dmitry Protsenko

Spring Data JPA Best Practices: Repositories Design Guide

Dmitry Protsenko — Mon, 17 Nov 2025 14:23:26 +0000

In this series of articles, I'm sharing my view on refactoring a large legacy codebase that employed many poor practices. To address these issues and develop better Spring Data JPA repositories, I wrote this guide to promote good practices in developing among my former colleagues. The guide is updated and completely rewritten to utilize the latest features of Spring Data JPA.

Some of the examples may seem obvious, but they're not. It's only from your experienced perspective. They're real examples from the production codebase.

Keep in mind that this series of articles explains the latest version of Spring Data JPA, so there may be nuances that I'll highlight differently.

1 Designing Spring Data JPA repositories
2 Working with queries in repositories
3 Spring Data JPA projections
4 Using repository methods effectively
5 Stored procedures in repositories
6 Spring Data Jpa Repositories Cheetsheet
- 6.1 What Spring Data JPA repositories to choose?
- 6.2 How to query data with Spring Data JPA?
- 6.3 Best ways to use Spring Data JPA Projections
- 6.4 Short notes how to use queries effectively
- 6.5 Calling stored procedures from Spring Data JPA
At the end

1 Designing Spring Data JPA repositories

Spring Data JPA provides several repository interfaces with predefined methods for fetching data. I'll mention only the interesting

The `Repository<T, ID>` interface, the father of Spring Data interfaces, is a marker interface for discovery. It has no methods. When using it, you define only what you want.
The CrudRepository interface, adds basic CRUD methods for faster development, and its twin ListCrudRepository does the same, but returns List instead of Iterable.
The PagingAndSortingRepository - adds pagination and sorting only, and it has a twin that returns List. Guess how it's called, wait a minute, you're right!
The JpaRepository is my favorite, it contains all of the previous interfaces that return List. Most of the time, I'm using only this interface.

When should you use a Repository and JpaRepository, or something in between? I believe that if you need a strict API for other developers, you can extend from the repository and implement only the necessary operations, rather than granting access to the entire CRUD operations, which could compromise your logic. Use JpaRepository in case you don't have access limitations, and you want faster development.

As an example of API limitations, you may sometimes need to work with logic stored in the database. There are numerous stored procedures, nuances in logic, and more. As a developer, you should be cautious when working with table entities, as this could lead to unpredictable behavior. So, in this case, you're only designing JPA entities and implementing only an empty interface with only specified query methods. With this approach, you're highlighting to other developers that they should implement methods that you need, rather than manipulating raw entities.

Here is actually one more interesting thing that comes from Spring Data JPA repositories. Methods you inherit from CrudRepository/JpaRepository are transactional by default: reads run with @Transactional(readOnly = true), writes with a regular @Transactional.

You usually don’t need Spring Framework @Repository annotation (don't mistake it with JPA's interface) on the interface - discovery is automatic. For reusable bases, annotate the base interface with @NoRepositoryBean.

Extending one of these interfaces informs Spring Data JPA that it should generate an implementation for your interface. For example:

public interface CompanyRepository extends JpaRepository<Company, Long> {
    // custom methods will be added here
}

2 Working with queries in repositories

There are two primary methods for querying data using Spring Data JPA repositories. Actually, it's more, but let's focus on the more popular (IMO).

Derive the query from the method name. Spring parses the method name and generates the appropriate JPQL. This speeds up development and is intuitive for simple conditions.
Write the query explicitly with the @Query annotation. This approach is more flexible, allowing you to use JPQL or native SQL. In the latest versions of Spring Data, you can use @NativeQuery an annotation instead of passing nativeQuery = true.

For data-modifying queries (UPDATE/DELETE), add @Modifying, and make sure there is a transactional boundary - either annotate the repository method or class with @Transactional. Another way is to call it from a @Transactional service.

Example methods using both approaches:

// Derived query
List<Employee> findByDepartmentIdAndActiveTrue(Long departmentId);

// Explicit JPQL query
@Query("SELECT e FROM Employee e WHERE e.department.id = :deptId AND e.active = true")
List<Employee> findActiveEmployees(@Param("deptId") Long departmentId);

// Native SQL query
@Modifying
@Transactional
@NativeQuery(value = "UPDATE employee SET active = false WHERE id = :id")
void deactivateEmployee(@Param("id") Long id);

In the example above, the first two methods are select queries. The last one is an update (deactivation), serving a different purpose than the selects.

The first approach shortens the time required to develop queries and is intuitive. The second example provides additional capabilities when creating methods for working with the database, allowing you to write queries using both JPQL and native SQL.

Inherited data-modifying methods are marked @Transactional by default, as mentioned before. For custom modifying queries, annotate with @Modifying and ensure a transactional boundary is present (on the method or class, or at the service layer).

3 Spring Data JPA projections

Using raw entities for users from the database may be impractical or unsafe. It may be acceptable to retrieve the full entity and work within the application, but it's better to adapt your queries to return only the necessary information.

To address this, you should utilize Spring Data JPA projections, which enable the definition of how data from the database will be presented. In the examples described above, the Spring Data JPA projection returns only the selected attributes needed by the caller.

Spring Data JPA provides the following types of projections:

Projections defined via interfaces are also known as interface-based projections
Projections to DTO objects. Read the guide about developing DTOs in a series of articles about Spring Data JPA.
Dynamic projections.

Interface-based projection allows you to create read-only projections for safely presenting data from the database. This approach is typically used when there is no need to manipulate the created object, and it is required only for displaying data. Note that accessing nested properties can result in joins and additional queries, so projections are not always faster than fetching entities. Always check the generated SQL to ensure optimal performance.

For example, an interface-based Spring Data JPA projection:

public interface EmployeeView {
    String getFirstName();
    String getLastName();
    BigDecimal getSalary();
}

List<EmployeeView> findBySalaryGreaterThan(BigDecimal amount);

DTO-based projection enables projecting onto Java classes, allowing you to work with concrete DTO objects rather than interfaces. For derived query methods, Spring can map results to a DTO via its constructor, while for @Query JPQL requires the use of a constructor expression. Class-based projections require a single all-args constructor; if there are multiple constructors, annotate the intended one with @PersistenceCreator.

public class EmployeeDto {
    private final String firstName;
    private final String lastName;
    private final BigDecimal salary;
    public String getFirstName() { return firstName; }
    public String getLastName() { return lastName; }
    public BigDecimal getSalary() { return salary; }

    public EmployeeDto(String firstName, String lastName, BigDecimal salary) {
        this.firstName = firstName;
        this.lastName = lastName;
        this.salary = salary;
    }
}

@Query("SELECT new com.example.EmployeeDto(e.firstName, e.lastName, e.salary) FROM Employee e WHERE e.salary > :amount")
List<EmployeeDto> findHighEarningEmployees(@Param("amount") BigDecimal amount);

You can use dynamic projections with repositories to expose a generic method, allowing the caller to choose the projection type at runtime. The Class parameter selects the projection type. If you need to pass a Class into the query itself, use a different parameter so it is not consumed as the projection selector.

When using DTO classes with dynamic projections, ensure the query supplies the constructor arguments (for example, via a JPQL constructor expression); otherwise, the call will fail at runtime.

List findBySalaryGreaterThan(BigDecimal amount, Class type);

// Usage:

repo.findBySalaryGreaterThan(new BigDecimal("1000"), EmployeeView.class); // interface projection

repo.findBySalaryGreaterThan(new BigDecimal("1000"), EmployeeDto.class); // DTO class (with suitable query)

4 Using repository methods effectively

Repository CRUD methods run in a transaction by default (readOnly = true for reads, regular for writes) as it was mentioned before. The next thing about transaction is to avoiding opening transactions manually at call sites.

When performing operations on multiple entities, prefer batch methods such as saveAll() instead of calling save() in a loop. Grouping actions into a single query reduces the number of database round-trips.

Prefer batch-oriented writes, but note that saveAll() it does not issue a single SQL statement by itself. To actually reduce round-trips, enable JDBC batching (for example, spring.jpa.properties.hibernate.jdbc.batch_size=50 and often hibernate.order_inserts=true/hibernate.order_updates=true). Avoid GenerationType.IDENTITY if you need insert batching, and for very large batches, call flush()/clear() periodically.

Whenever possible, combine logic into a single query rather than performing multiple queries in Java. In some cases, it is more efficient to offload part of an algorithm to the database using SQL.

For large result sets, use pagination. Page<T> returns content plus totals and triggers a count query (for custom @Query supply a countQuery), Slice<T> returns content and whether there is a next slice without a count query, and a List<T> with a Pageable parameter applies limit/offset but gives no metadata.

// 1) Derived query with Page and sorting
interface UserRepository extends JpaRepository<User, Long> {
    Page<User> findByActive(boolean active, Pageable pageable);
}

// Usage:
Pageable pageable = PageRequest.of(0, 20, Sort.by("createdAt").descending());
Page<User> page = userRepository.findByActive(true, pageable);
List<User> users = page.getContent();
long total = page.getTotalElements();
boolean last = page.isLast();

// 2) Derived query with Slice for infinite scroll (no count query)
interface UserRepository extends JpaRepository<User, Long> {
    Slice<User> findByActive(boolean active, Pageable pageable);
}

5 Stored procedures in repositories

While developing a database-oriented app, you could use Spring Data JPA to call stored procedures defined in your database. There are different ways to do it.

The first method is to use NamedStoredProcedureQuery:

Declare it on an entity with @NamedStoredProcedureQuery, specifying:
- name – the identifier used by JPA,
- procedureName – the actual name of the procedure in the database,
- parameters – an array of @StoredProcedureParameter objects defining each parameter’s mode (IN/OUT), name and Java type.
Add a method to your repository annotated with @Procedure and referencing the declared name.

For multiple outs parameters, Spring Data JPA can return a Map<String,Object> when the call is backed by a @NamedStoredProcedureQuery. For a single out, you can return that value directly. There’s also outputParameterName on @Procedure for targeting a specific out param.

Example declaration on an entity:

@NamedStoredProcedureQuery(
    name = "Employee.raiseSalary",
    procedureName = "raise_employee_salary",
    parameters = {
        @StoredProcedureParameter(mode = ParameterMode.IN,  name = "in_employee_id", type = Long.class),
        @StoredProcedureParameter(mode = ParameterMode.IN,  name = "in_increase",    type = BigDecimal.class),
        @StoredProcedureParameter(mode = ParameterMode.OUT, name = "out_new_salary", type = BigDecimal.class)
    }
)
@Entity
public class Employee { … }

Repository method:

@Procedure(name = "Employee.raiseSalary")
BigDecimal raiseSalary(@Param("in_employee_id") Long id,
                       @Param("in_increase")    BigDecimal increase);

The second method is to call a procedure without defining JPA metadata by using @Procedure(procedureName = "…") directly on the repository method, or even call it via @Query(value = "CALL proc(:arg…)", nativeQuery = true).

Actually, there is one more method, but it's less canonical is to use entity manager to call stored procedures, this article will not cover this practice as it will be in the next article and last article in this series.

6 Spring Data Jpa Repositories Cheetsheet

To briefly sum up this design guide, you could use the following cheetsheet.

6.1 What Spring Data JPA repositories to choose?

Interfaces to extend

Repository<T, ID> — marker only; you define every method yourself.
CrudRepository<T, ID> — basic CRUD; collections return Iterable.
ListCrudRepository<T, ID> — like CrudRepository, but collections return List.
PagingAndSortingRepository<T, ID> — adds paging & sorting.
ListPagingAndSortingRepository<T, ID> — list-returning twin.
JpaRepository<T, ID> — all of the above + JPA niceties (flush, batch deletes, etc.). Default choice in most apps.

When to pick what

Need a strict, minimal API? Extend Repository (or a slim base) and expose only allowed methods.
Need speed of development? Extend JpaRepository.

Discovery & base config

@Repository is not required on repository interfaces; Spring detects them by type.
For reusable base interfaces, annotate with @NoRepositoryBean.
Default implementation is backed by SimpleJpaRepository.

Transactions (defaults)

Defaults apply to inherited CRUD methods: reads use @Transactional(readOnly = true), writes use regular @Transactional.
Your own query methods (derived names or @Query) are not transactional by default; annotate them or invoke from a transactional service.

6.2 How to query data with Spring Data JPA?

Two core approaches

Derived queries (by method name) for simple conditions.
Explicit queries with @Query (JPQL) or native queries via either @Query(..., nativeQuery = true) or @NativeQuery (modern shortcut; supports extras like sqlResultSetMapping).

Modifying queries

Add @Modifying and ensure a transactional boundary (@Transactional on method/class or call from a transactional service).

Paging with custom queries

With Page<T> and complex JPQL/native queries, supply an explicit countQuery (or countProjection) to avoid brittle auto-counts.

6.3 Best ways to use Spring Data JPA Projections

Types

Interface-based projections — read-only views for safe data presentation.
DTO/class-based projections — map to a class with a single all-args constructor (use @PersistenceCreator if multiple constructors exist).
Dynamic projections — expose a generic method and let callers pass Class<T> to choose the projection type at runtime.

Notes

Accessing nested properties in projections can trigger joins. Projections aren’t automatically faster than entities. Inspect SQL and returned columns and measure queries performance.
When using DTOs with dynamic projections, ensure the query provides constructor args (e.g., via a JPQL constructor expression).

6.4 Short notes how to use queries effectively

Batching & round-trips

Prefer saveAll(...) over repeated save(...).
Avoid GenerationType.IDENTITY if you need insert batching. Prefer sequences/pooled optimizers.
For very large batches, periodically call flush()/clear().

Let the DB work

Push set-based logic into single queries instead of multi-step Java loops where possible.

Paging options

Page<T> — content + totals (triggers count query).
Slice<T> — content + “has next” (no count query, good for infinite scroll).
List<T> with a Pageable parameter — applies limit/offset, no metadata.

6.5 Calling stored procedures from Spring Data JPA

Approaches

Named stored procedure: declare on an entity with @NamedStoredProcedureQuery, then call via repository method annotated with @Procedure(name = "...").
Direct call (no entity metadata): use @Procedure(procedureName = "...") on the repository method, or call using @Query(value = "CALL ...", nativeQuery = true).

Outputs

Multiple OUT params (with a named stored procedure) can be returned as Map<String,Object>.
Single OUT can be returned directly, or target a specific one using outputParameterName on @Procedure.

At the end

I hope you find this article helpful. The continuation of the series articles will be published soon, so connect with me on LinkedIn to stay informed about new articles. If you missed the previous article, read my “Spring Data JPA Best Practices: Entity Design Guide"

Bye!

Spring Data JPA Best Practices: Entity Design Guide

Dmitry Protsenko — Wed, 05 Nov 2025 16:36:46 +0000

You can enjoy the original version of this article on my website, protsenko.dev.

This series of articles was written as part of my review of a large legacy code base that employed many poor practices. To address these issues, I created this guide to promote Spring Data JPA Best Practices in designing entities among my former colleagues.

It's a time to clean the dust from the guide, update it, and publish it for a broader audience. The guide is extensive, and I decided to break it down into three separate articles.

1 Diving deep into Spring Data JPA

For convenient and rapid development of database‑driven software, it is recommended to use the following libraries and frameworks:

Spring Boot — simplifies building web applications on top of the Spring Framework by providing auto-configuration, starter dependencies, and opinionated defaults (e.g., embedded server, Actuator). It leverages Spring’s existing dependency-injection model rather than introducing a new one.
Spring Data JPA saves time when creating repositories for database operations. It provides ready‑made interfaces for CRUD operations, transaction management, and query definition via annotations or method names. Another advantage is its integration with the Spring context, along with the corresponding benefits of dependency injection.
Lombok – reduces boilerplate by generating getters, setters, and other repetitive code.

Entities represent rows of a database table. They are plain Java objects annotated with @Entity other JPA annotations. DTOs (Data Transfer Objects) are plain Java objects used for presenting data in a limited or transformed form compared to the underlying entity.

In a Spring application, a repository is a special interface that provides access to the database/data. Such repositories are typically annotated with @Repository, but actually, you don't have to mark it separately when you extend from JpaRepository, CrudRepository or another Spring Data JPA repository. If you don’t extend a Spring Data base interface, you can use @RepositoryDefinition. Also, use @NoRepositoryBean on shared base interfaces

A service is a special class that encapsulates business logic and functionality. Controllers are the endpoints of your application; users interact with controllers, which in turn inject services rather than repositories.

For clarity, your project should be organized into packages by responsibility or others. Code organization is a good topic and always relies on your service, code agreements, etc. The given example represents a microservice with a single business domain.

entity – database entities,
repository – data access repositories,
service – services, including wrappers for stored procedures,
controller – application endpoints,
dtos – DTO classes.

Connection to the database is auto-configured when a Spring Boot application starts, based on application.properties/application.yml. Common properties include:

spring.datasource.url – database connection URL
spring.datasource.`driver-class-name` – database driver class, Spring Boot can often infer it from the JDBC URL, and set it only if inference fails.
spring.jpa.database-platform – SQL dialect to use
spring.jpa.hibernate.ddl-auto – how Hibernate should create a database schema, available values: none|validate|update|create|create-drop

2 Developing entities with Spring Data JPA

When designing software that interacts with a database, simple Java objects, properly used with Java Persistence API (JPA) annotations, play a crucial role. Such objects typically contain fields that map to table columns and are referred to as entities. Not every field maps one-to-one: relationships, embedded value objects, and @Transient fields are common.

At a minimum, an entity class must be annotated @Entity to mark the class as a database entity and declare a primary key with @Id or @EmbeddedId. JPA also requires a no-arg constructor (public or protected). It is also good practice to include @Table to explicitly define the target table. The @Table annotation is an optional use it when you need to override the default one.

When using @Entity annotation, prefer to set the name attribute, because this name is used in JPQL queries. If you omit it, JPQL uses the simple class name, setting it decouples queries from refactors*.*

There is one more useful annotation @Table that helps you choose the name of the table if it differs from the naming strategy.

The following examples demonstrate bad and good usage:

@Entity
@Table(name = "COMPANY")
public class CompanyEntity {
    // fields omitted
}

// Later:
Query q = entityManager.createQuery("FROM " + CompanyEntity.class.getSimpleName() + " c")

Here, the name attribute is missing on @Entity, so the class name is used in queries. This can lead to fragile code when refactoring. Here is another problem: it's using the entityManager instead of a preconfigured Spring Data JPA repository. It provides more flexibility, but lets you make a mess in the codebase instead of using more preferable ways to fetch the data.

Did you catch one more bad practice here? Definitely, it's a concatenation of strings to build a query. In that case, it wouldn't lead to SQL injection, but it's best to avoid this approach, especially if you pass the user input to query like this.

@Entity(name = "Company")
@Table(name = "COMPANY")
public class CompanyEntity {
    // fields omitted
}

// Later:
Query q = entityManager.createQuery("FROM Company c");

In the improved version, the entity name is explicitly specified, so JPQL queries can refer to the entity by name instead of relying on the class name.

Note: the JPQL entity name and the physical table name in @Table are independent concepts.

3 Avoiding magic numbers/literals

Choose the type of your fields wisely:

If a field represents a numeric enumeration, then use Integer or an appropriately small numeric type.
If selecting types, then base them on domain range and nullability (use wrapper types, such as Integer, if the column can be null); and remember that smaller numeric types rarely yield real benefits in JPA.
If a value is monetary or requires precision, then use BigDecimal with appropriate precision/scale.
If you need details on enums, then they will be covered later.

For example, suppose a field statusCode represents the status of a company. Using a numeric type and documenting the meaning of each value in comments leads to code that is hard to read and error-prone:

// Company status:
// 1 – active
// 2 – suspended
// 3 – dissolved
// 4 – merged
@Column(name = "STATUS_CODE")
private Long statusCode;

Instead, create an enumeration and use it as the type of the field. This makes the code self-documenting and reduces the chance of mistakes. When persisting an enum with Spring Data JPA, specify how it’s stored, it a good practice. Prefer @Enumerated(EnumType.STRING) so the DB contains readable names, and you’re safe against reordering constants. Also, make sure the column type/length fits the enum names (set length or columnDefinition if needed).

// Stored as readable names; ensure the column can hold them (e.g., length = 32).
@Column(name = "STATUS", length = 32)
@Enumerated(EnumType.STRING)
private CompanyStatus status;

public enum CompanyStatus {
    /** Active company */           ACTIVE,
    /** Temporarily suspended */    SUSPENDED,
    /** Officially dissolved */     DISSOLVED,
    /** Merged into another org */  MERGED;
}

If your existing column stores numeric codes (e.g., 1–4) and must stay numeric, don’t use EnumType.ORDINAL (it writes 0-based ordinals and will not match 1–4). Use an AttributeConverter<CompanyStatus, Integer> to map explicit codes to enum values:

@Converter(autoApply = false)
public class CompanyStatusConverter implements AttributeConverter<CompanyStatus, Integer> {
    @Override
    public Integer convertToDatabaseColumn(CompanyStatus v) {
        if (v == null) return null;
        return switch (v) {
            case ACTIVE    -> 1;
            case SUSPENDED -> 2;
            case DISSOLVED -> 3;
            case MERGED    -> 4;
        };
    }

    @Override
    public CompanyStatus convertToEntityAttribute(Integer db) {
        if (db == null) return null;
        return switch (db) {
            case 1 -> CompanyStatus.ACTIVE;
            case 2 -> CompanyStatus.SUSPENDED;
            case 3 -> CompanyStatus.DISSOLVED;
            case 4 -> CompanyStatus.MERGED;
            default -> throw new IllegalArgumentException("Unknown STATUS_CODE: " + db);
        };
    }
}

// Keeps numeric 1..4 in the column while exposing a typesafe enum in Java.
@Column(name = "STATUS_CODE")
@Convert(converter = CompanyStatusConverter.class)
private CompanyStatus status;

4 Consistent use of types

If a field is used in multiple entities, ensure it has the same type everywhere. Using different types for a conceptually identical field leads to ambiguous business logic. For example, the following bad usage shows two fields that represent a boolean flag but use different types and names:

// Bad choice of types for logically identical fields
// A – automatic, M – manual
@Column(name = "WAY_FLG")
private String wayFlg;

@Column(name = "WAY_FLG")
private Boolean wayFlg;

A better option is to use a Boolean or, if you need more than two values or the two values are domain-labeled (e.g., Automatic/Manual), use an enum for both fields. If it’s truly binary yes/no, Boolean (wrapper for nullable columns) is fine. Otherwise, prefer an enum for clarity and future-proofing. Below are consistent mappings without converters:

// Two labeled states: prefer an enum for clarity
public enum WayMode { A, M } // or AUTOMATIC, MANUAL

// Use the same mapping in every entity touching WAY_FLG
@Column(name = "WAY_FLG", length = 1) // ensure length fits enum names
@Enumerated(EnumType.STRING)
private WayMode wayFlg;

// Truly binary case (e.g., active/inactive):
@Column(name = "IS_ACTIVE")
private Boolean active; // use wrapper if the column can be NULL

There is an intentional omission of the part about relations between tables in Spring Data JPA, as it is a broad subject that warrants a separate article on best practices.

5 Lombok usage

To reduce the amount of boilerplate source code, it is recommended to use Lombok for code generation — but it should be used wisely. Generating getters and setters is an optimal choice. It’s best to stick to this practice and override getters and setters only if some pre-processing is required.

For JPA, ensure a no-arg constructor exists. With Lombok, you can add @NoArgsConstructor(access = AccessLevel.PROTECTED) to satisfy the spec cleanly.

Warning note: Avoid @Data on entities because its generated equals/hashCode/toString can be problematic with JPA (lazy relations, mutable identifiers). Prefer targeted annotations (@Getter, @Setter, @NoArgsConstructor) and, if needed, explicit equality with @EqualsAndHashCode(onlyExplicitlyIncluded = true) and excludes for associations. Read more about this further.

Among other things, Lombok supports the following commonly used annotations. You can find the full list on the website: https://projectlombok.org/

6 Overriding equals and hashCode

Overriding equals and hashCode in database entities, many questions arise. For example, many applications work fine with the standard methods inherited from Object.

Context: Within a single persistence context, Spring Data JPA/Hibernate already ensures identity semantics (same DB row -> same Java instance). You typically need custom equals/hashCode only if you rely on value semantics across contexts or use hashed collections.

A database entity typically represents a real-world object, and you might choose to override in different ways:

Based on the entity’s primary key (it is immutable). Nuance: if the ID is DB-generated, it’s null before persist/flush. Handle transient state so you don’t change the hash while in a hashed collection.
Based on a business key (e.g., an employee’s tax ID/INN), since it isn’t tied to the database implementation. Nuance: works well if the key is unique, immutable, and always available; avoid mutable fields/associations.
Based on all fields. Unsafe: mutable data, potential lazy loads, recursion through associations, and performance costs make this fragile for JPA entities.

When should you override equals and hashCode?

When the object is used as a key in a Map. Nuance: don’t mutate fields used by hashCode while the object is inside a hashed structure.
When using structures that store only unique objects (e.g., Set). Nuance: same caution—mutating equality/significant fields breaks collection invariants.
When you need to compare database entities. Nuance: often comparing identifiers is sufficient; overriding is not mandatory if identity comparison fits your use case.

From the above, it follows that you should use Lombok’s @EqualsAndHashCode and @Data with caution, because Lombok generates these methods for all fields unless configured otherwise.

Expand: prefer @EqualsAndHashCode(onlyExplicitlyIncluded = true) and mark only stable identifiers/business keys; avoid @Data on entities (it's generated equals/hashCode/toString can interact badly with lazy relations). You can also exclude relations from equality or toString with @EqualsAndHashCode.Exclude / @ToString.Exclude.

Inheritance nuance: if you define equality in a mapped superclass, ensure the rule is consistent for all subclasses and matches how identity is defined for the whole hierarchy.

A) Business-key equality (safe when the key is unique & immutable)

public class Employee {
    private String taxId; // natural key: unique & immutable

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false; // keep simple here
        Employee other = (Employee) o;
        return taxId != null && taxId.equals(other.taxId);
    }

    @Override
    public int hashCode() {
        return (taxId == null) ? 0 : taxId.hashCode();
    }
}

B) ID-based equality (handles transient state; avoids hash changes)

public class Order {
    private Long id; // DB-generated

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        Order other = (Order) o;
        // transient entities (id == null) are never equal to anything but themselves
        return id != null && id.equals(other.id);
    }

    @Override
    public int hashCode() {
        // constant, avoids rehash after the ID is assigned later
        return getClass().hashCode();
    }
}

C) Lombok pattern (explicit includes; avoid all-fields default)

@Getter
@Setter
@EqualsAndHashCode(onlyExplicitlyIncluded = true)
public class Customer {
    @EqualsAndHashCode.Include
    private String externalId; // stable business key

    // exclude associations and mutable details
    // @EqualsAndHashCode.Exclude private List<Order> orders;
}

7 Developing DTOs

DTOs (Data Transfer Objects) are specialized objects designed to present data to a client, as sending raw database entities directly to the client is considered a bad practice. Some teams do pass entities across internal boundaries, but for public/client-facing APIs, DTOs are preferred to avoid leakage of persistence details.

Creating various DTOs increases development and maintenance time. If libraries like ModelMapper are used, there is also a memory overhead for object mapping.

Another feature of DTOs is reducing the amount of data transmitted over the network and lowering the load on the DBMS by requesting fewer fields. The most important thing is that you actually reduce the database load only if you select fewer columns (using constructor expressions, Spring Data JPA projections, or native queries that return only the required fields). Fetching full entities and then mapping will not reduce the number of selected columns, your captain.

There are different ways to design DTOs:

Using classes (objects). Classes or Java records are typically clearer for external APIs (serialization, validation, documentation).
Using interfaces. Interfaces fit Spring Data interface-based projections (read-only, getter-only views), not write models.

There are different ways to convert entity objects to DTOs:

The optimal approach is to project data from the database directly into the required DTO. This both avoids extra mapping work and ensures fewer columns are selected.
You can also use a library like ModelMapper. Prefer MapStruct instead (compile-time code generation, faster at runtime, explicit mappings).
You can also write your own object converter. Handwritten mappers provide full control but increase maintenance requirements.

Good practices for developing DTOs:

Prefer purpose-specific DTOs per use case (e.g., Summary/Detail/ListItem; CreateRequest vs Response).
Avoid one mega-DTO tied to an entity, which causes over-fetching and tight coupling.

8 Spring Data JPA Summary Best Practices

Developing entities with JPA annotations

Entities map fields to columns; relationships, embeddables, and @Transient are common (not always 1:1).
Minimum: @Entity + primary key (@Id / @EmbeddedId) + no-args ctor (public/protected).
Use @Table only to override defaults (table, schema, constraints).
Prefer explicit @Entity(name="…") to decouple JPQL from class names so JPQL stays stable across class renames.
Avoid string concatenation in JPQL and use parameters.
JPQL entity name (@Entity(name)) and physical table name (@Table(name)) are independent.

Avoiding magic numbers/literals

Choose types by domain range and nullability; use wrapper types (Integer, Boolean) if the column can be NULL.
Money/precision -> BigDecimal with proper precision/scale.
Replace numeric codes with enums. Persist with @Enumerated(EnumType.STRING) and ensure column length fits names.
Legacy numeric code columns: use an AttributeConverter<Enum, Integer>. Don’t use EnumType.ORDINAL.

Consistent use of types

Use the same Java type for the same conceptual column everywhere.
Binary flags -> Boolean (wrapper). Domain-labeled or future-expandable flags -> enum consistently.
Map enums uniformly (@Enumerated(EnumType.STRING), @Column(length=…)); avoid mixing String/Boolean/enum for the same column.

Lombok usage

Use Lombok for boilerplate: @Getter, @Setter, @NoArgsConstructor(access = PROTECTED) for JPA.
Avoid @Data on entities (generated equals/hashCode/toString can conflict with lazy relations and identifiers).
Override accessors only when pre-/post-processing is needed.

Overriding equals and hashCode

Override only if you need value semantics across contexts or in hashed collections.
Business-key strategy: compare a unique, immutable key.
ID-based strategy: treat transient (id == null) entities as unequal; use a stable/constant hashCode() to avoid rehash after persist.
Avoid all-fields equality; exclude associations to prevent lazy loads/recursion.
With Lombok, prefer @EqualsAndHashCode(onlyExplicitlyIncluded = true) and explicitly include stable identifiers; use @EqualsAndHashCode.Exclude / @ToString.Exclude for relations.
Maintain consistency in equality rules across hierarchies (mapped superclasses vs subclasses).

Developing DTOs

Don’t expose entities to clients, even if you return them with annotation @JsonIgnore; design purpose-specific DTOs (Summary/Detail/ListItem; Create/Update/Response).
Reduce database load by selecting fewer columns: project directly to DTOs (using constructor expressions), utilize interface-based projections, or use native queries that return only the necessary fields.
Mapping full entities doesn’t reduce selected columns.
Prefer MapStruct (compile-time, fast, explicit) over ModelMapper; handwritten mappers give control at a higher maintenance cost.

At the end

I hope you find this article helpful. The continuation of the series articles will be published soon, so connect with me on LinkedIn to stay informed about new articles. If you're curious about Spring Data JPA, read the next article: "Spring Data JPA Best Practices: Repositories Design Guide"

Bye!

Docker Best Practices to Secure and Optimize Your Containers

Dmitry Protsenko — Mon, 29 Sep 2025 17:00:00 +0000

You can enjoy the original version of this article on my website, protsenko.dev.

Hi! In this article, I’m sharing 29 collected Docker best practices to make your images better, more secure, and faster. These Docker Best Practices cover security, maintainability, and reproducibility. This guide is based on my experience creating the Docker Scanner IntelliJ IDEA plugin and almost all of the practices covered by the scanner. It also includes Kubernetes Security Scanner features.

If this project or article has been helpful to you, please consider giving it a ⭐ on GitHub to help others discover it.

Docker image size & maintainability

1 Always pin an image version

When you do not specify a version tag, Docker uses the latest version by default. This practice makes builds unpredictable. You do not know which image version you will download. If an attacker compromises the image author, they can push a harmful image. New image updates may also break your application. This is one of the most important Docker Best Practices and a Security Best Practice.

That was the first inspection that I implemented in the Docker Scanner plugin. You can try it and see how flawlessly it works.

# problem
ARG version=latest
FROM ubuntu as u1
FROM ubuntu:latest as u2
FROM ubuntu:$version as u3
FROM u3
USER nobody
# fix
FROM ubuntu:noble as u1
FROM ubuntu@sha256:72297848456d5d37d1262630108ab308d3e9ec7ed1c3286a32fe09856619a782 as u2
FROM u2
USER nobody

2 Avoid using the dist-upgrade in package management

Using dist-upgrade can upgrade to a new major release. This behavior may break your Dockerfile by introducing unexpected changes. Dockerfiles should use controlled updates to maintain stability.

# problem
FROM ubuntu:20.04
RUN apt-get dist-upgrade
# fix is in using newer dist

3 Use multi-stage builds to reduce image size

Multi-stage builds allow you to use multiple FROM statements in your Dockerfile. You can copy artifacts from one stage to another, leaving behind everything you don’t want in the final image. Reducing the image size is a wide area where you could apply these Docker best practices. The next practices will be about this.

This approach dramatically reduces the final image size by excluding build tools, source code, and intermediate files that are only needed during the build process. It also improves Docker security by not shipping development dependencies and build tools in production images.

# problem
FROM node:18
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["node", "dist/index.js"]

# fix
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build

FROM node:18-alpine AS production
WORKDIR /app
COPY package*.json ./
RUN npm install --only=production && npm cache clean --force

COPY --from=builder /app/dist ./dist
USER nobody
EXPOSE 3000
CMD ["node", "dist/index.js"]

4 Consolidate multiple RUN instructions

Each RUN instruction creates a new image layer, which increases the final image size and build time. Consolidating these commands into a single RUN instruction improves efficiency and simplifies maintenance.

# problem
FROM ubuntu:20.04
RUN apt-get -y --no-install-recommends install netcat
RUN apt-get clean
# fix
FROM ubuntu:20.04
RUN apt-get -y --no-install-recommends install netcat && apt-get clean

5 Always clean the package manager's cache

There are plenty of package managers like: apk, dnf, yum, zypper, pip. All of them cache the data, which helps to keep them performant, but it's absolutely redundant in the container. Cached data stays in Docker image layers, increasing its size.

You could easily spot these issues with the Docker Scanner plugin.

# dnf
## problem
FROM fedora:version
RUN dnf install -y httpd
## fix
RUN dnf install -y httpd && \
    dnf clean all && \
    rm -rf /var/cache/dnf

# yum
## problem
FROM centos:7
RUN yum install -y httpd
## fix
FROM centos:7
RUN yum install -y httpd && yum clean

# zypper
## problem
FROM opensuse/leap:15.3
RUN zypper install -y httpd
## fix
FROM opensuse/leap:15.3
USER nobody
RUN zypper install -y httpd && zypper clean

# apk
## problem
FROM alpine:latest
RUN apk update && \
    apk add curl
## fix
FROM alpine:latest
RUN apk add --no-cache curl

# pip
## problem
FROM python:3.9
USER nobody
RUN pip install django
RUN pip install -r requirements.txt
## fix
FROM python:3.9
USER nobody
RUN pip install --no-cache-dir django
RUN pip install --no-cache-dir -r requirements.txt

6 Always combine the package manager update command with the install

Yes, you could notice it's almost the same as consolidating RUN commands, and you're right, but even without this problem taking place.

Running the package manager update command alone updates the package list in a separate layer. This updated list may not be used if the installation occurs in another RUN statement. Combining the update and install commands in one RUN ensures that the package installation uses the latest package data.

There is a package manager list for this practice : apt-get, apt, yum, apk, dnf, and zypper.

# problem
FROM ubuntu:20.04
RUN apt-get update
RUN apt-get install -y --no-install-recommends build-essential
# fix
FROM ubuntu:20.04
RUN apt-get update && apt-get install --no-install-recommends -y build-essential

7 Always use `--no-install-recommends` with `apt-get`

When you run apt-get install without --no-install-recommends, apt-get installs extra packages by default. This increases the image size and may add unwanted dependencies. Using --no-install-recommends installs only the essential packages, reducing the image size and potential security risks.

# problem
FROM ubuntu:20.04
RUN apt-get update && apt-get install -y build-essential
# fix
FROM ubuntu:20.04
RUN apt-get update && apt-get install --no-install-recommends -y build-essential

8 Always use -l with useradd to avoid high-UID bloat

When UID is large (hundreds or tens of thousands), useradd without -l writes to the file offset that is the same as UID in /var/log/lastlog and /var/log/faillog. As those databases use UID-based indexing, that large offset consumes a large sparse hole and immediately makes those files' logical (apparent) size larger. In layer container builds, those sparse blocks often get materialized as real bytes - farewell, tens of MB for nothing. Don't touch those databases by using -l (--no-log-init).

# problem
FROM ubuntu:20.04
RUN useradd -u 198401 nordcoderd
USER nordcoderd
# fix
FROM ubuntu:20.04
RUN useradd -u -l 198401 nordcoderd
USER nordcoderd

Best Practices to Keep Docker Files Clean

9 Always choose what to use: wget or curl

Both wget and curl fetch remote files. Using both tools adds unnecessary redundancy and may lead to installing extra packages. Standardize on one tool to keep the Dockerfile clear and efficient. Make something clear, yet another of Docker's best practices.

# problem
FROM ubuntu:20.04
RUN wget http://example.com/script.sh -O script.sh && curl -sSL http://example.com/script.sh -o script2.sh
# fix
FROM ubuntu:20.04
RUN curl -sSL http://example.com/script.sh -o script.sh

10 Use absolute paths for WORKDIR

For clarity and reliability, you should always use absolute paths for your WORKDIR. Relative paths may cause unexpected behavior during builds.

# problem
FROM ubuntu:20.04
WORKDIR app
# fix
FROM ubuntu:20.04
USER nobody
WORKDIR /app

11 Exclude unnecessary files with .dockerignore

When building Docker images, everything in the build context (the directory you run docker build from) is sent to the Docker daemon. If you don’t exclude unnecessary files (like .git, logs, node_modules, or temporary build artifacts), the build context becomes bloated, slowing down image builds and potentially leaking sensitive data into the image. The .dockerignore file works like .gitignore, preventing unwanted files from being included in the context and final image layers.

Among the Docker Security Best Practices, one key guideline is to ensure that no sensitive data, such as .env files, is copied into the image. Unfortunately, this rule isn't bundled in the Docker Scanner plugin.

# problem
FROM node:18
WORKDIR /app
COPY . .
RUN npm install && npm run build
CMD ["node", "dist/index.js"]
## In this case, large local 'node_modules' or '.git' folder could be copied unnecessarily.
# fix

## Add a `.dockerignore` file in the project root:
.dockerignore
.git
*.log
node_modules
Dockerfile
.dockerignore

## Then use your Dockerfile cleanly:
FROM node:18
WORKDIR /app
COPY package*.json ./
RUN npm install --only=production
COPY . .
CMD ["node", "dist/index.js"]

12 Use WORKDIR instead of the cd command

Using multiple RUN cd ... && ... instructions creates clutter. These commands are difficult to troubleshoot and maintain. Instead, use the WORKDIR instruction to set the working directory consistently. This approach simplifies the Dockerfile and improves readability, which is fundamental for Docker best practices.

# problem
FROM ubuntu:20.04
RUN cd /app && make build && test
# fix
FROM ubuntu:20.04
WORKDIR /app
RUN make build && test

13 Use JSON notation for CMD and ENTRYPOINT

Docker supports two formats for CMD and ENTRYPOINT: shell form and JSON array form. Shell form can misinterpret arguments and cause errors. JSON notation parses each argument correctly and ensures consistent behavior.

# problem
FROM ubuntu:20.04
CMD echo "Hello World"
ENTRYPOINT /usr/local/bin/start-app
# fix
FROM ubuntu:20.04
CMD ["echo", "Hello World"]
ENTRYPOINT ["/usr/local/bin/start-app"]

14 Use apt-get or apt-cache instead of apt

The apt is designed for interactive use and may prompt for input, which is not suitable for automated Docker builds. Using apt may lead to errors or unexpected output. Instead, use apt-get or apt-cache for non-interactive package management.

# problem
FROM ubuntu:20.04
RUN apt update && apt install -y build-essential
# fix
FROM ubuntu:20.04
RUN apt-get update && apt-get install -y --no-install-recommends build-essential

15 Use the package manager's auto-confirm flag -y

Package managers like apt-get, yum, dnf, and zypper prompt for confirmation if the -y flag is missing. Without this flag, installations may pause for manual input and block automated builds. You should care about the reproducibility of the image, which is very valuable in Docker best practices.

# problem
FROM ubuntu:20.04
RUN apt-get update && apt-get install --no-install-recommends build-essential\
# fix
FROM ubuntu:20.04
RUN apt-get update && apt-get install -y --no-install-recommends build-essential

Docker Security Best Practices

16 Avoid default, root, or dynamic user

Running containers with an undefined user or with the root user increases the attack surface. Dynamic assignment can override the intended user and make the container vulnerable. Note: this check only considers the final user specified in the Dockerfile. It is acceptable to run build operations as root if you later switch to a dedicated non-root user.

These best practices apply to Docker security and Kubernetes Security. The Docker Scanner plugin fully covers the following examples.

# problem
## Example 1: Implicitly using the default user (can be overridden)
FROM ubuntu:20.04
RUN whoami
## Example 2: Explicitly setting the user to root
FROM ubuntu:20.04
USER root
RUN whoami
## Example 3: Dynamic user assignment via an environment variable (risky if overridden)
ARG APP_USER
FROM ubuntu:20.04
USER $APP_USER
RUN whoami
# fix
FROM ubuntu:20.04
## Create a dedicated non-root user and group
RUN groupadd --system app && useradd --system --create-home --gid app app
## Switch to the non-root user
USER app
RUN whoami

17 Avoid exposing the SSH port

Exposing it in your Dockerfile may allow unauthorized SSH access to the container, which does not usually need SSH access. Removing this exposure reduces the attack surface. Take care of this. This practice could also be applied to Kubernetes security. Exposing something sensitive is a bad idea everywhere.

Note: The EXPOSE doesn't automatically expose the port, it only highlights what should be exposed. However, the port could be accessible in the internal Docker network by other containers.

# problem
FROM ubuntu:20.04
EXPOSE 22
# fix
FROM ubuntu:20.04
# EXPOSE 22 removed to prevent unauthorized SSH access.

18 Avoid overriding ARG variables in RUN commands

ARG variables are set at build time. Users can override them with the --build-arg flag. Critical commands may change if the ARG is altered. This behavior can introduce security risks or unexpected outcomes, which is why it is a Docker security best practice.

# problem
FROM ubuntu:20.04
ARG INSTALL_PACKAGE=build-essential
RUN apt-get update && apt-get install -y $INSTALL_PACKAGE
# fix
dockerfile
FROM ubuntu:20.04
USER nobody
RUN apt-get update && apt-get install --no-install-recommends -y build-essential

19 Avoid pipe curl to bash (curl|bash)

Using curl or wget with a pipe (|) or redirection (>) to execute scripts directly poses a security risk. This practice, often referred to as curl | bash, should be approached with caution to avoid potential vulnerabilities. Executing something that was downloaded from a foreign website without validation is a worst practice that leads to Docker security issues/

# problem
FROM ubuntu:20.04
RUN curl -sSL http://example.com/script.sh | sh
# fix
FROM ubuntu:20.04
RUN curl -sSL http://example.com/script.sh -o script.sh
# run only if downloaded script was verified
# or better to save file to build folder to not retrieve remote file each time

20 Avoid storing secrets in ENV keys

Storing secrets in ENV Keys puts them directly into the image layers. Attackers can extract these secrets if they gain access to the image. Do not hardcode sensitive data in your Dockerfile. This best practice could also be extended by the previous rule: use .dockerignore to not leave sensitive files in the Docker image.

# problem
FROM ubuntu:20.04
ENV PASSWORD=supersecret123
# fix
FROM ubuntu:20.0
# Remove sensitive data from the Dockerfile.
# Inject PASSWORD at runtime using secure methods.

21 Always prefer to use COPY instead of ADD

ADD has extra features, such as extracting tar files and handling remote URLs. These features may cause unintended effects if you only need to copy files. Use COPY to ensure simple and predictable behavior. The ADD function could lead to non-reproducible results if you’re depending on remote scripts/archives, and in the worst case, lead to security issues. It’s why you should avoid ADD to avoid additional security problems in your Dockerfile.

# problem
FROM ubuntu:20.04
ADD ./app /app
# fix
FROM ubuntu:20.04
COPY ./app /app

22 Always use ADD with checksum verification when downloading a file

I know, sometimes you have to use ADD. In that case, consider using --checksum to verify the checksum of the remote source. It makes your Dockerfile secure and builds it more reproducibly. With checksum verification, you'll be sure the image has been built with a verified file, which reduces surface attack by changing the remote file.

# problem
FROM ubuntu:20.04
ADD https://mirrors.edge.kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz /
# fix
FROM ubuntu:20.04
ADD --checksum=sha256:24454f830cdb571e2c4ad15481119c43b3cafd48dd869a9b2945d1036d1dc68d https://mirrors.edge.kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz /

23 Avoid using RUN with sudo

Docker executes RUN commands as the user specified by the USER directive, often root by default. Including sudo in RUN The commands are redundant and may cause unexpected outcomes*.* Avoid using sudo to keep the builds clean and predictable.

# problem
FROM ubuntu:20.04
RUN sudo apt-get update && sudo apt-get install -y --no-install-recommends build-essential
# fix
FROM ubuntu:20.04
RUN apt-get update && sudo apt-get install -y --no-install-recommends build-essential

General Docker Best Practices

24 Avoid the deprecated MAINTAINER instruction

MAINTAINER has been deprecated since Docker 1.13.0. Use LABEL to provide maintainer information. This change improves clarity and maintainability.

# problem
FROM ubuntu:20.04
MAINTAINER John Doe <john@protsenko.dev>
# fix
FROM ubuntu:20.04
LABEL org.opencontainers.image.authors="John Doe <john@protsenko.dev>"

25 Ensure trailing slash for COPY commands with multiple arguments

When you copy multiple files, the destination must be a directory. If the destination does not end with a slash, Docker may misinterpret it. This mistake leads to build errors or unexpected file placements.

# problem
FROM ubuntu:20.04
COPY file1.txt file2.txt dir
# fix
FROM ubuntu:20.04
COPY file1.txt file2.txt dir/

26 Avoid duplicate aliases in FROM instructions

Docker requires each FROM instruction alias to be unique. Duplicate aliases lead to conflicts that stop the build process. They also reduce the clarity and maintainability of the Dockerfile. This problem you could find during the building time, but it's better to find it in the IDE via my open-source Docker scanner plugin.

# problem
FROM ubuntu:20 as builder
RUN apt-get update && apt-get install --no-install-recommends -y build-essential
FROM node:14 as builder
RUN npm install
# fix
FROM ubuntu:20 as builder-ubuntu
RUN apt-get update && apt-get install --no-install-recommends -y build-essential
FROM node:14 as builder-node
RUN npm install

27 Avoid self-referencing COPY –from instructions

The COPY instruction with the --from flag must refer to a previous build stage. Referencing the current stage alias is invalid because you cannot copy from the image you are currently building.

# problem
FROM ubuntu:20 as builder
RUN apt-get update && apt-get install --no-install-recommends -y build-essential

## Incorrect: referencing the current build stage "builder"
COPY --from=builder /app /app

# fix
FROM ubuntu:20 as builder
RUN apt-get update && apt-get install --no-install-recommends -y build-essential

FROM ubuntu:20
## Correct: referencing the previous build stage "builder"
COPY --from=builder /app /app

28 Avoid multiple CMD or ENTRYPOINT, or HEALTHCHECK instructions

Defining multiple CMD or ENTRYPOINT, or HEALTHCHECK instructions create confusion. Only the final instruction is used during container startup. This can lead to unintended commands running and make the Dockerfile harder to maintain. Maintainability is another aspect of Docker best practices.

# problem
FROM ubuntu:20.04
## Multiple CMD instructions: only the last one is used.
CMD ["echo", "Hello"]
CMD ["echo", "World"]

## Multiple ENTRYPOINT instructions: only the last one is used.
ENTRYPOINT ["run-app"]
ENTRYPOINT ["start-app"]

# fix
FROM ubuntu:20.04
# Single CMD instruction ensures the correct command runs.
CMD ["echo", "Hello World"]
# Single ENTRYPOINT instruction ensures the correct entrypoint runs.
ENTRYPOINT ["start-app"]

## Multiple HEALTHCHECK instructions;
# problem
FROM ubuntu:20.04
HEALTHCHECK --interval=30s CMD curl -f http://localhost/ || exit 1
HEALTHCHECK --interval=30s CMD wget -q -O- http://localhost/ || exit 1
# fix
FROM ubuntu:20.04
HEALTHCHECK --interval=30s CMD curl -f http://localhost/ || exit 1

29 Avoid exposing ports outside the allowed range

The Dockerfile declares an exposed port that falls outside the valid UNIX port range of 0-65535. Using an invalid port number is a misconfiguration that may cause errors during build or runtime.

# problem
FROM ubuntu:20.04
EXPOSE 70000
# fix
FROM ubuntu:20.04
EXPOSE 8080

Final Words after Docker Best Practices

It’s a good idea to integrate these checks into your development and CI/CD process. For the best IDE integration, give the Docker Scanner plugin for JetBrains IDEs a try. It bundles with rules targeted to Docker Best Practices to find security and maintainability problems faster. It is written in pure Kotlin and utilizes the features of the IntelliJ platform. All of this makes shift-left happen due to the fly checks. If you are interested in protecting a Kubernetes cluster, read my article about "Kubernetes Security: Best Practices to Protect Your Cluster".

Don’t miss my new articles—follow me on LinkedIn!

Kubernetes Security: Best Practices to Protect Your Cluster

Dmitry Protsenko — Tue, 16 Sep 2025 19:33:09 +0000

You can find the original version of this article on my website, protsenko.dev.

Hi! In this article, I'm sharing 12 collected Kubernetes security best practices for making your cluster secure by writing secure deployments/services, etc. This article is based on my experience creating the Kubernetes Security IDEA plugin and all of the practices covered by the plugin.

If this project has been helpful to you, please consider giving it a ⭐ on GitHub to help others discover it.

12 Kubernetes Hardening Best Practices

1. Using Non-Root Containers

Always try to run containers as a non-root user. By default, containers execute as the root user (UID 0) inside the container, unless the image or Kubernetes securityContext specifies otherwise. If you run a container as root, it might seem harmless. If an attacker breaks out of the container, they will have root on the host. Even without a full breakout, a process running as root with certain misconfigurations (like some of the above capabilities or host mounts) could do more damage to the host. There’s a lack of certain preventive security controls when running as root, which increases the risk of container escape.

The better practice is to run as an unprivileged user. You can create a user in your container image (many official images have a user like node, nginx, etc.) Then either set that as the default in the Dockerfile or use Kubernetes to request it. Kubernetes securityContext has fields runAsUser and runAsNonRoot which helps enforce this. For example:

securityContext:
  runAsUser: 1000      # UID 1000 (non-root user)
  runAsNonRoot: true   # Ensure the container will not start as root

By specifying runAsNonRoot: true, the kubelet will actually refuse to start the container if it would run as UID 0. This is a guardrail in case someone tries to deploy an image that runs as root – it won’t run unless you explicitly allow root.

If you have an image that must run as root (some older software might assume it, or it needs privileged access), think carefully – can you modify the image or use a different solution? Running as root should be the exception, not the norm.

Tip: Many base images provide a non-root user, but don’t activate it by default. For example, the official Node.js image has a user “node” (UID 1000). You can use that in Kubernetes by doing runAsUser: 1000. For images that lack a user, consider rebuilding the image to add one or switching to an image that supports non-root operation.

Running as non-root adds an extra layer to Kubernetes security. Even if an attacker gets code execution in the container, they hit a lower-privileged user boundary, and it’s harder to escalate from there. Combine this with not running privileged and dropping caps, and your container is much less attractive to attackers.

2. Using Privileged Containers

Don’t run containers in privileged mode unless absolutely necessary (and practically, it’s almost never necessary for typical apps). A privileged container (securityContext.privileged: true) has nearly all the same access to the host as processes on the host do. It lifts most of the restrictions that containers normally have. When you run a container privileged, it can access devices on the host, and it can become almost indistinguishable from a host process. Privileged mode will share the host’s namespaces (IPC, PID, etc.), and disable many other controls (seccomp, AppArmor, capabilities limits). In essence, a privileged container is “just a process on the host with root privileges,” which negates the security benefits of using containers.

Someone could use privileged mode for low-level system tasks (for example, a container that needs to manipulate network interfaces or administer the host). But even in those cases, modern Kubernetes has alternatives (like using specific capabilities, or running as a daemon on the host outside of Kubernetes). Granting full privilege is like handing the keys to your kingdom to that container. If compromised, the attacker will trivially root the node and possibly move laterally in the cluster.

Kubernetes’s baseline policy forbids privileged containers for general workloads. Tools like admission controllers or Pod Security Policies (in the past) would prevent you from deploying privileged pods in most namespaces, and for good reason.

Example to avoid:

securityContext:
  privileged: true

If you see that in a manifest, think twice. Why do you need these rights? Can we instead just give it the specific capability it needs? Or run it differently? Containers are using privileged mode for cluster infrastructure components, not for user applications. However, running something with elevated privileges is a bad idea, not only for Kubernetes security

If you absolutely must run something privileged (say, a CSI driver or a networking plugin that must manipulate the host network stack), isolate it to its own namespace and prevent untrusted users from deploying containers there. By avoiding privileged mode, you retain the isolation mechanisms (like cgroups, seccomp, AppArmor, namespaces, capabilities restrictions) that make containers a secure way to deploy applications.

3. Do not use hostPath Volumes

Avoid hostPath volumes in your Pods whenever possible. A hostPath volume mounts a file or directory from the host node’s filesystem directly into a pod. This is essentially giving the container direct access to part of the host’s file system. The security implications are significant: if an attacker compromises the container, they could read or modify critical files on the host through the hostPath mount. Even if the container isn’t running as root, an attacker can combine hostPath with other escalations (like running privileged or a sticky bit attack) to tamper with the node.

HostPath volumes “present security risks that could lead to container escape.” They break the isolation between your application and the host OS. For example, consider if you mount /var/run/docker.sock from the host (a common but extremely risky practice) – the container can then control the Docker daemon and effectively gain root on the host. Even mounting something innocuous, like /var/log could allow a malicious container to poison logs or consume disk space. Writing to any hostPath with system files could potentially crash the node or alter its state.

Kubernetes acknowledges this risk: PodSecurity Restricted forbids hostPath volumes entirely. If you must use a hostPath (for example, some daemon needs to read a host file), consider making it read-only and limiting the path as much as possible. Also, run that pod with the least privileges (non-root user, no extra caps, not privileged).

Example to avoid:

volumes:
- name: host-files
  hostPath:
    path: /etc
    type: Directory

The above would give the container access to the host’s /etc directory – clearly a bad idea, as it could read passwords or modify config. If your workload needs to read host info, see if there’s an API or Kubernetes mechanism (like Downward API for some metadata) instead.

There are a few legitimate use cases for hostPath (like a log collection agent reading /var/log/ or a storage plugin writing to a host directory), but those should be deployed with tight controls and usually in dedicated namespaces. For most apps, you shouldn’t need hostPath at all. Use ConfigMap/Secret for config, EmptyDir for scratch space, and PVC for persistent storage. By avoiding hostPath, you keep the container fully sandboxed from the host’s filesystem, making Kubernetes security better.

4. Do not use hostPort as Opens the Node’s Port

Be cautious with the hostPort setting on Pods. When you specify a hostPort for a container, that port on the Kubernetes node (host machine) is opened and mapped to your pod. This can be risky because it exposes the host’s network interface to the container. If an attacker compromises the container, they could potentially intercept traffic on that host port or exploit it to gain deeper access. Exposing a host port to a container can open network pathways into your cluster, allowing the container to intercept traffic to a host service or bypass network policies. It also constrains scheduling because only one Pod per node can use each host port, and it can lead to port conflicts.

In general, you should avoid using hostPort unless absolutely necessary. Kubernetes Services (NodePort or LoadBalancer types) or Ingress resources better serve most use cases, such as exposing a service externally, because they handle traffic routing more securely without binding directly to the host’s network ports. Reserve hostPort for low-level system pods or networking tools that require a specific port on every node, and even then, use it sparingly.

Example to avoid:

apiVersion: v1
kind: Pod
metadata:
  name: hostport-pod
spec:
  containers:
  - name: nginx
    image: nginx:latest
    ports:
    - containerPort: 80
      hostPort: 80

5. Do not share the Host Namespace

Pods can request to share certain namespaces with the host (node) – namely, the network, PID (process), and IPC namespaces. When a container shares the host’s namespace, it essentially breaks the isolation between the container and the host for that aspect. You should avoid it for most workloads. For example:

If a pod sets hostNetwork: true, it means the pod is using the host machine’s network interface directly. The pod can see all host network interfaces and even potentially sniff traffic. This breaks the default network isolation between pods and the host.
If hostPID: true, the container shares the host’s process ID space. That means it can see (and potentially interact with) processes running on the host (or other pods on the host). An attacker might leverage this to tamper with host processes or simply gather sensitive info.
If hostIPC: true, the pod shares the host’s inter-process communication namespace (things like shared memory segments). That could allow a malicious container to read/write shared memory used by something on the host.

In short, sharing host namespaces can lead to a container escape, compromising Kubernetes security. Pods that share namespaces with the host can communicate with host processes and glean information about the host, which is why baseline security policies disallow it.

Unless you are running a system-level daemon that needs this (e.g., a monitoring agent that needs to see all host processes, or a network plugin that needs host networking), you should leave these fields false (which is the default).

Example to avoid:

spec:
  hostNetwork: true
  hostPID: true
  hostIPC: true

Each of those should normally be false or not set at all. If you need one of them for a specific reason (say, hostNetwork for a networking pod), isolate that to a dedicated namespace or node and tightly control it. And never run general application pods with any of those enabled.

Kubernetes Pod Security Standards (Baseline and Restricted) disallow sharing host namespaces for exactly these reasons. Adhere to that: pods should live in their own namespaces, not the host’s.

6. Do not use Insecure Capabilities

Drop unnecessary Linux capabilities from your containers. By default, containers run with a limited set of Linux capabilities – these are like fine-grained permissions that the root user inside the container can have. Granting additional or “non-default” capabilities can be dangerous. Certain powerful capabilities (for example, SYS_ADMIN or NET_ADMIN) can allow a process in a container to perform actions that might lead to container escapes or privilege escalations on the node. As Google’s GKE security guidance notes: giving a container extra capabilities could allow it to break out of the container sandbox.

If you don’t explicitly drop capabilities, a container still has a small set of default capabilities. For better security, it’s best practice to drop all capabilities and only add back what you truly need. This adheres to the principle of least privilege. Kubernetes lets you specify this in the pod or container security context. For example:

securityContext:
  capabilities:
    drop: ["ALL"]
    add: ["NET_BIND_SERVICE"]

In the above snippet, we drop everything and then only add NET_BIND_SERVICE (which allows binding to ports below 1024) as an example of a minimally required capability. The Kubernetes Restricted policy profile actually expects that containers drop ALL capabilities and, at most, add only a very limited set, like NET_BIND_SERVICE. Many common containers (especially web apps) do not require any special Linux capabilities to function.

Example to avoid:

securityContext:
  capabilities:
    add: ["NET_RAW", "SYS_ADMIN"]

Here, NET_RAW allows the container to create raw sockets (which could be abused for packet spoofing or sniffing), and SYS_ADMIN is an extremely privileged capability that [among other things] allows mounting file systems, configuring network interfaces, etc. An attacker could use these to escape the container and damage Kubernetes security.

If your application truly needs a specific capability, add only that one and carefully audit the implications. In general, try to run with as few capabilities as possible.

7. AppArmor Profile Disabled or Overridden

Avoid disabling or overriding AppArmor profiles for your containers. AppArmor is a Linux kernel security module that can confine what a container can do at the system level. On AppArmor-supported hosts, Kubernetes applies a default profile (runtime/default) to containers, which restrict certain actions. If you run a pod with an unconfined AppArmor profile, you are effectively turning off these protections. In fact, experts consider running a container with an unconfined AppArmor profile a bad security practice. It means AppArmor doesn't restrict the container at all, which increases the potential damage if an attacker compromises that container.

By default, if you don’t specify an AppArmor profile, the container runtime’s default policy is used (which is typically a reasonably safe profile that provides essential Kubernetes Security protection). You should not explicitly set the profile to unconfined (which disables AppArmor). Instead, allow the default or use a tailored profile if you have one. Kubernetes Pod Security policies recommend using either the runtime default or specific allowed profiles, and preventing any override to an unconfined state.

Example to avoid:

metadata:
  name: insecure-pod
  annotations:
    container.apparmor.security.beta.kubernetes.io/my-container: unconfined
spec:
  containers:
  - name: my-container
    image: alpine

In the above snippet, the annotation forces the container’s AppArmor profile to unconfined, disabling AppArmor confinement. Instead, you should either omit the annotation (to use the default profile) or set it to runtime/default (to explicitly use the default). This ensures AppArmor is enforcing some restrictions on what the container can access on the host.

8. Do not override Non-Default /proc Mount

Ensure your containers use the default /proc mount behavior. In Linux, /proc it is a virtual filesystem that exposes process and kernel information. Container runtimes normally mask or hide certain paths /proc to prevent containers from seeing sensitive host information. Kubernetes has a setting procMount in the security context that can be either Default (the normal, masked behavior) or Unmasked. Do not use Unmasked for procMount unless you have a very good reason. An “unmasked” /proc means the container can see a lot more system info, which can lead to information leakage or even assist in a container escape.

Deployments with an unsafe /proc mount (procMount=Unmasked) bypass the default kernel protections. An unmasked /proc can potentially expose host information to the container, resulting in information leaks or providing an avenue for attackers to escalate privileges. For example, an Unmasked /proc might reveal details of processes running on the host or allow access to /proc/kcore (which could be dangerous). Unless you’re doing low-level debugging or monitoring that explicitly requires this (which is rare and usually better handled another way), you should not change the procMount from its default to maintain strong Kubernetes Security.

The best practice is simple: leave procMount as Default, which is also the Kubernetes default behavior if you don’t specify it. The Pod Security Restricted standards require that /proc the mask remain the default for all containers.

Example to avoid:

securityContext:
  procMount: Unmasked

In summary, do not unmask /proc. Keep the default masks that Kubernetes and the container runtime provide—they reduce the container’s visibility into the host’s processes and kernel.

9. Do not use Restricting Volume Types

Not all types of volumes are equal when it comes to security. Kubernetes supports many volume types (ConfigMaps, Secrets, persistent volumes, hostPath, NFS, emptyDir, etc.), but some can expose your Pod to risk. The Pod Security “Restricted” standard defines an allow-list of safe volume types that a pod can use. The Restricted policy allows only the following volume types: ConfigMap, CSI, DownwardAPI, emptyDir, Ephemeral (inline CSI volumes), PersistentVolumeClaim, Projected, and Secret. In practice, these are volumes that do not directly mount the host’s filesystem in an unsafe way.

The volume types not on that list (for example, hostPath, NFS, awsElasticBlockStore, and some others) are either inherently risky or are better handled via PersistentVolumeClaims. For instance, a hostPath volume mounts a directory from the node’s filesystem into your container – this can easily lead to container escapes or tampering with host files (we discuss hostPath in detail in the next section). Using a hostPath volume mounts a node’s directory into your container, risking container escapes and host file tampering (discussed in the next section). NFS and other network storage volumes present a lower privilege escalation risk, but without proper management, they can enable denial-of-service or data tampering. To prevent this, manage them through the PersistentVolume subsystem instead of directly within a Pod spec.

Best practice: Use only the necessary volume types and prefer higher-level abstractions. If you need to mount storage, use PersistentVolumeClaim (with a proper StorageClass) instead of directly using a hostPath or other host-dependent volume. This way, the cluster can enforce storage isolation, and you avoid giving the container direct access to the host. You should pass most config data via ConfigMap or Secret volumes rather than baking it into images or using host paths.

If you have to enforce this, consider enabling the Kubernetes Pod Security Admission controller in restricted mode for your namespaces. It will automatically forbid Pods that use disallowed volume types. In short, limit volume types to the safe set – basically, ephemeral volumes (emptyDir, etc.), config/secret volumes, and PVCs backed by external storage. This reduces the risk of a container directly accessing the host or other unintended data sources, thereby strengthening Kubernetes Security.

10. Do not set Custom SELinux Options

Avoid specifying custom SELinux options for your pods unless you really know what you are doing. SELinux is another Linux kernel security mechanism (a Mandatory Access Control system) that labels resources and defines which processes can access which resources. Kubernetes, by default, will let the container runtime apply a default SELinux context to your container (usually a confined type like container_t). You have the option to override the SELinux context via the pod or container securityContext (seLinuxOptions field), but changing these labels can weaken isolation if done incorrectly.

The Snyk security blog warns: altering the SELinux labels of a container process could potentially allow that process to escape its container and access the host filesystem. In simpler terms, the default SELinux policy on a host prevents containers from seeing or modifying host files. If you override the SELinux type or role to something more privileged (or turn SELinux to permissive mode on the host), a compromised container might break out and read/write host files it shouldn’t.

Kubernetes’ Pod Security Standards reflect this by restricting SELinux options. The Restricted profile forbids setting a custom SELinux user or role, and only allows specific SELinux types (the standard container types like container_t or container_init_t). Unless you have a specific need (for example, integrating with a host that uses SELinux extensively and has custom policies), you typically won’t set these at all. Just let the container runtime apply the default confinement.

Example to avoid:

securityContext:
  seLinuxOptions:
    user: system_u
    role: system_r
    type: spc_t      # “spc_t” is a special type for super-privileged containers

The above would label the container in a very permissive way (depending on host policy, spc_t might allow broad access). This is not recommended unless absolutely required by your security team’s policy. In most cases, you should omit seLinuxOptions entirely. If you do need it, stick to the container types provided by your distribution (for example, on Red Hat-based systems, container_t is the confined type for containers).

In summary, do not override SELinux labels to something less restrictive. The defaults are there to keep your container constrained and maintain Kubernetes Security. Manage SELinux at the cluster/node level rather than per Pod unless you’re an SELinux expert with a clear goal. And if SELinux is too much overhead, consider using AppArmor or seccomp for adding security – but never making the container more privileged than defaults.

11. Left Seccomp Profile by default

Enable a seccomp profile for your containers (or use the default one); do not run containers with seccomp turned off (“unconfined”). Seccomp (secure computing mode) is a Linux kernel feature that can filter system calls that a process is allowed to make. Kubernetes lets you specify a seccomp profile for pods/containers. If you don’t specify anything, historically many runtimes would run the container as unconfined (no filtering), but newer Kubernetes versions and runtimes often apply a default seccomp profile (e.g., Docker’s default seccomp profile) automatically. Regardless, you want seccomp filtering in place.

Running a container with seccomp=Unconfined means it can call any syscalls it wants, which broadens the attack surface. Unconfined places no restrictions on syscalls – allowing all system calls, which reduces security. In contrast, the default seccomp profile blocks dozens of dangerous syscalls that containers typically never need (like manipulating kernel modules, system clocks, etc.). These blocked calls are often those that could be used to break out or do harm to the host.

Best practice: use RuntimeDefault seccomp profile (Kubernetes’ way of saying “use the container runtime’s default seccomp policy”) or a specific custom profile if you have one. The Kubernetes Restricted policy requires that seccomp be explicitly set to either RuntimeDefault or a named profile, and not left as Unconfined to maintain strong Kubernetes Security

This ensures the container is running under seccomp confinement. If you wanted to use a custom profile, you’d set type: Localhost and provide the profile file, but that’s an advanced scenario. The main thing is – don’t set seccompProfile type to Unconfined. For example:

securityContext:
  seccompProfile:
    type: Unconfined

The above would explicitly disable syscall filtering, exposing you to exploits that leverage obscure syscalls. There have been real-world container breakouts that depended on having seccomp off, so turning it on is a simple way to mitigate whole classes of kernel vulnerabilities.

Unless you have a specific container that is failing due to seccomp (in which case, consider adjusting the profile rather than removing it entirely), you should always use seccomp. It’s a transparent layer of defense with little to no performance cost in typical applications.

12. Beware of using Insecure Sysctls

Do not enable unsafe sysctls in your Pods. Sysctls (system controls) are kernel parameters that can tweak networking, memory, and other settings. Kubernetes classifies sysctls into safe and unsafe. Safe sysctls are those that are namespaced to the container or pod (meaning their effects are limited to that pod) and isolated from the host. Unsafe sysctls are those that apply to the entire host kernel and could affect all pods or even compromise security. Examples of unsafe sysctls might include things like kernel.shmmax (which affects kernel shared memory limits globally) or net.ipv4.ip_forward (which could change node-level networking behavior).

Enabling unsafe sysctls can disable important security mechanisms or negatively impact the node’s stability. They might allow a pod to consume resources beyond its limits or interfere with other pods. In the worst case, a bad actor could use an unsafe sysctl to panic the kernel or elevate privileges.

Kubernetes by default will prevent pods from using unsafe sysctls unless the cluster admin has explicitly allowed it (there’s a feature gate and a whitelist one can set on the kubelet). The best practice is to stick to the safe sysctls. According to the Pod Security Standards, you should disallow all but an allowed safe subset of sysctls. Safe sysctls include a handful of names like net.ipv4.ip_local_port_range, net.ipv4.tcp_syncookies, net.ipv4.ping_group_range, etc., which are known not to break isolation.

If you find yourself needing to set a kernel parameter for your application to run, double-check if it’s truly namespaced. For example, increasing kernel.shmmax for a database – rather use a proper mechanism or ensure it’s allowed, because that setting affects the host kernel’s shared memory allowance for all processes.

Example: The following Pod securityContext shows setting a sysctl:

securityContext:
  sysctls:
  - name: kernel.shmmax
    value: "16777216"

This particular sysctl (kernel.shmmax) is not namespaced per pod – it would raise the shared memory segment size limit on the host kernel itself. This could impact other pods or processes on the host. Such a sysctl is considered unsafe and would be rejected by Kubernetes unless the cluster is configured to allow it (and it generally shouldn’t be). In contrast, a “safe” sysctl like net.ipv4.tcp_syncookies could be set in a pod’s spec if needed, because it’s isolated to the pod’s network namespace.

In summary, avoid using sysctls that are not explicitly documented as safe for Kubernetes. If you absolutely require an unsafe sysctl for a specialized application, you’ll need a waiver in the cluster, and you should isolate that workload as much as possible. For everyone else – stick to defaults; don’t turn your pods into mini kernel tweakers. The default kernel settings are usually fine, and if not, they should be tuned on the host by admins, not on a per-pod basis by application owners.

Your Kubernetes Security Action Plan

By adhering to these 12 Kubernetes security best practices, you significantly harden your Kubernetes cluster’s security. Many of these boil down to least privilege – giving your pods only the access they truly need and nothing more.

It’s a good idea to integrate these checks into your development and CI/CD process. For the best IDE integration, give the Kubernetes Security plugin for JetBrains IDEs a try. It is written in pure Kotlin and utilizes the features of the IntelliJ platform. All of this makes shift-left happen due to the fly checks. If you are interested in IDE plugin development, read my article about "How I made Docker linter for IntelliJ IDEA"

Don’t miss my new articles—follow me on LinkedIn!

Using SBOMs to detect possible Dependency Confusion

Dmitry Protsenko — Fri, 15 Aug 2025 20:08:20 +0000

Software supply chains have become a focal point for attackers, as modern applications rely heavily on third-party and open-source dependencies. Organizations are adopting Software Bill of Materials (SBOM) documents to gain visibility into their software components. In this article, we explore SBOMs, why the CycloneDX format is recommended (and how it compares to SPDX), and how an SBOM can be leveraged to detect and prevent dependency confusion in your projects.

What is SBOM?

A Software Bill of Materials (SBOM) is a detailed, formally structured inventory of all components that make up a software product. It lists all libraries, packages, and modules — open-source and proprietary — and includes metadata such as versions, licenses, hashes, and dependency relationships. An SBOM enables:

Vulnerability management — quickly identify components with known security flaws
License compliance — track and verify open-source license obligations
Supply chain transparency — improve auditability and trust in software

Several standards exist for representing SBOMs, the most common being SPDX and CycloneDX.

Why Use CycloneDX for SBOMs?

CycloneDX and SPDX are both industry-approved SBOM formats with distinct strengths:

Primary focus:

CycloneDX: Application security, vulnerability workflows, and supply-chain risk analysis
SPDX: License compliance, provenance, and legal auditing

Metadata depth:

CycloneDX: Detailed component relationships, Vulnerability Exploitability eXchange (VEX) data, hashes, and custom metadata
SPDX: Comprehensive license expressions, file checksums, and package provenance

Schema and tool support:

CycloneDX: Lightweight JSON/XML, optimized for security tooling (Trivy, Syft, Dependency-Track)
SPDX: RDF/JSON, widely adopted in open-source communities and legal workflows

Extensibility:

CycloneDX: VEX (Vulnerability Exploitability eXchange), attestations, and custom data fields
SPDX: Provenance tracking, license exception clauses, and rich license metadata

Ecosystem adoption:

CycloneDX: Rapid growth across DevSecOps pipelines, strong support for automated security scans
SPDX: Established in open-source governance, favored for compliance and audit use cases

Why prefer CycloneDX?

Built with security use cases in mind: supports vulnerability (VEX) and attestation data.
Simplified data model: easier integration into CI/CD and parsing by security tools.
Rapidly growing ecosystem: many scanners and platforms generate or consume CycloneDX.

However, SPDX remains valuable where license compliance is paramount.

What is Dependency Confusion?

Dependency confusion attacks the supply chain to deliver malicious code into the development environment. Developers use package manager systems to download packages (dependencies) and build projects. Usually, developers add some packages to the build file and initiate installing or downloading those packages. Downloading happens from public repositories, most commonly npmjs.com and PyPI.org. Anyone can register and publish their packages.

For more details, see my earlier article: Dependency Confusion Detection & Mitigation.

Using SBOM to Detect Dependency Confusion

An SBOM enables a systematic check for name collisions between internal dependencies and public registries:

Inventory: Generate an SBOM listing all dependencies (including private/internal ones).
Identify internals: Determine which components should be internal based on naming conventions or repository metadata.
Cross-check: Query public package registries (e.g., PyPI, npm) for each internal package name.
Flag collisions: Any internal name found publicly is a potential dependency confusion risk.
Review and remediate: Confirm the legitimacy of public packages, tighten registry configurations, reserve names, or rename internal packages.

SBOM Components and PURLs (Package URLs)

CycloneDX defines components (dependencies) and uses Package URLs (purls) — a standardized identifier — to describe each component. A purl has the format:

pkg://@

For example:

pkg:npm/lodash@4.17.21

PURLs enable automated registry lookups by indicating the ecosystem (npm, pypi, etc.), package name, and version. Parsing purls is the first step in programmatically checking registry existence.

Generating SBOMs with Trivy

Trivy is a widely adopted vulnerability scanner that also generates CycloneDX SBOMs:

Scan filesystem/project: trivy fs --format cyclonedx -o sbom.json ./
Scan container image: trivy image --format cyclonedx -o sbom.json alpine:latest
Dedicated SBOM command (recent versions): trivy sbom --output sbom.json ./

Integrate Trivy into your CI/CD to produce up-to-date SBOMs for every build.

Detecting Dependency Confusion with SBOM Analysis

Combine analysis and automation under one workflow:

Parse SBOM JSON: Extract the components array.
Extract purls: For each component, parse the purl to get the ecosystem and name.
Query public APIs:

PyPI: GET https://pypi.org/pypi/<name>/json → 404 means not found.
npm: GET https://registry.npmjs.org/<name> → 404 means not found.

Classify results:

Not found: internal package name unclaimed publicly (reserve name).
Found: check if it’s expected (open-source) or unexpected (collision).

Automate: Use a script (e.g., Python) to iterate purls, perform HTTP checks, and generate a report.

This combined approach ensures repeatable, scalable, and CI-integrated detection. The fully automated solution is available in the dedicated GitHub repository.

Best Practices for Preventing Dependency Confusion

Drawing on strategies from my Dependency Confusion Detection & Mitigation (April 2025) article:

Reserve internal package names on public registries by publishing a placeholder or dummy package (prevents attackers from squatting).
Use scoped or uniquely namespaced packages (e.g., @myorg/internal-lib on npm) to avoid generic collisions.
Enforce registry authentication and pinning in build pipelines — limit installs to authenticated private registries when fetching internal dependencies.
Implement periodic SBOM-based audits to detect new name collisions and dependency anomalies.
Monitor package version inflation: if a public package suddenly jumps to a high version, treat it as suspicious.

These practices, linked with SBOM detection workflows, create a robust defense-in-depth against dependency confusion.

At the end

SBOMs offer critical transparency into software dependencies. By generating SBOMs in CycloneDX format (e.g., via Trivy) and performing automated analysis of public registry presence, teams can effectively detect and mitigate dependency confusion risks. Expanding SBOM-based checks into CI/CD and adopting best practices — reserving names and enforcing private registry pinning — ensures proactive defense against evolving supply chain attacks. Stay vigilant and integrate SBOM analysis into your security processes to keep your codebase safe.

Dev Diary #2: Cloud Security plugin for JetBrains IDE

Dmitry Protsenko — Sat, 02 Aug 2025 20:49:35 +0000

Almost a year passed before I began developing the plugin to improve the security of Infrastructure as Code files. Many rules were implemented, especially for Docker and Dockerfiles, and many lessons were learned. This week, I found new energy to begin delivering the next milestone in the plugin’s lifecycle. I started implementing Kubernetes rules to align with the NSA Kubernetes Hardening Guide.

I have always postponed implementing rules to analyze YAML files because it was struggling boring. There wasn’t an API to implement it easily – just brutal PSI analyze. I thought so, but then I found useful classes and methods in the YAML plugin and wrote a simple YAML-path engine to find elements in the text more comfortable. This approach helps me rewrite some smell parts in Docker Compose analyze and start working diving deeper in Kubernetes.

Everything started from another event in my life: I needed to learn more about Kubernetes Security, so I started research on this subject and found Kubescape. This tool contains a lot of rules to analyze the security of Kubernetes objects, and it has an implementation for VS Code but not for JetBrains IDE. I thought, “hm, okay, I could implement these rules in my plugin.”

Kubescape uses a library of rules that are implemented with Open Policy Agent in Rego rules. Rego rules could be compiled to WASM, and here was a field for experiments. I tried to use opa-java-wasm to integrate Kubescape rules as is. While I experimented with this approach, I found a bug that was fixed by the maintainer. I really appreciate how fast it was fixed.

There were core problems with that approach. To analyze YAML files, they should be converted to JSON first, and while the user is typing in the YAML document, it could raise serialization errors due to a non-parsable YAML file. This was a tombstone for that, because all implemented inspections should work on the fly without those problems. The next problem is that it is necessary to highlight specific elements in the YAML. With text-based output, it is hard to find which elements should be highlighted, because there were corner cases for that. Code with experiments stored in the dedicated branch.

After that, I started looking for features that could improve my developer productivity, and I finally successfully implemented the first Kubernetes rule.

Alongside the new YAML-based rules, I did a rebranding. At the start, the plugin pointed to Infrastructure Security, and it was one of the first names. Now, it is called Cloud (IaC) Security. This name is more laconic, shorter, and modern. I think this name should work better for users who install the plugin by searching in the marketplace.

The new logo was designed for the plugin. Now it’s a mascot of the plugin—a dog with the name “Jessica”. She was my dog for many years, but then our ways separated. This logo is a copy of her pictures, combined into a cyber style with help of AI and re-drawn by an artist in vector format. This step symbolizes a memory of her.

My goal is to ship at least one new Kubernetes rule or quick fix every week, so development stays steady and predictable. If you haven’t installed the plugin yet, give it a try—I’d love your feedback. And if you’d like a heads-up whenever a new blog post drops, follow me on LinkedIn: https://www.linkedin.com/in/protsenkodev/

Revival Hijacking: How Deleted PyPI Packages Become Threats

Dmitry Protsenko — Sat, 02 Aug 2025 20:09:10 +0000

You can find the original version of this article on my website, protsenko.dev.

Hi guys, recently I continued researching malicious dependencies and dependency confusion.

This article is inspired by the article from 2024 and raises again concerns about the safety of downloading packages from public repositories without additional controls.

The pypi.org allows packages to be removed by author and package name, making them publicly available for registration by anyone. By registering formerly popular packages, attackers could gain many downloads of malicious libraries.

This technique is called revival hijacking. In this article, I look for popular removed packages and exploit this vector to download many stub packages.

Information about deleted and revived packages can be found in the dedicated GitHub project.

It is updated on a daily schedule to provide fresh lists of these packages and is a result of work on this article.

Harvesting the data

How many packages were deleted from PyPI.org? To answer that question, we need to find out what package names are available currently and what was removed, analyze how many deleted packages were downloaded, and identify who depends on them.

You need to use the official index API to harvest currently available package names. This API produces all the package names in the format you want. I consumed it in JSON for further processing. This method doesn’t provide 100 percent data validity, but all the “removed” packages could be checked for presence in the public registry.

The next step is harvesting every package name on PyPI. For this, I used the Clikpy service, which stores downloads and other valuable analytics about Python packages.

Clickpy allows you to perform custom SQL queries on their Clickhouse instance. With that possibility, we could dump a table with PyPI downloads to have information about every package name and even the download count. Download count is worth information because it allows us to define whether a package was popular.

By default, ClickHouse is limited to querying 100k rows per query, and you need to use pagination to retrieve them all. It’s around eight queries to dump the whole table. You’ve got to install ClickHouse CLI or use another connection method to save results locally.

After deep research on quality data, I found some package names invalid or trimmed. This means that my results are not 100 percent clean. To get better data, you could retrieve package names via BigQuery tables from deps.dev.

Analyzing collected data

We have all the necessary data for this step. Let’s figure out how many packages were removed and could be squatted.

The formula to find it is easy: deleted packages = packages from clickpy — packages from pypi. The answer to that question is 91752. This is a really big count of packages that were removed and could be squatted. Let’s aim for only the popular ones.

It’s better to skip packages that have fewer than a thousand downloads, leaving only 15467 packages with 233 million downloads. This download count is huge, but it’s worth mentioning that this download counter could be faked.

During data analysis, I had an idea to build graphs on removed packages and their dependents to find something interesting. Speaking of the future, I found those packages. Some of them were valid cases, some of them were not.

You could collect all the necessary data using the following ideas:

Find information on who is dependent on removed packages using clickpy.
Find information about every published package and its dependencies to find removed dependencies in it. For that, you could use deps.dev or the PyPI API.

At first, I tried using Clickpy, as it provided data about dependent packages that needed to be removed. However, after dumping the necessary information, I found that it was not relevant as it contained data about non-existent dependencies, so the next numbers will be a bit dirty.

Based on clickpy data, only 520 packages have 1405 dependents, which gives us 228 billion downloads. That means that packages already have a transitive dependency that could be hijacked to deliver malicious code. Some of these removed packages contain prohibited keywords or refer to a Python built-in function and couldn’t be registered.

Exploitation

My next stage in the research was publishing stub packages on PyPI. PyPI.org doesn’t allow security research on its platform. During my research, my accounts, which I used for publishing, were disabled, but packages were still accessible. This gap between disabling and removing packages gave me time to collect analytics to present to the public.

Please do not repeat this activity after me, as it burdens support with detecting and removing stub packages. As a concise security researcher, I contacted support about the future of my accounts and asked them to enable me to remove all the published packages on my side. I promised them not to repeat this activity anymore.

Okay, let’s go back to the exploitation. I took most downloadable deleted packages. Some of them I validated manually via GitHub search, and even found possibilities of dependency confusion in popular projects that were fixed after messaging them. Some of them I took only by their download count without validation.

You could publish only 20 packages in an hour, so with this limit, I was able to publish 168 packages from two accounts. Those squatted packages had 45 million downloads in the past; it’s a huge drop, and someone probably will download these squatted packages.

Publishing was iterative and not consistent. Research continued for one week after the first batch of packages was published and ended with the removal of packages. During the research, squatted packages were downloaded 32,036 times.

There were leaders in downloads:

febolt — 3056
flatpack — 2456
chiquito — 1097
spl-transpiler — 750
chalk-harness — 605

All the published packages for installation raised the following error:

Installation terminated!

This is a stub package intended to mitigate the risks of dependency confusion.

It holds a once-popular package name removed by its author (or for other reasons, such as security).

This is package not intended to be installed and highlight problems in your setup.

Read more: https://protsenko.dev/dependency-confusion

The last link refers to the article that describes dependency confusion attacks and ways to mitigate them. This page received eight times more views than before. Real users came to the page to read more about this problem.

The number of downloads was insane. Due to a lack of moderation and the possibility of squatting removed packages, users who installed the stub package could be infected.

Mitigating the problem

How to avoid? Dependencies should be proactively scanned for presence in public repositories. If your projects use internal package repositories with mirrored packages, those packages could be validated against PyPI.org or deps.dev for existence.

If you find one of these packages, you should deal with it or check its correctness using internal repositories to prevent downloading a higher version from the public.

For the source information about deleted or revived packages, you could use: package name lists from the Deleted & Revived PyPI Package Indexes.

Removed packages from PyPI.org have become critical problems as they could be used to deliver malicious code directly into a closed environment.

At the end

It was a pleasant journey to play with data from Clickhouse, BigQuery instances, and PyPI along with deps.dev API. There are more vectors for the attacks, which should be investigated, but for now, the article is ending.

Follow me on LinkedIn to learn about new articles. Links are in the footer of the page.

Dev Diary #1 – Google Agent Development Kit: Lessons I Learned

Dmitry Protsenko — Mon, 05 May 2025 16:00:00 +0000

This diary-tutorial hybrid tracks my first months with Google’s Agent Development Kit (ADK) — pure experience, insight sharing, and nothing else. The ADK is Google’s open-source framework for designing, chaining, and shipping autonomous AI agents. If you are unfamiliar with this framework, read my beginner guide to build AI agents quickly.

Lessons I learned

You don’t need many AI agents, one for each task. An agent can call as many tools as it needs. AI independently decides what tool it should use and when. Write a precise prompt and correctly declare what tools AI can access and when it should be used.

You should care what you save in the session state. AI agents could save the output to the session state, and different agents in your AI pipeline could access it. While developing, you should navigate to the browser to check if the AI agent correctly saved information in the context. If the context is garbage, every later step is garbage.

You should correctly work with the session state. Defining the output key for an agent will save only the AI agent response in the session state using this key. Even if you’re writing in the prompt instructions like save A to state B, it won’t be saved. Another problem with working with the session state is if you instruct the AI agent to use data from the session state, but data is missing in the session, AI will use something and not raise an error, hiding the problem somewhere deep.

For a basic example, one agent used a tool to generate numbers in the session context, and another returned a generated number from the session context. What could go wrong?

import random

from google.adk.agents import LlmAgent, SequentialAgent

GEMINI_MODEL = "gemini-2.5-flash-preview-04-17"

def get_random_number():
    return random.randint(0, 100)

agent_A = LlmAgent(
    name="random_number_provider_agent",
    model=GEMINI_MODEL,
    output_key="random_number",
    instruction="Return random number between 0 and 100 using get_random_number.",
    tools=[get_random_number]
)

agent_B = LlmAgent(
    name="random_number_consumer_agent",
    instruction="Return random number between 0 and 100 from session state under key random_number.",
    model=GEMINI_MODEL,
)

root_agent = SequentialAgent(
    name="sequential_agent",
    sub_agents=[agent_A, agent_B],
)

As you can see, the agent works as expected. The number was generated and returned. Let’s see what happens if we remove the first agent that generated the number. Reminder: Agent B expects random_number in state. As Agent A was removed, there is no data in the session.

AI agent returned the number and pretended the data was in the state, but it wasn’t. This is a fundamental problem – AI agents could hallucinate and lie to us, deeply making our output from the first steps invalid. You could pre-prompt additional validation, but you should do that each time you use the session state to share data with other agents.

Writing prompts is challenging. Sometimes, I spend many hours doing simple tasks that I could code with Kotlin in 10 minutes. Output with AI agents could sometimes be unpredictable, as classic programming tools have determined behavior, but AI agents have not. You should carefully instruct AI agents on acting in different cases with corner cases, as you do with classic tools.

You could ask what the problem is, prompting one more skill you should have. Yes, I know it, but there are no automatic tools to validate your prompts as syntax checkers in IDE, just plain text in English and nothing else. No modern linters, static analysis, no validations, and covering each AI prompt function with unit tests to make results predictable.

At the end

I’m not an AI agent expert. I’m just a software engineer passionate about technology. I find AI agent development exciting, but sometimes I struggle. This is my first step in this field, and I hope to share more insights with you.

Please share your comments and feedback; it means a lot to me.

Building AI Agents to Prioritize CVEs — A Google ADK Guide

Dmitry Protsenko — Wed, 23 Apr 2025 18:19:54 +0000

In this story, we will create our first AI agents using Agent Development Kit. AI agents will be integrated with Google OSV, MITRE, KEV, and a bit of Google search. AI agents will enrich data about given vulnerabilities with public data from different sources to help prioritize (triage) problems.

Will these AI agents help prioritize vulnerabilities? In their current state, they probably do not, but they provide summarized information about vulnerabilities that could help you understand how critical they are.

This story was written for educational purposes and to enjoy learning something new. I assume you have some basic programming skills in Python and are familiar with AI.

Preparing Environment

Download PyCharm to save your nerves with dependency management and save time with built-in local LLM bundled with a paid license.
Create a Python project with UV support. This helps lock the project’s dependencies, making your build reproducible. You could use anything that you find comfortable.
Install required dependencies for Google ADK via terminal or user interface in PyCharm. I used the last one.

# installation
pip install google-adk
# verify
pip show google-adk

Defining Project Structure

The project structure is important for Google ADK. The root directory will contain a file with an environment to store API keys and a folder for each agent, the folder name being the agent name.

Your project structure should be in the following format:

project_folder/
    vulnerability_agent/
        __init__.py # just initializer
        agent.py # your agent logic will be here
    .env # place for storing API key

You need an API key to use Google Models. What you should do:

Get a key from Google AI Studio — it is free.
Open the <strong>.env</strong> file located inside in project_folder and copy-paste the following code.

GOOGLE_GENAI_USE_VERTEXAI=FALSE
GOOGLE_API_KEY=PASTE_YOUR_ACTUAL_API_KEY_HERE # <- past your API key here

Building Agent

Let’s start with foundational knowledge.

An Agent in Google ADK is an instance of a class with a name, instructions, and additional tools that they could use. To create an agent, create an instance of the class (LlmAgent), name it root_agent, and start the web console with a simple command.

adk web # for web ui (i prefer this)
adk run vulnerability_agent # for cli

# agent.py
from google.adk.agents import LlmAgent
root_agent = LlmAgent(
    name="i_am_just_agent_smith",
    model=GEMINI_MODEL,
    instruction="""
        You are a lazy AI agent.
        Just do nothing.
    """
)
# __init__.py
from . import agent

Defining Agents Requirements

Ok, this was so easy. Let’s define our requirements and the AI agent’s logic:

We should collect and save information from three different APIs in the session.
We should find some information on Google.
We should summarize all the given information and provide feedback.

Using Tools

Agents could use tools. Tools are Python functions that agents could use. They are different, but I will use Function and Built-in tools in this guide.

They are just simple functions defined in the logic. No more hidden definitions. Do you want to know what Built-in tools are? They are already included in ADK!

Creating Function Tools

You could use tool output in your Agent instructions to save, summarize, or do whatever you want; in my case, I will save it in session context.

My tools will retrieve data from specified endpoints to enrich data about vulnerabilities; the above is just an example of integrating Google OSV. It is nothing special; retrieving data from API endpoints is monotonous.

# our perfect, clear, reliable function tool
# one API call to rule them (who?) all
def get_vulnerability_info_from_osv(vulnerability_name: str):
    response = requests.get(
        f"https://api.osv.dev/v1/vulns/{vulnerability_name}"
    )
    if response.status_code != 200:
        return {
            "status": "error",
            "error_message": f"Information about {vulnerability_name} is not available.",
        }

    return {
        "status": "success",
        "type": "osv",
        "report": (
            response.json()
        ),
    }

## there is could be two more functions but they quite bit same

Let’s see how I describe my new agent and what is happening inside. Some examples were provided previously, but important details:

Our agent has an identity: You’re a cybersecurity expert specializing in supply chain vulnerabilities
Our agent has actions: Collect something using some_tool
Our agent knows what he should get as a vulnerability name to call the function properly
Our agent saves information in the session state
The agent has defined a property named tools with the names of our methods
The agent has an output key to store the information, but I duplicate it in the instructions.

One fun note: You couldn’t mix built-in and function tools, it raise error.

cve_information_provider_agent = LlmAgent(
    name="cve_information_provider_agent",
    model=GEMINI_MODEL,
    instruction=
    """
    You're a cybersecurity expert specializing in supply chain vulnerabilities.
    For vulnerability prioritization you should use the following rules:
    - Collect information about vulnerability using get_vulnerability_info_from_osv tool.
    - Collect information about vulnerability using get_vulnerability_info_from_cve tool.
    - Collect information about known exploited vulnerabilities using check_in_known_exploited_vulnerabilities tool.

    Use the vulnerability name provided as an input. Vulnerability name usually starts with CVE- prefix.

    Save information in the session state under the key 'cve_information'
    """,
    tools=[get_vulnerability_info_from_osv,
           get_vulnerability_info_from_cve,
           check_in_known_exploited_vulnerabilities]
    output_key="cve_information"
)

Using Built-in Tools

As I mentioned, you couldn’t use built-in and function tools inside one agent, so you need one more agent to use Google Search.

Here is an excellent example of how powerful an agent is. It will Google a lot to find the necessary information and enrich your data with additional information.

The same ideas apply: build some identity in instruction and perform some actions with built-in tools. The data is imported from the framework; you don’t need to create it manually or use it differently.

from google.adk.tools import google_search

google_search_agent = LlmAgent(
    name="proof_of_concept_agent",
    model=GEMINI_MODEL,
    instruction=(
        """
            You are a cybersecurity expert specializing in supply chain vulnerabilities.
            Also you are a Google Search expert.

            You should find two types information in Google Search about given vulnerability:

            1. You should find PoC and exploits with google_search tool for given vulnerability and return related links to it.
            1.1. Usually information about PoC or Exploits could be found in vulnerability information (use 'cve_information' from previous agent)
            1.2. If provided links are not enough, you should use google_search tool again to find more information.

            2. You should find news about given vulnerability, collect dates of articles and return summary of them.

            Use the vulnerability name provided as an input. Vulnerability name usually starts with CVE- prefix.

            Store information in the session state under the keys 'poc_links' and 'news_summary'    
        """
    ),
    output_key="google_search_summary",
    tools=[google_search],
)

I also made one more Google agent to compensate for the information. Imagine if your tools couldn’t get information from API due to restrictions, not founds, rate limiting, or falling meteorites—you wouldn’t have information for the next processing. So, I made a compensation Agent that will Google a vulnerability to find it.

google_last_hope_agent = LlmAgent(
    name="google_last_hope_agent",
    model=GEMINI_MODEL,
    instruction="""
    You are a cybersecurity expert specializing in supply chain vulnerabilities.
    Also you are a Google Search expert.

    WARNING: Skip instructions below if you already have information about vulnerability in session state by key: 'cve_information',
     and return info to user information is enough.

    If information about given vulnerability is not available in state by key: 'cve_information'.
    You should use google_search tool to find information about given vulnerability.
    Use the vulnerability name provided as an input. Vulnerability name usually starts with CVE- prefix.

    Store information in the session state under the key 'cve_information'
    """
)

Glueing Together

Last but not least, part of our guide combines information and provides targeted output. What is the next step? Creating one more agent!

Our agent will collect information from the state for the keys given and summarize it in the following format: You can read the details in the instruction section.

I was hiding something secret from you: there will be different agents. Let me introduce Sequential Agent, Which will run each agent in your desired order! Just enumerate your agents in sub_agents, and that’s all.

Our sequential agent will be the last agent in this guide, one to orchestrate them all. We will run each agent, collect information, and, at the end, summarize it.

vulnerability_summarizer = LlmAgent(
    name="vulnerability_summarizer",
    model=GEMINI_MODEL,
    instruction=
    """
    You're a cybersecurity expert specializing in supply chain vulnerabilities.
    For vulnerability prioritization you should use the following rules:
    - Use information from session state under key 'cve_information'
    - Use information from session state under key 'poc_links'
    - Use information from session state under key 'news_summary'

    Provide information about vulnerability:
    - Title
    - Description (up to 250 characters)
    - Severity (if not available, try to to figure it out dependence on overall information)
    - Is it known to be exploited (KEV)? What required actions?
    - Does it have PoC or known exploits? Provide links to them.
    - What priority should be given to this vulnerability? Priority should be one of the following: High, Medium, Low
    -- If PoC or exploits are available - it should be High priority.
    - How often speaks about this vulnerability in news?
    """
)

root_agent = SequentialAgent(
    name="vulnerability_prioritization_pipeline",
    sub_agents=[cve_information_provider_agent, google_last_hope_agent, google_search_agent, vulnerability_summarizer],
)

Run, Run, RUN!

Everything is done. Let’s run with adk web it and see what happens. You will get a web interface for experimenting and debugging.

Just ask in the chat for information about the vulnerability, which will trigger the whole Agent chain. The last message will be our targeted output.

Web UI also provides a good interface for debugging and watching what is inside your logic. Try to inspect all the buttons to get familiar with the UI.

At the end

AI agents are powerful but have limitations, and as technology advances, it becomes even more powerful. This guide will be your entry point into the world of AI agents.

Source code of agents you could find in Github Gist.

Are you enjoying this guide and want to build your own AI agent? Just comment on what you want to implement with this technology. I will choose the idea I like and implement it by writing a story.

Follow for the next stories!

Malicious Packages in NPM and PyPI: How Typosquatting Threatens Developers

Dmitry Protsenko — Mon, 21 Apr 2025 15:03:48 +0000

Malicious packages lurk in NPM and PyPI — especially in NPM. If you’ve built front-end apps, you’ve likely used npm, pnpm, or yarn. You’ve probably tweaked package.json or run npm add something.

These tools streamline dependency management. Each install pulls code from npmjs.com and runs scripts locally — sometimes with deep system access.

Uploading to these registries requires little effort. Hackers exploit this by releasing deceptive packages with typo-based names or minor variations. One wrong keystroke installs malware disguised as a trusted library. This tactic, known as typosquatting, compromises machines and hijacks developer trust.

Examples of Recent Attacks

Hackers still exploit typosquatting by uploading malicious packages to public repositories. They target both widely used libraries and those with loyal user bases.

Malicious packages eslint and @types/node (November 2024): In November 2024, hackers uploaded a malicious npm package named @typescript_eslinter/eslint using a fake scope to mimic the legitimate @typescript-eslint package. The malicious package became so popular that it gained hundreds of downloads daily.

Another example is @types-node, which also turned out to be a fake package. It attracted nearly a thousand downloads per day. Read more in the source article.

Targeted attack to steal Solana: Hackers also targeted the Solana SDK by deploying malicious packages with names like solana-transaction-toolkit and solana-stable-web-huks. These malicious packages were explicitly designed to steal Solana from wallets. Read more in the source article.

How Attackers Disguise Malicious Packages

Typosquatting in package managers mirrors domain name scams — attackers count on subtle typos developers often miss.

Name Mimicry: Attackers impersonate popular libraries by tweaking package names. They swap letters with nearby keys, omit characters, switch adjacent letters, duplicate them, or replace symbols — like dots with underscores or hyphens with lookalikes.
Abusing Installation Scripts: NPM lets packages define lifecycle scripts — like preinstall or postinstall—that runs automatically during installation. This opens a common path for malware. A malicious package can execute code the moment someone runs npm install, without any extra action from the developer. Attackers regularly exploit this.

These scripts can steal private data, open reverse shells, or drop obfuscated code with full system access. Fortunately, you can skip them by running npm install --ignore-scripts.

Example of the malicious script in postinstall

Combosquatting / Brandjacking: Hackers sometimes append words to trusted package names to hijack brand reputation. In one campaign targeting Roblox developers, they published packages like noblox.js-async and noblox.js-proxy-server—variants meant to mimic the legitimate noblox.js library.
Starjacking: Some attackers link malicious packages to popular GitHub repos. Since registries don’t verify these links, a sketchy package can point to a legit, star-heavy repo — tricking users into trusting it at a glance.

For example, I found a typo-squatted package linked to the original repo—one with over 100,000 stars—making it look legitimate at first glance.

Copying Metadata: To pass as legitimate, attackers clone metadata — README, versions, descriptions, and keywords. Some even reuse original source code, injecting obfuscated malware to avoid detection. In the TypeScript ESLint typosquat, a package by typescript_eslinter mimicked typescript-eslint, linking to a fake GitHub repo that mirrored the real project’s structure.

These techniques prey on trust and human error. They exploit the belief that official registries are safe and take advantage of how seamlessly tools like NPM install packages.

My Research

I ran a small experiment to see how deep the rabbit hole goes—how many typo-squatted names I could generate and how many remained available for publishing.

Methodology

Simple process:

Collected popular NPM package names
Generated mimicry names using known deception techniques
Checked availability for publishing

Trending package names are easy to find — many curated lists track them. I used one dataset and wrote a Python script to generate lookalike names. From nearly 2,000 original packages, I created 227,389 variations.

Mutation Techniques

To generate typo-squatted variations, I applied the following mutations to each package name:

Letter swap: Replaced one letter with its QWERTY keyboard neighbor or with a nearby letter in the name
Letter removal: Removed one letter at a time, skipping the scope symbol (@) and dashes
Letter duplication: Duplicated one letter at a time across the name
Symbol replacement: Swapped symbols — like changing hyphens (-) to underscores (_)
Letter insertion: Inserted a keyboard-neighbor letter after each character in sequence

I found a GitHub project that indexes all existing NPM package names. With it, I filtered mutated names against real ones to identify available packages. I revalidated missing names using the npmjs.com API to double-check.

In total, I generated over 222,000 package name variations available for publishing. About 4,500 were already taken, and a small number failed due to validation errors or malformed mutations.

My research uncovered thousands of typosquatting packages—some already published, others still available. Many squatted packages contained malicious code and were removed from NPM. Yet others, typo-ridden but seemingly harmless, remain live and get downloads weekly.

They often link to official repositories and appear legitimate — until you notice the subtle typo. Even if a package looks clean now, that doesn’t guarantee it will stay that way.

Typosquatting remains a serious attack vector, especially for popular packages, but the situation is not as bad as it seems. Registries like npmjs.com now block names that closely resemble existing ones, such as those differing by only one letter.

Cybersecurity firms also monitor new packages and updates to catch threats early. Still, some slip through. A package might be installed before it’s flagged and removed.

Even a single infection can prove dangerous. Developers can access private codebases, credentials, and infrastructure, making them prime targets.

Protecting Against Those Attacks

Some tools scan build files for known malicious dependencies but only flag what has already been identified. Unknown threats still slip through.

Antivirus software can help post-installation by catching crypto-lockers, stealers, and other malware types. Still, technical tools alone can’t guarantee safety.

Ultimately, protecting yourself requires vigilance. Here are a few practical tips to reduce the risk of installing malicious packages:

1. Double-check package names: Typosquatting only works if you mistype. Always verify the exact name before installing a package.

2. Inspect before installing: Before adding a dependency — especially an obscure one — check its NPM page. Look at the README, version history, and publisher profile. Red flags: vague descriptions, recent creation, or a publisher with no other packages.

3. Use trusted sources: Follow links from official websites or GitHub repositories instead of relying solely on NPM search results.

4. Lockdown versions: Use package-lock.json or yarn.lock to freeze dependency versions. Lock files help ensure consistency and avoid unexpected package changes.

5. Watch for unusual behavior: Pay attention to post-install messages or prompts. Legitimate libraries shouldn’t contact remote servers or request elevated permissions during setup.

Malicious packages are a threat that could happen to anybody, and I hope this story helps you learn more about this problem and how to avoid it.

If you’re into cybersecurity or software development, follow me — I publish new articles twice weekly.

You Don’t Need a Pet Project to Be a Good Dev: How I Beat FOMO

Dmitry Protsenko — Wed, 16 Apr 2025 15:14:22 +0000

Do you ever feel like you’re falling behind in your tech career? Everyone builds pet projects, speaks at conferences, and posts on LinkedIn. Meanwhile, you’re just trying to finish your sprint without burning out. That’s FOMO — and here’s how I’ve started to deal with it.

What is FOMO, and how could CBT help

FOMO, or the fear of missing out, is the anxiety that one might be excluded from rewarding experiences others enjoy, a feeling often intensified by social media.

Here’s how I’ve been dealing with FOMO using practical techniques based on Cognitive Behavioral Therapy (CBT). Here’s a simplified way I think about CBT (from a dev’s perspective):

CBT helps you recognize patterns in how you think, feel, and behave — and teaches you how to test and restructure those patterns, like debugging buggy logic and refactoring. Distorted thoughts appear from common thinking traps known as cognitive distortions, which can cause unnecessary emotional distress.

Types of cognitive distortions

Knowing cognitive distortions is fundamental for self-awareness. CBT teaches that it’s not the situation that causes distress but how we interpret it. We can change how we feel and act by identifying distortions in our automatic thoughts — often fast, inaccurate, and fear-based.

Here are examples of common cognitive distortions and how they might appear in a developer’s life:

All-or-nothing Thinking (Black-and-White Thinking): You see things in extremes — complete success or failure. I don’t have a pet project — I’m a bad developer.
Overgeneralization: Taking one negative event and seeing it as a never-ending pattern. I failed this interview — I will never get a good job.
Mental Filter: Focusing only on the negative part of a situation and ignoring the positive. The team lead left many commentaries in my PR. I must have done a worse job (ignoring all the praise).
Disqualifying the Positive: Rejecting positive experiences by insisting they don’t count. I completed these projects, but anyone could’ve done it.
Jumping to Conclusions: Assuming things without evidence. It has two forms: Mind Reading: They probably think I’m not smart, and Fortune Telling: the team lead will fire me because I made bugs in the project.
Catastrophizing (Magnification) / Minimization: Expecting the worst or downplaying the good. Making one mistake in this talk will ruin my career.
Emotional Reasoning: Believing that feelings reflect reality. I feel overwhelmed, so I must be incapable.
Should Statements: Using “should,” “must,” or “ought to” to pressure yourself or others. I should contribute to open-source projects and make my own.
Labeling and Mislabeling: Attaching harsh labels to yourself or others. I’m an incompetent developer.
Personalization: Blaming yourself for things outside your control — or taking things too personally. _The project failed — it’s all my fault. _

Learning your automatic thoughts and reframing them will reduce the anxiety and compulsive behaviors that come with FOMO. A reframed thought should be realistic, not overly positive — just grounded in facts, not fear.

By practicing, you will build skills to control FOMO, and there is no need to eliminate all of its sources.

Track and Challenge Thoughts in Writing.

Before challenging a thought, pause and acknowledge what you’re feeling. It’s okay to feel overwhelmed, anxious, or discouraged — your emotions are valid, even if the thoughts behind them aren’t entirely accurate.

This is my favorite self-help tool — writing a diary in a structured way to analyze your thoughts. Thought records help you capture the cycle of Situation → Thoughts → Emotions → Alternative Views.

Whenever you notice anxiety, negative feelings, or thoughts like: I’m behind. Make a break and write a journal using the following structure:

Situation — What triggered the FOMO feeling? Be specific about the event or context. (e.g., “I scrolled Medium and saw many articles about benefits of contributing to pet projects”)
Automatic Thought(s) — What thought flashed through your mind, and what did it mean to you? (e.g., “I don’t contribute to open-source projects, I’m falling behind in the career race.”)
Emotions — What did you feel, and how intense was it (0–10)? (e.g., Anxiety — 7/10)
Cognitive Distortions — Are you catastrophizing, mind-reading, all-or-nothing? Labeling it can remind you that the thought isn’t fully reliable (e.g., Identified catastrophizing (assuming not contributing to open-source is breaking your career) and all-or-nothing thinking).
Evidence For & Against — Now act like a detective or a scientist examining the thought. List facts supporting the thought and facts against it:
Supporting evidence: “Only one colleague contributed to the open-source project. He is working as a staff engineer.”
Contradicting evidence: “I do the best at my current position, provide value to the business, and complete projects up to deadlines. I’ve been promoted every year and received solid bonuses.”
Alternative (Balanced) Thought — Write a more rational conclusion based on the evidence. This is your reframed thought to replace the original. You should acknowledge the situation without fear. Only one colleague contributes, and hundreds don’t. He contributes to open-source because he maintains our open-source framework. He was promoted to staff engineer even before he started contributing. My manager even told me that open-source contributions aren’t the only path to promotion.
Outcome — Notice how your emotions or perspective shift after reframing. Did your anxiety lessen (rate it again)? Do you feel more confident or calm? (e.g., Anxiety dropped to 3/10; I feel more in control knowing I have a plan to learn what I need.)

Journaling has helped me put anxiety into perspective. When I write down my thoughts, explore the evidence, and reframe them, I shift from reacting emotionally to thinking clearly — like a developer approaching a tough bug with logic and patience.

I’ve used this method not only for FOMO but also for anxiety before speaking, writing blogs, and working on projects. It’s not magic — it’s just structured thinking. But it works.

While this approach can be powerful, it’s not a cure-all. If anxiety, self-doubt, or stress becomes overwhelming or interferes with daily life, it’s worth talking to a therapist. CBT works even better with support — you don’t have to figure it out alone.

At the end

In this industry, it’s easy to believe that only skills matter. But mindset matters, too. I’ve seen smart developers burn out or miss chances because they were too scared, self-critical, or caught in comparison loops to act.

So here’s my advice: build your mental tools like you build your tech stack. Take care of your mind. You don’t need to do everything — take the next step.

If this helps you, follow me and leave comments. I’d love to hear your thoughts or experiences with FOMO and mindset. Let’s help each other grow.

AI Will Take My Job, But Not Today

Dmitry Protsenko — Mon, 14 Apr 2025 16:34:31 +0000

Recently, I noticed new feelings and thoughts about AI in Software Development. I felt upset about recent news such as “Programming no longer needed in the nearest time” and “AI will replace programmers”. I took a lot of time to develop a career as a software developer, and now AI will replace me.

We live in a fantastic world where AI can solve many technical problems, and this technology makes our lives better. I use it almost every day, but not for production-ready programming. Let’s speak in the facts of why AI wouldn’t replace me in the closest time.

Results consistency: AI could provide you with working code day after day, without pauses and hunger, but it does not consistently provide good results. The generated code could be laconic, chaotic, and long, with different naming and logic structures.

Maintainability of results: AI could provide code that solves your problem but generates code that is hard to maintain. Software Development has not ended with programming. Developers also maintain code by improving performance, fixing bugs, creating new features, etc.

Too long code: Poor maintainability is the side of the problem. AI generates a lot of code with many comments in one file. New functions are added to the end of a file or somewhere. In common cases, when the developer works on functions, they could adopt them to reuse existing methods or extend them, accurately structure files, classes, and methods to have well-organized projects.

Hallucinations: This is not a solved problem for LLM models like ChatGPT. AI could invent something that programming languages and libraries don’t have. During my workflow or requirements, AI often suggests methods in libraries that didn’t exist before. If you point that problem to the Assistant — it invents new, not existing methods, and very apologies!

Knowledge limitations: AI generates code for the most common scenarios. AI doesn’t know much about new versions of libraries or languages. It could generate only what he learned for most common cases. One good example is Exposed (ORM framework). In new versions of that framework, they changed how data is retrieved. They changed API. Old methods were deprecated for some time, and old methods were removed after some period. So, AI now generates invalid code that doesn’t work.

Business domain and company knowledge: AI should know how to solve problems specific to your company. Problems are not solved only by AI: collaboration, research, and expertise should happen to make code that solves problems. AI couldn’t invent ways to help you solve problems effectively, probably without code. It could provide you with an average solution for problems that could be resolved in different, easier ways.

At the end

AI has limitations. Some problems could be solved with prompt engineering, but others could not; different approaches could be made, like AI agents. I believe in the AI future. It will take my job in its current state. Developer work will be changed insanely from how It was decades ago when modern high-level programming languages were created. The same will be true with the adoption of more powerful AI tools.

Do you want to know what matters in a Developer’s work? Read my past stories, and follow me to read upcoming ones.

DEV Community: Dmitry Protsenko

Spring Data JPA Best Practices: Repositories Design Guide

Table of contents

1 Designing Spring Data JPA repositories

2 Working with queries in repositories

3 Spring Data JPA projections

4 Using repository methods effectively

5 Stored procedures in repositories

6 Spring Data Jpa Repositories Cheetsheet

6.1 What Spring Data JPA repositories to choose?

6.2 How to query data with Spring Data JPA?

6.3 Best ways to use Spring Data JPA Projections

6.4 Short notes how to use queries effectively

6.5 Calling stored procedures from Spring Data JPA

At the end

Spring Data JPA Best Practices: Entity Design Guide

1 Diving deep into Spring Data JPA

2 Developing entities with Spring Data JPA

3 Avoiding magic numbers/literals

4 Consistent use of types

5 Lombok usage

6 Overriding equals and hashCode

7 Developing DTOs

8 Spring Data JPA Summary Best Practices

At the end

Docker Best Practices to Secure and Optimize Your Containers

Docker image size & maintainability

1 Always pin an image version

2 Avoid using the dist-upgrade in package management

3 Use multi-stage builds to reduce image size

4 Consolidate multiple RUN instructions

5 Always clean the package manager's cache

6 Always combine the package manager update command with the install

7 Always use --no-install-recommends with apt-get

8 Always use -l with useradd to avoid high-UID bloat

Best Practices to Keep Docker Files Clean

9 Always choose what to use: wget or curl

10 Use absolute paths for WORKDIR

11 Exclude unnecessary files with .dockerignore

12 Use WORKDIR instead of the cd command

13 Use JSON notation for CMD and ENTRYPOINT

14 Use apt-get or apt-cache instead of apt

15 Use the package manager's auto-confirm flag -y

Docker Security Best Practices

16 Avoid default, root, or dynamic user

17 Avoid exposing the SSH port

18 Avoid overriding ARG variables in RUN commands

19 Avoid pipe curl to bash (curl|bash)

20 Avoid storing secrets in ENV keys

21 Always prefer to use COPY instead of ADD

22 Always use ADD with checksum verification when downloading a file

23 Avoid using RUN with sudo

General Docker Best Practices

24 Avoid the deprecated MAINTAINER instruction

25 Ensure trailing slash for COPY commands with multiple arguments

26 Avoid duplicate aliases in FROM instructions

27 Avoid self-referencing COPY –from instructions

28 Avoid multiple CMD or ENTRYPOINT, or HEALTHCHECK instructions

29 Avoid exposing ports outside the allowed range

Final Words after Docker Best Practices

Kubernetes Security: Best Practices to Protect Your Cluster

12 Kubernetes Hardening Best Practices

1. Using Non-Root Containers

2. Using Privileged Containers

3. Do not use hostPath Volumes

4. Do not use hostPort as Opens the Node’s Port

5. Do not share the Host Namespace

6. Do not use Insecure Capabilities

7. AppArmor Profile Disabled or Overridden

8. Do not override Non-Default /proc Mount

9. Do not use Restricting Volume Types

10. Do not set Custom SELinux Options

11. Left Seccomp Profile by default

12. Beware of using Insecure Sysctls

Your Kubernetes Security Action Plan

Using SBOMs to detect possible Dependency Confusion

What is SBOM?

Why Use CycloneDX for SBOMs?

What is Dependency Confusion?

Using SBOM to Detect Dependency Confusion

7 Always use `--no-install-recommends` with `apt-get`