Jing for AREX Test

Posted on Oct 9, 2023 • Edited on Dec 11, 2023

Production Traffic Replication with Java Agent

#opensource #java #testing

Production traffic replication is a technique used to record traffic in production environment, and replaying it in another. It is also considered the best solution for performing regression testing.

Record and playback testing tools

Depending on the location where the recording takes place, all the recording tools can be categorized as: web server-based recording, application layer-based recording, and network protocol stack-based recording.

Web server-based recording

HTTP request recording based on web server refers to the process of recording and replicating the requests and responses between the web server and client.

Advantages: Supports a diverse range of request formats and protocols.

Disadvantages: High maintenance cost, consumes a significant amount of online resources.

Internet protocol stack-based recording

Directly monitoring to network ports and recording by duplicating data packets.

Advantages: Low impact on the application

Disadvantages: More low-level implementation, higher cost. In addition, it cannot be used to test non-idempotent interfaces, because traffic replay can result in the generation of invalid or dirty data, potentially impacting the correctness of the business operations.

Representative tools: goReplay, tcpCopy, tcpReplay

Application layer-based recording

Since Spring Boot is widely used as a Java backend framework, we can leverage AOP (Aspect-Oriented Programming) to intercept the Controller layer and record traffic.

Advantages: Non-intrusive to the code, relatively quick and simple implementation. In addition to validating the returned responses from server, it also supports validation of data written to the service, validation of database, message queue, Redis data, and even validation of runtime in-memory data.

Disadvantages: May consume some online resources, potential impact on business

Representative tools: ngx_http_mirror_module, Java sandbox, AREX

AREX is an automated regression testing platform based on traffic recording and playback. It utilizes Java Agent and bytecode enhancement technology to capture the entry points of real request chains and the request and response data of their 3rd-party dependencies in the production environment. Then, in the testing environment, it simulates to replay these requests and verifies the correctness of the entire invocation chain logic one by one.

How dose AREX Java Agent work?

AOP (Aspect-Oriented Programming) is a programming paradigm that provides a way to modularize cross-cutting concerns, such as logging, security, and transaction management. It achieves this by separating these concerns from the core business logic, allowing you to add extra functionality to existing modules without directly modifying their code.

Java Agent is a mechanism in Java that allows you to dynamically modify the bytecode of classes at runtime. It provides a way to instrument and manipulate the behavior of Java applications, including adding aspects or intercepting method calls.

In the context of AOP, a Java Agent can be used to apply AOP principles by intercepting method calls and weaving in additional behavior. It provides the necessary infrastructure to implement AOP in Java applications.

As shown in the figure below, a request typically has a chain of calls consisting of an entry point and dependencies that are either synchronous or asynchronous.

The recording process is to connect the entry and dependency calls through a RecordId to form a complete test case. AREX-Agent enhances the bytecode of the entry and dependency calls, intercepts the call process when the code is executed, and records the entry parameter, return value, and exceptions of the call, and sends them to the storage service.

During playback in the test environment, the AREX Agent simulates requests using the real data recorded in the production environment. The AREX Agent determines whether playback is required by identifying the playback flag. If playback is required, the actual method invocation is not performed. Instead, the stored response data from the storage service is retrieved and returned as the response.



public Integer parseIp(String ip) {
    if (needReplay()) {
        //  In the playback scenario, the collected data is used as the return result, which is also known as Mocking.
        return DataService.query("parseIp", ip);
    }

    int result = 0;
    if (checkFormat(ip)) {
        String[] ipArray = ip.split("\\.");
        for (int i = 0; i < ipArray.length; i++) {
            result = result << 8;
            result += Integer.parseInt(ipArray[i]);
        }
    }

    if (needRecord()) {
        // In the recording scenario, the parameters and execution results are saved into the database.
        DataService.save("pareseIp", ip, result);
    }
    return result;
}

Taking the above function as an example:

At the beginning of the function, a decision is made whether to playback or not. If playback is necessary, the collected data is utilized as the return result, which is commonly referred to as Mocking.

At the end of the function, a decision is made whether to record or not. If recording is necessary, the intermediate data that the application needs to store is saved to the AREX database.

Technical challenges

The process of recording and replay is very complex. Next, we will dive into the challenges and technical details of AREX Agent.

ClassLoader Isolation

To ensure the AREX Agent code and dependencies do not have a conflict with the application code, the AREX Agent and application code are isolated by different class loaders. As shown in the figure below, AREX Agent overrides the findClass method by customizing AgentClassLoader to ensure that the classes used by AREX Agent will only be loaded by AgentClassLoader, so as to avoid conflicts with the application ClassLoader.

Meanwhile, in order to let the application ClassLoader recognize the recording and playback code of AREX Agent, AREX Agent injects the byte code needed for recording and playback into the application ClassLoader through the ByteBuddy ClassInjector to make sure there is no ClassNotFoundException/NoClassDefFoundError during recording and replay.

Tracing

When the data is recorded and replayed, the entry point of a request and the calls of each dependency will be linked together by a RecordId. When dealing with multi-threading and various asynchronous frameworks, there are significant challenges in maintaining data continuity. AREX Agent addresses the issue of passing RecordId across threads by enhancing the thread behavior, ensuring seamless transfer of RecordId between threads. The supported threads and thread pools are as follows:

Thread
ThreadPoolExecutor
ForkJoinTask
FutureTask
FutureCallback
Reactor Framework
……

Here's a simple example for better understanding of implementation.

When invoking java.util.concurrent.ThreadPoolExecutor#execute(Runnable runnable), the parameter AgentRunnableWrapper is used to wrap the AgentRunnableWrapper runnable. When constructing AgentRunnableWrapper, the current thread context is captured. In the run method, the subthread context is replaced, and after execution, the child thread context is restored. Here is an example code snippet:



executors.execute(Runnable runnable)
executors.submit(Callable callable)

public void execute(Runnable var1) {
var1 =RunnableWrapper.wrap(var1);
}

public class RunnableWrapper implements Runnable {
  private final Runnable runnable;
  private final TraceTransmitter traceTransmitter;

  private RunnableWrapper(Runnable runnable){
    this.runnable = runnable;
    //Capture the current thread context 
    this.traceTransmitter = TraceTransmitter.create();
    }

  @Override
  public void run(){
    //Replacing the subthread context 
    try (TraceTransmitter tm = traceTransmitter.transmit()){
      (runnable.run();
    }
    //Reducing the Atomic Thread Context 
  }
}

...

Component Version Compatibility

The components introduced by an application may have multiple versions, and different versions of the same component may be incompatible, such as changes in package, addition or removal of methods, etc. In order to support multiple versions of components, AREX Agent needs to identify the correct component version for bytecode enhancement, to avoid duplicate enhancement or enhancement of the wrong version.

AREX Agent identifies the Name and Version in the META-INF/MANIFEST.MF of the component JAR file, and matches the version during class loading to ensure that the correct version is used for bytecode enhancement.

Mock Local Cache



Object value = localCache.get(key)
// Cache is available during recording, but not available during playback in which case the code needs to query the database (db.query()).
if (value != null) {
    return value;
} else {
    return db.query();
}

As shown above, during recording, the code first attempts to retrieve the value associated with the given key from the local cache (localCache.get(key)). If the value is not null, it means that the corresponding data is available in the cache during recording, and it is directly returned.

However, during playback, the cache is not available. Therefore, if the value retrieved from the cache is null, it means that the data is not present in the cache during playback. In this case, the code needs to query the database (db.query()) to retrieve the data and return it as the result.

In a word, the execution flow of the replay request is often different from the recording due to inconsistent local cache data with the recording, resulting a low pass rate of replay testing. There are a few challenges to solve this problem:

It is challenging to achieve real-time synchronization between production and test cache data due to the isolation between them.
Local memory is implemented in various ways, and it is impossible to perceive each one individually.
Local memory data is typically fundamental data and can have a large volume. Recording this data can lead to significant performance overhead.

Currently, the solution adopted by AREX Agent is to only record the cache data used in the current request chain. This is achieved by allowing the application to configure dynamic classes to identify the recording, and during playback in the test environment, the cache data is automatically replaced to ensure consistency between the recorded and playback memory data. We are still researching the solution of recording large cache data.

Time Mock

Many business systems are time-sensitive, where accessing them at different times can result in different outcomes. If the recording and playback times are inconsistent, it can lead to playback failures. Additionally, modifying the machine time on the test server is not suitable as playback requests are concurrent, and many servers do not allow modification of the current time. Therefore, we need to implement mocking of the current time at the code level to address this issue.

The currently supported time types are as follows:

java.time.Instant
java.time.LocalDate
java.time.LocalTime
java.time.LocalDateTime
java.util.Date
java.util.Calendar
org.joda.time.DateTimeUtils
java.time.ZonedDateTime

public static native long currentTimeMillis() is an intrinsic function. When the JVM performs inline optimization on intrinsic functions, it replaces the existing bytecode with internal code (JIT), which causes the enhanced code by AREX Agent to become ineffective. The JDK performs inline operations on System.currentTimeMillis() and System.nanoTime() as follows:



// // https://hg.openjdk.org/jdk8u/jdk8u/hotspot/file/dae2d83e0ec2/src/share/vm/classfile/vmSymbols.hpp#l631
//------------------------inline_native_time_funcs--------------
// inline code for System.currentTimeMillis() and System.nanoTime()
// these have the same type and signature
bool LibraryCallKit::inline_native_time_funcs(address funcAddr, const char* funcName) {
  const TypeFunc* tf = OptoRuntime::void_long_Type();
  const TypePtr* no_memory_effects = NULL;
  Node* time = make_runtime_call(RC_LEAF, tf, funcAddr, funcName, no_memory_effects);
  Node* value = _gvn.transform(new ProjNode(time, TypeFunc::Parms+0));
#ifdef ASSERT
  Node* value_top = _gvn.transform(new ProjNode(time, TypeFunc::Parms+1));
  assert(value_top == top(), "second value must be top");
#endif
  set_result(value);
  return true;
}

AREX Agent has taken special care of this issue by replacing the code that uses the method System.currentTimeMillis() with AREX Agent's method of obtaining the time directly through the application configuration, avoiding inline optimizations.

Community⤵️
🐦 Follow us on Twitter
📝 Join AREX Slack
📧 Join the Mailing List