DEV Community: LEO Qin

Four deployment strategies under Microservices

LEO Qin — Sun, 25 Jun 2023 08:08:25 +0000

Blue and green release

Blue and Green Release Features

Precautions for blue-green release

Rolling Publishing

Rolling Release Features

Rollover Release Notes

Grayscale publishing

A/B testing

Assuming you are the Twitter project manager, the new version has undergone significant changes compared to the old version, including design of service architecture, front-end UI, etc. After testing, there are no obstacles to functionality. How can users switch to the new version at this time?

Blue and Green Release Features

Obviously, there is no such issue with the first release of the application, and this kind of thinking about how to release will only appear in subsequent version iterations.

In the blue-green deployment, there are two systems in total: one is the service system being provided (i.e. the old version mentioned above), marked as "green"; The other set is a system ready for release, marked as "blue". Both systems are fully functional and running, but the system version and external service situation are different. The old system that is providing services to the outside world is a green system, while the newly deployed system is a blue system.

The blue system does not provide external services, what is it used for?
Used for pre release testing. If any issues are found during the testing process, they can be directly modified on the blue system without interfering with the system being used by users.
After repeated testing, modification, and verification, the blue system determines that it meets the online standards and directly switches users to the blue system. After switching, for a period of time, the blue and green systems still coexist, but the user is already accessing the blue system. During this period, observe the working status of the blue system (new system), and if there are any problems, directly switch back to the green system.
When it is confirmed that the blue system that provides services to the outside world is working properly and the green system that does not provide services to the outside world is no longer needed, the blue system officially becomes the new green system that provides services to the outside world. The original green system can be destroyed, releasing resources for deployment of the next blue system.

Blue and Green Release Features

The purpose of blue-green deployment is to reduce interruption time during publication and enable quick recall of publications.
Only when the two systems are not coupled can there be 100% guarantee of no interference

Precautions for blue-green release

Blue green deployment is just one of the online strategies, and it is not a universal solution that can handle all situations. The premise for the simple and fast implementation of blue-green deployment is that the target system is very cohesive. If the target system is quite complex, careful consideration should be given to how to switch, whether the data of the two systems needs to be synchronized, and so on.
When you switch to the blue environment, it is necessary to properly handle unfinished business and new business. If your database backend cannot handle it, it will be a relatively troublesome problem;
It may be necessary to process "Microservices architecture application" and "traditional architecture application" at the same time. If the two are not well coordinated in the blue-green deployment, the service may still be stopped.
It is necessary to consider the issue of synchronous migration/rollback between database and application deployment in advance.
Blue green deployment requires infrastructure support.
Performing blue green deployment on non isolated infrastructure (VM, Docker, etc.) poses a risk of destruction for both blue and green environments.

Rolling Publishing

Typically, one or more servers are taken out of service, updated, and put back into use. Cycle through until all instances in the cluster are updated to the new version.

Release process:

Compared to the need for a complete set of machines for blue-green publishing, rolling publishing only requires one machine (for understanding, it may actually be multiple machines). We only need to deploy some functions on this machine, and then replace the running machine, as shown in the figure above. The updated functions are deployed on Server1, and then Server1 replaces the running server. The replaced physical machine can continue to deploy the new version of Server2, Then replace the working Server2, and so on, until all servers are replaced, and the service update is complete.

Rolling Release Features

This deployment method is more resource efficient compared to blue green deployment - it does not require running two clusters or twice the number of instances. 2.We can partially deploy, for example, upgrading by only taking out 20% of the cluster at a time.

Rollback difficulties

Rollover Release Notes There is no confirmed feasible environment for rolling publishing. Using blue-green deployment, we can clearly know that the old version is feasible, while using rolling release, we cannot be certain.
Modified the existing environment.
Rolling back is difficult. For example, in a certain release, we need to update 100 instances, 10 instances at a time, and each deployment takes 5 minutes. When scrolling to the 80th instance, a problem was discovered and a rollback was needed, which was a painful and lengthy process.
Sometimes, we may also dynamically scale the system. If the system automatically expands/shrinks during deployment, we also need to determine which node is using which code. Despite some automated operation and maintenance tools, they are still terrifying.
Because it is a gradual update, there will be a brief inconsistency between the old and new versions when we launch the code. If there are high requirements for online scenarios, we need to consider how to ensure compatibility.

Grayscale publishing

Grayscale publishing, also known as canary publishing. It refers to a publishing method that can smoothly transition between black and white. AB test is a grayscale publishing method that allows some users to continue using A and some users to start using B. If users have no objections to B, gradually expand the scope and migrate all users to B. Grayscale publishing can ensure the overall stability of the system, and problems can be detected and adjusted at the initial grayscale to ensure their impact. What we commonly refer to as canary deployment is also a way of grayscale publishing.

Specifically, on the server, more control can be done in practical operations, such as setting a lower weight for the initial 10 updated servers, controlling the number of requests sent to these 10 servers, and gradually increasing the weight and number of requests. A smooth transition approach, this control is called "traffic segmentation".

Prepare the artifacts for deployment at each stage, including: build artifacts, Test script, configuration files, and deployment Manifest file.
Deploy the 'Canary' server into the server for testing.
Remove the 'Canary' server from the load balancing list.
Upgrade the "Canary" application (eliminate existing traffic and deploy it).
Test automation of applications.
Add the 'Canary' server back to the load balancing list (connectivity and health checks).
If the online usage test of 'Canary' is successful, upgrade the remaining other servers. (Otherwise, roll back)

A/B testing

A/B testing and blue-green release, rolling release, and canary release are completely different things.
Blue green release, rolling release, and canary are release strategies, with the goal of ensuring the stability of the newly launched system and focusing on the bugs and hidden dangers of the new system.
A/B testing is an effectiveness test, where multiple versions of services are available for external service at the same time. These services have undergone sufficient testing and meet the online standards, with differences but no distinction between old and new (they may have been deployed in a blue and green manner when they were launched).
A/B testing focuses on the actual performance of different versions of services, such as conversion rates and order status.
During A/B testing, multiple versions of services are running simultaneously online, and these services often have some differences in experience, such as different page styles, colors, and operational processes. Relevant personnel select the most effective version by analyzing the actual performance of each version of the service.

How does the anti-corrosion layer work in DDD?

LEO Qin — Tue, 11 Apr 2023 08:26:37 +0000

Gateway
In domain-driven design , there are several recommended architectures . But they all have a common feature: the outermost layer is a gateway (some are also called adapters).

Southbound and Northbound
For the gateway, it is actually divided into southbound and northbound. According to the concept of northbound and southbound, the northbound gateway corresponds to input, and the southbound gateway corresponds to output. For example, the controller interface, message queue monitoring interface, RPC interface, etc. provided by a service are all northbound gateways, which are used to accept and monitor other incoming requests; while a service calls the downstream DB, MQ producer, and http of other services. , RPC interface, etc., are all southbound gateways.

The role of anti-corrosion layer

Anti-corrosion layer service
The concept of anti-corrosion layer originally originated from DDD, and was later used in the architecture of microservices. As the name suggests, the main function of the anti-corrosion layer is to "prevent the architecture from rot". The principle of entropy increase is also applicable in the software field. With continuous iteration, the code and architecture will always tend to be chaotic. Therefore, it is necessary to consider in advance the future of this possible change and add a layer of anti-corrosion.

In microservices, the anti-corrosion layer refers to extracting a single service for anti-corrosion, which is mainly used in scenarios such as system migration. In DDD, the anti-corrosion layer can actually be called an "adaptation layer", which is used to isolate upstream and downstream dependencies.

^Anti-corrosion layer in microservices^
Anti-corrosion layer code
In addition to being a separate service, each microservice should also have its own anti-corrosion layer code.

In large teams, many microservices will be split. Due to various reasons such as business development or architecture upgrade, it is very likely that some services and some interfaces will be upgraded at a certain time. At this time, the general approach is to find the upstream and downstream dependencies, and then notify them to use the new interface.

But this brings up a problem: if an upstream service calls this interface in a large number of places in the code, the transformation cost will become very high, and this kind of transformation between systems is difficult to test and the risk is high. At this time, if the upstream service has its own adaptation layer before calling, then he only needs to change the code of the adaptation layer.

And if you can’t push forward the transformation of upstream services, you can also adapt within your own services. This can be regarded as the first layer of adaptation for the northbound gateway itself, but this approach is not very recommended, and it is not conducive to your own later iterations.

The recommended practice is: all southbound gateways should be adapted as much as possible, especially to call external interfaces; all northbound gateways should be as simple as possible, without adaptation, and the upstream should adapt themselves.

How to write anti-corrosion layer

The code of the anti-corrosion layer generally applies the "adapter pattern" . Generally, the interfaces exposed by other microservices will be provided to other microservices in the form of SDK. These services are generally called xxService and xxClient. The anti-corrosion layer code is a layer of packaging on this basis, generally called xxWapper and xxAdapter. Among them, Request and Response will also be included. But the workload of redefining a class by yourself is relatively large, so the original structure is generally reused directly by inheritance or combination, and when it is necessary to overwrite the field, a new field is defined on this basis, and then transferred Go to the new field.

For example, there is a field called id in the original response. Suddenly one day, the upstream abandoned this field and changed it to projectId, whose business meaning is consistent with the original id. If there is no anti-corrosion layer, then all places where this response is used must modify the code. And if there is an anti-corrosion layer, you can rewrite the getId method in the anti-corrosion response and return the projectId of the parent class, without any changes in the business code.

In addition, you can also define the structure you need by yourself, and use tools such as "mapStruct" to automatically convert data.

Pros and Cons of Corrosion
Writing an anti-corrosion layer also comes at a price. The biggest price is "additional development costs" . So if your upstream and downstream are relatively small and relatively stable, you can actually do without an anti-corrosion layer.

In large teams, it is valuable to pay these additional development costs, because the upstream and downstream relationships of large teams are very complicated. They may not be in the same team, and they may often perform iterative upgrades. From my own experience, interface changes It happens quite often.

Although we advocate not to modify the original interface, do not modify field names, method names, etc. But occasionally it is found that some things designed before are unreasonable and not conducive to long-term maintenance. If we design a v2 interface based on the new model and maintain the original v1 interface, the maintenance cost will increase. Therefore, the upstream that used the v1 interface is generally pushed to switch to v2, and then the v1 interface is offline. At this time, it is very important for the caller to have an anti-corrosion layer, which can reduce the cost of modification.

Troubleshooting a JVM GC Long Pause

LEO Qin — Thu, 06 Apr 2023 05:30:30 +0000

Initially, there was an abnormality in garbage collection of a certain application online, and some instances in the application experienced particularly long Full GC time, lasting about 15-30 seconds. On average, it occurred once every two weeks.

JVM parameter configuration:

-Xms2048M –Xmx2048M –Xmn1024M –XX:MaxPermSize=512M

Analyze GC logs. The GC log records the execution time and result of each GC. By analyzing the GC logs, you can optimize heap and GC settings, or improve the object allocation pattern of the application.

In this case, the reason for Full GC is Ergonomics, because UseAdaptiveSizePolicy is enabled, and the JVM is adapting and adjusting itself, causing Full GC.

This log mainly reflects the changes before and after GC, but it is currently not clear what is causing the issue.

To enable GC logs, the following JVM startup parameters need to be added:

-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/export/log/risk_pillar/gc.log

The meanings of common Young GC and Full GC logs are as follows:

Further investigate server performance metrics. After obtaining the GC execution time, investigate the metrics with abnormal values at this time point through a monitoring platform. It was ultimately discovered that around 5:06 (the time of GC), CPU usage increased significantly, while SWAP showed a release of resources and a turning point in the growth of memory resources.

Did the JVM use swap?
Was the sudden increase in CPU usage and the release of swap space to memory caused by GC?
To verify whether the JVM used swap, we checked the process memory resource usage under the "proc" directory.

for i in (cd/proc;ls∣grep"[0−9]"∣awk′0 >100');
do awk '/Swap:/{a=a+2}END{print '"i"',a/1024"M"}' /proc/$i/smaps 2>/dev/null;
done | sort -k2nr | head -10

head -10" means to retrieve the top 10 processes with high memory usage. The first column of the output represents the process ID, and the second column represents the size of the process's swap usage. We can see that there is indeed a process using 305MB of swap space.

Here's a brief introduction to what swap is:

Swap refers to a swapping partition or file, which is primarily used to trigger memory recycling when there is pressure on memory usage. At this point, some of the data in memory may be swapped to the swap space so that the system does not run out of memory and cause OOM or other fatal situations.

When a process requests memory from the OS and finds that there is not enough, the OS will swap out temporarily unused data from memory and place it in the swap partition, a process called "swap out". When the process needs this data again and the OS finds that there is free physical memory, it will swap the data back into physical memory from the swap partition, a process called "swap in".

To verify that there is a necessary relationship between GC time and swap operations, I surveyed more than a dozen machines, focusing on GC logs with long duration, and confirmed that the time points of GC and swap operations are indeed consistent.

Furthermore, by checking the swappiness parameter of each instance of the virtual machine, a common phenomenon is that instances with longer Full GC are configured with the parameter vm.swappiness = 30 (a larger value means a greater tendency to use swap), while instances with relatively normal GC times are configured with the parameter vm.swappiness = 0 (maximizing the reduction of using swap).

Swappiness can be set to a value between 0 and 100. It is a Linux kernel parameter that controls the relative weight of memory usage during swap.

swappiness=0: maximum use of physical memory, followed by swap space
swappiness=100: active use of swap partition, and timely swapping of data on memory to swap space

The corresponding physical memory usage rate and swap usage are shown below.

Problem analysis:
When the memory usage reaches the waterline (vm.swappiness), Linux will move some temporarily unused memory data to the disk swap to free up more available memory space. When data in the swap area is needed, it will be moved back into memory. When the JVM performs garbage collection, it needs to traverse the used memory in the corresponding heap partition. If part of the heap content has been swapped to the swap space during GC, when it is traversed, it needs to be swapped back to memory. Because it requires accessing the disk, it will be much slower than accessing physical memory, and the GC pause time will be very long, which can cause Linux to lag behind in swapping area recovery (memory-to-disk swapping operations are very CPU and system IO intensive). In high-concurrency/QPS services, this lag can be fatal (STW).
Questions:
Will the JVM with swap enabled always take longer to perform GC?
If the JVM dislikes swap so much, why doesn't it prohibit its use?
What is the working mechanism of swap? This server has 8GB of physical memory and uses swap memory, which means that physical memory is not enough, but according to the free command, the actual physical memory usage does not seem to be that high, while Swap has occupied nearly 1G.

free：The amount of remaining memory excluding buff/cache

shared：Shared memory.

buff/cache：The number of memory used for buffering and caching (frequently caused by programs frequently accessing files).

available：The real amount of available remaining memory.

5.Further thoughts
One may consider what it means to disable swap disk cache.

In fact, it is not necessary to be so radical. It is important to note that the world is never simply binary, and everyone tends to choose somewhere in between. Some lean towards 0, while others towards 1.

Clearly, with regard to the swap issue, the JVM can choose to minimize its usage to reduce its impact. It is important to understand how Linux memory recovery works in order to reduce any possible concerns.

Let's first take a look at how swap is triggered.

Linux triggers memory recovery in two scenarios: when there is not enough free memory during memory allocation, and when a daemon process (kswapd process) periodically checks the system's memory and initiates memory recovery when the available memory drops below a specific threshold.

6.Speculation
Due to the short intervals of GC in real-time services, the things in memory have no chance to be swapped to swap and are immediately recovered during GC. When GC is performed, data from the swap partition does not need to be swapped back to physical memory, but is calculated entirely based on memory, which makes it much faster. The selection strategy for which memory data to swap into the swap partition is likely to be similar to the LRU algorithm (least recently used).

Lowering the heap size appropriately can also solve the problem.

This also indirectly indicates that when deploying Java services on Linux systems, memory allocation should not simply be large and comprehensive, but should consider the memory requirements of the JVM for Java permanent generation, Java heap (young and old generations), thread stack, and Java NIO in different scenarios.

7.Conclusion
In conclusion, when swap and GC occur at the same time, GC time will be very long, causing serious JVM stuttering, and in extreme cases, service crashes.

The main reason is that when the JVM performs GC, it needs to traverse the used memory of the corresponding heap partition. If a part of the heap has been swapped to swap at the time of GC, it must be swapped back to memory when traversing this part. In more extreme cases, if another part of the heap in memory needs to be swapped to swap due to insufficient memory space, the entire heap partition will be written to SWAP in turn during the process of traversing the heap partition, resulting in excessively long GC time. The size of the swap area should be limited online, and if the swap usage ratio is high, it should be investigated and resolved. When appropriate, the heap size can be lowered or physical memory can be added.

Therefore, when deploying Java services on Linux systems, it is important to be cautious about memory allocation.

A Bloody Case Caused by a Thread Pool Rejection Strategy

LEO Qin — Tue, 10 Jan 2023 15:16:35 +0000

Conduct an in-depth analysis of the full gc problem of the online business of the inventory center, and combine the solution method and root cause analysis of this problem to implement the solution to such problems

1. Event review

Starting from 7.27, the main site, distributors, and other business parties began to report that orders occasionally timed out. We began to analyze and troubleshoot the cause of the problem, and were shocked to find that full gc occasionally occurred online, as shown in the figure below. If you continue to let it go, it will inevitably affect the core link and user experience of Yanxuan trading orders, resulting in transaction losses. The development of the inventory center responds quickly, actively investigates and solves problems, and handles problems in the bud to avoid capital losses.

2. Emergency hemostasis

For frequent full gc, based on experience, we boldly guess that it may be caused by some interfaces generating large objects and calling them frequently. In an emergency, first ensure that the core functions of the system are not affected, and then troubleshoot the problem. Generally, there are three methods, as follows:

Expansion
There are generally two ways to expand capacity, one is to increase the size of the heap memory, and the other is to expand the capacity of the application machine; in essence, it is to delay the number and frequency of full gc occurrences, try to ensure the core business, and then troubleshoot the problem.
Limiting
Current limiting can be regarded as a kind of service degradation. Current limiting is to limit the input and output traffic of the system to achieve the purpose of protecting the system. Generally, current limiting can be done at the proxy layer and application layer.
Reboot
It is a relatively violent method. A little carelessness may cause data inconsistency. It is not recommended unless necessary.

Our application current limit is aimed at the application interface level. Since we don't know the specific cause of the problem and the problem is still in the bud, we don't directly limit the current, but directly expand the capacity and restart it incidentally. We temporarily expanded the heap memory in an emergency, increased the heap memory size of some machines from 6g to 22g, and restarted the application to make the configuration parameters take effect.
After emergency expansion of some machines (73 and 74) on 7.27, we can find that full gc did not occur within 2 days after the expansion, providing us with fault tolerance time for further investigation;

3. Problem analysis
3.1 Status Quo Challenges

Since there is no OOM, there is no on-site memory snapshot, so it is difficult to determine the cause of the problem, and the main inventory service involves too much logic (the core business logic has more than 100,000 lines of code, which are all running daily), and the business logic is complex. The volume is large, and there are a small number of slow requests, which increases the difficulty of troubleshooting. Since there is no relatively complete infrastructure, we do not have a global call monitoring platform to observe what happened to the application before and after the full gc. We can only find the truth of the problem by analyzing the link call situation on the problem machine.

3.2 Appearance reasons
Essentially, we need to look at what the application system does when full gc occurs, that is to say, what is the last straw that crushes the camel?
We have done a lot of analysis on the application logs at the time point before the full gc occurred, combined with the slow SQL analysis, as long as the business frequently operates the [internal and external procurement and outbound] business for a period of time, the system will trigger a full gc, and the time point is relatively consistent, so , the preliminary judgment may be caused by internal and external procurement and outbound business operations. Through the analysis of the business code analysis, it is found that the inventory change will load 100,000 pieces of data into the memory after intervention and interception, with a total of about 300M.
In this regard, we urgently contacted the dba on 7.28 to migrate part of the business data to other databases to avoid further impact on the business, and then optimize the business process in the future! !

After the migration, we found that there was no full gc on the day, and no business feedback interface timed out. On July 29, we found that machine 73 (upgraded configuration) did not have full gc, and machine 154 continued to have full gc on July 29. Observe every The amount of memory that can be reclaimed by a gc is not much, which means that the memory is not released in time, and there may be a leak problem!

3.3 Root cause of the problem
At that time, we dumped the memory snapshots many times, and did not find similar problems. Fortunately, the 155 machine was upgraded last (the backup machine, mainly used to process timing tasks, and was reserved for reference and comparison), which brought us closer to the root of the problem. because.
To further analyze the reason, we analyzed the heap memory snapshot of one of the machines (155), and found an interesting phenomenon, that is, there are a large number of threads blocking waiting threads;

Each blocked thread will hold about 14M of memory. It is these threads that cause the memory leak. So far we have finally found the cause of the problem and verified our guess, that is, a memory leak has occurred!

3.4 Cause Analysis
3.4.1 Business Description
From 4.2, we locate the problem code. In order to facilitate our understanding of this part of the business (pull the SKU quantity information from the database, every 500 SKUs form a SyncTask, and then cache it in redis for use by other business parties, execute it every 5 minutes ) to give an overview.

3.4.2 Business code

3.4.2 Business code
@Override
public String sync(String tableName) {
    // Generate data version number
    DateFormat dateFormat = new SimpleDateFormat("YYYYMMdd_HHmmss_SSS");
    // Start the Leader thread to complete execution and monitoring
    String threadName = "SyncCache-Leader-" + dateFormat.format(new Date());
    Runnable wrapper = ThreadHelperUtil.wrap(new PrimaryRunnable(cacheVersion, tableName,syncCachePool));
    Thread core = new Thread(wrapper, threadName); // Create new thread
    core.start();
    return cacheVersion;
}
private static class PrimaryRunnable implements Runnable {
    private String cacheVersion;
    private String tableName;
    private ExecutorService syncCachePool;
    public PrimaryRunnable(String cacheVersion, String tableName,ExecutorService syncCachePool) {
        this.cacheVersion = cacheVersion;
        this.tableName = tableName;
       this.syncCachePool = syncCachePool;
    }
    @Override
    public void run() {
       ....
        try {
            exec();
            CacheLogger.doFinishLog(cacheVersion, System.currentTimeMillis() - leaderStart);
        } catch (Throwable t) {
            CacheLogger.doExecErrorLog(cacheVersion, System.currentTimeMillis() - leaderStart, t);
        }
    }
    public void exec() {
        // Query data and build synchronization task
        List<SyncTask> syncTasks = buildSyncTask(cacheVersion, tableName);
        // Synchronize task submission thread pool
        Map<SyncTask, Future> futureMap = Maps.newHashMap();
        for (SyncTask task: syncTasks) {
            futureMap.put(task, syncCachePool.submit(new Runnable() {
                @Override
                public void run() {
                    task.run();
                }
            }));
        }

        for (Map.Entry<SyncTask, Future> futureEntry: futureMap.entrySet()) {
            try {
                futureEntry.getValue().get(); // Block getting synchronization task results
            } catch (Throwable t) {
                CacheLogger.doFutureFailedLog(cacheVersion, futureEntry.getKey());
                throw new RuntimeException(t);
            }
        }
    }
}
/**
 * Deny Policy Class
 */
private static class RejectedPolicy implements RejectedExecutionHandler {
    static RejectedPolicy singleton = new RejectedPolicy();
    private RejectedPolicy() {
    }
    @Override
    public void rejectedExecution(Runnable runnable, ThreadPoolExecutor executor) {
        if (runnable instanceof SyncTask) {
            SyncTask task = (SyncTask) runnable;
            CacheLogger.doRejectLog(task);
        }
    }
}

The current queue size is 1000, and the maximum number of threads is 20, which means that the thread pool can handle at least 51w data, and the current number of sku is about 54w. If the task takes time, all remaining tasks may be put into the queue, and there may be threads Insufficient pool queue condition. Insufficient queue size will trigger the rejection policy. Currently, the rejection policy in our project is similar to DiscardPolicy (when a new task is submitted, it will be discarded directly without any notification to you)

From the analysis here, we summarize the causes of the problem as follows:

First, when the task is submitted to the thread pool and the rejection policy is triggered, the state of the FutureTask is in the New state, and calling the get() method will reach LockSupport.park(this), blocking the current thread and causing a memory leak;

The reason is that the thread pool is not used properly. There are two main problems. One is that there is a problem with the selection of the rejection strategy. Abnormal termination (in addition, there is no need to obtain task results in the project, and there is actually no need to use the submit method to submit tasks).

3.4.4 Getting to the bottom of it
After analyzing this point, we can say that we have found the cause of the problem, that is to say, when FutureTask gets the execution result, it calls LockSupport.park(this) and blocks the main thread. When will the current thread be woken up? Let's move on to the code.
That is, when the task currently assigned by the existing worker thread Worker is executed, it will call the getTask() method of the Worker class to get the task from the blocking queue, and execute the run() method of the task.

4. Problem solving
By optimizing the thread pool configuration and business process l, such as increasing the size of the thread pool queue, repairing the rejection strategy, optimizing the business process to avoid large objects, and executing tasks at off-peak times, a series of combined measures ensure the stable execution of tasks.
4.1 Thread pool configuration optimization
Increase thread pool queue size, fix rejection policy

4.1.1 Modify the deny policy

The main purpose of the custom rejection strategy used in the project is to print out the task information contained in the rejection task, such as skuId, etc., and then manually update it to prevent abnormal inventory data provided to other services;

From the previous article, we have already seen that the runnable type is FutureTask, so the if judgment in the picture will never be established. This custom rejection policy is like the default rejection policy in the thread pool. I will give you any notice, relatively speaking, there is a certain risk, because we did not know that this task would be discarded when we submitted it, which may cause data loss);

After the modification, when the queue is full, the rejection strategy will be triggered immediately and an exception will be thrown, so that the parent thread will not be blocked all the time to obtain the result of the FutureTask.

ps: The Runnable in the thread is currently packaged in the project. If you use native classes, you can obtain the rejected tasks in the rejection policy through reflection. Just get the rejection task information, you can ignore it.

4.1.2 Increase the queue size

The maximum number of threads in the thread pool is 20, the queue size is 1000, the current number of skus is 54w, and each task has 500 skuIds. If the execution time of each task is a little longer, it can only process 51w skus at most, plus 3 tasks Common thread pool, set the queue size to 3000;

After the queue is adjusted, it can prevent some SKUs from not synchronizing the inventory data to the cache in time. 4.2 Business Process Optimization Optimize the large objects that appear in internal and external procurement to reduce the problem of requesting 300M large objects each time. At the same time, the execution time of the three scheduled tasks in the public thread pool is staggered to avoid mutual interference between tasks after the increase in sku.

Summarize precipitation 5.1 Summary of full gc solutions

What should we do when encountering frequent full gc online? The first thing that comes to mind is to deal with it urgently first, and then analyze the reasons. We have three options for emergency treatment: restart, current limit, and capacity expansion.

Secondly, clarify the direction. Generally speaking, there are two main reasons for full gc, one is application resource allocation problems, and the other is program problems. In terms of resource configuration, we need to check whether the jvm parameter configuration is reasonable; most of the full gc is caused by program problems. There are two main reasons. One is that the program has large objects, and the other is that there is a memory leak;

The most important point is to analyze the dump file, but to ensure that the memory snapshot at the time of the incident is obtained, the analysis software can use MAT and VisualVM. For the problem we encountered, we can actually use jstack to obtain all the threads of the current process for analysis;

In case of full gc, a timely alarm should be issued to avoid the development response lagging behind the business. In addition, in practice, we should set JVM parameters reasonably, so as to avoid full gc as much as possible. In this troubleshooting, we also adjusted the jvm parameters, which will be discussed later Corresponding articles have been published.

5.2 Notes on using thread pool

If you do not need to obtain task results synchronously, try to use the execute method to submit tasks, and handle exceptions carefully to prevent frequent destruction and creation of threads;

If you need to use the submit method to submit the task, try to use the timeout method to obtain the result synchronously, so as to avoid the problem of memory leak caused by the continuous blocking problem;

Use the rejection policy carefully, and be familiar with the possible problems in the combination of the rejection policy and the thread submission method. For example, DiscardPolicy and submit methods may cause blocking and waiting for results;

Thread pool threads must be recognizable, that is, they have their own naming rules to facilitate troubleshooting.