记录一次concurrent mode failure问题排查过程以及解决思路

首页 > 代码库 > 记录一次concurrent mode failure问题排查过程以及解决思路

记录一次concurrent mode failure问题排查过程以及解决思路

2024-10-21 11:23:02 211人阅读

背景：后台定时任务脚本每天凌晨5点30会执行一个批量扫库做业务的逻辑。

gc错误日志：

2017-07-05T05:30:54.408+0800: 518534.458: [CMS-concurrent-mark-start]
2017-07-05T05:30:55.279+0800: 518535.329: [GC 518535.329: [ParNew: 838848K->838848K(1118464K), 0.0000270 secs]
[CMS-concurrent-mark: 1.564/1.576 secs] [Times: user=10.88 sys=0.31, real=1.57 secs]
 (concurrent mode failure): 2720535K->2719116K(2796224K), 13.3742340 secs] 
 3559383K->2719116K(3914688K), 
 [CMS Perm : 38833K->38824K(524288K)], 13.3748020 secs] [Times: user=16.19 sys=0.00, real=13.37 secs]
2017-07-05T05:31:08.659+0800: 518548.710: [GC [1 CMS-initial-mark: 2719116K(2796224K)] 2733442K(3914688K), 0.0065150 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2017-07-05T05:31:08.666+0800: 518548.716: [CMS-concurrent-mark-start]
2017-07-05T05:31:09.528+0800: 518549.578: 
[GC 518549.578: [ParNew: 838848K->19737K(1118464K), 0.0055800 secs] 
3557964K->2738853K(3914688K), 0.0060390 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
[CMS-concurrent-mark: 1.644/1.659 secs] [Times: user=14.15 sys=0.84, real=1.66 secs]
2017-07-05T05:31:10.326+0800: 518550.376: [CMS-concurrent-preclean-start]
2017-07-05T05:31:10.341+0800: 518550.391: [CMS-concurrent-preclean: 0.015/0.015 secs] [Times: user=0.05 sys=0.02, real=0.02 secs]
2017-07-05T05:31:10.341+0800: 518550.391: [CMS-concurrent-abortable-preclean-start]

借鉴于:understanding-cms-gc-logs

得知导致concurrent mode failure的原因有是： there was not enough space in the CMS generation to promote the worst case surviving young generation objects. We name this failure as “full promotion guarantee failure”

解决的方案有： The concurrent mode failure can either be avoided increasing the tenured generation size or initiating the CMS collection at a lesser heap occupancy by setting CMSInitiatingOccupancyFraction to a lower value and setting UseCMSInitiatingOccupancyOnly to true.

第二种方案需要综合考虑下，因为如果设置的CMSInitiatingOccupancyFraction过低有可能导致频繁的cms 降低性能。［参考不建议3g下配置cms：why no cms under 3G］

问题排查：

1 jvm参数配置 -Xmx4096m -Xms2048m -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSCompactAtFullCollection -XX:MaxTenuringThreshold=10 -XX:-UseAdaptiveSizePolicy -XX:PermSize=512M -XX:MaxPermSize=1024M -XX:SurvivorRatio=3 -XX:NewRatio=2 -XX:+PrintGCDateStamps -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails 几乎没什么问题

2 从报警时间看每天凌晨5点30报警一次, 应该是定时任务的问题。

该问题很容易排查，服务是个脚本服务，线上业务逻辑几乎没有，所以根据时间点找到定时任务的业务逻辑，就可以分析出来问题。

业务代码：

　　　　 int batchNumber = 1;
        int realCount = 0;
        int offset = 0;
        int limit = 999;
        int totalCount = 0;
        //初始化20个大小的线程池
        ExecutorService service = Executors.newFixedThreadPool(20);
        while (true) {
            LogUtils.info(logger, "{0},{1}->{2}", batchNumber, offset, (offset + limit));
            try {
                //分页查询
                Set<String> result = query(offset, limit);
                realCount = result.size();
                //将查询出的数据放入线程池执行
                service.execute(new AAAAAAA(result, batchNumber));
            } catch (Exception e) {
                LogUtils.error(logger, e, "exception,batch:{0},offset:{1},count:{2}", batchNumber, offset, limit);
                break;
            }
            totalCount += realCount;
            if (realCount < limit) {
                break;
            }
            batchNumber++;
            offset += limit;
        }
        service.shutdown();

用了一个固定20个线程的线程池，循环执行每次从数据库里面取出来999条数据放到线程池里面去跑

分析

newFixedThreadPool
底层用了一个

LinkedBlockingQueue
无限队列，而我的数据有2kw+条,这样死循环取数据放到队列里面没有把内存撑爆算好的吧？？？

最后换成

BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(20);
ThreadPoolExecutor service = new ThreadPoolExecutor(20, 20, 1, TimeUnit.HOURS, queue, new ThreadPoolExecutor.CallerRunsPolicy());

用了个固定长度的队列，而且失败策略用的callerruns，可以理解为不能执行并且不能加入等待队列的时候主线程会直接跑run方法，会造成多线程变单线程，降低效率。

明天看看效果如何。

记录一次concurrent mode failure问题排查过程以及解决思路

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 记录一次concurrent mode failure问题排查过程以及解决思路