背景
如果对linux load不熟悉的话,可以先看一下这里
最近线上的系统(java应用)出现了一个很诡异的问题,就是cpu非常平稳的情况下load呈现出周期性的波动(如上图).如果把JVM停掉,又恢复正常了.花了不少时间排查,排除了是定时任务,IO.也排除了是JVM笨笨的问题.最后怀疑是内核的问题,于是准备写个测试程序验证一下.
测试程序 问题来了.怎么写一个消耗指定cpu百分比的测试?消耗cpu就是指的就是消耗cpu的时间.所以基本的原理就是按百分比吃掉cpu的时间.比如要消耗30%的cpu,假设单位时间是100ms,那就是跑满30ms,然后sleep 70ms.代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 public class CpuUtil_Single { public static final int TIME_UNIT_MS = 100 ; public static double tmp; public static void main (String[] args) throws Exception { int cpuUtil = 30 ; if (args.length != 0 ) { cpuUtil = Integer.valueOf(args[0 ]); } if (cpuUtil < 5 || cpuUtil > 80 ) { throw new IllegalArgumentException ("cpuUtil must be between 5 and 80" ); } int runTime = TIME_UNIT_MS * cpuUtil / 100 ; int sleepTime = TIME_UNIT_MS - runTime; System.out.println("runTime(ms):" + runTime + ",sleepTime(ms):" + sleepTime); while (true ) { eatCpuTime(runTime); Thread.sleep(sleepTime); } } public static void eatCpuTime (long useTimeMs) { long start = System.nanoTime(); long useTimeNano = useTimeMs * 1000000 ; while (true ) { for (int i = 0 ; i < 10000 ; i++) { tmp = Math.sqrt(i) * Math.sqrt(i); } if ((System.nanoTime() - start) >= useTimeNano) { break ; } } } }
运行一下试试:
1 2 3 4 5 javac CpuUtil_Single.java //后面的参数表示,把cpu跑到40% java CpuUtil_Single 40
如果你的机器是单核的cpu,上面的代码没啥问题,还是比较准的.但如果是多核的就有问题了,需要改造一下.代码引起的变化可以用top命令看看,前面那个代码只能把其中一个核跑起来,后面这个则是每个核都跑起来了.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 import java.util.concurrent.ExecutorService;import java.util.concurrent.Executors;public class CpuUtil_Multi { public static final int TIME_UNIT_MS = 100 ; public static double tmp; public static final int CpuCoreNum = Runtime.getRuntime().availableProcessors(); public static ExecutorService MyExecutor = Executors.newFixedThreadPool(CpuCoreNum); public static void main (String[] args) throws Exception { int cpuUtil = 30 ; if (args.length != 0 ) { cpuUtil = Integer.valueOf(args[0 ]); } if (cpuUtil < 5 || cpuUtil > 80 ) { throw new IllegalArgumentException ("cpuUtil must be between 5 and 80" ); } int runTime = TIME_UNIT_MS * cpuUtil / 100 ; int sleepTime = TIME_UNIT_MS - runTime; System.out.println("runTime(ms):" + runTime + ",sleepTime(ms):" + sleepTime + ",CpuCoreNum:" + CpuCoreNum); for (int i = 0 ; i < CpuCoreNum; i++) { MyExecutor.execute(() -> { while (true ) { try { eatCpuTime(runTime); Thread.sleep(sleepTime); } catch (InterruptedException e) { e.printStackTrace(); } } }); } } public static void eatCpuTime (long useTimeMs) { long start = System.nanoTime(); long useTimeNano = useTimeMs * 1000000 ; while (true ) { for (int i = 0 ; i < 10000 ; i++) { tmp = Math.sqrt(i) * Math.sqrt(i); } if ((System.nanoTime() - start) >= useTimeNano) { break ; } } } }
重现问题
linux 的load其实是大概每隔5s看一次运行队列(在linux中是R+D)长度.系统的负载是很规律的波动的话,这个计算方式就有问题了.为了重现这个问题,需要对上面的代码再做一点改造.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 package name.chengchao.mylinuxtest.cpu;import java.util.concurrent.ExecutorService;import java.util.concurrent.Executors;import java.util.concurrent.ScheduledExecutorService;import java.util.concurrent.TimeUnit;public class CpuUtil_Multi_Interval { public static final int TIME_UNIT_MS = 100 ; public static final int CPU_WAVE_INTERVAL_SECOND = 5 ; public static double tmp; public static final int CpuCoreNum = Runtime.getRuntime().availableProcessors(); public static ScheduledExecutorService MyScheduledExecutor = Executors.newScheduledThreadPool(1 ); public static ExecutorService MyExecutor = Executors.newFixedThreadPool(CpuCoreNum); public static void main (String[] args) throws Exception { int cpuUtil = 30 ; if (args.length != 0 ) { cpuUtil = Integer.valueOf(args[0 ]); } if (cpuUtil < 5 || cpuUtil > 80 ) { throw new IllegalArgumentException ("cpuUtil must be between 5 and 80" ); } int runTime = TIME_UNIT_MS * cpuUtil / 100 ; int sleepTime = TIME_UNIT_MS - runTime; System.out.println("runTime(ms):" + runTime + ",sleepTime(ms):" + sleepTime + ",CpuCoreNum:" + CpuCoreNum); int loopNum = 1000 / TIME_UNIT_MS; MyScheduledExecutor.scheduleAtFixedRate(() -> { for (int i = 0 ; i < CpuCoreNum; i++) { MyExecutor.execute(() -> { for (int j = 0 ; j < loopNum; j++) { try { eatCpuTime(runTime); Thread.sleep(sleepTime); } catch (InterruptedException e) { e.printStackTrace(); } } }); } }, 0 , CPU_WAVE_INTERVAL_SECOND, TimeUnit.SECONDS); } public static void eatCpuTime (long useTimeMs) { long start = System.nanoTime(); long useTimeNano = useTimeMs * 1000000 ; while (true ) { for (int i = 0 ; i < 10000 ; i++) { tmp = Math.sqrt(i) * Math.sqrt(i); } if ((System.nanoTime() - start) >= useTimeNano) { break ; } } } }
把这个代码跑一个一天,大概就能看到load是上面这个样子!原因就在于系统的load采样周期和cpu波动的周期会慢慢的偏移,重合.偏移,重合.最终导致了这个结果.当然可以把定时任务的周期改成3s试试,还是会有一样的结果只是波动的周期变了而已.