//2019/7/31 18:41:14
掐指一算应该resore完了呀,是不是天热想罢工?不过已经差不多30个小时了
无意间一查 tail -500f /var/log/messages 发现有些“more than 120 seconds|hung_task_timeout_secs”,还有写kernel乱七八糟的,有点蒙圈。。。先分析再干活
咋一看2个帖子分析的非常有道理,那就干吧,粗体是要执行查看的
问题原因:
默认情况下,Linux会最多使用40%的可用内存作为文件系统缓存。当超过这个阈值后,文件系统会把将缓存中的内存全部写入磁盘,导致后续的IO请求都是同步的。
将缓存写入磁盘时,有一个默认120秒的超时时间。出现上面的问题的原因是IO子系统的处理速度不够快,不能在120秒将缓存中的数据全部写入磁盘。
IO系统响应缓慢,导致越来越多的请求堆积,最终系统内存全部被占用,导致系统失去响应。
系统默认
[root@dinpaydg_cc ~]# sysctl -a | grep ‘vm.dirty‘
vm.dirty_background_ratio = 10
vm.dirty_background_bytes = 0
vm.dirty_ratio = 20
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
[root@dinpaydg_cc ~]# 
解决方法:
根据应用程序情况,对vm.dirty_ratio,vm.dirty_background_ratio两个参数进行调优设置
root执行下看看
sysctl -w vm.dirty_ratio=10
sysctl -w vm.dirty_background_ratio=5
sysctl -p
[root@dinpaydg_cc ~]# sysctl -a | grep ‘vm.dirty‘
vm.dirty_background_ratio = 5	#调小一倍
vm.dirty_background_bytes = 0
vm.dirty_ratio = 10	#调小一倍
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
[root@dinpaydg_cc ~]# 
当然我只用了上免得方法,生产暂时没法重启。
如果系统永久生效,修改/etc/sysctl.conf文件。加入如下两行:
#vi /etc/sysctl.conf
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
重启系统生效。
系统日志分析
grep ‘Jul 31‘ messages | grep -E ‘more than 120 seconds|hung_task_timeout_secs‘
--发现机器restore了30个小时
[root@dinpaydg_cc log]# grep ‘Jul 31‘ messages | grep -E ‘more than 120 seconds|hung_task_timeout_secs‘
Jul 31 15:11:57 dinpaydg_cc kernel: INFO: task oracle:4146 blocked for more than 120 seconds.
Jul 31 15:11:57 dinpaydg_cc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 31 16:17:57 dinpaydg_cc kernel: INFO: task oracle:4147 blocked for more than 120 seconds.
Jul 31 16:17:57 dinpaydg_cc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 31 16:23:57 dinpaydg_cc kernel: INFO: task oracle:4144 blocked for more than 120 seconds.
Jul 31 16:23:57 dinpaydg_cc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 31 16:23:57 dinpaydg_cc kernel: INFO: task oracle:4146 blocked for more than 120 seconds.
Jul 31 16:23:57 dinpaydg_cc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 31 16:25:57 dinpaydg_cc kernel: INFO: task oracle:4144 blocked for more than 120 seconds.
Jul 31 16:25:57 dinpaydg_cc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 31 17:15:57 dinpaydg_cc kernel: INFO: task oracle:4147 blocked for more than 120 seconds.
Jul 31 17:15:57 dinpaydg_cc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[root@dinpaydg_cc log]# 
资源使用分析
sar -r 查看发现这一段时间的内存使用情况,已使用内存的百分数(%memused)非常高,已使用交换空间的百分数(%swapused)比例也在个位数到40%直接,服务器在10:31重启,10:00 到10:30之间的数据没有,不清楚什么原因导致。从这些数据也可以看出内存资源非常紧张。
另外也用sar -u查看了CPU,CPU使用率不高。基本正常,当时忘记查看I/O和传送速率的统计信息(现在去查看,已经看不到昨天的数据了),另外在message日志里面看到有"ata2.00: status: { DRDY ERR }", "ata2.00: qc timeout(cmd 0xa0)"; 之类的错误信息
【内存有瓶颈】一直99%以上
[root@dinpaydg_cc log]# sar -r
Linux 2.6.32-573.12.1.el6.x86_64 (dinpaydg_cc)  07/31/2019      _x86_64_        (16 CPU)
10:50:02 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
11:00:01 AM    355288  65617760     99.46    150040  63419688   2689968      3.27
11:10:01 AM    354528  65618520     99.46    150088  63423884   2690020      3.27
11:20:01 AM    367640  65605408     99.44    150136  63412012   2689880      3.27
11:30:01 AM    362712  65610336     99.45    150088  63419560   2689976      3.27
11:40:01 AM    368524  65604524     99.44    150084  63411952   2690108      3.27
11:50:01 AM    347320  65625728     99.47    149712  63434068   2689940      3.27
12:00:01 PM    351676  65621372     99.47    149792  63426180   2689956      3.27
12:10:01 PM    356232  65616816     99.46    149924  63415948   2689944      3.27
12:20:01 PM    355240  65617808     99.46    150068  63416388   2689804      3.27
12:30:01 PM    368880  65604168     99.44    150048  63401996   2690048      3.27
12:40:01 PM    359596  65613452     99.45    150104  63413688   2690104      3.27
12:50:01 PM    350048  65623000     99.47    150172  63427012   2689940      3.27
01:00:01 PM    349784  65623264     99.47    150080  63417960   2690072      3.27
01:10:01 PM    362572  65610476     99.45    149996  63401240   2690092      3.27
01:20:01 PM    351272  65621776     99.47    150068  63414564   2689928      3.27
01:30:01 PM    350668  65622380     99.47    150272  63422528   2690012      3.27
01:40:01 PM    352900  65620148     99.47    150252  63412152   2690120      3.27
01:50:01 PM    366284  65606764     99.44    150116  63384588   2690016      3.27
02:00:01 PM    363068  65609980     99.45    150276  63397584   2690048      3.27
02:10:01 PM    368648  65604400     99.44    149992  63376084   2690120      3.27
02:20:01 PM    369268  65603780     99.44    149928  63373840   2690008      3.27
02:30:01 PM    362060  65610988     99.45    149820  63376720   2690012      3.27
02:40:01 PM    367160  65605888     99.44    149712  63383124   2690120      3.27
02:50:01 PM    360448  65612600     99.45    149276  63370696   2689952      3.27
03:00:01 PM    358000  65615048     99.46    149408  63372396   2690012      3.27
03:10:01 PM    359588  65613460     99.45    147528  63255224   2689944      3.27
03:20:01 PM    351660  65621388     99.47    147348  63241720   2690072      3.27
03:30:01 PM    355504  65617544     99.46    146716  63225352   2689952      3.27
03:40:01 PM    369560  65603488     99.44    147140  63211492   2682996      3.26
03:50:02 PM    365856  65607192     99.45    147276  63221384   2682936      3.26
04:00:01 PM    350200  65622848     99.47    147108  63235804   2685272      3.26
04:10:02 PM    359236  65613812     99.46    147284  63215036   2682936      3.26
04:20:01 PM    348992  65624056     99.47    147220  63184804   2683064      3.26
04:30:01 PM    348712  65624336     99.47    147492  63182200   2682868      3.26
04:40:01 PM    363872  65609176     99.45    147200  63189104   2683104      3.26
04:50:01 PM    355796  65617252     99.46    147840  63249792   2682868      3.26
05:00:01 PM    351828  65621220     99.47    147484  63243968   2682996      3.26
05:10:01 PM    368956  65604092     99.44    147996  63236072   2682984      3.26
05:20:01 PM    364492  65608556     99.45    147844  63237672   2683060      3.26
05:30:01 PM    350464  65622584     99.47    147632  63239828   2682912      3.26
05:40:01 PM    360896  65612152     99.45    148100  63253944   2683084      3.26
05:50:01 PM    366972  65606076     99.44    149312  63276300   2683028      3.26
06:00:01 PM    361516  65611532     99.45    150500  63310344   2682912      3.26
06:10:01 PM    355424  65617624     99.46    150628  63336856   2682980      3.26
06:20:01 PM    348836  65624212     99.47    150708  63356368   2682984      3.26
06:30:01 PM    351704  65621344     99.47    150700  63358596   2682912      3.26
06:40:01 PM    357428  65615620     99.46    151420  63355612   2687860      3.26
06:50:01 PM    350732  65622316     99.47    151608  63366176   2687856      3.26
Average:       358674  65614374     99.46    143847  63371864   2688699      3.26
[root@dinpaydg_cc log]# 
参考帖子,感谢
Linux服务器宕机案例第二则
https://www.cnblogs.com/kerrycode/p/5653586.html
故障报错之/proc/sys/kernel/hung_task_timeout_secs
https://blog.csdn.net/weixin_43279032/article/details/87718804
more than 120 seconds|hung_task_timeout_secs 什么鬼?
原文:https://www.cnblogs.com/ritchy/p/11284491.html