Tuesday, March 24, 2009

Troubleshooting excessive linux iowait time

I hate when a unix system gets really slow but is doing almost nothing. Typically what happens is the load sky rockets, idle time goes to zero, and cpu system and user time usage is very low.

The vmstat command is great for monitoring the situation.

vmstat 30

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
6 33 1131712 9400 5740 103116 2017 5978 3022 5995 657 855 4 6 0 90
0 31 1172848 106668 2664 41288 3792 4598 4496 5061 726 1094 25 8 0 67
1 31 1146468 8984 1088 19060 2559 674 3034 882 559 642 19 9 0 71
1 30 1238520 15116 2008 22084 2362 4689 2517 4707 712 878 3 6 0 91
1 26 1345528 12440 4756 55188 3861 6525 4744 6729 775 1141 5 9 0 87
1 27 1439400 20520 4952 61852 2641 6133 2892 6889 734 914 6 7 0 87
2 22 1505548 291744 988 17412 3973 5662 4295 5684 802 1194 16 9 0 75



The vmstat output above shows a system spending most of its time in iowaits with a little more than half of the activity being swapping shared memory.

The machine has 1GB ram and up to 1.5GB swapped in above output. Swapping might be eliminated by adding another gig of memory to the machine. This might alleviate the problem.

You can check which processes are using the most memory with the ps command.


ps aux|sort -n +3|tail -10
apache 27288 0.6 2.5 198840 25684 ? D 08:28 0:03 /usr/sbin/httpd
apache 27923 1.7 4.0 197796 41512 ? S 08:34 0:02 /usr/sbin/httpd
apache 27874 1.1 5.3 214640 54096 ? S 08:33 0:02 /usr/sbin/httpd
apache 27849 0.8 5.6 214648 57504 ? S 08:33 0:01 /usr/sbin/httpd
apache 27852 1.3 5.6 214892 57712 ? S 08:33 0:02 /usr/sbin/httpd
apache 27887 4.1 6.2 273356 63160 ? D 08:33 0:09 /usr/sbin/httpd
apache 27927 7.1 10.9 273356 111236 ? D 08:34 0:09 /usr/sbin/httpd
apache 27890 3.3 11.4 273608 116184 ? S 08:33 0:07 /usr/sbin/httpd
apache 27929 4.0 13.0 290040 133308 ? R 08:34 0:05 /usr/sbin/httpd
apache 27334 2.2 16.7 328072 170808 ? D 08:28 0:11 /usr/sbin/httpd


The above command sorts processes by percent of memory usage (4th col) and shows the ten biggest memory pigs. In this case, the apache httpd processes listed account for about 70% of available memory.

If you still experience high iowait time when the system is not swapping heavily you need to investigate which processes are doing the io. The following scrip puts together vmstat, block_dump, and ps, to show system activity and the largest io consumers during that activity.

bin/whowait
/etc/init.d/syslog stop
echo 1 > /proc/sys/vm/block_dump
uptime
vmstat 5 6
uptime
dmesg | egrep "READ|WRITE|dirtied" | egrep -o '([a-zA-Z]*)' | sort | uniq -c | sort -rn | head
echo 0 > /proc/sys/vm/block_dump
/etc/init.d/syslog start
echo Top memory hogs:
ps aux|sort -n +3|tail

Outputting:

Shutting down kernel logger: [ OK ]
Shutting down system logger: [ OK ]
09:41:06 up 5 days, 20:48, 1 user, load average: 3.90, 3.58, 2.72
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 2 815844 201800 2052 72560 632 831 1141 1802 64 42 18 5 45 32
0 1 815196 277960 2424 72784 1221 0 1298 396 452 624 6 5 1 89
0 0 815196 275816 2560 74004 3788 0 3931 5306 589 753 16 10 6 67
0 0 815196 275816 2560 74004 177 0 442 114 304 175 5 1 72 22
0 0 815196 275716 2564 74004 19 0 19 77 272 98 0 0 98 1
0 0 815196 275344 2608 74136 18 0 25 33 267 114 0 0 96 4
0 0 815196 275344 2616 74136 32 0 32 7 259 93 0 0 99 0
09:41:32 up 5 days, 20:48, 1 user, load average: 2.74, 3.33, 2.66
1074 httpd
782 named
567 kjournald
177 pdflush
130 qmail
130 mysqld
45 java
9 miniserv
6 executable
4 bash
Starting system logger: [ OK ]
Starting kernel logger: [ OK ]
Top memory hogs:
apache 31245 1.9 2.7 185932 28100 ? S 09:22 0:21 /usr/sbin/httpd
apache 32073 0.4 3.1 190620 32364 ? S 09:36 0:01 /usr/sbin/httpd
apache 32106 1.3 3.1 190608 32176 ? S 09:37 0:03 /usr/sbin/httpd
apache 30959 2.4 3.2 190644 32816 ? S 09:20 0:31 /usr/sbin/httpd
apache 31225 3.0 3.2 191024 33144 ? S 09:22 0:34 /usr/sbin/httpd
apache 30955 1.8 4.0 198840 40964 ? S 09:20 0:24 /usr/sbin/httpd
apache 31982 4.0 5.1 210604 52704 ? S 09:35 0:13 /usr/sbin/httpd
apache 31983 3.1 5.3 212300 54220 ? S 09:35 0:10 /usr/sbin/httpd
apache 31338 1.7 10.8 269196 110512 ? S 09:23 0:19 /usr/sbin/httpd
apache 31924 1.9 10.9 269176 111220 ? S 09:32 0:10 /usr/sbin/httpd

Here the load is not too high and a sample during higher load would be more meaningful.

We may also want to see exactly what files are opened by blocked processes as follows.

for i in `ps -eo pid,user,wchan=WIDE-WCHAN-COLUMN -o s,cmd | awk '$4 ~ /D/ {print $1}'`; do echo "----------"$i"----------"; lsof -p $i; done|more

In the above system the issue to be resolved turn out to be the longevity and memory use of apache processes. This can be achieved examining the apache logs. If you want to see what is accessed recently and have many separate log files to check, the following command is useful substituting the wildcard path to all the access log or error logs on your server.

tail -1 `ls -tr /home/httpd/vhosts/*/statistics/logs/access_log|tail -5`

==> /home/httpd/vhosts/ippsa.org/statistics/logs/access_log <== 68.254.73.3 - - [24/Mar/2009:13:07:57 -0400] "GET /html/index.php?name=News&catid=&topic=12 HTTP/1.1" 200 11618 "http://www.ippsa.org/html/modules.php?op=modload&name=Topics&file=index" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
==> /home/httpd/vhosts/eies.org/statistics/logs/access_log <== 74.247.231.3 - - [24/Mar/2009:13:10:21 -0400] "GET /cgi-bin/ip/herrmann.eies.org HTTP/1.1" 200 184 "-" "curl/7.12.1 (i686-redhat-linux-gnu) libcurl/7.12.1 OpenSSL/0.9.7a zlib/1.2.1.2 libidn/0.5.6"
==> /home/httpd/vhosts/wikiworld.com/statistics/logs/access_log <== 68.151.117.181 - - [24/Mar/2009:13:18:25 -0400] "GET /wiki/index.php/ClassicalLogic HTTP/1.1" 200 3497 "http://www.google.ca/search?hl=en&client=firefox-a&rls=org.mozilla%3Aen-GB%3Aofficial&hs=2X&q=platonic+reasoning&btnG=Search&meta=" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)"
==> /home/httpd/vhosts/waldorfwithoutwalls.com/statistics/logs/access_log <== 24.5.48.111 - - [24/Mar/2009:13:18:26 -0400] "GET /i/h1articles.gif HTTP/1.1" 200 3085 "http://www.waldorfwithoutwalls.com/articles/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
==> /home/httpd/vhosts/whitescarver.com/statistics/logs/access_log <== 64.66.192.62 - - [24/Mar/2009:13:18:23 -0400] "POST /gallery/Family-Photo-Phun/IMG_1486?full=0 HTTP/1.1" 200 708383 "http://www.whitescarver.com/gallery/Family-Photo-Phun/IMG_1486" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"


This shows the last entry in the five most recently updated logs and can easily be adjusted to show more or less. After the fact you can examine each log for the time the problem ocurred to see what was accessed.

One problem that needed to be corrected was spiders (web robots) accessing time consuming functions that need not be indexed
and should not be spidered. Entries were added to the robots.txt file in the document root of site to exclude most of the robots from these pages.

It may not be clear which log entries are problematical. To find log entries for specific processes the process id must be recorded in the log. You will most likely need to add %P to the LogFormat in your apache configuration, /etc/httpd/conf/httpd.conf etc.

LogFormat "%h %l %u %t \"%r\" %>s %b %P" common

To be continued...

1 comment: