An Adventure in strace

Troubleshooting bugs is not always as simple as reading an error log. Today I ended up running end to end in order to track down a relatively simple bug, which, like all bugs, was obvious and trivial to fix once the logic behind the issue was understood.

I started off my day thinking I'd deploy some code to a staging environment. Nothing too exciting about that, after all, yesterday I had deployed some code and went home and left a long running process on over night. In the morning I checked that my import had finished and got to work.

After updating the source code and deploying to the environment I ran into a wall. Specifically a 504 Error. Confused about the timeout, I ssh-ed to the server and checked the log files. This particular server had an nginx server that proxied back to an apache instance. Looking through the logfiles I saw a single error message:

client closed connection while waiting for request, client: xx.xxx.xxx.xx:8080, server: 0.0.0.0:80

Mildly confused, I tested the apache server in the background via curl and found it returning the expected 401. Authenticating with the --user flag and everything hung. No timeout (even after 5 minutes), nothing in the log files for apache. Nothing.

Confused, I started troubleshooting my nginx configuration, fiddling with parts of it to make sure that each server was proxying back to the configured instance or application. Every single virtual host on apache and server configuration in nginx worked except for the one I was testing.

After using httpd -S to make sure that my default virtual host wasn't incorrect I quickly wrote a test script.

<?php echo "Yeah, I'm working..."; ?>

And on testing it I saw that the server itself was working:

curl --user user:password localhost:8080/test.php
#Yeah, I'm working...

So then the error must be in the application. I bumping up the logging in curl via the --verbose flag to see if there was an infinite redirect going on, but that returned unfruitful. "So the error must be in the application" I thought. Checking the error logs showed nothing, so I figured there might be an infinite loop or something of that nature. So I dug out the trusty tool strace.

strace allows you to view the system calls of a running application by attaching to it. In order to use it one typically uses ps to or pidof to determine the process id, then specifies this id in the -p flag. When it comes to debugging a daemonized application such as httpd or mysql you'll also need to use the -f flag to follow any forked processes.

So if the output of ps aux | grep apache looks like this:

root      2742  0.0  0.1 420428  9508 ?        Ss   Apr21   0:12 /usr/sbin/apache2 -k start
www-data 16117  0.0  0.0 420516  7288 ?        S    Apr22   0:00 /usr/sbin/apache2 -k start
www-data 16118  0.0  0.0 420516  7288 ?        S    Apr22   0:00 /usr/sbin/apache2 -k start
www-data 16119  0.0  0.0 420484  7052 ?        S    Apr22   0:00 /usr/sbin/apache2 -k start
www-data 16120  0.0  0.0 420484  7052 ?        S    Apr22   0:00 /usr/sbin/apache2 -k start
www-data 16121  0.0  0.0 420484  7052 ?        S    Apr22   0:00 /usr/sbin/apache2 -k start

Then you would want to use strace -f -p 2742 to monitor any additional processes created. Or, to be sure, you could make use of the fact that the -p flag can be used multiple times and monitor the already forked processes like so:

strace -f -p 2742 -p 16117 -p 16118 -p 16119 -p 16120 -p 16121

And you'll see an output something like this:

Process 2742 attached - interrupt to quit
Process 16117 attached - interrupt to quit
Process 16118 attached - interrupt to quit
Process 16119 attached - interrupt to quit
Process 16120 attached - interrupt to quit
Process 16121 attached - interrupt to quit
[pid  2742] select(0, NULL, NULL, NULL, {0, 574545} <unfinished ...>
[pid 16120] semop(131072, {{0, -1, SEM_UNDO}}, 1 <unfinished ...>
[pid 16121] semop(131072, {{0, -1, SEM_UNDO}}, 1 <unfinished ...>
[pid 16119] semop(131072, {{0, -1, SEM_UNDO}}, 1 <unfinished ...>
[pid 16118] semop(131072, {{0, -1, SEM_UNDO}}, 1 <unfinished ...>
[pid 16117] epoll_wait(47,  <unfinished ...>
[pid  2742] <... select resumed> )      = 0 (Timeout)
[pid  2742] wait4(-1, 0x7fff8200f2ac, WNOHANG|WSTOPPED, NULL) = 0
[pid  2742] select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
[pid  2742] wait4(-1, 0x7fff8200f2ac, WNOHANG|WSTOPPED, NULL) = 0
[pid  2742] select(0, NULL, NULL, NULL, {1, 0}^C <unfinished ...>
Process 2742 detached
Process 16117 detached
Process 16118 detached
Process 16119 detached
Process 16120 detached
Process 16121 detached

Of course this doesn't get more interesting than your regular event polling loop until the server responds to something or is asked to execute code by the application. In my case, the first breadcrumb appeared as an error in resolving an ip address:

[pid   904] connect(32, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("192.168.100.194")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid   904] fcntl(32, F_SETFL, O_RDWR)  = 0
[pid   904] poll([{fd=32, events=POLLIN|POLLPRI}], 1, 60000 <unfinished ...>
...
[pid   904] setsockopt(32, SOL_SOCKET, SO_SNDTIMEO, "\2003\341\1\0\0\0\0\0\0\0\0\0\0\0\0", 16 <unfinished ...>
...
[pid   904] <... read resumed> 0x7f2bfe23c8d0, 16384) = -1 ETIMEDOUT (Connection timed out)

Anyone who's ever dealt with the c networking apis knows that the inet_addr function is used to convert a human readable ip to a machine readable form, and then this is used by the connect call to actually perform the connection. We're lucky in this case, as we can very easily see the IP address that is being attempted to resolve.

Any ip address that looks like 192.164.x.x is a private IP, which means that it's internal to the subnetwork of the machines. In this case, the ip address couldn't be resolved. Why? Simple. The ip address belonged to the production environment and not the staging one.

Of course, I only realized this after I had run off to look at another system first. If you note that call to htons is using the port 3306 it should be clear which application. If you're not used to web development involving mysql, then that's ok, but 3306 is the standard mysql port.

While straceing the mysql process I saw a flurry of activity when I ran up the application (A Drupal 6 Installation) and after the browser continued spinning, I saw an insert statement fly by. Noting that it was a logging query from the watchdog function. I checked out the table in the database:

SELECT * FROM watchdog ORDER BY timestamp DESC LIMIT 20\G

| 188148200 |   0 | php  | %message in %file on line %line. | a:4:{s:6:"%error";s:7:"warning";s:8:"%message";s:67:"mysql_close(): supplied argument is not a valid MySQL-Link resource";s:5:"%file";s:67:"/var/www/sites/all/modules/cdn/cdn.advanced.inc";s:5:"%line";i:82;}                                                                                                          |        3 |      | http://staging.example.com/ |         | 127.0.0.1 | 1429902803 |
+-----------+-----+------+---------------------------
...

Aha! As strace had indicated, we were failing on connecting to a mysql database somewhere in the code. And according to the watchdog error, it was coming from the cdn module. After grepping the code base I found the variables used to specify the configuration of the module and noted that the same mysterious IP address appeared there as well.

At that point the gears clicked and the realization that the file conveyer processes that migrated any uploaded images from the site admins to the CDN occured. This is a process which doesn't happen on the staging environment I was troubleshooting, and had slipped in from a database import that I had left running over night.

So the answer was pretty simple to fix the issue. Disable the CDN Module. However, due to... circumstances, drush wasn't available to me in the
environment I was in. This meant I had to toggle the module the old fashioned way, via SQL.

update system set status = 1 where name = 'cdn';

This was enough to get most of the application up and running and I was able to see a header appear, however an error appeared in my logs until I reset some of the values for the cdn in the cache:

update cache set data = replace(data, '"cdn_status";s:1:"0";','"cdn_status";s:1:"1";') where data like '%cdn_status%';
update variable set value = 's:1:"0";' where name like 'cdn_status';

After that the application loaded and I was able to verify, after a few hours of banging my head against server configurations, system logs, and network scanning, that the minute changes I had deployed were working.

The take away?

When your regular application logs fail, tools like strace or sysdig can really come in handy.