Problem
I've been dealing with a problem in one of my projects for quite some time. I think this is the worst kind of problem: it only happens in production, it occurs randomly, and it doesn't leave any traces in either PHP logs or server logs. When the problem occurs, there is a terribly long page load that ends in a timeout.
The project is quite old with a large technology debt. No framework was used, so it's very difficult to find common problems or get community advice.
The client won't invest a lot of money in this project, but it's still profitable, so he is trying to get every task done at a low cost.
Therefore, there is no opportunity to purchase a New Relic license or a similar paid profiling service. So, how can we trace low-level problems with the tools that we already have?
Strace to the rescue
Strace is the Linux tool for tracing system processes. It can be used to debug low-level interactions between them.
The basic usage command is:
$ strace -p 26380
strace: Process 26380 attached
...
When the failure occurs again, the only thing left to do is execute a command that will connect the Strace to every php-fpm sub-process:
sudo strace -f $(pgrep php-fpm | sed 's/\([0-9]*\)/\-p \1/g') -o ~/stracelogs/log5 -t
-f trace child processes
-o save result to the file
-t add timestamps to each line
As you can see, we use a combination of phrep and sed to find all child php-fpm processes to attach to.
Logs may appear to be complex, but we can sometimes find interesting information in them. I have found this:
...
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
connect(6, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("xxx.xxx.xx.xx")}, 16) = 0
poll([{fd=6, events=POLLOUT}], 1, 0) = 1 ([{fd=6, revents=POLLOUT}])
sendmmsg(6, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="&\5\1\0\0\1\0\0\0\0\0\0\6portal\7interia\2pl\0\0"..., iov_len=35}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=35}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\36\r\1\0\0\1\0\0\0\0\0\0\6portal\7interia\2pl\0\0"..., iov_len=35}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=35}], 2, MSG_NOSIGNAL) = 2
poll([{fd=6, events=POLLIN}], 1, 5000) = 1 ([{fd=6, revents=POLLIN}])
ioctl(6, FIONREAD, [156]) = 0
recvfrom(6, "&\5\201\200\0\1\0\1\0\3\0\3\6portal\7interia\2pl\0\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("xxx.xxx.xx.xx")}, [28->16]) = 156
poll([{fd=6, events=POLLIN}], 1, 4999) = 1 ([{fd=6, revents=POLLIN}])
ioctl(6, FIONREAD, [93]) = 0
recvfrom(6, "\36\r\201\200\0\1\0\0\0\1\0\0\6portal\7interia\2pl\0\0"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("xxx.xxx.xx.xx")}, [28->16]) = 93
close(6) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
fcntl(6, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK) = 0
connect(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("xxx.xx.xx.xxx")}, 16) = -1 EINPROGRESS (Die Operation ist jetzt in Bearbeitung)
poll([{fd=6, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 60000
...
And there, the process got stuck.
Additional it was unable to connect to the ip xxx.xx.xx.xxx by CURL.
The page went online again when this IP responded.
After a little workaround, I found an issue in the code. The reason for the crush was the file_get_contents() function that tried to load the URL directly and put it in HTML's header.
Conclusion
Strace is a powerful tool. Sometimes, even basic knowledge of it is enough to find the cause of a problem. This is especially true when we don't have the budget for specialized, dedicated debugging software.
Top comments (0)