Troubleshooting Stuck Processes on Linux

By Douglas Leonard – Sr. Software Engineer – ADP Cobalt Inventory Services

If you have a batch job or process on Linux that doesn’t seem to be doing anything, it can be difficult to figure out where exactly it is stuck. You can run strace on the process to figure out if it is hung on a system call but that may not tell you as much as you would like. Lots of processes have filehandles, network sockets, pipes, or other connections open so you don’t always know what the bottleneck or blockage is. Here is an example of a process that was stuck doing nothing for a long period of time and how we found out what it was stuck on. Originally the problem was found because the logfile hadn’t been written to in 16+ hours. But what was wrong? Had it died? Was it stuck? Was it just failing to write out the log?

Get the process ID by querying the process table

ps -eo pid,args | grep garbage

Gives you the PID and the command run (filtered by grep to only get back what we care about.) Note: this will include the grep command sometimes. Excluded here due to irrelevance.

21129 perl /web/imageserver/bin/photo_garbage_collector.pl --mode=actor --noclean --throttle=233 --logfile=/web/imageserver/log/garbagecollector.type2.part2.log

The process is running. The way this process works it reads from a file, writes to a log file, does filesystem deletes against an NFS mounted filer, has a database connection, and is a child process of another job. Any one of those things could potentially be causing this job to run inordinately slowly.

Use strace on the process to see if it is doing any work.

strace -p 21129
Process 21129 attached - interrupt to quit
read(6,
Process 21129 detached

Nothing ever happened on this so I used Ctrl+c to break out of it. The process is definitely doing a whole lot of nothing from a system perspective.

Use netstat to see if the process has an open network socket.

netstat -p | grep 21129
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 batch_machine.env.company:45950 db.env.company:ncube-lm ESTABLISHED 21129/perl

This machine does have a network connection established on port 45950 to db.env.company… I could have gotten more information but I didn’t include enough commandline switches to netstat. Look up some more info with different switches.

netstat -an | grep 45950
tcp 0 0 11.63.7.63:45950 11.63.21.32:1521 ESTABLISHED

That’s a database connection to the oracle database. But is it the stuck read(6 we saw via strace? At this point we don’t know. In fact we don’t even know what else it has open that it could be reading.

Get all of the file descriptors for the process

ls -l /proc/21129/fd
total 0
lr-x------ 1 inv users 64 Sep 13 09:16 0 -> pipe:[4230561338]
l-wx------ 1 inv users 64 Sep 13 09:16 1 -> /web/imageserver/log/garbagecollector.stdout
l-wx------ 1 inv users 64 Sep 13 09:16 2 -> /web/imageserver/log/garbagecollector.stdout
lr-x------ 1 inv users 64 Sep 13 09:16 3 -> /web/imageserver/data/garbagecollector.1.txt
lrwx------ 1 inv users 64 Sep 13 09:16 5 -> socket:[4231677600]
lrwx------ 1 inv users 64 Sep 13 09:16 6 -> socket:[4231677602]

This tells us that it has 6 different file descriptors open and that the 6 it is reading from in the strace command is a socket. But is it the database connection socket or something else?

Convert the port number to hex so it can be looked up and correlated with the socket id.

perl -e 'printf("%X\n", 45950)'
B37E

Look up the corresponding tcp entry by that port in hex

awk '$2~/:B37E/ {print $0}' < /proc/net/tcp
548: 3D08400A:B37E 1E16400A:05F1 01 00000000:00000000 00:00000000 00000000 432 0 4231677602 1 ffff8103162e1980 231 40 13 3 -1

Notice the 4231677602 in that line? That’s the socket id. That tells us that the process which is waiting on read(6 is waiting on a read from the database. That gives us enough information to go talk to the database administrators and figure out why the database call is taking forever.

About collectivegenius
Everyone has a voice and great ideas come from anyone. At Cobalt, we call it the collective genius. When technical depth and passion meets market opportunity, the collective genius is bringing it’s best to the table and our customers win.

3 Responses to Troubleshooting Stuck Processes on Linux

  1. Joseph Huckaby says:

    Awesome info! The only other utility I can think of which may help further is “lsof” which lists open files and sockets for a process, but you actually covered that with netstat and /proc/PID/fd.

    I have a trick that is specific to Perl, but can be very useful… if you insert this at the top of your Perl script:

    use Enbugger::OnError ‘USR1’;

    You can jump into the Perl debugger at runtime, after the script has started, by sending the process a USR1 signal (i.e. kill -USR1 PID). It will then show you what line it is stuck at, you can generate a stack trace, etc.

    (Note: Enbugger isn’t part of the Perl core, but you can easily install it with “cpanm”).

    Great article!

    Like

  2. Philippe Lantin says:

    Great tips to get deeper insight when it’s sorely needed! So, why was the db call taking forever?

    Like

    • Douglas Leonard says:

      The DB call was a single row delete on a large table using the primary key of that table in the where clause. This query had been running just fine for 6+ years. The cost via explain plan was about as cheap as possible. However, a new table had been added recently that had a foreign key constraint to the large table’s primary key. This new table was also quite large and did not have an index on the column with the foreign key constraint. This meant every deletion on the original table caused a full table scan on the new table.

      Two lessons were learned from this;

      1: Adding something new can have unintended consequences to existing processes even though neither the original table or software had changed at all.
      2: Oracle’s explain plan is insufficient for calculating the actual cost of a delete when other tables have foreign key constraints against the table you are deleting data from.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: