/ Java Thread deadlock – Case Study ~ Java EE Support Patterns

3.20.2012

Java Thread deadlock – Case Study

This article will describe the complete root cause analysis of a recent Java deadlock problem observed from a Weblogic 11g production system running on the IBM JVM 1.6.

This case study will also demonstrate the importance of mastering Thread Dump analysis skills; including for the IBM JVM Thread Dump format.

Environment specifications

-        Java EE server: Oracle Weblogic Server 11g & Spring 2.0.5
-        OS: AIX 5.3
-        Java VM: IBM JRE 1.6.0
-        Platform type: Portal & ordering application

Monitoring and troubleshooting tools

-        JVM Thread Dump (IBM JVM format)
-        Compuware Server Vantage (Weblogic JMX monitoring & alerting)

Problem overview

A major stuck Threads problem was observed & reported from Compuware Server Vantage and affecting 2 of our Weblogic 11g production managed servers causing application impact and timeout conditions from our end users.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non-technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? MEDIUM (only 2 managed servers / JVM affected out of 16)
·        Recent change of the affected platform? Yes (new JMS related asynchronous component)
·        Any recent traffic increase to the affected platform? No
·        How does this problem manifest itself?  A sudden increase of Threads was observed leading to rapid Thread depletion
·        Did a Weblogic managed server restart resolve the problem? Yes, but problem is returning after few hours (unpredictable & intermittent pattern)

-        Conclusion #1: The problem is related to an intermittent stuck Threads behaviour affecting only a few Weblogic managed servers at the time
-        Conclusion #2: Since problem is intermittent, a global root cause such as a non-responsive downstream system is not likely

Thread Dump analysis – first pass

The first thing to do when dealing with stuck Thread problems is to generate a JVM Thread Dump. This is a golden rule regardless of your environment specifications & problem context. A JVM Thread Dump snapshot provides you with crucial information about the active Threads and what type of processing / tasks they are performing at that time.

Now back to our case study, an IBM JVM Thread Dump (javacore.xyz format) was generated which did reveal the following Java Thread deadlock condition below:

1LKDEADLOCK    Deadlock detected !!!
NULL           ---------------------
NULL          
2LKDEADLOCKTHR  Thread "[STUCK] ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CC08B00)
3LKDEADLOCKWTR    is waiting for:
4LKDEADLOCKMON      sys_mon_t:0x0000000126171DF8 infl_mon_t: 0x0000000126171E38:
4LKDEADLOCKOBJ      weblogic/jms/frontend/FESession@0x07000000198048C0/0x07000000198048D8:
3LKDEADLOCKOWN    which is owned by:
2LKDEADLOCKTHR  Thread "[STUCK] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012E560500)
3LKDEADLOCKWTR    which is waiting for:
4LKDEADLOCKMON      sys_mon_t:0x000000012884CD60 infl_mon_t: 0x000000012884CDA0:
4LKDEADLOCKOBJ      weblogic/jms/frontend/FEConnection@0x0700000019822F08/0x0700000019822F20:
3LKDEADLOCKOWN    which is owned by:
2LKDEADLOCKTHR  Thread "[STUCK] ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CC08B00)

This deadlock situation can be translated as per below:

-        Weblogic Thread #8 is waiting to acquire an Object monitor lock owned by Weblogic Thread #10
-        Weblogic Thread #10 is waiting to acquire an Object monitor lock owned by Weblogic Thread #8

Conclusion: both Weblogic Threads #8 & #10 are waiting on each other; forever!

Now before going any deeper in this root cause analysis, let me provide you a high level overview on Java Thread deadlocks.

Java Thread deadlock overview

Most of you are probably familiar with Java Thread deadlock principles but did you really experience a true deadlock problem?

From my experience, true Java deadlocks are rare and I have only seen ~5 occurrences over the last 10 years. The reason is that most stuck Threads related problems are due to Thread hanging conditions (waiting on remote IO call etc.) but not involved in a true deadlock condition with other Thread(s).

A Java Thread deadlock is a situation for example where Thread A is waiting to acquire an Object monitor lock held by Thread B which is itself waiting to acquire an Object monitor lock held by Thread A. Both these Threads will wait for each other forever. This situation can be visualized as per below diagram:


Thread deadlock is confirmed…now what can you do?

Once the deadlock is confirmed (most JVM Thread Dump implementations will highlight it for you), the next step is to perform a deeper dive analysis by reviewing each Thread involved in the deadlock situation along with their current task & wait condition.

Find below the partial Thread Stack Trace from our problem case for each Thread involved in the deadlock condition:

** Please note that the real application Java package name was renamed for confidentiality purposes **

Weblogic Thread #8

"[STUCK] ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x000000012CC08B00, j9thread_t:0x00000001299E5100, java/lang/Thread:0x070000001D72EE00, state:B, prio=1
(native thread ID:0x111200F, native priority:0x1, native policy:UNKNOWN)
Java callstack:
       at weblogic/jms/frontend/FEConnection.stop(FEConnection.java:671(Compiled Code))
       at weblogic/jms/frontend/FEConnection.invoke(FEConnection.java:1685(Compiled Code))
       at weblogic/messaging/dispatcher/Request.wrappedFiniteStateMachine(Request.java:961(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.syncRequest(DispatcherImpl.java:184(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.dispatchSync(DispatcherImpl.java:212(Compiled Code))
       at weblogic/jms/dispatcher/DispatcherAdapter.dispatchSync(DispatcherAdapter.java:43(Compiled Code))
       at weblogic/jms/client/JMSConnection.stop(JMSConnection.java:863(Compiled Code))
       at weblogic/jms/client/WLConnectionImpl.stop(WLConnectionImpl.java:843)
       at org/springframework/jms/connection/SingleConnectionFactory.closeConnection(SingleConnectionFactory.java:342)
       at org/springframework/jms/connection/SingleConnectionFactory.resetConnection(SingleConnectionFactory.java:296)
       at org/app/JMSReceiver.receive()
……………………………………………………………………

Weblogic Thread #10

"[STUCK] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x000000012E560500, j9thread_t:0x000000012E35BCE0, java/lang/Thread:0x070000001ECA9200, state:B, prio=1
 (native thread ID:0x4FA027, native priority:0x1, native policy:UNKNOWN)
Java callstack:
       at weblogic/jms/frontend/FEConnection.getPeerVersion(FEConnection.java:1381(Compiled Code))
       at weblogic/jms/frontend/FESession.setUpBackEndSession(FESession.java:755(Compiled Code))
       at weblogic/jms/frontend/FESession.consumerCreate(FESession.java:1025(Compiled Code))
       at weblogic/jms/frontend/FESession.invoke(FESession.java:2995(Compiled Code))
       at weblogic/messaging/dispatcher/Request.wrappedFiniteStateMachine(Request.java:961(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.syncRequest(DispatcherImpl.java:184(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.dispatchSync(DispatcherImpl.java:212(Compiled Code))
       at weblogic/jms/dispatcher/DispatcherAdapter.dispatchSync(DispatcherAdapter.java:43(Compiled Code))
       at weblogic/jms/client/JMSSession.consumerCreate(JMSSession.java:2982(Compiled Code))
       at weblogic/jms/client/JMSSession.setupConsumer(JMSSession.java:2749(Compiled Code))
       at weblogic/jms/client/JMSSession.createConsumer(JMSSession.java:2691(Compiled Code))
       at weblogic/jms/client/JMSSession.createReceiver(JMSSession.java:2596(Compiled Code))
       at weblogic/jms/client/WLSessionImpl.createReceiver(WLSessionImpl.java:991(Compiled Code))
       at org/springframework/jms/core/JmsTemplate102.createConsumer(JmsTemplate102.java:204(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.doReceive(JmsTemplate.java:676(Compiled Code))
       at org/springframework/jms/core/JmsTemplate$10.doInJms(JmsTemplate.java:652(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.execute(JmsTemplate.java:412(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.receiveSelected(JmsTemplate.java:650(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.receiveSelected(JmsTemplate.java:641(Compiled Code))
       at org/app/JMSReceiver.receive()
……………………………………………………………

As you can see in the above Thread Strack Traces, such deadlock did originate from our application code which is using the Spring framework API for the JMS consumer implementation (very useful when not using MDB’s). The Stack Traces are quite interesting and revealing that both Threads are in a race condition against the same Weblogic JMS consumer session / connection and leading to a deadlock situation:

-        Weblogic Thread #8 is attempting to reset and close the current JMS connection
-        Weblogic Thread #10 is attempting to use the same JMS Connection / Session in order to create a new JMS consumer
-        Thread deadlock is triggered!

Root cause: non Thread safe Spring JMS SingleConnectionFactory implementation

A code review and a quick research from Spring JIRA bug database did reveal the following Thread safe defect below with a perfect correlation with the above analysis:

# SingleConnectionFactory's resetConnection is causing deadlocks with underlying OracleAQ's JMS connection

A patch for Spring SingleConnectionFactory was released back in 2009 which did involve adding proper synchronized{} block in order to prevent Thread deadlock in the event of a JMS Connection reset operation:

synchronized (connectionMonitor) {
  //if condition added to avoid possible deadlocks when trying to reset the target connection
  if (!started) {
    this.target.start();
    started = true;
  }
}

Solution

Our team is currently planning to integrate this Spring patch in to our production environment shortly. The initial tests performed in our test environment are positive.

Conclusion

I hope this case study has helped understand a real-life Java Thread deadlock problem and how proper Thread Dump analysis skills can allow you to quickly pinpoint the root cause of stuck Thread related problems at the code level. Please don’t hesitate to post any comment or question.

0 comments:

Post a Comment