Friday, September 21, 2007

VM errors handling.

Current post is dedicated to behavior of different implementations in extreme cases and particularly in case of stack overflow. JVM spec says (see http://java.sun.com/docs/books/jvms/second_edition/html/Concepts.doc.html#24870):

“The class Error and its standard subclasses are exceptions from which ordinary programs are not ordinarily expected to recover.”

but what VM behavior we can rely on that is the question. First of all let’s understand, that this is actually important. Everybody who writes some resource allocation or synchronization code could face with this problem. Let’s imagine that you have some code like this:




resource.acquire (); //Acquire some resources
//e.g. file handler or lock
Do some work here (which could include calls etc))
resource.release(); //Release resources



this code actually is not so good as might be. If some exception were thrown in main block the resources wouldn’t release which could lead to resource leakage or deadlock. To solve this let’s put resource releasing into the finally block.





try {
resource.acquire();
Do some work here...
} finally
resource.release();
}



Of course that doesn’t guarantee that resources are released, but it will be noticed in this case. To investigate the problem I’ve done some experiments on different implementations of Java 1.5 (Sun JRE 1.5 and JRockit JRE 1.5) and Apache Harmony (which is still haven’t got JCK to pass)

I started experiments with simple example:




public class SOETest1 {
public int i = 0;

public static void main (String[] args) {
SOETest1 test = new SOETest1();
try {
test.foo();
} finally {
System.out.println("The final i value is: " + test.i);
}
}

public void foo() {
try {
i++;
foo();
} finally {
i--;
}
}
}


In this example we have infinite recursion, which apparently will produce StackOverflowError at some step. It was expected that variable i is equal to 0 at the end of the test, but the situation is actually differs for different VMs. In fact SUN JRE 1.5 and Harmony behave as expected whereas JRockit JRE 1.5 value of i was 1025. It seems that in case of stack overflow JRockit silently unwinds a stack frame and thus skips execution for number of finally blocks. Definitely this is not the best way of doing things since one can’t rely on even simplest actions in finally blocks are done.

Then I've tried more complex example with method call in finally block:





import java.util.concurrent.locks.ReentrantReadWriteLock;

public class SOETest2 {
public static ReentrantReadWriteLock rwl = new ReentrantReadWriteLock();
public static void main (String[] args) {
try {
Thread[] t = new Thread[2];
for (int i = 0; i < t.length; i++) {
t[i] = new SOEThread();
t[i].start();
t[i].join();
}
} catch (Exception e) {
} finally {
System.out.println("The end!");
}
}

static class SOEThread extends Thread {
public int i = 0;
public int j = 0;
public void run() {
System.out.println("Thread: " + this + " started");
try {
foo();
} finally {
System.out.println("Thread: " + this + " finished, final [i,j] values are: [" + i + "," + j + "]");
}
}

public void foo() {
try {
i++;
SOETest2.rwl.writeLock().lock();
j++;
foo();
} finally {
i--;
SOETest2.rwl.writeLock().unlock();
j--;
}
}
}
}




In this example we have 2 threads which acquire lock on the same object, lock is released in finally block. There are two counters i and j; i allows to check that finally block was executed for each foo()and j shows how many times unlock() was successfully called. If lock is released fewer times than obtained then the test gets into the deadlock.



The results for this test are quite expected:

  • SUN JRE 1.5 returns [i, j] = [0, 1]. j = 1 is o.k., because VM just got another SOE while trying to call unlock(). Second thread is failed to finish due to deadlock (since unlock() was called fewer times than lock()).


  • For JRockit JRE 1.5 [i, j] = [507, 507], this behavior conforms to the first case, JRockit again skips number of finally blocks (which produces deadlock situation) and then successfully executes method unlock().


  • Harmony returns [i, j] = [0, 53], throwing new SOE for the deepest unlock() calls. IMO this is the most correct behavior since finally block was executed exactly once for each foo() and new SOE was thrown for every unsuccessful call of unlock(). Anyway test still got into deadlock.


Deeper investigation showed that Harmony’s behavior could be improved. Such a huge number of failures during unlock() execution is due to the method compilation (method is compiled only at the first call). If compilation is accomplished earlier (it could be done by adding preliminary calls of unlock()) then Harmony works just fine. It returns [i, j] = [0, 0] and both threads are finished successfully. SUN’s and JRockit’s behavior hasn’t changed after such change.

This simple investigation shows that handling VM errors is not so straightforward. Different VMs have different and not always predictable behavior. JRockit JRE 1.5 in case of StackOverflowError could skip number of finally blocks which leads to unexpected results. SUN JRE 1.5 and Harmony showed predictable behavior, but still it doesn’t guarantee that calls in finally blocks will be successful, such cases must be handled in specific way.

Wednesday, September 5, 2007

Apache Harmony scalability analysis

I’ve spend a while experimenting with Harmony scalability. There was a discussion on the dev mail list about Harmony’s Thread Manager quality and I’ve decided to make a simple analysis of current TM behavior. The idea was to compare Harmony behavior on multithreaded workloads with different contention level with other Java 1.5 implementations (Sun JRE 1.5 and JRockit JRE 1.5).

I started with a simple benchmark where several threads operate (trying to get or update generated random sequence of elements) on a single HashMap object. As standard HashMap implementation doesn’t provide synchronization mechanism I used Collections.synchronizedMap(Map m) to make synchronized HashMap. Of course this is not the best way of doing things in parallel, but such approach emulates behavior with very high contention level (actually only single thread could operated with the object at the moment independently on operation type). For second benchmark I used ConcurrentHashMap class which has internal synchronizations mechanisms. In this case we could vary contention level by changing operations type (read or update). In case of reads there are no conflicts and scalability should be ideal, in case of update contention level depends on class implementation (hashing quality, number of groups used, etc) and consequence of elements to be updated.

The following environment was used for the benchmarking:

Hardware: Dual processor Quad Core Xeon® 5355 (8 Cores total) 2.67 GHz, 4 Gb of RAM

Software: Windows Server 2003 OS, Harmony r571439, SUN JRE build 1.5.0_06-b05, JRockit JRE build R27.1.0-109-73164-1.5.0_08-20061129-1428-windows-ia32

Java execution options: all the benchmarks were run with the -server -Xms900m -Xmx900m options.

Synchronized HashMap

Results of benchmark for synchronized version of HashMap showed that SUN JRE has large overhead on thread management (see Chart 1 and Chart 2). But Harmony showed very good result. As you can see it has bigger initial overhead on thread management than JRockit (i.e. overhead on switching form single threaded execution to 2 threads), but after that it performs just fine, with almost no additional overhead, which is very good in case of using of large number of threads. Chart 1 and Chart 2 show similar pictures as actually there is no big difference in operation used due to synchronization model as I previously mentioned.

Chart 1: Synchronized HashMap: 100% reads

Chart 2: Synchronized HashMap: 100% updates

To check that the difference observed is due to VM implementation I’ve made additional experiment for SUN and JRockit with Harmony’s java.util package implementation (except Vector class since it introduces Internal Error in SUN’s VM). Charts 3 and 4 shows that usage of different implementation of the classes doesn’t affect the situation for synchronized HashMap with read operations (for updates there is exactly the same situation).

Chart 3: Synchronized HashMap: 100% reads

Chart 4: Synchronized HashMap: 100% reads

To find out hotspots of different implementation I’ve performed VTune sampling for synchronized HashMap get operations in 16 threads. Looking at the Table 1 and Table 2 we could make some assumptions about the behaviors. At first we could notice the large difference in number of instructions retired for different implementation, so we could guess that SUN JRE spend much more time waiting than JRockit and Harmony (an this was confirmed by CPU usage level, which was much lower for SUN). Another interesting observation is that JRockit spends more than half of time (and retires more than half of instruction) in Other32 module which is actually JITed code. So our second assumption is that JRockit has some synchronization mechanisms inside the JITed code which could be actually area for deeper analysis in Harmony.

Table 1: Synchronized HashMap clockticks brakedown

Time (%%)
Module
SUN JRE 1.5
JRockit JRE 1.5
Harmony
ntkrnlpa.exe
60.85
31.38
18.11
hal.dll
11.72
5.64
3.56
jvm.dll
8.31
1.67

intelppm.sys
7.22
3.22
1.98
Other32
6.98
53.92
4.07
KERNEL32.dll
2.50
1.67

ntdll.dll
2.42
2.51
55.56
harmonyvm.dll


4.29
HYTHR.dll


12.43

Table 2: Synchronized HashMap instructions retired brakedown

Instructions retired
Module
SUN JRE 1.5
JRockit JRE 1.5
Harmony
ntkrnlpa.exe
5.56E+09
5.05E+09
4.17E+09
hal.dll
1.43E+09
9.44E+08
1.23E+09
jvm.dll
2.50E+09
5.15E+08

intelppm.sys
4.99E+08
3.49E+08
3.95E+08
Other32
8.64E+08
1.96E+10
4.73E+09
KERNEL32.dll
5.12E+08
6.69E+08

ntdll.dll
5.12E+08
9.28E+08
1.40E+10
harmonyvm.dll


2.71E+09
HYTHR.dll


6.56E+09
Total
1.19E+10
2.81E+10
3.38E+10

ConcurrentHashMap

Experiments with ConcurrentHashMap showed that in case of reads Harmony works 20-30% slower than SUN and JRockit (see Chart 5). Replacement of java.util.concurrent (note that java.util.concurrent.locks was not replaced, cause it needs VM support) package in SUN and JRockit showed that classlib implementation is not the case (see Chart 6, Chart 7), usage of Harmony’s classes gave 10% benefit for SUN JRE and the same results for JRockit.

Chart 5: ConcurrentHashMap: 100% reads

Chart 6: ConcurrentHashMap: 100% reads

Chart 7: ConcurrentHashMap: 100% reads

Collecting sampling data for execution in 16 threads showed that 100% of time (and instructions retired) is spent in JITed code, thus we could guess that Harmony CG generates not optimal code in this case. This could be caused by lack of loop versioning optimization in Harmony JIT which is currently under implementation and this is definitely area for deeper investigation.

On Chart 8 for ConcurrentHashMap with updates operations you can see that situation is pretty similar. Harmony works 20% slower than SUN JRE and 10% slower than JRockit. Replacement of java.util.concurrent package showed that it could be due to ConcurrentHashMap implementation (see Charts 9 and 10).

Chart 8: ConcurrentHashMap: 100% updates Chart 9: ConcurrentHashMap: 100% updatesChart 10: ConcurrentHashMap: 100% updates

Deeper analysis using VTune sampling (for 16 threads) also showed that 100% of time (and instructions retired) was spent in JITed code, there is no additional synchronization overhead in VM.

Conclusion

The conducted experiments proved that Harmony synchronizations mechanism is mature enough, for high contention case it outperforms such well-known JREs as SUN and JRockit. In case of low contention there are some performance issues concerned to implementation of ConcurrentHashMap and JIT optimizations that need deeper investigation.