Wednesday, September 5, 2007

Apache Harmony scalability analysis

I’ve spend a while experimenting with Harmony scalability. There was a discussion on the dev mail list about Harmony’s Thread Manager quality and I’ve decided to make a simple analysis of current TM behavior. The idea was to compare Harmony behavior on multithreaded workloads with different contention level with other Java 1.5 implementations (Sun JRE 1.5 and JRockit JRE 1.5).

I started with a simple benchmark where several threads operate (trying to get or update generated random sequence of elements) on a single HashMap object. As standard HashMap implementation doesn’t provide synchronization mechanism I used Collections.synchronizedMap(Map m) to make synchronized HashMap. Of course this is not the best way of doing things in parallel, but such approach emulates behavior with very high contention level (actually only single thread could operated with the object at the moment independently on operation type). For second benchmark I used ConcurrentHashMap class which has internal synchronizations mechanisms. In this case we could vary contention level by changing operations type (read or update). In case of reads there are no conflicts and scalability should be ideal, in case of update contention level depends on class implementation (hashing quality, number of groups used, etc) and consequence of elements to be updated.

The following environment was used for the benchmarking:

Hardware: Dual processor Quad Core Xeon® 5355 (8 Cores total) 2.67 GHz, 4 Gb of RAM

Software: Windows Server 2003 OS, Harmony r571439, SUN JRE build 1.5.0_06-b05, JRockit JRE build R27.1.0-109-73164-1.5.0_08-20061129-1428-windows-ia32

Java execution options: all the benchmarks were run with the -server -Xms900m -Xmx900m options.

Synchronized HashMap

Results of benchmark for synchronized version of HashMap showed that SUN JRE has large overhead on thread management (see Chart 1 and Chart 2). But Harmony showed very good result. As you can see it has bigger initial overhead on thread management than JRockit (i.e. overhead on switching form single threaded execution to 2 threads), but after that it performs just fine, with almost no additional overhead, which is very good in case of using of large number of threads. Chart 1 and Chart 2 show similar pictures as actually there is no big difference in operation used due to synchronization model as I previously mentioned.

Chart 1: Synchronized HashMap: 100% reads

Chart 2: Synchronized HashMap: 100% updates

To check that the difference observed is due to VM implementation I’ve made additional experiment for SUN and JRockit with Harmony’s java.util package implementation (except Vector class since it introduces Internal Error in SUN’s VM). Charts 3 and 4 shows that usage of different implementation of the classes doesn’t affect the situation for synchronized HashMap with read operations (for updates there is exactly the same situation).

Chart 3: Synchronized HashMap: 100% reads

Chart 4: Synchronized HashMap: 100% reads

To find out hotspots of different implementation I’ve performed VTune sampling for synchronized HashMap get operations in 16 threads. Looking at the Table 1 and Table 2 we could make some assumptions about the behaviors. At first we could notice the large difference in number of instructions retired for different implementation, so we could guess that SUN JRE spend much more time waiting than JRockit and Harmony (an this was confirmed by CPU usage level, which was much lower for SUN). Another interesting observation is that JRockit spends more than half of time (and retires more than half of instruction) in Other32 module which is actually JITed code. So our second assumption is that JRockit has some synchronization mechanisms inside the JITed code which could be actually area for deeper analysis in Harmony.

Table 1: Synchronized HashMap clockticks brakedown

Time (%%)
Module
SUN JRE 1.5
JRockit JRE 1.5
Harmony
ntkrnlpa.exe
60.85
31.38
18.11
hal.dll
11.72
5.64
3.56
jvm.dll
8.31
1.67

intelppm.sys
7.22
3.22
1.98
Other32
6.98
53.92
4.07
KERNEL32.dll
2.50
1.67

ntdll.dll
2.42
2.51
55.56
harmonyvm.dll


4.29
HYTHR.dll


12.43

Table 2: Synchronized HashMap instructions retired brakedown

Instructions retired
Module
SUN JRE 1.5
JRockit JRE 1.5
Harmony
ntkrnlpa.exe
5.56E+09
5.05E+09
4.17E+09
hal.dll
1.43E+09
9.44E+08
1.23E+09
jvm.dll
2.50E+09
5.15E+08

intelppm.sys
4.99E+08
3.49E+08
3.95E+08
Other32
8.64E+08
1.96E+10
4.73E+09
KERNEL32.dll
5.12E+08
6.69E+08

ntdll.dll
5.12E+08
9.28E+08
1.40E+10
harmonyvm.dll


2.71E+09
HYTHR.dll


6.56E+09
Total
1.19E+10
2.81E+10
3.38E+10

ConcurrentHashMap

Experiments with ConcurrentHashMap showed that in case of reads Harmony works 20-30% slower than SUN and JRockit (see Chart 5). Replacement of java.util.concurrent (note that java.util.concurrent.locks was not replaced, cause it needs VM support) package in SUN and JRockit showed that classlib implementation is not the case (see Chart 6, Chart 7), usage of Harmony’s classes gave 10% benefit for SUN JRE and the same results for JRockit.

Chart 5: ConcurrentHashMap: 100% reads

Chart 6: ConcurrentHashMap: 100% reads

Chart 7: ConcurrentHashMap: 100% reads

Collecting sampling data for execution in 16 threads showed that 100% of time (and instructions retired) is spent in JITed code, thus we could guess that Harmony CG generates not optimal code in this case. This could be caused by lack of loop versioning optimization in Harmony JIT which is currently under implementation and this is definitely area for deeper investigation.

On Chart 8 for ConcurrentHashMap with updates operations you can see that situation is pretty similar. Harmony works 20% slower than SUN JRE and 10% slower than JRockit. Replacement of java.util.concurrent package showed that it could be due to ConcurrentHashMap implementation (see Charts 9 and 10).

Chart 8: ConcurrentHashMap: 100% updates Chart 9: ConcurrentHashMap: 100% updatesChart 10: ConcurrentHashMap: 100% updates

Deeper analysis using VTune sampling (for 16 threads) also showed that 100% of time (and instructions retired) was spent in JITed code, there is no additional synchronization overhead in VM.

Conclusion

The conducted experiments proved that Harmony synchronizations mechanism is mature enough, for high contention case it outperforms such well-known JREs as SUN and JRockit. In case of low contention there are some performance issues concerned to implementation of ConcurrentHashMap and JIT optimizations that need deeper investigation.


5 comments:

Egor Pasko said...

Did you experiment with SUN JRE 1.6?

Yuri Dolgov said...

Not yet, I just wanted to make comparison with Java 5.0 implementations, but I think it would be quite interesting to find out if situation has changed in SUN JRE 1.6.

Maksim Ananjev said...

Isn't it a bit premature to do rhis comparison? May be Harmony runs faster in certain cases because it does not fully implement the spec?

Yuri Dolgov said...

Thanks for the question. Actually as you might know we have an issue with acquiring JCK from SUN for Harmony project, so I can't guarantee that we are 100% compliant, but we did our best to make the implementation according the spec, and I hope that comparison does make sense. We are looking forward SUN to let us use JCK, so we could be sure that our implementation is 100% Java 1.5 compliant.

Anonymous said...

Can anyone recommend the robust Managed Service software for a small IT service company like mine? Does anyone use Kaseya.com or GFI.com? How do they compare to these guys I found recently: [url=http://www.n-able.com] N-able N-central desktop management
[/url] ? What is your best take in cost vs performance among those three? I need a good advice please... Thanks in advance!