I’ve spend a while experimenting with Harmony scalability. There was a discussion on the dev mail list about Harmony’s Thread Manager quality and I’ve decided to make a simple analysis of current TM behavior. The idea was to compare Harmony behavior on multithreaded workloads with different contention level with other Java 1.5 implementations (Sun JRE 1.5 and JRockit JRE 1.5).
I started with a simple benchmark where several threads operate (trying to get or update generated random sequence of elements) on a single HashMap object. As standard HashMap implementation doesn’t provide synchronization mechanism I used Collections.synchronizedMap(Map m) to make synchronized HashMap. Of course this is not the best way of doing things in parallel, but such approach emulates behavior with very high contention level (actually only single thread could operated with the object at the moment independently on operation type). For second benchmark I used ConcurrentHashMap class which has internal synchronizations mechanisms. In this case we could vary contention level by changing operations type (read or update). In case of reads there are no conflicts and scalability should be ideal, in case of update contention level depends on class implementation (hashing quality, number of groups used, etc) and consequence of elements to be updated.
The following environment was used for the benchmarking:
Hardware: Dual processor Quad Core Xeon® 5355 (8 Cores total) 2.67 GHz, 4 Gb of RAM
Software: Windows Server 2003 OS, Harmony r571439, SUN JRE build 1.5.0_06-b05, JRockit JRE build R27.1.0-109-73164-1.5.0_08-20061129-1428-windows-ia32
Java execution options: all the benchmarks were run with the -server -Xms900m -Xmx900m options.
Synchronized HashMap
Results of benchmark for synchronized version of HashMap showed that SUN JRE has large overhead on thread management (see Chart 1 and Chart 2). But Harmony showed very good result. As you can see it has bigger initial overhead on thread management than JRockit (i.e. overhead on switching form single threaded execution to 2 threads), but after that it performs just fine, with almost no additional overhead, which is very good in case of using of large number of threads. Chart 1 and Chart 2 show similar pictures as actually there is no big difference in operation used due to synchronization model as I previously mentioned.
Chart 1: Synchronized HashMap: 100% reads
Chart 2: Synchronized HashMap: 100% updates
To check that the difference observed is due to VM implementation I’ve made additional experiment for SUN and JRockit with Harmony’s java.util package implementation (except Vector class since it introduces Internal Error in SUN’s VM). Charts 3 and 4 shows that usage of different implementation of the classes doesn’t affect the situation for synchronized HashMap with read operations (for updates there is exactly the same situation).
Chart 3: Synchronized HashMap: 100% reads
Chart 4: Synchronized HashMap: 100% reads
To find out hotspots of different implementation I’ve performed VTune sampling for synchronized HashMap get operations in 16 threads. Looking at the Table 1 and Table 2 we could make some assumptions about the behaviors. At first we could notice the large difference in number of instructions retired for different implementation, so we could guess that SUN JRE spend much more time waiting than JRockit and Harmony (an this was confirmed by CPU usage level, which was much lower for SUN). Another interesting observation is that JRockit spends more than half of time (and retires more than half of instruction) in Other32 module which is actually JITed code. So our second assumption is that JRockit has some synchronization mechanisms inside the JITed code which could be actually area for deeper analysis in Harmony.
Table 1: Synchronized HashMap clockticks brakedown
| Time (%%) |
Module | SUN JRE 1.5 | JRockit JRE 1.5 | Harmony |
ntkrnlpa.exe | 60.85 | 31.38 | 18.11 |
hal.dll | 11.72 | 5.64 | 3.56 |
jvm.dll | 8.31 | 1.67 |
|
intelppm.sys | 7.22 | 3.22 | 1.98 |
Other32 | 6.98 | 53.92 | 4.07 |
KERNEL32.dll | 2.50 | 1.67 |
|
ntdll.dll | 2.42 | 2.51 | 55.56 |
harmonyvm.dll |
|
| 4.29 |
HYTHR.dll |
|
| 12.43 |
Table 2: Synchronized HashMap instructions retired brakedown
| Instructions retired |
Module | SUN JRE 1.5 | JRockit JRE 1.5 | Harmony |
ntkrnlpa.exe | 5.56E+09 | 5.05E+09 | 4.17E+09 |
hal.dll | 1.43E+09 | 9.44E+08 | 1.23E+09 |
jvm.dll | 2.50E+09 | 5.15E+08 |
|
intelppm.sys | 4.99E+08 | 3.49E+08 | 3.95E+08 |
Other32 | 8.64E+08 | 1.96E+10 | 4.73E+09 |
KERNEL32.dll | 5.12E+08 | 6.69E+08 |
|
ntdll.dll | 5.12E+08 | 9.28E+08 | 1.40E+10 |
harmonyvm.dll |
|
| 2.71E+09 |
HYTHR.dll |
|
| 6.56E+09 |
Total | 1.19E+10 | 2.81E+10 | 3.38E+10 |
ConcurrentHashMap Experiments with ConcurrentHashMap showed that in case of reads Harmony works 20-30% slower than SUN and JRockit (see Chart 5). Replacement of java.util.concurrent (note that java.util.concurrent.locks was not replaced, cause it needs VM support) package in SUN and JRockit showed that classlib implementation is not the case (see Chart 6, Chart 7), usage of Harmony’s classes gave 10% benefit for SUN JRE and the same results for JRockit.
Chart 5: ConcurrentHashMap: 100% reads
Chart 6: ConcurrentHashMap: 100% reads
Chart 7: ConcurrentHashMap: 100% reads
Collecting sampling data for execution in 16 threads showed that 100% of time (and instructions retired) is spent in JITed code, thus we could guess that Harmony CG generates not optimal code in this case. This could be caused by lack of loop versioning optimization in Harmony JIT which is currently under implementation and this is definitely area for deeper investigation.
On Chart 8 for ConcurrentHashMap with updates operations you can see that situation is pretty similar. Harmony works 20% slower than SUN JRE and 10% slower than JRockit. Replacement of java.util.concurrent package showed that it could be due to ConcurrentHashMap implementation (see Charts 9 and 10).
Chart 8: ConcurrentHashMap: 100% updates Chart 9: ConcurrentHashMap: 100% updatesChart 10: ConcurrentHashMap: 100% updates
Deeper analysis using VTune sampling (for 16 threads) also showed that 100% of time (and instructions retired) was spent in JITed code, there is no additional synchronization overhead in VM.
Conclusion
The conducted experiments proved that Harmony synchronizations mechanism is mature enough, for high contention case it outperforms such well-known JREs as SUN and JRockit. In case of low contention there are some performance issues concerned to implementation of ConcurrentHashMap and JIT optimizations that need deeper investigation.