Parallel Sorting in Java 8

Posted by & filed under , , .

java codeI started playing with Java 8 relatively recently (booo! 🙂 ) — and going through the new features this JDK brought, I keep discovering things which I feel deserve more attention than they have been given. Sure, everyone is raving about the lambda expressions, try-with-resources and so on, but occasionally there are small additions that actually prove to be hidden gems and are an easy win for developers out there.

One such hidden gem I feel is the new parallelSort method — which as per the JavaDoc “breaks the array into sub-arrays that are themselves sorted then merged. […] The ForkJoin common pool is used to execute any parallel tasks.” Great! So we apply a divide-and-conquer and performing multiple sorts in parallel — that to me reads like an immediate speed improvement. So I put it to the test 🙂

As per usual, there’s a GitHub repo for this here: https://github.com/liviutudor/Java8ParallelSort . And you can look at the sources for this as well as the data I’ve used. Feel free to also generate other test files (I’ve used my other project from this blog post to generate the files but anything else will do) and look at the results.

For the purpose of aggregating the data I’ve used the Apache Commons Math library, which provides simple ways to compute basic statistics. I’ve only looked at the average of the operations, standard deviation and the 95% percentile — but again, Apache Math library offers much more if you want to dig further into this. For the purpose of this, I just wanted to get a simple comparison in between the standard Arrays.sort and the new Arrays.parallelSort and figure out an order of magnitude comparison (if there is any).

The way this test is run is per the following: I’ve built a few test files (which get read into arrays) with 100, 1,000, 10,000, 100,000 and 1,000,000 random integers respectively. These arrays get sorted 1,000 times each first with the standard sort then with the parallel sort and the timings for each sort is then aggregated and few stats are computed.

Running this on my Mac here’s a sample of the output:

standard -- 100.txt
average     =   32884.00 ns     =       0.03 ms
95 perc     =    2000.00 ns     =       0.00 ms
stddev      =  252204.87 ns     =       0.25 ms

standard -- 1000.txt
average     =   51187.00 ns     =       0.05 ms
95 perc     =   34000.00 ns     =       0.03 ms
stddev      =   13396.85 ns     =       0.01 ms

standard -- 10000.txt
average     =  611770.00 ns     =       0.61 ms
95 perc     =  540000.00 ns     =       0.54 ms
stddev      =  114533.55 ns     =       0.11 ms

standard -- 100000.txt
average     = 7635874.00 ns     =       7.64 ms
95 perc     = 6968509.50 ns     =       6.97 ms
stddev      =  949718.02 ns     =       0.95 ms

standard -- 1000000.txt
average     = 89002092.00 ns    =      89.00 ms
95 perc     = 83446019.00 ns    =      83.45 ms
stddev      = 6321875.10 ns     =       6.32 ms

parallel -- 100.txt
average     =    4067.00 ns     =       0.00 ms
95 perc     =    2000.00 ns     =       0.00 ms
stddev      =   19583.05 ns     =       0.02 ms

parallel -- 1000.txt
average     =   47775.00 ns     =       0.05 ms
95 perc     =   33000.00 ns     =       0.03 ms
stddev      =   14184.91 ns     =       0.01 ms

parallel -- 10000.txt
average     =  530962.00 ns     =       0.53 ms
95 perc     =  343019.00 ns     =       0.34 ms
stddev      = 3654827.07 ns     =       3.65 ms

parallel -- 100000.txt
average     = 4205239.00 ns     =       4.21 ms
95 perc     = 3582509.50 ns     =       3.58 ms
stddev      =  681404.40 ns     =       0.68 ms

parallel -- 1000000.txt
average     = 43857086.00 ns    =      43.86 ms
95 perc     = 41049076.00 ns    =      41.05 ms
stddev      = 3027405.87 ns     =       3.03 ms

Wow! So even though my code is not optimized to take into account garbage collection and other things happening inside the JVM (remember, I’m only interested in the order of magnitude not the precise timings here!), I can spot right away with the naked eye that the parallel sort is about twice as fast as the standard one!

The only places where they seem to have a similar timing is the case of 100 and 1,000 items. Looking back at the javadoc, the following bit explains why: “When the sub-array length reaches a minimum granularity, the sub-array is sorted using the appropriate Arrays.sort method. If the length of the specified array is less than the minimum granularity, then it is sorted using the appropriate Arrays.sort method.” In other words, it seems that 1,000 (and therefore 100 too) is within the “minimum granularity” and as such, the code still applies the standard Arrays.sort method. The difference starts becoming apparent in the case of 10,000 items — where w are looking at ~0.54 ms  for the standard sort versus about ~0.34 ms for parallel sorting. And once we look at 100,000 and then 1,000,000 items the difference in timing keeps growing.

I didn’t have the patience to run this on 10mil items or more but I would bet good money on the fact that the time difference increase. And all of this at the cost of simply replacing a call to Arrays.sort() with a call to Arrays.parallelSort(). Sweet!!!! 🙂