Friday, 3 June 2016

Don't just assume arrays will be faster: how we implemented object pooling

Update (29/06/2016): As Mike Barker kindly pointed out, the array backed object pool was not pre-populating the array with objects so the benchmark was going through some extra logic. I've since updated the results and graphs with the array pre-populated, which hasn't greatly changed the results.

To ensure customers receive the best service possible at TransFICC we need to make sure that our systems have robust latency and throughput figures. A key part of avoiding latency spikes (in Java anyway) is to attempt to keep the Garbage Collector at bay. One way we attempt to do this is with the use of object pooling pattern to allow us to re-use objects.

We weren't sure what data structure would be best to back our object pool implementation so we decided to write a few benchmarks that utilise various data structures: stack, dequeue based stack, queue, and an array. The goal of these benchmarks was to see if any of the data structures allocate (e.g. if using a LinkedList, a node would be created everytime you added an object to the pool), and to see what the latency and throughput values would be.

The benchmark results for measuring throughput show that both of the deque backed object pools perform at a similar level for both of the scenarios we are testing (removing and adding a single object, removing and adding several objects), with our stack implementation having similar throughput for the removing and adding several objects scenario, taking into account the error.

What is more interesting is the graphing of the latency figures.







Both object pools that are backed by a deque have very similar latency profiles, which I was surprised at as I expected the the stack implementation to outperform on the remove one add one strategy as it would only be using a single object (the queue implementation will always be taking a different object from the front of the queue). The stack object pool does perform slightly better in the remove several add several scenario at one data point, but the difference isn't significant compared to the queue implementation.

What I found most surprising is the high variance in the array backed object pool. I assumed using a simple array would yield the best throughput and latency numbers, however this is not the case. I'm not sure if it's because of my implementation of the array object pool so will try another version if I have time.

In the end we opted to use an object pool using a stack backed by a deque based on our findings. I should point out that all our object pools are accessed from a single thread so are not required to be thread safe.

The benchmarks were run on a Fedora 23 box with 6 core Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz and 32GB RAM.

Benchmark code can be found here.

Throughput


Benchmark                                                              Mode  Cnt   Score    Error   Units
StackObjectPoolUsingDequeBenchmark.removeAndAddOne                          thrpt   10  96.785 ±  0.076  ops/us
StackObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.alloc.rate           thrpt   10   0.002 ±  0.006  MB/sec
StackObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.alloc.rate.norm      thrpt   10  ≈ 10⁻⁵             B/op
StackObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.count                thrpt   10     ≈ 0           counts
QueueObjectPoolUsingDequeBenchmark.removeAndAddOne                          thrpt   10  96.558 ±  0.614  ops/us
QueueObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.alloc.rate           thrpt   10   0.002 ±  0.006  MB/sec
QueueObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.alloc.rate.norm      thrpt   10  ≈ 10⁻⁵             B/op
QueueObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.count                thrpt   10     ≈ 0           counts
StackObjectPoolUsingArrayBenchmark.removeAndAddOne                          thrpt   10  88.930 ±  0.065  ops/us
StackObjectPoolUsingArrayBenchmark.removeAndAddOne:·gc.alloc.rate           thrpt   10   0.002 ±  0.006  MB/sec
StackObjectPoolUsingArrayBenchmark.removeAndAddOne:·gc.alloc.rate.norm      thrpt   10  ≈ 10⁻⁵             B/op
StackObjectPoolUsingArrayBenchmark.removeAndAddOne:·gc.count                thrpt   10     ≈ 0           counts
StackObjectPoolBenchmark.removeAndAddOne                                  thrpt   10  16.276 ±  0.004  ops/us
StackObjectPoolBenchmark.removeAndAddOne:·gc.alloc.rate                     thrpt   10   0.002 ±  0.006  MB/sec
StackObjectPoolBenchmark.removeAndAddOne:·gc.alloc.rate.norm                thrpt   10  ≈ 10⁻⁴             B/op
StackObjectPoolBenchmark.removeAndAddOne:·gc.count                          thrpt   10     ≈ 0           counts


StackObjectPoolUsingDequeBenchmark.removeAndAddSeveral                      thrpt   10  38.524 ±  0.012  ops/us
StackObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.alloc.rate       thrpt   10   0.002 ±  0.006  MB/sec
StackObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.alloc.rate.norm  thrpt   10  ≈ 10⁻⁴             B/op
StackObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.count            thrpt   10     ≈ 0           counts
QueueObjectPoolUsingDequeBenchmark.removeAndAddSeveral                      thrpt   10  38.108 ±  0.255  ops/us
QueueObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.alloc.rate       thrpt   10   0.002 ±  0.006  MB/sec
QueueObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.alloc.rate.norm  thrpt   10  ≈ 10⁻⁴             B/op
QueueObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.count            thrpt   10     ≈ 0           counts
StackObjectPoolUsingArrayBenchmark.removeAndAddSeveral                      thrpt   10  39.241 ±  1.179  ops/us
StackObjectPoolUsingArrayBenchmark.removeAndAddSeveral:·gc.alloc.rate       thrpt   10   0.002 ±  0.006  MB/sec
StackObjectPoolUsingArrayBenchmark.removeAndAddSeveral:·gc.alloc.rate.norm  thrpt   10  ≈ 10⁻⁴             B/op
StackObjectPoolUsingArrayBenchmark.removeAndAddSeveral:·gc.count            thrpt   10     ≈ 0           counts
StackObjectPoolBenchmark.removeAndAddSeveral                                thrpt   10   5.479 ±  0.036  ops/us
StackObjectPoolBenchmark.removeAndAddSeveral:·gc.alloc.rate                 thrpt   10   0.002 ±  0.006  MB/sec
StackObjectPoolBenchmark.removeAndAddSeveral:·gc.alloc.rate.norm            thrpt   10  ≈ 10⁻³             B/op
StackObjectPoolBenchmark.removeAndAddSeveral:·gc.count                      thrpt   10     ≈ 0           counts


Latency

Remove One Add One Scenario

Percentile   StackObjectPoolUsingDequeBenchmark   QueueObjectPoolUsingDequeBenchmark   StackObjectPoolUsingArrayBenchmark   StackObjectPoolBenchmark

0                      0.558                                0.558                                   0.558                             0.558
50                     0.559                                0.559                                   0.559                             0.629
90                     0.629                                0.629                                   0.629                             0.629
95                     0.629                                0.629                                   0.629                             0.698
99                     0.629                                0.629                                   0.629                             0.699
99.9                   1.118                                1.118                                   1.118                             1.118
99.99                  10.88                                6.842                                   10.752                            10.964
99.999                 11.872                               11.792                                  11.872                            12.057
99.9999                17.991                               16.591                                  51.3                              43.027
100                    18.144                               17.92                                   58.56                              44.416
            
Mean                   0.588                                0.59                                    0.586                             0.628
Error (us/op)          0.001                                0.001                                   0.001                             0.001

Remove Several Add Several Scenario

Percentile    StackObjectPoolUsingDequeBenchmark    QueueObjectPoolUsingDequeBenchmark    StackObjectPoolUsingArrayBenchmark    StackObjectPoolBenchmark
0            0.558                                0.558                                   0.558                              0.698
50         0.628                                0.628                                   0.628                              0.768
90                         0.629                                0.629                                   0.629                              0.769
95                         0.629                                0.629                                   0.629                              0.769
99                         0.629                                0.629                                   0.629                              0.769
99.9                       1.118                                1.118                                   1.118                              1.258
99.99                      10.816                               6.6                                     10.816                             11.232
99.999                     12.006                               11.872                                  11.99                              12.312
99.9999                    14.389                               19.716                                  29.362                             67.966
100                        14.512                               22.4                                    35.712                             69.376
                
Mean                       0.603                                0.605                                   0.604                              0.746
Error (us/op)              0.001                                0.001                                   0.001                              0.001