To ensure customers receive the best service possible at TransFICC we need to make sure that our systems have robust latency and throughput figures. A key part of avoiding latency spikes (in Java anyway) is to attempt to keep the Garbage Collector at bay. One way we attempt to do this is with the use of object pooling pattern to allow us to re-use objects.
We weren't sure what data structure would be best to back our object pool implementation so we decided to write a few benchmarks that utilise various data structures: stack, dequeue based stack, queue, and an array. The goal of these benchmarks was to see if any of the data structures allocate (e.g. if using a LinkedList, a node would be created everytime you added an object to the pool), and to see what the latency and throughput values would be.
The benchmark results for measuring throughput show that both of the deque backed object pools perform at a similar level for both of the scenarios we are testing (removing and adding a single object, removing and adding several objects), with our stack implementation having similar throughput for the removing and adding several objects scenario, taking into account the error.
What is more interesting is the graphing of the latency figures.
Both object pools that are backed by a deque have very similar latency profiles, which I was surprised at as I expected the the stack implementation to outperform on the remove one add one strategy as it would only be using a single object (the queue implementation will always be taking a different object from the front of the queue). The stack object pool does perform slightly better in the remove several add several scenario at one data point, but the difference isn't significant compared to the queue implementation.
What I found most surprising is the high variance in the array backed object pool. I assumed using a simple array would yield the best throughput and latency numbers, however this is not the case. I'm not sure if it's because of my implementation of the array object pool so will try another version if I have time.
In the end we opted to use an object pool using a stack backed by a deque based on our findings. I should point out that all our object pools are accessed from a single thread so are not required to be thread safe.
The benchmarks were run on a Fedora 23 box with 6 core Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz and 32GB RAM.
Benchmark code can be found here.
Throughput
Benchmark Mode Cnt Score Error Units StackObjectPoolUsingDequeBenchmark.removeAndAddOne thrpt 10 96.785 ± 0.076 ops/us StackObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.alloc.rate thrpt 10 0.002 ± 0.006 MB/sec StackObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.alloc.rate.norm thrpt 10 ≈ 10⁻⁵ B/op StackObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.count thrpt 10 ≈ 0 counts QueueObjectPoolUsingDequeBenchmark.removeAndAddOne thrpt 10 96.558 ± 0.614 ops/us QueueObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.alloc.rate thrpt 10 0.002 ± 0.006 MB/sec QueueObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.alloc.rate.norm thrpt 10 ≈ 10⁻⁵ B/op QueueObjectPoolUsingDequeBenchmark.removeAndAddOne:·gc.count thrpt 10 ≈ 0 counts StackObjectPoolUsingArrayBenchmark.removeAndAddOne thrpt 10 88.930 ± 0.065 ops/us StackObjectPoolUsingArrayBenchmark.removeAndAddOne:·gc.alloc.rate thrpt 10 0.002 ± 0.006 MB/sec StackObjectPoolUsingArrayBenchmark.removeAndAddOne:·gc.alloc.rate.norm thrpt 10 ≈ 10⁻⁵ B/op StackObjectPoolUsingArrayBenchmark.removeAndAddOne:·gc.count thrpt 10 ≈ 0 counts StackObjectPoolBenchmark.removeAndAddOne thrpt 10 16.276 ± 0.004 ops/us StackObjectPoolBenchmark.removeAndAddOne:·gc.alloc.rate thrpt 10 0.002 ± 0.006 MB/sec StackObjectPoolBenchmark.removeAndAddOne:·gc.alloc.rate.norm thrpt 10 ≈ 10⁻⁴ B/op StackObjectPoolBenchmark.removeAndAddOne:·gc.count thrpt 10 ≈ 0 counts StackObjectPoolUsingDequeBenchmark.removeAndAddSeveral thrpt 10 38.524 ± 0.012 ops/us StackObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.alloc.rate thrpt 10 0.002 ± 0.006 MB/sec StackObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.alloc.rate.norm thrpt 10 ≈ 10⁻⁴ B/op StackObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.count thrpt 10 ≈ 0 counts QueueObjectPoolUsingDequeBenchmark.removeAndAddSeveral thrpt 10 38.108 ± 0.255 ops/us QueueObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.alloc.rate thrpt 10 0.002 ± 0.006 MB/sec QueueObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.alloc.rate.norm thrpt 10 ≈ 10⁻⁴ B/op QueueObjectPoolUsingDequeBenchmark.removeAndAddSeveral:·gc.count thrpt 10 ≈ 0 counts StackObjectPoolUsingArrayBenchmark.removeAndAddSeveral thrpt 10 39.241 ± 1.179 ops/us StackObjectPoolUsingArrayBenchmark.removeAndAddSeveral:·gc.alloc.rate thrpt 10 0.002 ± 0.006 MB/sec StackObjectPoolUsingArrayBenchmark.removeAndAddSeveral:·gc.alloc.rate.norm thrpt 10 ≈ 10⁻⁴ B/op StackObjectPoolUsingArrayBenchmark.removeAndAddSeveral:·gc.count thrpt 10 ≈ 0 counts StackObjectPoolBenchmark.removeAndAddSeveral thrpt 10 5.479 ± 0.036 ops/us StackObjectPoolBenchmark.removeAndAddSeveral:·gc.alloc.rate thrpt 10 0.002 ± 0.006 MB/sec StackObjectPoolBenchmark.removeAndAddSeveral:·gc.alloc.rate.norm thrpt 10 ≈ 10⁻³ B/op StackObjectPoolBenchmark.removeAndAddSeveral:·gc.count thrpt 10 ≈ 0 counts
Latency
Remove One Add One Scenario
Percentile StackObjectPoolUsingDequeBenchmark QueueObjectPoolUsingDequeBenchmark StackObjectPoolUsingArrayBenchmark StackObjectPoolBenchmark 0 0.558 0.558 0.558 0.558 50 0.559 0.559 0.559 0.629 90 0.629 0.629 0.629 0.629 95 0.629 0.629 0.629 0.698 99 0.629 0.629 0.629 0.699 99.9 1.118 1.118 1.118 1.118 99.99 10.88 6.842 10.752 10.964 99.999 11.872 11.792 11.872 12.057 99.9999 17.991 16.591 51.3 43.027 100 18.144 17.92 58.56 44.416 Mean 0.588 0.59 0.586 0.628 Error (us/op) 0.001 0.001 0.001 0.001
Remove Several Add Several Scenario
Percentile StackObjectPoolUsingDequeBenchmark QueueObjectPoolUsingDequeBenchmark StackObjectPoolUsingArrayBenchmark StackObjectPoolBenchmark 0 0.558 0.558 0.558 0.698 50 0.628 0.628 0.628 0.768 90 0.629 0.629 0.629 0.769 95 0.629 0.629 0.629 0.769 99 0.629 0.629 0.629 0.769 99.9 1.118 1.118 1.118 1.258 99.99 10.816 6.6 10.816 11.232 99.999 12.006 11.872 11.99 12.312 99.9999 14.389 19.716 29.362 67.966 100 14.512 22.4 35.712 69.376 Mean 0.603 0.605 0.604 0.746 Error (us/op) 0.001 0.001 0.001 0.001
I noticed that the array based pool doesn't pre-populates where as the others do. Could this be influencing the results?
ReplyDeleteOh dear. I believe there is a very high chance you're correct. I'll re-run with the benchmarks (and probably re-write the post) with the amendment. It will be interesting to see what the difference is (at least it isn't allocating). Good spot
Delete