Apache Geode™ is an in-memory data grid that provides real-time, consistent access to data-intensive applications throughout widely distributed cloud architectures. Data is distributed throughout the cluster, consisting of servers, locators, and clients. Servers are responsible for hosting data, while locators are responsible for directing the requests from the clients. Due to the distributed and real-time nature of Geode deployments, performance is important - so what is Geode’s performance?
Geode’s performance is shown in figure 1, with average throughput in operations per second on the vertical axis and four different performance tests on the horizontal axis. The graph shows that PartitionedGetBenchmark, when testing Geode 1.9.0, had an average throughput of 200,000 operations per second.
But what does that mean? Is 200,000 a good throughput for this test deployment? Is the performance consistent and accurate? Has it improved or regressed since the previous version? And perhaps most importantly, can it be better?
To answer those questions, we must first ask: how is Geode benchmarked? When we started replacing the previous bare-metal performance testing of Geode, we had several goals for the project, many of which have already been accomplished. You can run performance tests on demand against any revision of Geode (released or in development), on an AWS cluster, or on any development machine. You can also run benchmarks from Concourse CI pipelines. We enabled running with a profiler attached for use in debugging performance bottlenecks. And finally, you can compare any two runs of benchmarks for changes in performance.
While the benchmarks currently enable users to understand the performance of a cluster, there are still a few goals in progress. The first goal is to increase community engagement with benchmarks. Community members like you can run benchmarks to understand the performance impact of your configurations (as an operator) or code changes (as a developer). Community members can also create your own benchmark tests, helping to increase coverage over Geode. Increasing this test coverage is another ongoing effort. Finally, data visualization of benchmark data is still in progress.
Our current portfolio of tests, paired with various configuration options in Geode, cover many of the common workloads that are used in deployments. This post focuses on the results of four of the benchmarks that cover the simplest and most common operations in Geode:
Current configuration options are:
The benchmarks are driven from the client side, performing operations against the cluster as fast as possible. Latency and throughput are measured throughout the test and then analyzed to calculate the average, standard deviation, and 99th percentile. The tests, by default, are run for five minutes, with a one-minute warm-up period. The cluster is comprised of one locator, two servers, and the client. The servers have a region (either partitioned or replicated depending on the test), which is pre-populated with data before starting the test phase.
The first step towards fixing performance issues is finding them. Using a profiler, we can look for several different kinds of hotspots. Monitor locks, thread park/unpark reentrant locks, and excessive allocations/garbage collection (GC) are all good places to start looking for bottlenecks. Other things to look for include:
Let’s focus on a specific example of a performance refactor, starting with the reason that we were even looking for a bottleneck. When the benchmarks were run, the stats showed us that none of the hardware was being saturated, but when running with a profiler, no hot spots were showing up. This seemed immediately suspect, since an absence of bottlenecks meant we should be using all of our CPU to do operations. Eventually, we found the secret profiler option that shows the zero-time reentrant locks. This exposed Thread.park() as a hotspot, with callers of reentrant lock and the connection pool. Further investigations showed that the connection pool was holding a reentrant lock in a hot path while using a deque.
Through further benchmarking of Geode 1.9.0 with different numbers of client threads performing get operations on the cluster, we found that the throughput only scaled to 32 threads before decreasing for higher thread counts (figure 2). Since this was run on a 36 vCPU AWS instance, we did not expect to see decreasing performance at such low thread counts. This issue is caused by the connection pool’s lack of support for a sufficient number of concurrent operations.
The profiler shows where in the code the bottleneck was occurring (figure 3). Every operation that is executed on the server results in one call to ConnectionManagerImpl.borrowConnection and one call to ConnectionManagerImpl.returnConnection (highlighted in green), both of which get a reentrant lock (highlighted in blue). This lock is responsible for almost half of the time spent in these two methods. This is the cause of the taper in performance shown in figure 2. As the thread count increases, contention for the lock increases as operations borrow and return connections concurrently, resulting in a bottleneck.
The profiler points us to a specific location in the code, shown in figure 4. This is a pared-down version of the ConnectionManagerImpl, which implements the ConnectionManager. In this class, available connections are being stored in a deque (the first arrow in figure 4). Because the deque is not a thread safe structure, a reentrant lock (the second arrow in figure 4) is used when accessing the deque. The places where this lock is used, in borrowConnection and returnConnection, are the bottlenecks we were seeing in the profiler.
Figure 5 shows one of the two implementations of borrowConnection. One of these implementations returns an available connection to any server, and the other returns an available connection to a specific server. Both implementations of borrowConnection, as well as the implementation of returnConnection have the same issue with locking, so this post will focus on the server-specific implementation of borrowConnection.
In this implementation, the reentrant lock is held for a significant portion of the method, while the deque is being traversed. The arrows on the left of the image show that the lock is held for a long time. In the middle of holding that lock, there is an await (the arrow on the right of the image). The await causes the thread to be paused until the condition has been met and a signal is received. During this time, the lock is returned. This means that it must reacquire the lock before the thread can continue, further delaying the return of a connection to the caller. In the worst case, this delay is the duration of the timeout provided to the await, plus the time it takes to reacquire the lock, with contention.
Between the profiler and the code, we can be very confident that this area of the code is a significant bottleneck, but how can we fix it? The first part of the solution is to replace the deque in the connection manager with a more appropriate structure. Since we also want to test this code, let’s introduce some modularity into the code and extract all of the behavior related to the available connections to be moved into another class called the AvailableConnectionManager. The implementation of this class also allows us to get rid of the lock in the connection manager, resulting in the implementation of ConnectionManagerImpl shown in figure 6.
Figure 7 shows the implementation of the AvailableConnectionManager, extracted from the ConnectionManagerImpl. In this implementation, the deque has been replaced with a concurrent linked deque. The linked nature of the deque does cause some performance hits due to the need to allocate and garbage collect the nodes. However, this structure relies on compare and swap for a lock-free implementation, making the ConcurrentLinkedDeque the ideal choice for this implementation. The benefit gained from being lock-free far outweighs the slowdown from allocations and GC.
With those changes in mind, we can now take a look at the refactored implementation of borrowConnection. Figure 8 shows that borrowConnection is now lock-free, instead calling the useFirst method on the AvailableConnectionManager. This keeps all of the logic for manipulating the list of available connections in the AvailableConnectionManager.
If we look at the logic in useFirst, shown in figure 9, we can see that there are still no reentrant locks used. Instead, ConcurrentLinkedDeque.removeFirstOccurence() is called. Since this method is thread-safe and lock-free, a sufficiently large pool of connections should result in continued scaling of throughput well past 32 threads on the client.
The new implementations of ConnectionManagerImpl and AvailableConnectionManagerImpl have been thoroughly tested at every level – beyond the familiar unit and integration testing. The use of a profiler is instrumental for finding hotspots. Using the same process as used in finding the bottleneck, a profiler can be attached while running benchmarks to see if any of the old hot spots are still there, or if any new ones have appeared. Figure 10 shows that the borrowConnection and returnConnection methods are no longer hot spots and no longer call ReentrantLock methods. This gives us good confidence that no new hot spots have been introduced by this refactor.
The next type of test run against this refactor is distributed testing. Distributed tests show how the connection manager behaves in a real Geode cluster. A cluster is spun up in several VMs and operations are run, causing connections to be created and destroyed, borrowed and returned. The results of these operations are compared against the expected results. This ensures that this refactor behaves as expected when running with the rest of the system.
In addition to testing the system as a whole, we wanted to ensure that there were no concurrency issues within the refactored ConnectionManagerImpl, as well as with the extracted AvailableConnectionManager. To that end, concurrency tests were added for these classes. In concurrency testing, an executor is given multiple threads to run in parallel, applying pressure to the connection manager to test that certain timings do not result in concurrency issues. This also allows us to test how the code behaves with contention for connections.
The final type of testing that was done on this refactor is performance testing. The results of the performance tests are shown in figure 11. This graph compares the throughput of the commit prior to the refactor, with the committed refactored code. It shows a 239% increase in performance due to this single commit. Additionally, we could see by looking at the Geode stats that the CPU of the client was saturated when running with the refactored connection manager. Finally, figure 12 shows that the scaling of throughput now continues well past the 32 threads at which it tapered off using the old connection pool. With the new connection pool, it takes until 512 threads for scaling to slow at all. This shows that the new implementation not only provides better throughput than the original at moderate thread counts, but also behaves much better under high contention.
Figure 13 shows a comparison of throughputs for the two versions, with 1.9.0 in dark purple and 1.10.0 in light purple. There is a significant improvement in throughput in Geode 1.10.0.
Additionally, figure 14 shows that there was also a significant reduction in latency between 1.9.0 and 1.10.0. These improvements are largely due to the change in connection pool, but version 1.10.0 includes several other small performance refactors that, together, make a big difference in the performance of Geode.