We were in the process of migrating a legacy java app to PCF. We were replatforming it to Spring Boot from a TIBCO AMX base. The issue we were seeing is that XML service calls ran
2x slower when running in PCF as compared to when the service calls were made from a standalone java jar. The exact same app displayed different performance characteristics when running in different scenarios:
ok! so given this situation how do you go about resolving it? First, It wasn't an obvious situation for doing method-level profiling. So, we had to rule-out a few things that came to mind first:
Network-related differences including Latency, bandwidth and route differences. We found out that network performance was faster from PCF than from the Laptop.
JDBC differences including driver versions, pooling configuration, and caching config. We looked at Database call latency in the different environments and found that JDBC performance was not changing enough between scenarios to cause the difference we were seeing.
Class-loading differences especially around the XML parsing libraries used. The major portion of latency was coming from XML processing of the Database resultsets. Were different classes being used in the different scenarios? We did some detailed Class-Loading analysis of the different scenarios and did find a few differences, but
no smoking gun here either!
What else could we do to understand the differences across these scenarios. We turned to Profiling.
What are the options that come to mind when a service is performing poorly and not satisfying the latency service level indicators ?
1. Poor man's profiler. Take three or four javacores spread 20 seconds apart to see if threads are hanging in particular java operation or code paths. Javacores are the poor man's profiler to debug performance issues. Take a whole bunch of thread dumps spaced 30s apart to determine which threads are stuck and the exact processing that takes place. There are some tools that automatically trigger a JVM thread dump. thread-dump-grapher. On PCF to trigger a threaddump you can either ssh into the container and issue
kill -3 <PID> or Use Spring Boot
/dump actuator endpoint use VisualVM. More on this later….Back to the actual problem. In our case the service call execution was in the 50-200s range therefore there simply was not enough time to trigger multiple thread-dumps aka java cores.
2. The second option that came to mind was to instrument the application with metrics and benchmark it using a microprofiler like JMH. We considered getting inspired by David Syer's work with spring boot like the spring-boot-startup-bench. naah … we are lazy and impatient developers - this would be too much work.
3. We briefly considered JVM garbage collection was a likely culprit. We suspected that the app may be getting pressured due to low memory and GC stop the world activity may be causing the slowdown. Since we could change the size of the JVM and not affect performance, we moved on.
4. However, if we wanted to validate this hypothesis in more detail we would have needed to compare the verbose GC logs from the jvm inside the container and the app running standalone outside. We would need to push the app with
java_opts: -Xloggc:$PWD/beacon_gc.log -verbose:gc and the pull the log out from the container or Splunk etc.,. Again this felt too much of a chore with not enough immediate ROI. If you are suspecting issues with the native memory then you will need to do more invasive surgery with
JAVA_OPTS: -Djava.security.egd=file:///dev/urandom -XX:NativeMemoryTracking=summary -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
5. The hunt was now on for a light weight CPU sampling profiler that could be invoked without much setup or prep within the app. oh wait … there is this thing called the
hprof JVMTI profiler baked within the JVM. Unfortunately HROF is not reliable and has problems. Moreover the hprof tool was sunset in JDK8 further adding to our misgiving to our tool.
Finally through the process of elimination we landed on VisualVM for profiling our app. VisualVM on your laptop can connect remotely to the app running in PCF. See this excellent KnowledgeBase article on How to remotely monitor Java applications deployed on PCF® via JMX. Please note that The instructions are only useful in Diego-based containers with SSH access enabled.
We decided to use VisualVM to profile several of the scenarios to see if the profile results were consistent and could help explain the performance differences.
The three scenarios were tested with VisualVM's Sampling option.
“VisualVM’s profiler works by “instrumenting” all of the methods of your code. This adds extra bytecode to your methods for recording when they’re called, and how long they take to execute each time they are. VisualVM’s sampler, however, takes a dump of all of the threads of execution on a fairly regular basis, and uses this to work out how roughly how much CPU time each method spends” (reference from : https://blog.idrsolutions.com/2014/04/profiling-vs-sampling-java-visualvm/ ).
To get relevant profile information we had to configure the VisualVM sampling configuration
We ran each test with the same load test using SOAPUI. The results are below:
We still aren't 100% sure why the Buildpack execution was slower than
mvn spring-boot:run. Our best guess is that the method in question is loading classes and the classpath order can be very different between execution methods. Even if the same class is found, the position on the classpath can delay the loading of the class.
To summarize our finding, here is the method that caused the slowdown we were seeing in PolicyNoteBoot:
The performance penalty of
DatatypeFactory.newInstance().newXMLGregorianCalendar(c); was confirmed from the data above with a workaround of caching the Factory outside the loop. Java Performance Pitfalls - DatatypeFactory
Moving the factory object out of the loop made performance better and consistent.
The final SoapUI results were: