We've been having a problem in our application running in WebLogic Server where seemingly randomly the Log API just goes through the roof in terms of CPU usage. Time between occurrences can be as short as a few hours and as long as three weeks without any change to the software or environment. When we look at the method hotspots the CPU spikes, it looks like Hibernate tracing has been enabled. However, we don't see any trace-level logging in our logs – it's as though Hibernate is going through all the CPU-intensive work to log trace-level information but then doesn't actually do it.
We applied a method sensor to java.util.logging.Logger.setLevel() and we actually located the needle in a haystack - the PurePath that appears to change the logging level to Trace. Great! But... why is it happening? Unfortunately, the setLevel() method is called many times (770) in this PurePath and in others, nearly always with 'null' and occasionally with 'INFO'. Near the bottom of the PurePath of interest, we see setLevel() called with a parameter of 'ALL'. That's the needle. But we don't understand why it's happening in order to understand it well enough to prevent in the future.
When I look at the Methods dashlet, the Methods Hotspots dashlet for the calling stack to setLevel(), or in the PurePath Tree, the call stack presented is incomplete. Dynatrace appears to be telling me that setLevel() is being called directly by weblogic.wsee.server.servlet.BaseWSServlet.service(). When I do a Source Lookup > Decompile Source Code, the resulting code does not show any direct calls from BaseWSServlet.service() to setLevel().
So my question is: how do I get a complete stack trace to setLevel(), but only when the value of the parameter passed is "ALL"? There are far too many calls to the method otherwise. Is there some kind of action I can configure as part of a Business Transaction that's akin to "dump the current thread call stack when the parameter passed to setLevel() is 'ALL' "?
Depending on whether you suspect weblogic or your own code, you could try shotgunning a package. If it's weblogic that you suspect, you could try shotgunning the weblogic.wsee.server.servlet package.
If you suspect one of your own, you'll have to shotgun your own business packages.
If you do this, note that you're walking a thin line between killing your server with overinstrumentation and actually finding the bug, choose the packages that you instrument like this carefully. You may want to try it in a test setup to see how much additional instrumentation this would mean.
You could set up a "Configuration" at the system profile level to be able to quickly switch this extra instrumentation off and on.
The hope is that by shotgunning these packages you're able to restrict the area of code that you have to look through.
All this shotgunning is necessary because the needle is too small to be hit by the auto sensor information (at least that's what I assume).
Shotgunning was used in the olden days before auto-sensor information was available.
You'd define a sensor rule for a package and instrument all methods in it by putting an asterisk as the method rule.
This would give you a very exact picture of your program's logic in exchange for a relatively high instrumentation overhead penalty.
It still works, and when you're looking for the call stack of a method that's 0.41ms once per week it's probably the only reliable approach.
Thanks for the feedback, Peter. I'm assuming "shotgunning" means instrumenting all methods in a package. We'll keep digging before going that route by first trying to reproduce the problem in a lower environment. I was hoping I was unaware of a feature but can see that this would be a big ask in terms of potential overhead.
The System Profile Configurations has kind of blown past me before. In case anyone else comes across this posting and also is unaware of that feature, here's a link: https://www.dynatrace.com/support/doc/appmon/admin...
Mike, Another approach to consider:
When viewing a purepath which contains an explicit call to setLevel(), try displaying "Show All Nodes". This should give you the complete call tree above setLevel() so you can really see who's calling setLevel(). "Show All Nodes" is a link in the upper right corner of the lower panel of the Purepath dashlet.
Let us know if this works, as there's additional alternative ideas as well before you go to Shotgun instrumentation.
Then the next option is to use the CPU Sampler and run it for enough time to capture the necessary features. Since you don't know when it's going to happen, try running the CPU Sampler at a reasonable rate, but run it for many hours/days. Yes it will consume some CPU itself, but will likely be cheaper than running wildcard instrumentation.
Understood about the CPU sampling, but I just don't see that working. The method in question's execution time is recorded by Dynatrace at 0.41ms and occurs very infrequently (~once a week). I just don't think the possibility of hitting in this small window of time will be worth the additional overhead of the sampling.
I do appreciate the lifeline(s). I'm cautiously optimistic that we lucked out by looking in the right cubbyhole in the system to have a theory as to how this is happening.
Rajesh: Don't think I'd be able to do that through the forums; too much confidential info in there.
As an update, we have located the use case that causes our problem so we've staunched the bleeding. We don't understand exactly how that use case causes the trace-level logging but it's a use case that's under our control. So I'm going to pull the ripcord on further digging - I've got lost time to make up. 😉 I do have a lingering suspicion that this would be a good product feature -- the ability to have an action to dump the current threads' stacks when a precise condition occurs -- but I hope never to be in need of it again. Lightning never strikes twice, eh?
I do appreciate your interest in following up.
Hmm, yes, that might be a useful feature in some cases. The main problem is that getting the stack trace from inside the agent is very time consuming (all the string handling and so on), so you'd have to be very careful with overhead, but in your case it would have been the perfect tool.