Solved: Is it possible train base-lining engine in Performance Env and Port it to Production so the AI is ready for monitoring on Day 1?

thanes_bala · ‎19 Aug 2019

We are trying to make full use of Base-lining engine from Day 1 in Production. It seems it takes ~3 weeks of DT monitoring to attain optimal base-lining for the AI. We were wondering if it's possible to train the AI engine in performance environment and move it to Prod so AI engine is optimal from the day 1 of hte deployment in prod. Is this possible?

skrystosik · ‎19 Aug 2019

In general this does not have sense, first thing is that load test are not working the same way like regular users. Even if you will simulate weeks of traffic and then on the same dynatrace environment you will install production agents this will have to learn again. You will have many alerts about traffic too high or too low.

Dynatrace is starting creating baselines from day one but for some time there are some wrong alerts, I agree. But this is something that we have to leave with for now,

If you are sure that you can make exact representation of traffic with random think times, different paths (not only one like robots but random as well) and you think this massive amount of work is worth it you can always try such experiments. But I will never recommend it to anyone 🙂

Sebastian

Regards, Sebastian

thanes_bala · ‎20 Aug 2019

@sebastian k.

I need advice on migrating from existing Monitoring tool to DT. My client is currently using another tool that monitors Prod Env and the affected application owners are not approving DT migration unless we can accomplish incident parity with the existing tool from Day 1 in production. As DT takes sometime to learn Is there a way to achieve this task?

@Andreas G.

@Joseph M. H.

@Julius L.

skrystosik · ‎20 Aug 2019

In general proper alerting should work almost instaltly. What dynatrace is learning is exact baselines over time. This takes around week. Till this time there are some to sensitive alerts (from my experience) but system is fully operative. Issue can be for example if for some service there is normal state of 50% failure rates. From start DT will alert you about issue there, but those fails can be because of 401 error which may be normal. After few days DT will learn tat XX% of failure rate is normal and will stop alert you when this level will be inside baselines. All violations will be alerted etc. So it's not that DT does not work from day 1.

Sebastian

Regards, Sebastian

Joe_Hoffman · ‎20 Aug 2019

Thanes, Perhaps you can let us know which other tool your client is currently using, as this could influence your best approach. For example, can both tools be run in parallel? If so, then this would allow you to run DT and get the baseline established before removing the old tool.
Another option is to set manual thresholds on DT that match what you've found in the old tool. I suspect this would give you no different than the alerting they already have with the old tool.

thanes_bala · ‎20 Aug 2019

Thanks @Joseph M. H.

The other tool is Introscope. Can both Introscope and DT co-exist in critical Prod Environment? OR can we achieve by deploying DT on subset of Prod nodes let's say (2 nodes out of 8 nodes cluster) for 3 weeks to let the AI engine baseline the System then enable the remainder of the nodes? will this strategy work?

If we did set manual thresholds to recreate existing alerts;

will the AI engine continue to learn while the static thresholds are set?
what happens to new alerts type in DT that are not configured in Introscope? will those get triggered? if yes, any way to suppress these till DT optimally adopts the baseline? This is to avoid false alerts and to eliminate noise
after 3 weeks period if all static thresholds are removed, will DT optimally identifies incidents

Thanks

skrystosik · ‎20 Aug 2019

Setting static thresholds doesn't have sense, after few hours DT will set them automatically based on traffic it was so far. Time is needed to set baselines in weekly basis which will cover traffic violations because of part of week and day. Davis will work properly from day one, but as I said, some alerts may be to sensitive at start, such things can be overwrite during observation of DT after deployment (manually by you).

Sebastian

Regards, Sebastian

thanes_bala · ‎20 Aug 2019

@sebastian k.

These are critical applications. Business owners are not very keen on enabling new monitoring tool which may or may not fire alerts that the OPS is familiar with. They want to be assured by demonstrating parity on incident identification to provide their approvals. That's why I was inquiring on any advice to overcome this hurdle. These specific questions are to figure out any potential work around.

Joe_Hoffman · ‎20 Aug 2019

Introscope doesn't have dynamic baselining (at least last I knew it). So everything they're current alerted to is based on static thresholds, which likely has lots of errors and false positives. So they're trying to hold onto a 'truth' that's really not accurate and certainly not as business relavent as the alerting they'll get from DT. As Sebastian mentioned, within a very short few hours, DT will already be baselining application behavior, and adding more value than they currently have with Introscope. It is not supported to run Introscope and DT at the same time on the same JVM but you can run OneAgent on the same host as Introscope Java agent, as long as you're not injecting into the JVM with both Java agents. For example, you could put OneAgent on all hosts, but disable deep Java injection on selected JVMs to leave Introscope in there. However this would cause problems with Purepath continuity and limit what you're seeing, so it's not ideal. For the effort and confusion of this hybrid approach it would be much quicker and simpler to just put OneAgent in there. You'll start to see DAVIS output very quickly. I also suggest you consider engaging DT1 team or services team and see about alternative approaches to your transition issue.

thanes_bala · ‎20 Aug 2019

Thanks @Joseph M. H.

How about enabling DT on subset of nodes and let it train for 3 weeks till all threshold base-lining reaches optimal level then deploy to all nodes? Will this approach work?

thanks and sorry if I'm asking too many questions.

Joe_Hoffman · ‎20 Aug 2019

Putting OneAgent on a subset of nodes might work, but there's also a few problems with that approach.

1) If you have a service that runs on a host where you did NOT put OneAgent, then OneAgent will never know about that service and never baseline it, until you put OneAgent on all hosts. If one of your existing Introscope alerts is based on metrics coming from this service, then this would not seem to satisfy the business owners demands.

2) Purepaths will likely break as DT cannot follow all transactions thru the architecture.

3) Downstream services (which are not entry points, such as WebRequests) will not be reported if they're called from non instrumented entry point services. This would further lead to a patchwork of coverage.

So as you can see, there's numerous problems with the hybrid approach, although if it satisfies the business owners, this might be worth trying.

thanes_bala · ‎21 Aug 2019

Thanks all!