Dynatrace AppMon 6.5 Documentation

Other Doc Versions & KB

Skip to end of metadata
Go to start of metadata

This page describes how AppMon handles availability and describes measures you can take to increase the availability of a  deployment. This page also describes high-availability strategies to protect AppMon against data loss due to hardware or software faults.

Availability is an important consideration in application deployment, especially in production environments. What you can do to increase availablity ranges from very simple and inexpensive backup and disaster recovery strategies to sophisticated usage of Collector fail-over and / or fail-over clusters. 

Note

Icon

Implementing a high-availability AppMon deployment is not required for all customers. In a typical QA environment and most production environments, it is acceptable to lose monitoring data in the rare case of a hardware fault. In addition, AppMon has a built-in watchdog mechanism that automatically brings up the system again after a software fault and Agents can fail-over to a redundant Collector in a designated group.

Availability of System under Diagnosis Versus AppMon Availability

The availability of the AppMon solution and the availability of the System Under Diagnosis (SUD) are not intrinsically related, which means that an AppMon failure does not typically cause the SUD to fail and vice versa. This is an important benefit of loosely coupled AppMon Agents.

Availability of System Components

The AppMon system architecture implements loose coupling between the components (see diagram above). This guarantees that a failing AppMon Agent will not take down the AppMon Collector or the AppMon Server. Similarly, a failing AppMon Client will not cause an AppMon Server failure. All components automatically reconnect after a failure. Therefore, it is important to focus on the availability of individual AppMon components.

SUD and AppMon Agent

Because the AppMon Agent runs in the same process as the SUD, it is possible that an Agent failure can take down the SUD. Note that this is the only AppMon component that can take the SUD down.

To minimize this risk as much as possible, the following measures are taken:

  • The Agent is a very thin software layer. As much work as possible is done by the Collector and Server.
  • The Agent usually fails gracefully. For example, if the connection to the Collector / Server fails, the Agent simply skips application events or at worst fails to instrument the SUD. In either case, this will not cause failure of the SUD.
  • Significant testing is done by the AppMon QA Team to ensure the reliability of this component.

If you need high availability (HA), such as fail-over support (FO) for the SUD, you can take any measure supported by the SUD to ensure its health (beyond the scope of this page).

Summary: SUD and Agent Availability

Risk

Consequences

AppMon Bonus (/ Malus?)

Hardware Failure

SUD will be unavailableAppMon monitoring will alert you

Software Failure of SUD

SUD will be unavailable
Agent could be unavailable

At least missing AppMon data will alert you

Software Failure of Agent

Agent will be unavailable
SUD could be unavailable

Even in this one case that AppMon has an adverse effect on the SUD,
missing AppMon data will alert you

AppMon Collector

The Collector collects data such as measurements and PurePath-related events from the Agents and sends these data to the Server. If a Collector fails due to hardware or software failure, the Agents buffer data from a couple of seconds to up to a minute, depending on load. As a result, no data is lost if the Collector is started again within this time.

You should use more than one Collector for Agents of the same type (Agent Group / tier) and configure Collector groups in a production environment.
If the Collector comes up within a minute again, the Agents will automatically reconnect to the Collector and the latter to the Server.
If not, the Agents can fail over to a different Collector in the Collector group.

The Collector uses self monitoring with an integrated software watchdog to detect fatal software problems. For issues such as out-of-memory or hanging threads, the Collector process is restarted automatically. This is a very important feature that increases availability whether or not you plan to use clustering techniques for high availability, because the chances for cluster software to detect such problems are very limited.

The time necessary for the watchdog to detect software problems ranges from immediate (for example, out-of-memory) to a couple of minutes (hanging threads).

Summary: Collector Availability

Risk

Consequences

Precautions

Hardware Failure
Software Failure of the Collector

PurePaths and measurements of connected Agents and of Collector plugins will be missing for the minute the Agents try to reconnect to the Collector and could´t buffer.

Make the Collector highly available by creating Collector groups and having head-room and redundancy in Collectors

AppMon Server

If the Server fails due to a hardware or software failure, the Collectors buffer data for a period of time, ranging from 30 seconds to a couple of minutes depending on load and Collector heap configuration. As a result, no PurePaths or measurements are lost if the Server is started again within this period.

The Server uses self monitoring with an integrated software watchdog to detect fatal software problems. For issues such as out-of-memory or a hanging threads, the Server process is restarted automatically. This is a very important feature that increases availability if you plan to use clustering techniques for high availability or not, because the chances for cluster software to detect such problems are very limited.

The time necessary for the watchdog to detect software problems ranges from immediate (for example, out-of-memory) to a couple of minutes (hanging threads).

Summary: Server Availability

Risk

Consequences

Precautions

Hardware Failure
Software Failure of Server

PurePaths and measurements of connected Collector that could not be buffered will be missing.

Make the Server highly available

AppMon Frontend Server

For the Frontend Server it is not necessary to implement special availability precautions, but it plays an important role in the component puzzle. It frees the Server from having to provide analysis data to the Clients, giving it more headroom for PurePaths correlation and protecting it from potentially harmful queries.

AppMon Client

The AppMon Client is a much less critical component than the Agent, Collector and Server. It is not necessary to implement high availability for the Client.

Performance Warehouse

All Measures and Incidents are stored in the Performance Warehouse RDBMS. Therefore it is extremely important to think about availability and a backup and disaster recovery strategy if high availability of the AppMon solution is a priority.

Note

Icon

The AppMon Server buffers data for up to one hour (memory permitting) if the Performance Warehouse is not available. After this period of time has elapsed, the data is removed from the buffer.