Thank you for the interest in Dynatrace continuous performance management for hybris.
We will do our best to answer all your questions!
 

 


Performance and User Experience Management for SAP Hybris

Since 2011 Dynatrace provides the only certified application performance management solution for hybris based eCommerce environments. The partnership between Dynatrace and SAP Hybris has a long and successful history, working together across development, professional services and support and SAP Hybris' own cloud & managed services offering. Together we defined best practices on continuous performance management. With the knowledge and expertise of Hybris and Dynatrace experts we've created the easiest to use, yet most comprehensive end-to-end performance management tool for Hybris eCommerce environments!

No other APM solution will provide the comprehensive insight into all important aspects of your platform like Dynatrace does.

Key benefits:

 


Included Analysis and Monitoring Dashboards

These dashboards come out of the box with the latest hybris monitoring fastpack. Of course you can create your own dashboards and customize these to your needs.

Now 16 out of the 23 included dashboards are also available as Web dashboards.

Page Class Performance

This dashboard shows the auto-detected different pages of your hybris ECP environment. On the left side you will see the average server-side response time of individual pages. This is the time it takes the server to process a page and deliver it to the end-user. It does not include any client side processing or loading of additional resources. Ideally this is very fast (below 1 second) especially for the most called pages. The number how often a page is accessed is shown on the right. The table below shows the same metrics for different timeframes plus error rates for these pages. To go into details of a specific page type you can right-click on a line in table and drill down to various analysis functions.

 How to Analyse Page Performance ...

This Dashboard is the ideal entry point for dedicated analysis on specific pages. Of course you want to focus on pages based on these criterions:

  • slowest average response time
  • high count of invocation
  • important landing pages (HomePage, ProductPage, CheckoutPage, ...)

If you are working on a production environment make sure you are looking at a representative timeframe (e.g. the last 6 hours during peak times). If you ran a load test, adopt the dashboard timeframe setting to the time of the load test.

Step 1: set the timeframe

Use the link on top of the dashboard to bring up the timeframe configuration, select a predefined timeframe for your needs or use custom to enter your own setting:

Step 2: response time hotspot analysis

In the lower table of the dashboard select the page you are interested in (e.g. CheckoutStepPage) and bring up the context menu by right clicking. The best analysis to go from here is the 'Response Time Hotspot' analysis: select 'Drill Down' and choose 'Response Time Hotspot'. This will start a analysis of all page transactions of that timeframe and analyse them by API, ApplicationServer and type of resource usage (CPU, Wait, Sync, Suspension (GC) and I/O Time)

Once the analysis is finished you might see a result like the one below. On the left side you see that the hotspots are not really related to one specific application server. All show a similar distribution of contribution - a high GC part (yellow) and a high Wait part (medium blue). On the right side we do see a breakdown by application layer (API) with a obvious hotspot in the 'Datacash' layer. Now with a little background knowledge that 'Datacash' is actually a payment provider service called remotely we know that the page is slow because of taking to that service. Of course we can get more details by clicking on that hotspot. We will see the exact stack of where the webservice is being called and can even drill down into individual transactions (PurePaths).

Database Impact by Page

One of the most-common performance problem patterns is the 'Too many Database Calls' issue. Often the DB admin is blamed where in reality the problem is located in bad software design. This dashboard allows to easily identify if your hybris environment is affected by this problem pattern. The chart is split into 3 areas which display the maximum, average and minimum number of database statements per page type. Hover over the big bars in the charts to identify problematic pages with high number of statements. The different aggregations allow to identify if the high number of statements are constant or just deviations.

High number of statements/page could also mean:

  • potentially cache invalidations due to cron jobs
  • data driven problems
 How to Identify Pages with High Database Impact ...

The dashboard can get hard to read, especially if there are lots of different pages detected. To focus your analysis you can apply filters to only show metrics for certain pages. For example you maybe only want to show the most frequent pages that you have identified by using the Page Class Performance dashboard.

Step 1: filter metrics by specific pages

To change the filter click on the Business Transaction filter link shown on the dashboard. This will bring up a dialog to edit the existing filter or add a new one.

Select the 'Page Types - DB Usage' business transaction and then add/remove the available splittings (autodetected) to the selection. You can also define a simple pattern match to include more than one splitting.

Apply the filter and the dashboard will change and only show metrics for the specific pages you selected.

Step 2: analyze the minimum, average and maximum statements per transaction

With the filtered dashboard it's now very easy to identify if there are certain times when pages have too many DB calls, or if their database usage is generally too high. Hover over the charts to see which page at which time uses database resources.

Especially look out for:

  • Constant high average number of statements per page. This usually indicates an architectural problem that needs further investigation and addressing. As a rule of thumb: 100 statements on average are ok if you are in the several hundreds or even thousands you have a problem
  • Points in time where the number of statements per transaction significantly changes from normal behaviour. This usually is related to some other actions like cron jobs or cache invalidations. It could also mean that a application server has been restarted and/or caches have been cleared

 

 

Request Balancing

This dashboard provides a visual representation of loadbalancing between application and webservers. In a well balanced environment the application server balancing chart (top left) will show a equally distributed percentage of requests between application servers. Note that not always a unequal distribution means a wrong balancing configuration on the webservers. It's also easy to detect if there are a big number of requests coming from the same source (e.g. a attacker or scripted test that is routed to one server). Also shown are the number of threads per application server and the CPU usage.

The bottom part of the dashboard shows balancing metrics for the webservers plus their busy threads and transfer rate.

Session Counters

This Dashboard tracks the overall Real-User Sessions and HTTP sessions per application server. This is displayed in the upper left chart . Note that the session duration measured by Dynatrace should be aligned with the session timeout setting in hybris. The other charts show JALO sessions and sessions for various backend applications. These can be configured depending on individual needs.

Cron Jobs Execution Dashboard

The Job Invocations chart (top chart) shows how often different jobs are executed. Hover the mouse over an area of the chart and the context menu will show the name and count how often a job has been executed in a specific timeframe. As cron jobs can have a big impact on the database it's important to know how heavy the impact of specific cron jobs is on the DB. The 'DB Calls per Job' chart (middle) tells you how many database statements are being executed by a specific job. Hover over an area and you will see the number of statements and which job is executing them. In the table below you will see which jobs executed how often and if there were errors. To analyze a specific execution of a job right click on a line and drill down to further analysis functions

 How to Analyse Cron Job Execution ...

The first thing to check for cron jobs after you know which jobs are running is verifying where they are running and how often they are executing. Usually there is a dedicated batch or admin server that is executing these jobs while this it is not serving frontend load.

Step 1: confirm jobs are executed on a dedicated node

By default the dashboard doesn't show where jobs are executed because this is a very basic setting that usually is done right and rarely leads to problems. But just to be sure, check the job execution server. To do so, change the splitting visualisation on the 'Job Invocations' chart and select 'Split by Agent Host'. This will change the chart visualisation toi identify the executing node. Once that is confirmed change back to 'No Splitting'.

Step2: check the frequency and number of job execution

As jobs require resources and - even more important - can affect end-user request handling their execution frequency should be kept to a minimum. Of course this has to be aligned with other (non-technical) requirements.
The best way to visualise the execution frequency of certain jobs is to edit the filter of the dashboard. This removes all the 'noise' of less-relevant jobs. To do so edit the Business transaction filter of the dashboard and only select the jobs that are of interest.

Now the dashboard will only display metrics of the selected jobs, making it easy to identify when exactly and at what interval the jobs are executed. To simply get the number of job executions look at the table at the bottom of the chart. This will show the number of executions for today, yesterday and this week. These numbers might already tell you if some jobs are running too often.

 

Also note the chart in the middle, which show the number of database statements executed by individual jobs. Make sure that you are not running database heavy jobs while your frontend application requires high database usage as this could easily create a bottleneck on database resources.

Step 3: diagnosing a job execution

Sometimes it's required to diagnose the execution of a single job. E.g. to find out if there are any errors, if you want to know the exact database statements a job is executing or if you need to know if a job is communicating with an external service.
From the table at the bottom of the dashboard, right click on a job code and select 'Drill Down' and 'PurePaths' (or any other specific analysis function). This will bring up the individual transactions of job invocations with details on execution for further analysis.

A single execution of a job might look like this and reveal slow database statements, external service calls, exceptions, logs and much more!

 

 

Orders and Sales Volume

This dashboard will automatically pick up all orders and the sales volume of these orders. If you are using a international/multi-currency eCommerce store it will also split orders and sales volume by currency.

If your Dynatrace setup has User Experience Management enabled you will also get conversion rate and bounce rate charts populated.

On the bottom charts you will see order and sales volume metrics of the last 7 days.

External Webservices

Requests to external web services (like search providers, payment services, ...) are an important aspect to keep an eye on when monitoring a hybris environment. This Dashboard does some basic analysis to measure the impact of those services. Dynatrace will automatically detect the used web services.
On the left side of the dashboard the chart shows the time spent on web services per transaction (min, avg, max, 90th percentile). On the right side you will see the number of invocations (min, avg, max) per transaction.
Those two metrics in combination allow you to identify if the web service is generally slow (high numbers on the left chart but low number of calls on the right side chart) or if there is an architectural issue with too many calls to the web service per transaction (high numbers on the right side chart and thus likely also high numbers on the left side chart).

 How to Analyse Webservice Performance ...

There are different options to go from this dashboard. Depending on your use-case you might want to find out who and where a webservice is being used or which endpoints are being called.

Step 1: identify the endpoint(s) of the webservice

To get more details on a specific webservice, right-click anywhere on a measure area on the right chart and drill down to 'Webservices'. This will analyze all transactions wich use this webservice in the dashboards timeframe and present a breakdown of called endpoints, number of calls and timing information.

Step 2: Investigate individual transactions

From the Webservice dashlet you can drill down to further analysis dashlets. For example if you are seeing a high number of calls to the webservice per transactions you might want to investigate where on the transaction these calls are made. Show all transaction details by right-clicking on the webservice endpoint entry and 'Drill Down' to 'PurePaths' .

External Web Requests

Requests to external services like (search providers, payment services, ...) are an important aspect to keep an eye on when monitoring a hybris environment. This Dashboard does some basic analysis to measure the impact of those services. Dynatrace will automatically detect the external services. On the left side of the dashboard the chart show the time spent on handling external requests per transaction (min, avg, max, 90th percentile). On the right side chart you will see the number of invocations (min, avg, max) of the external service per transactions. Those two metrics in combination allow you to identify if the external service is generally slow (high numbers on the left chart but low number of calls on the right side chart) or if there is an architectural issue with too many calls to an external service per transaction (high numbers on the right side chart and thus likely also high numbers on the left side chart).

Webrequest Distribution

This dashboard provides a easy to understand visual representation of all requests passing through the hybris environment. These charts show the number of requests categorized by different server-side response time. There are 4 different buckets with color coding:

  • green: requests with a response time faster than 1 second
  • yellow: requests between 1 and 3 seconds
  • orange: requests between 3 and 5 seconds
  • red: all requests slower than 5 seconds.

Note that the chart has a logarithmic y-axis to make the slow requests better visible. Of course you want to have has many requests in the green category. This chart can be used to easily identify if there has been a special event that impacts general performance (e.g. compare the visual after a new release) or if there are recurring impacts (note the yellow spikes in the chart). It also allows you to narrow the timeframe for further investigation.

Webrequest Performance

The Webrequest Performance dashboard displays the distribution (satisfied, toleration, frustrated) of the User Experience Index (based on APDEX) for page loads on the top chart. Along with that distribution you will see the number of real user visits accessing the site. Note that these metrics are only available if you are using the user experience module. The middle chart displays the number of page impressions per time unit. This should help sizing your environment accordingly to hybris' best practices. The lower chart is an API breakdown which tells us in which application component most of the execution time is spent. The fastpack comes with predefined APIs for various hybris internals but it will also detect custom API code. Hover over the big areas in this chart to identify the API and learn where most of your execution time is spent.

Java Garbage Collection

Garbage Collection is an important aspect of every hybris application server. Tuning the Garbage Collector is a craft of it's own and requires to consider multiple factors. While you can monitor garbage collection performance with various tools/GC logs it's hard to tell how GC-runs impact transactional performance. The top chart in this dashboard shows the impact of GC on transactions. It's a rate measure based on transaction processing time with and without garbage collection contribution.

Dynatrace knows for how long a single transaction is suspended due to Garbage Collection pauses, so the GC impact rate can be calculated as (PurePath Time w/o GC time/PurePath Time w/ GC time). Ideally this would be close to 1 (or 100%) meaning that GC has no impact at all. From our experience values between 90-100% is healthy, between 75-90% as a unhealthy GC behavior and everything below 75% is critical.

 More on Java GC analysis...

If you see a chart like the one in the screenshot here, you are probably encountering a load-dependant situation where the GC kicks in at times with increased visitors. So it's always a good idea to align load metrics (like concurrent visits with GC impact)

Java Memory Pools

Checking how your JVMs are doing memory-wise is one of the very basic monitoring steps. This dashboard shows the different memory pools of all application server JVMs to highlight if there are any bottlenecks or load related memory consumption issues.

User Experience Page Class

This dashboard shows the average perceived render time as experienced by the end-user operating his device. Depending on content that's being loaded from 3rd party resources like CDNs this could be significantly different from what the processing is on the server side. Usually the metrics range around a fe seconds. Also shown is the number of times these pages are accessed. Typically you will see the Category Pages, Product Pages and Home Page as the top called pages here.

Database Performance

This is a general Database overview dashboard. On the top charts you will see the overall database executions over time, split by applications. The majority of executions will come from your public (default B2C) application, but it will also show DB statements executed from other applications like backend administration. On the right side the chart displays the overall time spent on database calls.

The chart in the middle shows the number of database calls per transaction (min, avg, max), also split by application. This helps to identify if there is a general performance anti-pattern present with too many DB calls per transaction.

The table at the bottom will show you the top database statements of the last 15 minutes. To find slow statements sort by the Exec Max (ms) column or to find the most called statements sort by the Executions column

Database Connection Pool Dashboard

This basic dashboard retrieves the hybris connection pool metrics from application servers it helps verifying the connection pool sizes for your environment. Note that when you are hitting the connection pool limit it's not always the right thing to increase the pool size. Before increasing the pool check the overall usage of the database to verify if you are facing an architectural issue with too many database calls per transactions or if maybe cron jobs are using the database extensively.

 End-User Performance Overview

This Dashboard provides an overview how your visitors are doing and how they perceive the performance of your site. If you only focus on the performance of the backend systems, your web and application servers you see only half of the picture, sometimes even less.
On the top left there is a chart for the user experience index. This APDEX based number between 0 and 1 indicates good (=1) or bad (=0) end-user performance. It should never drop below 0.75.
The 'Completed Real User Visits' is self-explanatory, it shows the number of completed visits over time. The chart below 'User Action Breakdown' visualizes exactly why it's important to also monitor the performance from the end-user perspective. With rich content, 3rd party and CDNs most of the time is likely spent on the client's side.
On the bottom left we are charting W3C timing information as reported by modern browsers. On the bottom right you can track the impact of third party content that is loaded from other domains or sites. This is also an important measure to keep an eye on! Quite often the excessive use of tracking tags, marketing addons, social media integration reduces the speed of actual content you want to present. Capturing these metrics also helps you choosing/verifying your CDN usage.

 How do we know how users are doing ...

On the world map you see where your end users are coming from and how their experience is when visiting your site. The size of the dots indicate how many users accessing from a specific location, the color indicates how their performance is (green=satisfied, yellow=tolerating, red=frustrated).
Dynatrace measures this data by injecting a JavaScript library to the content that is delivered to the end-user's device. This is done without any need to change code or adding tags manually. When monitoring the backend servers this user experience monitoring can be turned on/off on the fly.

Page Class Percentiles

To measure the performance of individual pages we should not only look at average response times. Percentiles are a more accurate way. The dashboard tries to present the distribution of response times over the number of requests. The upper charts are percentile charts split by page types. You can hover over the lines and will get the percentage of requests that are faster than a specific response time.

The charts in the middle are meant to be read top-down with the vertical axis being time and the horizontal being response time on these charts you will see three percentiles (50th ,90th, 99th) over time. The ideal situation would be if all percentile lines are close to eachother, which means that all requests have the same response time.

The lower chart displays the number of requests per page type also in a top-down chart.

  • No labels