Scenario

GosiaMurawska · ‎13 Feb 2025

With Valentine's Day just around the corner, Dynatrace is happy to bring to you: the Valentine's Scavenger Hunt!

This blog post will guide you through a root cause analysis hands-on exercise. Each assignment introduces a goal to reach and steps to follow to reach this goal. You can either solve it by yourself, follow the steps to solve it or just look at the solved investigation available in Security Investigator.

In the bustling city of Byteville, Valentine's Day was just around the corner. The air was filled with the scent of roses, and love was in the air—or at least, it was supposed to be. At the heart of Byteville was a quirky little company called "HeartSync," known for its innovative dating app that matched people based on their unique heartbeats.

One fateful morning, the HeartSync office was in chaos. Backend services analyzing the heartbeats were not working properly, customers were complaining, and the Dynatrace monitoring reported failures in the services running on the company's Kubernetes cluster.

Without hesitation, Cupid McHeartFace, head of the HeartSync Engineering Department, called you, the best log forensics person, in the hope of assistance: maybe you can help them analyze the logs and traces to figure out what actually happened and what caused this madness on a day filled with hopes and dreams!

Background

The service we’re about to investigate is running as a container called “heartbeat-matcher-service” in a pod “heartbeat-matcher-service-78f6c784c9-2g77v“ that runs in a Kubernetes cluster “prod.cupid.cluster“. Inbound connections to the service are managed by Istio, which runs in a container called “istio-proxy” in the same pod. All logs from the infrastructure are ingested to Dynatrace.

Logs

All logs have the SmartScape metadata fields attached to them. All logs have the cluster name “prod.cupid.cluster“ set in the k8s.cluster.name field, pod names are in the k8s.kubernetes.pod fields. Logs sent from the containers stdout have the container name shown in the k8s.container.name field. Kubernetes console output logs can be found using the dt.process.name value “Linux System”.

Examples from the logs that we’re using during this hands-on are as follows:

Istio log example

The content of an istio sample request looks like this:

{  
"bytes_received":1499,  
"hostname":"heartbeat-matcher-service-78f6c784c9-2g77v",  
"method":"GET",  
"upstream_transport_failure_reason":null,  
"response_code":200,  "upstream_remote_port":8080,  "path":"/users/henry/",  
"downstream_local_port":8080,  "time":"2025-02-14T05:31:11.033Z",  
"log_type":"accesslog",  "user_agent":"heartbeat.dating.app",  
"downstream_protocol":"HTTP/1.1",  
"upsteam_protocol":"HTTP/1.1",  
"upstream_local_port":50840,  
"downstream_tls_version":"TLSv1.3","duration":33,"response_flags":"-",  
"upstream_remote_address":"100.124.40.190",  
"downstream_remote_address":"10.178.165.253",  
"start_time":"2025-02-14T05:31:11.000Z",  
"bytes_sent":0
 }

Application GET request example
Additionally, to istio logs, a GET request is logged by an application with the following log lines:

2025-02-14T06:13:49.000Z INFO HB-CTX c.d.p.p.i.http.ApiLoggingInterceptor : Request@inbound: Method=GET RequestURI=/users/noah/ AuthHdrLen=1024 
2025-02-14T06:13:49.030Z INFO HB-CTX c.d.p.p.i.http.ApiLoggingInterceptor : Response@inbound: Status=200

Application POST request example

A POST request is logged by an application by the following log lines:

2025-02-14T06:13:50.000Z INFO HB-CTX c.d.p.p.i.http.ApiLoggingInterceptor : Request@inbound: Method=POST RequestURI=/users/mia/beats/ AuthHdrLen=1024 
2025-02-14T06:13:50.071Z INFO HB-CTX c.d.p.s.d.s.i.HeartbeatAnalyzer : Analysing heartrate batch. Frames: 11, beats: 22 
2025-02-14T06:13:50.071Z DEBUG HB-CTX c.d.p.s.d.s.i.HeartbeatAnalyzer : Heartbeat Fragment: 1010000010100000101000001010000010100000101000001010000010100000101000001010000010100000 
2025-02-14T06:13:50.439Z INFO HB-CTX c.d.p.s.d.s.i.HeartbeatAnalyzer : Processing done, adding to matching queue, workflow id: a6402a01-0866-4272-8b30-03de86268781 
2025-02-14T06:13:50.442Z INFO HB-CTX c.d.p.p.i.http.ApiLoggingInterceptor : Response@inbound: Status=200


Scenario

Mr. McHeartFace informs you that Dynatrace has created a ticket for them about a Problem that Davis® AI has detected. It seems that failure rate for web requests in the heartbeat-service peaked somewhere between 5AM and 6AM (UTC). He asked if you could help them to get to the bottom of this.

From the Problem description, you see that a bunch of web requests got an error 503 as a response for some reason, which triggered the problem creation. Can you find out why this has happened?

You can use the Dynatrace Security Investigator in Dynatrace Playground to conduct the log analysis.

Let’s start the investigation

With no time to waste, let’s start with the investigation to find the root cause of these weird events that are disrupting our service!

Find logs from Grail

Let's see if we can find any logs from the cluster prod.cupid.cluster around the problem's timeframe between 5 AM and 6 AM UTC. To do that, follow these steps:

Go to the Dynatrace Playground environment and start the Security Investigator.
Choose the + Case to start a new Investigation.
Give your case a fancy name by clicking on the current case name “Untitled case” in the top area.
Create a custom timeframe:
1. Open the Timeframe editor.
2. Choose the timeframe between 5 AM to 7 AM morning (UTC) from Yesterday.
3. Click Save.
4. Name the timeframe “incident timeframe”.
  
  PS! Yes, I actually mean “yesterday.” So, if you’re doing the walk-through on 15.02.2025, set the timeframe from 2025-02-14T05:00:00Z to 2025-02-14T05:59:59Z.
  
  PPS! Keep in mind that it’s UTC time! If you are using local time in Security Investigator, then either change the timezone in settings or translate the UTC timestamp to your local time.
Choose the “incident timeframe” you’ve just created from the custom timeframes section and click Apply to activate the custom timeframe. You will see the timeframe selected in the timeframe selector component, as shown in the image.
Fetch logs for the k8s cluster called prod.cupid.cluster. You can do it by using the following DQL query:

fetch logs 
| filter k8s.cluster.name == "prod.cupid.cluster"

Nice! You got a bunch of logs as a result! Let's aggregate the results to see if all the relevant logs described in the “Background” section are available from Grail.

To do that, add the following summarize command to the query and click Run.

| summarize count(), by: { k8s.cluster.name, k8s.container.name, dt.process.name }

You should see the following result:

Right-click on the query node you’ve just created and rename the node to “Log sources”.

Fetch the ISTIO logs

Cool! Logs are visible in Grail, and we can access them, so let's kick off the investigation! Let's continue and fetch istio container logs. To do that, follow these steps:

Remove the summarize command from the end of your DQL query.
Right-click on the “istio-proxy“ value in the results table and choose Filter. A filter command will be added to your DQL query automatically.
Click Run to execute the constructed query.

Web Request response code distribution

OK, there are a lot of them, and it would be really time-consuming to go through them all manually. So, instead, let's visualize the data by viewing the response codes as metrics. This enables us to quickly understand how response errors are distributed over time and how they affect each other.

To do that, follow these steps:

Extract the response_code value from the istio JSON formatted log record by adding the following parse command to the query:
| parse content, "json{int:response_code}(flat=true)"
The command will convert the log record to a JSON object and extract the response_code value from it as an integer value. The extracted value is displayed as a new column in the results table.
Create a metric from the field values by using the makeTimeseries command. Use the response_code as the metric and use a 1-minute interval for the series. The following DQL snippet can be used for this:
| makeTimeseries count(default: 0), by: response=toString(response_code), interval: 1m

The final DQL query you’ve just constructed looks like this:

fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter k8s.container.name == "istio-proxy"
| parse content, "json{int:response_code}(flat=true)"
| makeTimeseries count(default: 0), by: response=toString(response_code), interval: 1m

Click Run to execute the query and choose the Chart tab to view the results as a visualized chart.

The output looks like this:

Right-click on the query node in the query tree and rename the node “Response code distribution”

What happened before the first 503

Let's read a bit about the error itself. As written in RFC7231, the response code 503 (Service Unavailable) indicates that the server is currently unable to handle the request due to a temporary overload or scheduled maintenance, which will likely be alleviated after some delay. In other words, a request that results in a response 503 is not the cause of the problem but is only an indication that something has already broken the service.

To find out what caused the service problems, let's examine the events that occurred before the first request, which received a response of 503.

To do that, follow these steps:

Remove the last line from the query that contains the makeTimeseries command.
Add a filter command to retrieve only requests, which results in response code 503. You can do this by adding the following command to your query:
| filter response_code == 503
To get only the first timestamp, let's add a summarize command that takes the minimum timestamp value from all of the results. You can do this by adding the following command to your query:
| summarize timestamp = min(timestamp)
PS! Notice that we added an alias for the min(timestamp) result: the timestamp = in front of the function gives the field a new name.

Click Run to execute the query.

Now that we have the timestamp for the first 503 response, let's construct a DQL query to fetch events that happened before that. To do this, follow these steps:

Remove the last two lines from the query that contains the filter for the 503 response code and the summarize commands.

Right-click on the timestamp value in the results table, and choose Timestamp filters → Earlier than. The DQL query looks now like this:

fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter k8s.container.name == "istio-proxy"
| parse content, "json{int:response_code}(flat=true)"
| filter timestamp < toTimestamp("2025-02-14T05:32:01.000000000Z")

Click Run to execute the DQL query.

Yeah, well. OK, we still have a lot of requests to go through, but it seems that many of them result in either 200 (OK) or 401 (Unauthorized), which might not be that interesting for us at this point.

Let's change the color of this node to blue for future reference and filter the 200 and 401 events out to see if we have anything interesting left.

To set the node color, right-click on the active node, choose Color, and select the blue color.

To create the described filter, follow these steps:

Hold down the Ctrl (Cmd for Macbook users) key on your keyboard.
Choose the values 401 and 200 from the results table.
Right-click on either of them and choose Filter.
The following filter command will be automatically added to the query:
| filter in(response_code, {200, 401})
Add a not condition to the filter you created in the last step to exclude the records containing these records from your results. The filtering statement would look like this:
| filter not in(response_code, {200, 401})
Click Run to execute the query.

No results … Okay, let's set the node color to orange for future reference. Right-click on the node and choose the orange color from the color menu.

Let's check the response latency

OK, but maybe it is related to the requests that resulted in either 200 or 401. Let's analyze the response latency to see if something suspicious stands out. Let’s generate the average response time and maximum response time metrics from these log records.

To do that, follow these steps:

Navigate back to the blue node you created before
Change the parse command to extract the duration from istio logs as well. You can do so by using the following DPL pattern instead of the previous one:
json{string:response_code, int:duration}(flat=true)
To extract metrics from logs, we can use the following makeTimeseries command as follows:
| makeTimeseries {avg(duration), max(duration)}, interval:1m

The final DQL query looks like this.

fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter k8s.container.name == "istio-proxy"
| parse content, "json{string:response_code, int:duration}(flat=true)"
| filter timestamp < toTimestamp("2025-02-14T05:32:01.000000000Z")
| makeTimeseries {avg(duration), max(duration)}, interval:1m

The result visualized as a chart looks something like this:

Right-click on the node and name it “response latency chart”

Looking at the charts, nothing stands out. Everything seems to be within normal deviance thresholds. But what might have caused the incident?

When are the events logged by istio?

When looking at the istio log sample in the “Background” section, we see two interesting fields in the logged event: duration and start_time. Since the log records contain the response and duration for the request, we can assume that the events are generated and logged at the end of the transaction after the response has been received. But what time does the timestamp represent, then?

Looking at the event, it seems that the timestamp field we used earlier represents the end of the transaction (when a request has received its response). This means that it is possible that our suspicious request has taken quite a long time to get its response and might appear in the logs after the first 503 event. To verify this theory, let’s look at the start_time field instead of timestamp and look for events based on that.

To do that, follow these steps:

Navigate back to the first orange node we created earlier.
Modify the DPL pattern in the parse command so it will extract the start_date value as a new field using the following DPL pattern:
json{int:response_code, timestamp('yyyy-MM-ddTHH:mm:ss.SZ'):start_time}(flat=true)
This DPL expression will extract the response_code and start_time values from the JSON object with integer and timestamp types, respectively.

In the filter command after the parse command, change the field name from timestamp to start_time. The final query should look like this:

fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter k8s.container.name == "istio-proxy"
| parse content, "json{int:response_code, timestamp('yyyy-MM-ddTHH:mm:ss.SZ'):start_time}(flat=true)"
| filter start_time < toTimestamp("2025-02-14T05:32:01.000000000Z")
| filter not in(response_code, {200, 401})

Click Run to execute the query

OK, cool, we got something! This time, the response is not empty! For some reason, there is a single event that resulted in a status 500 (Server Error)! You can look at the JSON payload closer by right-clicking on the content field and choosing View field details.

Let's change the node color to green and continue to look deeper into the request.

By now, your query tree should look something like this:

What was executed with the request?

Let's try to find out what happened during this request in our systems. The easiest way to do that is to use Dynatraces' Distributed Tracing, which creates spans and traces for every request that it is monitoring. Luckily, we have trace_ids attached to our log events.

First, let's save the trace_id from the suspicious error 500 events as evidence. You can do so by right-clicking on the trace_id value in the results table and choosing Add to evidence list → Create a new list. Name the list as “errors trace”. Now we can use this trace_id to create filters on any query node.

To start searching for the log records matching this trace_id, follow these steps:

Navigate to the first node at the top of the query tree where we had all the logs from our k8s cluster.
Add a filter command to the DQL query by clicking on the “errors trace” Evicende list menu, choose Filter for, and choose Filter within field “trace_id”.

The DQL query you’ve just constructed looks like this:

fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter matchesValue(trace_id, "*e0f28a67b2854b1fa8442d9df9f3deeb*")

Click Run to execute the created query.

A new branch is created for the first node, which has all the events from this trace so we can dig deeper. Set the node color to purple so we can reference it later.

Enable the multiline mode for the column from the Column menu at the top of the results table to simplify reading the logs. This will give a better overview of multiline logs, such as stack traces and other longer and structured logs.

Another good option for viewing multiline logs would be to open Field Details view from the fields context menu and navigate between the log records using the arrows on your keyboard.

Rename the node to “Failing request“ and continue analyzing the results.

When we look at what a POST request logs should look like in the “Background” section, they aren’t quite the same as we see here. Especially interesting is the stack trace part, which is visible in the trace. It seems we have encountered some heartbeat irregularities, which has thrown our system into an unexpected state.

What else did the application do after this?

We should look beyond this request and see what else our application did afterward. To see the following events in our application, let's modify the query following these steps:

Remove that last filter command from the query that has the trace_id in it.
Add a filter command to fetch logs only from the same pod by right-clicking on the k8s.pod.name value and choose Filter for.
Add another filter command to see only relevant logs from the same container by right-clicking on the k8s.container.name value and choose Filter for.
To see only the events following this event, right-click on the timestamp field value of the first log record, choose Timestamp filters, and choose Later than.
Add a sort command to order the results by timestamp in an ascending order.

The DQL query you’ve just constructed looks like this:

fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter k8s.pod.name == "heartbeat-matcher-service-78f6c784c9-2g77v"
| filter k8s.container.name == "heartbeat-matcher-service"
| filter timestamp > toTimestamp("2025-02-14T05:32:00.000000000Z")
| sort timestamp

Click Run to execute the query.

The results provide some interesting information. It seems that our service received a shutdown command (graceful, I might add), and the application restarted itself.

 Commencing graceful shutdown. Waiting for active requests to complete

But why? Has our system been hacked? Has a hacker stopped or restarted our system? Has the Kubernetes control plane lost its mind? Is it the Aliens? Let's see what we can find from other logs!

What did Kubernetes do at that time?

Let's see what our Kubernetes cluster did at that time. Fortunately, we’re collecting those logs as well. To see the k8s console logs, follow these steps:

Copy the last two lines containing the filter and sort commands from the current query.
Go to the second gray node in the query tree, where we aggregated all the log sources and what we called “Log sources”.
Remove the last line that contains the summarize command.
Right-click on the “Linux System” value in the dt.process.name column and choose Filter for to fetch only the k8s plane logs.
Paste the filter and sort commands to the very end of the query.

The DQL query looks like this:

fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter dt.process.name == "Linux System"
| filter timestamp > toTimestamp("2025-02-14T05:32:00.000000000Z")
| sort timestamp

Click Run to execute the DQL query.

The query tree by now should look something like this:

OK, seems that we’re getting the puzzle together here. From the logs, we gathered that heartbeat monitor/health checks started to fail after the request. The failing heartbeats resulted in the K8S control plane restarting the server gracefully since the container was marked unhealthy:

Killing container with a grace period

Seems that the logical sequence of events might have been the following:

A request came in, which resulted in an error and caused problems in the service.
Because of the error, the load on the server increased, causing health check endpoints to fail as well.
Due to failed heartbeats, K8S decided to restart the container and sent a “kill container” signal to the container.
The sent signal restarted the service gracefully.
The requests that came in during the restart resulted in a response code 503 (service not available).

The restart sequence is an interesting source that we should use in the future. See if you can find the first successful event after the restart that would indicate the end of the restart sequence. Right-click on its timestamp and choose Timestamp filter → Earlier than. The following filter will be added to your query:

| filter timestamp < toTimestamp("2025-02-06T05:35:01.000000000Z")

Click Run to execute the query and give the node a neon color to reference it later. Also, name the node to “Restart cycle”.

What was the request that killed it all?

But what was the request that started it all? Let's go back to the first purple node and analyze the heartbeat fragment that was sent. It seems that this might have been the source of the problem.

Let's try to analyze and visualize the Heartbeat Fragment to understand what’s in it. To analyze the inner workings of the fragment, follow these steps:

Create a filter to see only the log record containing the Heartbeat Fragment. To achieve this, select the “Heartbeat Fragment” portion from the content field, right-click on the selection, and choose Filter for.

The following filter command will be added to your query:
| filter contains(content, "Heartbeat Fragment")
Extract the heartbeat bitmap as a separate array field. To do that, add the following parse command to your query:
| parse content, "LD 'Heartbeat Fragment: ' array{ LD{1}:a }{,100}:binary"
This DPL pattern will extract up to 100 1-character-long pieces of line data (LD matcher) into an array field called binary.
To visualize the data better, let's generate a key for each binary value. Use the iIndex() function to do it, like in the following DQL snippet:
| fields heartbeats = record(value = binary[], key = concat("position_", iIndex()))
Use expand to expand all array elements into separate log records and fieldsFlatten command to spread the elements of the record object to separate fields.
Finally, add a fieldsRemove command to remove the heartbeats field and leave only the key and value columns.

The final DQL query you just constructed looks like this:

fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter contains(trace_id, "e0f28a67b2854b1fa8442d9df9f3deeb")
| filter contains(content, "Heartbeat Fragment")
| parse content, "LD 'Heartbeat Fragment: ' array{ LD{1}:a }{,100}:binary"
| fields heartbeats = record(value = binary[], key = concat("position_",iIndex()))
| expand heartbeats
| fieldsFlatten heartbeats
| fieldsRemove heartbeats

The results don't reveal much in the Security Investigator. Still, luckily, we can use Notebooks to format the findings in a more meaningful way and use it to share the outcome with our stakeholders! But first, give the node a new color of Gold and rename the node to “Heartbeats“.

The final query tree looks like this:

Report your investigation results in Notebooks

Security Investigator enables you to create a Notebook document from your query nodes to create a comprehensive report using Dynatrace Notebooks. You can select the relevant nodes for your report and generate a Notebook document from them.

To create a Notebooks document for our investigation, follow these steps:

Choose the relevant nodes for your report. Hold down the Ctrl (Cmd for Macbook users) key and click on the following nodes:

the second gray node to show the log sources we analyzed
fourth gray node to show the response code distribution over time
second blue node to show the response latency chart
first purple node to show the request that caused the error
neon node for the restart of the application
golden to show the bitmap

2. Right-click on the nodes and choose Download nodes as → Notebooks document.

3. Check the Results checkbox to include results in the Notebooks document.

4. Click Download, and you’re done!

Now, open the Notebooks app, click on the Upload button, and select the document you just downloaded from the Security Investigator.

Follow these steps to illustrate your report:

Resize the “log sources” section results table to fill in the content.
Change the visualization options for the “response code distribution” and “response latency chart” sections to a line chart and Hide the input.
Resize all sections so that they would look nice for the report, and if needed, add markup sections in the middle to fill in the gaps in your investigation.

But currently, let's focus on the last section that contained our Heartbeat Fragment. Let's see if we can find a suitable visualization for the fragment.

Open Heartbeats segments Options.
Choose Honeycomb visualization.
From the color palette, choose Fireplace.

Your request that made your service skip a heartbeat (or a few) looks like this:

Congratulations! You have successfully finished the Valentine's Day scavenger hunt!

A shared Security Investigator case of the full walkthrough can be found here.

The final Notebook document report can be found here.

Did you reach a different solution? Do you have a different query tree? Share your tree and describe your experience in the comments!

➡️ Author of this hands-on: @Tiit_Hallas

victoria · ‎25 Feb 2025

Finally got around to this. Such fun!

I did end up with a slightly different query tree but it ended up all good in the end.

Ignoring the multiple extra nodes because I ran frequent queries with small iterations, My node for the Restart Cycle (cyan?) is on the left side of the tree instead of underneath one of the purple nodes. I think it ended up fine though!

Software Engineering Team Lead at FreedomPay

Tiit_Hallas · ‎26 Feb 2025

Thank You for sharing your experience with Security Investigator and I'm glad to hear you had fun doing this walkthrough! 🙂 Just a little hint, maybe it comes handy in some cases (specially with the small iterations you mentioned): Security Investigator has also the possibility to execute the query in the same node.

You can either choose this option from the "Run" menu ("three-dot menu" next to the Run button) or use the keyboard shortcut Ctrl + Shift + Enter (Cmd+Shift+Enter for macbook users).

I had a life once. Then I bought my first computer ...

victoria · ‎27 Feb 2025

oh awesome, @Tiit_Hallas, thanks!

Software Engineering Team Lead at FreedomPay

Valentine's Day hands-on: Heartbeat failed ❤️

Background

Logs

Istio log example

Application GET request example

Application POST request example

Scenario

Let’s start the investigation

Find logs from Grail

Fetch the ISTIO logs

Web Request response code distribution

What happened before the first 503

Let's check the response latency

When are the events logged by istio?

What was executed with the request?

What else did the application do after this?

What did Kubernetes do at that time?

What was the request that killed it all?

Report your investigation results in Notebooks