13 Feb 2025
07:37 PM
- last edited on
27 Feb 2025
11:38 AM
by
AgataWlodarczyk
With Valentine's Day just around the corner, Dynatrace is happy to bring to you: the Valentine's Scavenger Hunt!
This blog post will guide you through a root cause analysis hands-on exercise. Each assignment introduces a goal to reach and steps to follow to reach this goal. You can either solve it by yourself, follow the steps to solve it or just look at the solved investigation available in Security Investigator.
In the bustling city of Byteville, Valentine's Day was just around the corner. The air was filled with the scent of roses, and love was in the air—or at least, it was supposed to be. At the heart of Byteville was a quirky little company called "HeartSync," known for its innovative dating app that matched people based on their unique heartbeats.
One fateful morning, the HeartSync office was in chaos. Backend services analyzing the heartbeats were not working properly, customers were complaining, and the Dynatrace monitoring reported failures in the services running on the company's Kubernetes cluster.
Without hesitation, Cupid McHeartFace, head of the HeartSync Engineering Department, called you, the best log forensics person, in the hope of assistance: maybe you can help them analyze the logs and traces to figure out what actually happened and what caused this madness on a day filled with hopes and dreams!
The service we’re about to investigate is running as a container called “heartbeat-matcher-service” in a pod “heartbeat-matcher-service-78f6c784c9-2g77v“ that runs in a Kubernetes cluster “prod.cupid.cluster“. Inbound connections to the service are managed by Istio, which runs in a container called “istio-proxy” in the same pod. All logs from the infrastructure are ingested to Dynatrace.
All logs have the SmartScape metadata fields attached to them. All logs have the cluster name “prod.cupid.cluster“ set in the k8s.cluster.name
field, pod names are in the k8s.kubernetes.pod
fields. Logs sent from the containers stdout
have the container name shown in the k8s.container.name
field. Kubernetes console output logs can be found using the dt.process.name
value “Linux System”.
Examples from the logs that we’re using during this hands-on are as follows:
The content
of an istio sample request looks like this:
{
"bytes_received":1499,
"hostname":"heartbeat-matcher-service-78f6c784c9-2g77v",
"method":"GET",
"upstream_transport_failure_reason":null,
"response_code":200, "upstream_remote_port":8080, "path":"/users/henry/",
"downstream_local_port":8080, "time":"2025-02-14T05:31:11.033Z",
"log_type":"accesslog", "user_agent":"heartbeat.dating.app",
"downstream_protocol":"HTTP/1.1",
"upsteam_protocol":"HTTP/1.1",
"upstream_local_port":50840,
"downstream_tls_version":"TLSv1.3","duration":33,"response_flags":"-",
"upstream_remote_address":"100.124.40.190",
"downstream_remote_address":"10.178.165.253",
"start_time":"2025-02-14T05:31:11.000Z",
"bytes_sent":0
}
GET
request is logged by an application with the following log lines:2025-02-14T06:13:49.000Z INFO HB-CTX c.d.p.p.i.http.ApiLoggingInterceptor : Request@inbound: Method=GET RequestURI=/users/noah/ AuthHdrLen=1024
2025-02-14T06:13:49.030Z INFO HB-CTX c.d.p.p.i.http.ApiLoggingInterceptor : Response@inbound: Status=200
A POST
request is logged by an application by the following log lines:
2025-02-14T06:13:50.000Z INFO HB-CTX c.d.p.p.i.http.ApiLoggingInterceptor : Request@inbound: Method=POST RequestURI=/users/mia/beats/ AuthHdrLen=1024
2025-02-14T06:13:50.071Z INFO HB-CTX c.d.p.s.d.s.i.HeartbeatAnalyzer : Analysing heartrate batch. Frames: 11, beats: 22
2025-02-14T06:13:50.071Z DEBUG HB-CTX c.d.p.s.d.s.i.HeartbeatAnalyzer : Heartbeat Fragment: 1010000010100000101000001010000010100000101000001010000010100000101000001010000010100000
2025-02-14T06:13:50.439Z INFO HB-CTX c.d.p.s.d.s.i.HeartbeatAnalyzer : Processing done, adding to matching queue, workflow id: a6402a01-0866-4272-8b30-03de86268781
2025-02-14T06:13:50.442Z INFO HB-CTX c.d.p.p.i.http.ApiLoggingInterceptor : Response@inbound: Status=200
Mr. McHeartFace informs you that Dynatrace has created a ticket for them about a Problem that Davis® AI has detected. It seems that failure rate for web requests in the heartbeat-service peaked somewhere between 5AM and 6AM (UTC). He asked if you could help them to get to the bottom of this.
From the Problem description, you see that a bunch of web requests got an error 503 as a response for some reason, which triggered the problem creation. Can you find out why this has happened?
You can use the Dynatrace Security Investigator in Dynatrace Playground to conduct the log analysis.
With no time to waste, let’s start with the investigation to find the root cause of these weird events that are disrupting our service!
Let's see if we can find any logs from the cluster prod.cupid.cluster
around the problem's timeframe between 5 AM and 6 AM UTC. To do that, follow these steps:
Go to the Dynatrace Playground environment and start the Security Investigator.
Choose the + Case to start a new Investigation.
Give your case a fancy name by clicking on the current case name “Untitled case” in the top area.
Create a custom timeframe:
Open the Timeframe editor.
Choose the timeframe between 5 AM to 7 AM morning (UTC) from Yesterday.
Click Save.
Name the timeframe “incident timeframe”.
PS! Yes, I actually mean “yesterday.” So, if you’re doing the walk-through on 15.02.2025, set the timeframe from 2025-02-14T05:00:00Z
to 2025-02-14T05:59:59Z
.
PPS! Keep in mind that it’s UTC time! If you are using local time in Security Investigator, then either change the timezone in settings or translate the UTC timestamp to your local time.
Choose the “incident timeframe” you’ve just created from the custom timeframes section and click Apply to activate the custom timeframe. You will see the timeframe selected in the timeframe selector component, as shown in the image.
Fetch logs for the k8s cluster called prod.cupid.cluster
. You can do it by using the following DQL query:
fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
Nice! You got a bunch of logs as a result! Let's aggregate the results to see if all the relevant logs described in the “Background” section are available from Grail.
To do that, add the following summarize
command to the query and click Run.
| summarize count(), by: { k8s.cluster.name, k8s.container.name, dt.process.name }
You should see the following result:
Right-click on the query node you’ve just created and rename the node to “Log sources”.
Cool! Logs are visible in Grail, and we can access them, so let's kick off the investigation! Let's continue and fetch istio container logs. To do that, follow these steps:
Remove the summarize
command from the end of your DQL query.
Right-click on the “istio-proxy“ value in the results table and choose Filter. A filter
command will be added to your DQL query automatically.
Click Run to execute the constructed query.
OK, there are a lot of them, and it would be really time-consuming to go through them all manually. So, instead, let's visualize the data by viewing the response codes as metrics. This enables us to quickly understand how response errors are distributed over time and how they affect each other.
To do that, follow these steps:
Extract the response_code
value from the istio JSON formatted log record by adding the following parse
command to the query:
| parse content, "json{int:response_code}(flat=true)"
The command will convert the log record to a JSON object and extract the response_code
value from it as an integer value. The extracted value is displayed as a new column in the results table.
Create a metric from the field values by using the makeTimeseries
command. Use the response_code
as the metric and use a 1-minute interval for the series. The following DQL snippet can be used for this:
| makeTimeseries count(default: 0), by: response=toString(response_code), interval: 1m
The final DQL query you’ve just constructed looks like this:
fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter k8s.container.name == "istio-proxy"
| parse content, "json{int:response_code}(flat=true)"
| makeTimeseries count(default: 0), by: response=toString(response_code), interval: 1m
The output looks like this:
Right-click on the query node in the query tree and rename the node “Response code distribution”
Let's read a bit about the error itself. As written in RFC7231, the response code 503 (Service Unavailable) indicates that the server is currently unable to handle the request due to a temporary overload or scheduled maintenance, which will likely be alleviated after some delay. In other words, a request that results in a response 503 is not the cause of the problem but is only an indication that something has already broken the service.
To find out what caused the service problems, let's examine the events that occurred before the first request, which received a response of 503.
To do that, follow these steps:
Remove the last line from the query that contains the makeTimeseries
command.
Add a filter command to retrieve only requests, which results in response code 503
. You can do this by adding the following command to your query:
| filter response_code == 503
timestamp
value from all of the results. You can do this by adding the following command to your query:| summarize timestamp = min(timestamp)
min(timestamp)
result: the timestamp =
in front of the function gives the field a new name.Click Run to execute the query.
Now that we have the timestamp for the first 503 response, let's construct a DQL query to fetch events that happened before that. To do this, follow these steps:
Remove the last two lines from the query that contains the filter
for the 503 response code and the summarize
commands.
Right-click on the timestamp value in the results table, and choose Timestamp filters → Earlier than. The DQL query looks now like this:
fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter k8s.container.name == "istio-proxy"
| parse content, "json{int:response_code}(flat=true)"
| filter timestamp < toTimestamp("2025-02-14T05:32:01.000000000Z")
Click Run to execute the DQL query.
Yeah, well. OK, we still have a lot of requests to go through, but it seems that many of them result in either 200 (OK) or 401 (Unauthorized), which might not be that interesting for us at this point.
Let's change the color of this node to blue for future reference and filter the 200 and 401 events out to see if we have anything interesting left.
To set the node color, right-click on the active node, choose Color, and select the blue color.
To create the described filter, follow these steps:
Hold down the Ctrl (Cmd for Macbook users) key on your keyboard.
Choose the values 401 and 200 from the results table.
Right-click on either of them and choose Filter.
The following filter
command will be automatically added to the query:
| filter in(response_code, {200, 401})
Add a not
condition to the filter you created in the last step to exclude the records containing these records from your results. The filtering statement would look like this:
| filter not in(response_code, {200, 401})
Click Run to execute the query.
No results … Okay, let's set the node color to orange for future reference. Right-click on the node and choose the orange color from the color menu.
OK, but maybe it is related to the requests that resulted in either 200 or 401. Let's analyze the response latency to see if something suspicious stands out. Let’s generate the average response time and maximum response time metrics from these log records.
To do that, follow these steps:
Navigate back to the blue node you created before
Change the parse
command to extract the duration from istio logs as well. You can do so by using the following DPL pattern instead of the previous one:
json{string:response_code, int:duration}(flat=true)
To extract metrics from logs, we can use the following makeTimeseries
command as follows:
| makeTimeseries {avg(duration), max(duration)}, interval:1m
The final DQL query looks like this.
fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter k8s.container.name == "istio-proxy"
| parse content, "json{string:response_code, int:duration}(flat=true)"
| filter timestamp < toTimestamp("2025-02-14T05:32:01.000000000Z")
| makeTimeseries {avg(duration), max(duration)}, interval:1m
The result visualized as a chart looks something like this:
Right-click on the node and name it “response latency chart”
Looking at the charts, nothing stands out. Everything seems to be within normal deviance thresholds. But what might have caused the incident?
When looking at the istio log sample in the “Background” section, we see two interesting fields in the logged event: duration
and start_time
. Since the log records contain the response and duration for the request, we can assume that the events are generated and logged at the end of the transaction after the response has been received. But what time does the timestamp represent, then?
Looking at the event, it seems that the timestamp field we used earlier represents the end of the transaction (when a request has received its response). This means that it is possible that our suspicious request has taken quite a long time to get its response and might appear in the logs after the first 503 event. To verify this theory, let’s look at the start_time
field instead of timestamp
and look for events based on that.
To do that, follow these steps:
Navigate back to the first orange node we created earlier.
Modify the DPL pattern in the parse
command so it will extract the start_date
value as a new field using the following DPL pattern:
json{int:response_code, timestamp('yyyy-MM-ddTHH:mm:ss.SZ'):start_time}(flat=true)
response_code
and start_time
values from the JSON object with integer
and timestamp
types, respectively.In the filter
command after the parse
command, change the field name from timestamp
to start_time
. The final query should look like this:
fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter k8s.container.name == "istio-proxy"
| parse content, "json{int:response_code, timestamp('yyyy-MM-ddTHH:mm:ss.SZ'):start_time}(flat=true)"
| filter start_time < toTimestamp("2025-02-14T05:32:01.000000000Z")
| filter not in(response_code, {200, 401})
OK, cool, we got something! This time, the response is not empty! For some reason, there is a single event that resulted in a status 500 (Server Error)! You can look at the JSON payload closer by right-clicking on the content
field and choosing View field details.
By now, your query tree should look something like this:
Let's try to find out what happened during this request in our systems. The easiest way to do that is to use Dynatraces' Distributed Tracing, which creates spans and traces for every request that it is monitoring. Luckily, we have trace_ids attached to our log events.
First, let's save the trace_id from the suspicious error 500 events as evidence. You can do so by right-clicking on the trace_id
value in the results table and choosing Add to evidence list → Create a new list. Name the list as “errors trace”. Now we can use this trace_id to create filters on any query node.
To start searching for the log records matching this trace_id, follow these steps:
Navigate to the first node at the top of the query tree where we had all the logs from our k8s cluster.
Add a filter
command to the DQL query by clicking on the “errors trace” Evicende list menu, choose Filter for, and choose Filter within field “trace_id”.
The DQL query you’ve just constructed looks like this:
fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter matchesValue(trace_id, "*e0f28a67b2854b1fa8442d9df9f3deeb*")
A new branch is created for the first node, which has all the events from this trace so we can dig deeper. Set the node color to purple so we can reference it later.
Enable the multiline mode for the column from the Column menu at the top of the results table to simplify reading the logs. This will give a better overview of multiline logs, such as stack traces and other longer and structured logs.
Another good option for viewing multiline logs would be to open Field Details view from the fields context menu and navigate between the log records using the arrows on your keyboard.
Rename the node to “Failing request“ and continue analyzing the results.
When we look at what a POST request logs should look like in the “Background” section, they aren’t quite the same as we see here. Especially interesting is the stack trace part, which is visible in the trace. It seems we have encountered some heartbeat irregularities, which has thrown our system into an unexpected state.
We should look beyond this request and see what else our application did afterward. To see the following events in our application, let's modify the query following these steps:
Remove that last filter
command from the query that has the trace_id
in it.
Add a filter
command to fetch logs only from the same pod by right-clicking on the k8s.pod.name
value and choose Filter for.
Add another filter
command to see only relevant logs from the same container by right-clicking on the k8s.container.name
value and choose Filter for.
To see only the events following this event, right-click on the timestamp
field value of the first log record, choose Timestamp filters, and choose Later than.
Add a sort
command to order the results by timestamp in an ascending order.
The DQL query you’ve just constructed looks like this:
fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter k8s.pod.name == "heartbeat-matcher-service-78f6c784c9-2g77v"
| filter k8s.container.name == "heartbeat-matcher-service"
| filter timestamp > toTimestamp("2025-02-14T05:32:00.000000000Z")
| sort timestamp
The results provide some interesting information. It seems that our service received a shutdown command (graceful, I might add), and the application restarted itself.
Commencing graceful shutdown. Waiting for active requests to complete
Let's see what our Kubernetes cluster did at that time. Fortunately, we’re collecting those logs as well. To see the k8s console logs, follow these steps:
Copy the last two lines containing the filter
and sort
commands from the current query.
Go to the second gray node in the query tree, where we aggregated all the log sources and what we called “Log sources”.
Remove the last line that contains the summarize
command.
Right-click on the “Linux System” value in the dt.process.name
column and choose Filter for to fetch only the k8s plane logs.
Paste the filter
and sort
commands to the very end of the query.
The DQL query looks like this:
fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter dt.process.name == "Linux System"
| filter timestamp > toTimestamp("2025-02-14T05:32:00.000000000Z")
| sort timestamp
The query tree by now should look something like this:
OK, seems that we’re getting the puzzle together here. From the logs, we gathered that heartbeat monitor/health checks started to fail after the request. The failing heartbeats resulted in the K8S control plane restarting the server gracefully since the container was marked unhealthy:
Killing container with a grace period
Seems that the logical sequence of events might have been the following:
A request came in, which resulted in an error and caused problems in the service.
Because of the error, the load on the server increased, causing health check endpoints to fail as well.
Due to failed heartbeats, K8S decided to restart the container and sent a “kill container” signal to the container.
The sent signal restarted the service gracefully.
The requests that came in during the restart resulted in a response code 503 (service not available).
The restart sequence is an interesting source that we should use in the future. See if you can find the first successful event after the restart that would indicate the end of the restart sequence. Right-click on its timestamp and choose Timestamp filter → Earlier than. The following filter
will be added to your query:
| filter timestamp < toTimestamp("2025-02-06T05:35:01.000000000Z")
But what was the request that started it all? Let's go back to the first purple node and analyze the heartbeat fragment that was sent. It seems that this might have been the source of the problem.
Let's try to analyze and visualize the Heartbeat Fragment to understand what’s in it. To analyze the inner workings of the fragment, follow these steps:
Create a filter to see only the log record containing the Heartbeat Fragment. To achieve this, select the “Heartbeat Fragment” portion from the content field, right-click on the selection, and choose Filter for.
The following filter
command will be added to your query:
| filter contains(content, "Heartbeat Fragment")
parse
command to your query:| parse content, "LD 'Heartbeat Fragment: ' array{ LD{1}:a }{,100}:binary"
This DPL pattern will extract up to 100 1-character-long pieces of line data (LD
matcher) into an array field called binary
.
To visualize the data better, let's generate a key for each binary value. Use the iIndex()
function to do it, like in the following DQL snippet:
| fields heartbeats = record(value = binary[], key = concat("position_", iIndex()))
expand
to expand all array elements into separate log records and fieldsFlatten
command to spread the elements of the record
object to separate fields.Finally, add a fieldsRemove
command to remove the heartbeats field and leave only the key and value columns.
The final DQL query you just constructed looks like this:
fetch logs
| filter k8s.cluster.name == "prod.cupid.cluster"
| filter contains(trace_id, "e0f28a67b2854b1fa8442d9df9f3deeb")
| filter contains(content, "Heartbeat Fragment")
| parse content, "LD 'Heartbeat Fragment: ' array{ LD{1}:a }{,100}:binary"
| fields heartbeats = record(value = binary[], key = concat("position_",iIndex()))
| expand heartbeats
| fieldsFlatten heartbeats
| fieldsRemove heartbeats
The final query tree looks like this:
Security Investigator enables you to create a Notebook document from your query nodes to create a comprehensive report using Dynatrace Notebooks. You can select the relevant nodes for your report and generate a Notebook document from them.
To create a Notebooks document for our investigation, follow these steps:
the second gray node to show the log sources we analyzed
fourth gray node to show the response code distribution over time
second blue node to show the response latency chart
first purple node to show the request that caused the error
neon node for the restart of the application
golden to show the bitmap2. Right-click on the nodes and choose Download nodes as → Notebooks document.
3. Check the Results checkbox to include results in the Notebooks document.
4. Click Download, and you’re done!
Now, open the Notebooks app, click on the Upload button, and select the document you just downloaded from the Security Investigator.
Follow these steps to illustrate your report:
Resize the “log sources” section results table to fill in the content.
Change the visualization options for the “response code distribution” and “response latency chart” sections to a line chart and Hide the input.
Resize all sections so that they would look nice for the report, and if needed, add markup sections in the middle to fill in the gaps in your investigation.
But currently, let's focus on the last section that contained our Heartbeat Fragment. Let's see if we can find a suitable visualization for the fragment.
Open Heartbeats segments Options.
Choose Honeycomb visualization.
From the color palette, choose Fireplace.
Your request that made your service skip a heartbeat (or a few) looks like this:
Congratulations! You have successfully finished the Valentine's Day scavenger hunt!
A shared Security Investigator case of the full walkthrough can be found here.
The final Notebook document report can be found here.
Did you reach a different solution? Do you have a different query tree? Share your tree and describe your experience in the comments!
➡️ Author of this hands-on: @Tiit_Hallas
25 Feb 2025 01:27 PM
Finally got around to this. Such fun!
I did end up with a slightly different query tree but it ended up all good in the end.
Ignoring the multiple extra nodes because I ran frequent queries with small iterations, My node for the Restart Cycle (cyan?) is on the left side of the tree instead of underneath one of the purple nodes. I think it ended up fine though!
26 Feb 2025 07:54 AM
Thank You for sharing your experience with Security Investigator and I'm glad to hear you had fun doing this walkthrough! 🙂 Just a little hint, maybe it comes handy in some cases (specially with the small iterations you mentioned): Security Investigator has also the possibility to execute the query in the same node.
You can either choose this option from the "Run" menu ("three-dot menu" next to the Run button) or use the keyboard shortcut Ctrl + Shift + Enter (Cmd+Shift+Enter for macbook users).
27 Feb 2025 09:19 PM
oh awesome, @Tiit_Hallas, thanks!