Solved: Re: The execution time of Extension 2.0 created with Python is a little longer than 60 seconds.

tasuzuki · ‎19 Jun 2024

I developed an extension that runs locally using dt-extensions-sdk.
I have it set up so that a START log is issued at the beginning of the query function.
After running the extension for several days in a row, I noticed that sometimes a minute's worth of data was missing.
When I looked at the START log in python3.log, I found that the time was as follows:

2024-06-16 22:28:59,931
2024-06-16 22:29:59,952
2024-06-16 22:30:59,953
2024-06-16 22:31:59,983
2024-06-16 22:33:00,012
2024-06-16 22:34:00,032
2024-06-16 22:35:00,031

There is no log that started at 22:32.
Looking at other start logs, I found that the execution interval was between 60.00 and 60.05 seconds.
It looks like 0.05 seconds piled up to a minute where the plugin was not executed. I think it's strange that only the metrics from the extension are missing, even though the items that OneAgent retrieves as standard (ex: builtin:host.cpu.usage) are not missing.

I don't know if there's a problem with the implementation, or with OneAgent or the SDK, so please help me.

Thank you.
Takayuki.

david_lopes · ‎19 Jun 2024

Can you please share the code?

the sdk has several mechanisms where this doesn’t happen, otherwise every single dynatrace created extension would produce gaps

methods are never scheduled to run exactly on the 00:00.00 of a minute to avoid this scenario
even if they happen to be (very unlikely), metrics reported on scheduled methods have their timestamp recalculated to compensate for any delays

without checking what your code is doing it will not be possible to help

tasuzuki · ‎20 Jun 2024

I created a simple extension to verify the operation.

This extension mainly does the following two things.

- Outputs the START log at the beginning of the query function.
- Registers the current time in seconds in the metric.

I check the time of the START log output every minute and determine that the plugin call time is delayed by a few milliseconds.

I check the log in /var/log/dynatrace/oneagent/extensions/datasources/custom_{plugin name}/python3.log.

In 0.0.1, when query is called, it first registers the metric and then ends the process. The timing of the start of execution at that time was delayed by about 4 milliseconds per minute.

In 0.0.2, time.sleep(2) was added before registering the metric and time.sleep(1) was added after registering the metric.

The timing of the start of execution at that time was delayed by about 8 milliseconds per minute.

If it is delayed by about 8 milliseconds per minute, data loss may occur approximately once every 5 days and 4 hours.

I have attached the source code and execution log.

florent_duchate · ‎20 Jun 2024

Hi

metric resolution within Dynatrace is 1 minute minimum so the value will be a snapshot anyway when it is taken (once per minute) so the concept of data loss makes no sense here. The time will wobble around the second but won't creep forward slowly.

What you are pointing out is a problem when ingesting 1 minute worth of data from a log for instance. If you rely on the execution time of the query method, you could indeed lose data. This is why you want to handle the start and end time in your code and not rely on the execution time. Typically, we read older data to ensure it is all there. So if you execute at 13:45:43.05, you need to read the data from 13:44:00 and 13:45:00 for instance this way you won't miss anything. The next minute, you might execute at 13:46:43.01 but you will read 13:45:00 to 13:46:00 etc...

tasuzuki · ‎20 Jun 2024

Yes, the minimum resolution of the metric is 1 minute. But because the query method is called every 60.0x seconds, there is a minute where data cannot be collected.

To make it easier to understand, if the query method is called every 60.1 seconds, it will look like this.

Time when the query method was called

- 2024-06-17 08:25:59.876

- 2024-06-17 08:26:59.976

- 2024-06-17 08:28:00.076

- 2024-06-17 08:29:00.176

Looking at python3.log, it looks like this, and the data for 08:27 was actually missing.

Is it wrong to retrieve monitoring items from the OS in the query method and register metrics?

david_lopes · ‎20 Jun 2024

Your extension is manually reporting the timestamp for the metric, and that is where your issue comes from, but lets assume folks always wanted to manually report the timestamp for their metrics (not the case), instead of letting the SDK calculate it.

This scenario you describe never happened in the logs you shared, like I mentioned before extensions are not scheduled to run at 00:00.00 of the minute.

The real logs look like:

2024-06-20 15:14:30,168 [INFO] dynatrace_extension.sdk.extension (ThreadPoolExecutor-0_0): query START mysample:0.0.2.
2024-06-20 15:15:30,173 [INFO] dynatrace_extension.sdk.extension (ThreadPoolExecutor-0_0): query START mysample:0.0.2.
2024-06-20 15:16:30,184 [INFO] dynatrace_extension.sdk.extension (ThreadPoolExecutor-0_0): query START mysample:0.0.2.

So there is a "cushion" of several seconds before that scenario could ever happen, 30 seconds in this case.

For that scenario to happen at 4ms per minute the extension would need to run for 7500 cycles, so because you are manually setting the timestamp of your metric, you are responsible for reporting it every minute, you denied the SDK the option to adjust the metric timestamp.

Even in that worst case scenario, you will lose 1 datapoint every 7500 minutes, which is acceptable because you took the risk of setting your own timestamps.

If you didn't set the timestamp manually, the SDK calculates the metric timestamp based on the first time the method was scheduled, which guarantees that no datapoint will be lost, you can check the code here and here

In summary, remove the timestamp argument from your report_metric method and you will never lose a datapoint

The code comments even describe the same issue you have.

tasuzuki · ‎21 Jun 2024

Thank you. I didn't know this mechanism.

> In summary, remove the timestamp argument from your report_metric method and you will never lose a datapoint

Following this advice, I modified the code as follows and ran it for about 7 hours.

    def query(self):
        # ### Some parts omitted ###
        now = datetime.datetime.now()
        self.report_metric(
            key=metric_key,
            value=now.second,
            dimensions=dimensions,
            # timestamp=now,    # this line was commented out
        )

The graph for this metric now looks like this:

However, if as you say, there is no data loss with this implementation, I will need to wait a few more hours until the start second changes from 59 to 00.

I am attaching the result of grepping this version of python3.log with START. If there are any other files that would be useful in investigating this, please let me know.

tasuzuki · ‎21 Jun 2024

Hmm, I'm in trouble.

As David_lopes said, I removed the timestamp from the report_metirc argument and checked to see if any metrics were missing.

    def query(self):
        # ### Some parts omitted ###
        now = datetime.datetime.now()
        self.report_metric(
            key=metric_key,
            value=now.second,
            dimensions=dimensions,
            # timestamp=now,    # this line was commented out
        )

Finally, the moment when the millisecond changes from 9xx to 0xx has arrived, so take a look.

2024-06-21 17:50:59,902 [INFO] dynatra ... utor-0_0): query START mysmp2:0.0.6.
2024-06-21 17:51:59,926 [INFO] dynatra ... utor-0_0): query START mysmp2:0.0.6.
2024-06-21 17:52:59,956 [INFO] dynatra ... utor-0_0): query START mysmp2:0.0.6.
2024-06-21 17:53:59,985 [INFO] dynatra ... utor-0_0): query START mysmp2:0.0.6.
2024-06-21 17:55:00,015 [INFO] dynatra ... utor-0_0): query START mysmp2:0.0.6.
2024-06-21 17:56:00,023 [INFO] dynatra ... utor-0_0): query START mysmp2:0.0.6.
2024-06-21 17:57:00,053 [INFO] dynatra ... utor-0_0): query START mysmp2:0.0.6.
2024-06-21 17:58:00,081 [INFO] dynatra ... utor-0_0): query START mysmp2:0.0.6.
2024-06-21 17:59:00,111 [INFO] dynatra ... utor-0_0): query START mysmp2:0.0.6.
2024-06-21 18:00:00,121 [INFO] dynatra ... utor-0_0): query START mysmp2:0.0.6.

As you can see from the logs, the plugin did not start at 17:54.

And when I checked DataExplorer, the data at 17:55 was missing.

If you didn't set the timestamp manually, the SDK calculates the metric timestamp based on the first time the method was scheduled, which guarantees that no datapoint will be lost.

Indeed, the metric timestamp appears to be calculated automatically, but ultimately one point of data was missing.

Is it inevitable that extensions developed using the SDK will have occasional defects? Or is there something wrong with the way I implement it?

david_lopes · ‎21 Jun 2024

Are you using the latest version of the SDK? 1.2.2 ? If so please open a github issue and we will investigate.

tasuzuki · ‎21 Jun 2024

Yes, I tried it with SDK 1.2.2 and the result was the same.
I'll create an issue on GitHub.

https://github.com/dynatrace-extensions/dt-extensions-python-sdk/issues/69

david_lopes · ‎23 Jun 2024

Thanks, we've published 1.2.3 which solves the accumulating timedrift

The execution time of extension 2.0 created with Python longer than 60 seconds