27 Oct 2025 07:26 PM
27 Oct 2025 08:47 PM - edited 28 Oct 2025 12:27 AM
FUN!
"The original order of the records is not preserved. Therefore, by default the sequence of records that are chosen during deduplication is random. If you want to pick a particular record out of the duplicates, you can use the sort parameter." (docs)
- sorting first ensures the order of records sequenced into the deduplication algorithm, potentially more predictable
- deduplication first on a randomized order seems less predicatable to me
How does this compare internally to using the sort option included in the dedup function?
| dedup {fieldA, fieldB}, sort {fieldA asc, fieldB desc}
28 Oct 2025 02:25 AM
@henk_stobbe I think the count returned should be same for both the queries , unless they are executed at different intervals . As they are logs which are continuously ingested even less than a second difference can give you a different count
29 Oct 2025 01:17 PM
Hello,
Problem (at my end) is that every part of the "pipeline" seems to have its own limits (-; so you can loose data in every step. Not sure how to prevent this when using multiple steps.
So starting with 9999 log lines, after the sort you can end up with 8888 (as an example),
KR Henk
29 Oct 2025 02:22 PM
This is true in multiple languages, not just DQL.
The limits on each step in the query are applied from the configuration in the tile/segment.
This would be different from a "| limit {n}" applied at the end of the processing.
For your riddle, however - I think it's good to remember how the dedup function works in DQL: it is not sorted by default and on a large query, sorting before dedup can improve performance quite a lot.