We had a network issue yesterday that was caused by a MS SQL server memory issue. Today i went back to look at some specific queries to see if a change in their behavior (req/s, response time, size) could have been a root cause of the issue. What i found was that some of the queries showed relevant data up to the point that our issue occured (3:04pm) and then my graphs go blank until after the issue was resolved. The query i was looking at is executed more than 500 time a minute so i have a hard time believing that it just stopped being executed during this time. What i think happened was that it was moved into the All Other Operations buckets but i can't explain why.
Has anyone else seen this type of behavior and is there a way to avoid it?
A possible explanation is that if MSSQL was having memory issues, then it would be unable to handle incoming network connections from application servers, resulting in connectivity issues (such as timeouts or dropped connections). Servers can't send queries over the network to databases without a working connection, so even though the application servers were trying to run the queries (from the point of view of the app server), the actual queries never made it onto the network (and so DCRUM can't see it).
In DCRUM, you can use the Network explorer to see if there were lots of TCP errors for the MSSQL server that you're monitoring, that would have affected the Availability metrics. If so, then the only place that we can report those errors is under "All other operations" - we can see that there was an attempt made to connect (so we count that as an operation request), but can't possibly know what DB query (or queries) would have been sent, because the connection attempt failed.