27 Mar 2023 01:38 PM - edited 28 Mar 2023 11:03 AM
In the previous article, we've talked about logs, and now, we're ready to discuss the common issues or problems that you might encounter in your z/OS monitoring and, of course, how to troubleshoot them.
Just so you know, this article doesn't cover extensions. If you want an article regarding this, please let me know in the comments. 😊
zRemote and zDC are not connected
There are many possible reasons why this is happening. The common root cause of this problem is the network connection.
The Watchdog Launcher - The first thing that I check is the watchdog. It is normally named as OneagentzwatchdogLauncher. In this file, you will be able to see when and why the oneagent stopped. You will see the stop reason, How it is stopped and if it restarted automatically.
For example:
[oneagentz] [STOPPING PROCESS] PID: 999
[oneagentz] [STOPPING PROCESS] Memory usage: 999 MiB
[oneagentz] [STOPPING PROCESS] Stop reason: NEXT_STATE_QUEUED
[oneagentz] [STOPPING PROCESS] Stop timeout: 9m 9s
[oneagentz] [STOPPING PROCESS] Next state: SHUTDOWN
[oneagentz] [STOPPING PROCESS] Core dump reason: none
[oneagentz] [STOPPING PROCESS] K!ll process tree: false
[oneagentz] [STOPPING PROCESS] Stop with: Ctrl-Break to a different process group
[oneagentz] [process] Stopping process with CTRL_BREAK
[oneagentz] [process] Sent CTRL+BREAK event to process 999
[oneagentz] [process] Waiting 9m 9s for graceful shutdown
[oneagentz] Delay reset threshold reached 99m 9s, setting restart delay to: 9s
[oneagentz] [PROCESS TERMINATED] PID: 999
[oneagentz] [PROCESS TERMINATED] Duration: 9h 99m termination: 9.9s
[oneagentz] [PROCESS TERMINATED] Exit code: EC_OK, exitcode: 0x0, code expected: true
[oneagentz] [PROCESS TERMINATED] Exit reason: PROCESS_STOPPED
oneagentz] [PROCESS TERMINATED] Process restart: disabled, never restart proccess on launcher state change
It doesn't give you the full information but it is a good start. After that, I go and check the other logs.
zRemote Inactive
Your license could affect the connection between zDC and zRemote. Search for this line:
[native] ============================== Agent inactivated ============================== (no more data will be sent by this agent, except heartbeats and config request)
This indicates that you've exhausted your host units and you would need to contact your CSM or Product Specialist for help.
Additionally, You can also check for the following OneAgent details:
1. If installed in the zRemote host, it must be in full-stack mode
2. It is enabled
3. It uses the same tenant as the zRemote.
Latency
The location of your zRemote matters. Search for this keyword:
[native] There is severe latency of over 10 seconds between the zRemote and zLocal. Please check your network connection or the priority of the zDC.
This indicates a slow connection between you zLocal and zRemote. Please make sure that your zRemote is installed in the same datacenter as your zLocal
DTZDCNM Job
To get more information regarding your connection, I recommend to run this job. This is the network contention diagnostic job. This can provide more information about your network connection and it can help with the investigation.
We are not seeing any mainframe data in the Dynatrace UI
Again, there are a lot of reasons why this is happening. It could be related to network issues, if it is not then you can troubleshoot it by using the following:
Contention
zDC needs a high priority to process your data. Look for this line:
[native] ASID[999], smfID[9999], sysid[xxxx], jobName[xxxxxxx ] - ZDC666W - List of possible zDC CPU contenders jobnameA.
This indicates that zDC doesn't have a high priority and it is being blocked by other jobs. Please make sure that your zDC has a higher or equal priority than your CICS/IMS jobs
If you have RMF, you can also check the TSO RMFWDM workflow delay reports to look for issues.
Missing CICS regions
zRemote and zDC are running fine but you are missing a specific CICS Region, check the DTAX transaction if it is enabled. DTAX communicates with the zDC. If it is enabled, it sends an INIT to zDC and when it gets an Ok response, then DTAX will start sending data.
Transaction Buffer
Search for this line:
TXB No buffers available, PPs being lost!
This means that you do not have enough transaction buffers and PurePath data are being lost. What you can do is to increase the parameter DTMSG_TRANBUFSIZE. This parameter offers better performance for CICS and IMS agents by placing event messages related to each PurePath into dedicated buffers. The maximum size that you could give is around (126,4) or (248,2). For more information about this parameter, please check the comment section in the ZDCSYSIN.
2022-10-20 14:04:18.573 UTC [00000cec] warning [native] ASID[999], smfID[xxxx], sysid[xxxx], jobName[xxxxxxxx]: active transactions=99, timeouts=99, corrupt paths=99, true timeouts=9
< >[999.9999]=sequence error
< >[999.9999]=sequence error
< >[999.9999]=sequence error
< >[999.9999]=sequence error
< >[999.9999]=sequence error
then this means that you have corrupted data. The root cause of this problem could either be connection problems, zDC priority issues or sizing problems. You need to check the previous issues (the one I've written above) before troubleshooting this. Most of the time this is an effect or a secondary problem caused by the main issue. If you fix the main root cause then this issue might go away.
If you are still unsure and the error persists, please open a support ticket or talk to us via chat and we would be happy to assist you.
This - is - amazing!!!!
Another common reason is the version compatibility. Not sure exactly in which version, but after an update we did in our zRemotes AG version, we had also to update all ours zDCs OneAgent version as well, since the zRemotes were not accepting the connection.
Sadly I do not have the error message from logs.
*edit* - just now I saw your another post, which mentions exactly this!!! https://community.dynatrace.com/t5/Troubleshooting/Troubleshooting-your-z-OS-monitoring-Covering-the...
Just for visibility:
As a best practice, I recommend that zDC, zLocal and zRemote should be on the same version OR at least 3 versions lower. If zDC is on a higher version than the zRemote, then a problem might occur.