Zabbix Issues

From UVOO Tech Wiki
Revision as of 12:58, 14 July 2020 by Busk (talk | contribs) (Created page with " Solving the alert: Zabbix unreachable poller processes more than 75% busy https://www.zabbix.com/forum/zabbix-troubleshooting-and-problems/400962-solving-the-alert-zabbix-un...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Solving the alert: Zabbix unreachable poller processes more than 75% busy https://www.zabbix.com/forum/zabbix-troubleshooting-and-problems/400962-solving-the-alert-zabbix-unreachable-poller-processes-more-than-75-busy

Solving the alert: Zabbix unreachable poller processes more than 75% busy
11-05-2020, 14:30
Hello,

we were for a long time plagued by the alert: Zabbix unreachable poller processes more than 75% busy

There is a lot of info about this message on the net, but none really helped me. My main problem was finding out what exactly the unreachable pollers were doing. So I thought I'd share what I've discovered, even it might not be 100% correct. I am still a zabbix newby, so feel free to correct where necessary or provide better methodology

STEP 1: Cleaning up unreachable items
Go to Configuration > Hosts, click on any random 'items' link.
Open the filter, and clean all fields to emtpy/all/.... IMPORTANT: This includes the 'Host' field you just filled
Change State from all to Not supported. This will cause Status to change to Enabled.
Searching produces a report of all items that are unpollable. Unfortunately, it also includes items from disabled hosts. I disabled any item that had no chance of becoming available.

STEP 2: Cleaning up unreachable hosts.
Go again to Configuration > Hosts
Look at the column 'Availablity' with Red/green leds for ZBX|SNMP|JMX|IPMI
Everything red takes up capacity from an unreachable poller.
Again I disabled any host that would never come up again

STEP 3: Finding out what the unreachable pollers are doing.

This is what led me to discover step 2.
Open a linux terminal and do something like ps axu|grep -i unreachable
Note the unreachable pollers that are slow. E.g. I had some saying 1 item in 60 seconds. Note the PID (of the thread, not of the whole zabbix process)
Use strace to find out what that thread is doing, e.g. strace -p 1234
I got some IO on an IP adress (bingo) and a select on fd 0 with time out of 30 seconds.
For the fd number, do something like ls -hal /proc/1234/fd/0 , this is for PID 1234 and FD 0. You can now see what file/socket/... is causing the slowdown.
This also yielded an interesting fact: