Microceph Troubleshooting

From UVOO Tech Wiki
Jump to navigation Jump to search

Change osd status from down to up

lxc shell microceph<X>
systemctl restart snap.microceph.osd.service
ceph osd tree

If a Ceph OSD (Object Storage Daemon) is reported as down while the cluster node itself is still up, there could be several potential reasons for this issue. Here are some common troubleshooting steps you can take:

Check OSD Status: Use the ceph osd tree command to check the status of OSDs in your cluster. Look for the OSD that is reported as down and note its ID.

Review OSD Logs: Check the logs for the OSD that is down. You can find OSD logs typically in /var/log/ceph/ceph-osd.{osd_id}.log on the node where the OSD is running. Look for any error messages or warnings that might indicate the cause of the OSD being down.

Check OSD Processes: Verify that the OSD process is running on the node where the OSD is supposed to be running. You can use tools like ps or systemctl to check the status of the OSD process.

Storage Issues: Ensure that the storage device or partition where the OSD's data resides is accessible and functioning properly. Check for any hardware failures or storage-related issues that might be affecting the OSD.

Network Connectivity: Verify that there are no network connectivity issues between the OSD node and other nodes in the cluster. Ensure that the OSD node can communicate with the Ceph monitors and other OSD nodes.

Firewall and Security: Check firewall rules and security settings to ensure that they are not blocking communication between OSD nodes or between OSD nodes and other components of the Ceph cluster.

Recovery: If the OSD is reported as down due to a transient issue, you may try restarting the OSD process (systemctl restart ceph-osd@{osd_id}) or performing a manual recovery if necessary.

Cluster Health: Overall cluster health can also impact individual OSDs. Check the overall health of the Ceph cluster using ceph -s and address any issues reported there.

If after performing these steps you're still unable to resolve the issue, you may need to delve deeper into the specific logs and configurations of your Ceph cluster to identify the root cause of the OSD being down. Additionally, consulting Ceph documentation or community resources for further assistance can be beneficial.

Apr 10 17:25:44 lxd2 kernel: Buffer I/O error on dev zd64, logical block 2, lost async page write zpool scrub