在使用ceph 15.2.8版本时,普通osd节点启动遇到的问题与解决方案探析
在测试环境下使用了三台服务器,使用ceph version 15.2.8 去搭建了一套ceph集群环境,搭建前后经历了很多坑,但是大多都可以在各种谷歌百度中解决,顺利搭建了好环境;
ceph version 15.2.8 是容器化安装的ceph,但是日常的操作ceph命令操作,也是直接使用
ceph 这个命令去完成,使用起来跟之前版本安装其实区别不大;
这是ceph -s 看到的集群状态,时钟同步的可以先行略过 cluster: id: cf8ecbc0-6559-11eb-84de-fa163eaa0aa6 health: HEALTH_WARN clock skew detected on mon.UAT-CEPH-MASTER03, mon.UAT-CEPH-MASTER02 services: mon: 3 daemons, quorum UAT-CEPH-MASTER01,UAT-CEPH-MASTER03,UAT-CEPH-MASTER02 (age 11m) mgr: UAT-CEPH-MASTER02.kgpbkj(active, since 41m), standbys: UAT-CEPH-MASTER01.fivcwu mds: uat_cephfs:1 {0=uat_cephfs.UAT-CEPH-MASTER03.eixgkh=up:active} 2 up:standby osd: 9 osds: 9 up (since 11m), 9 in (since 36m) data: pools: 3 pools, 49 pgs objects: 33 objects, 3.3 KiB usage: 9.8 GiB used, 590 GiB / 600 GiB avail pgs: 49 active+clean
下面这个就是其中一个节点的上面运行的ceph进程,注意看osd.8这个普通的节点
[root@UAT-CEPH-MASTER01 ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
773239687d8f ceph/ceph:v15 "/usr/bin/ceph-osd -…" 7 minutes ago Up 7 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-osd.8
7f6a57630ee1 ceph/ceph:v15 "/usr/bin/ceph-osd -…" 7 minutes ago Up 7 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-osd.4
3d437ca305e2 ceph/ceph:v15 "/usr/bin/ceph-osd -…" 7 minutes ago Up 7 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-osd.1
daa4d4f43184 ceph/ceph:v15 "/usr/bin/ceph-mds -…" 8 minutes ago Up 7 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mds.uat_cephfs.UAT-CEPH-MASTER01.mixizk
1630dc65c900 prom/node-exporter:v0.18.1 "/bin/node_exporter …" 8 minutes ago Up 7 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-node-exporter.UAT-CEPH-MASTER01
28e9b12cac7d ceph/ceph-grafana:6.6.2 "/bin/sh -c 'grafana…" 8 minutes ago Up 7 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-grafana.UAT-CEPH-MASTER01
ae74428f5a94 ceph/ceph:v15 "/usr/bin/ceph-mgr -…" 8 minutes ago Up 7 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mgr.UAT-CEPH-MASTER01.fivcwu
aded21660f45 ceph/ceph:v15 "/usr/bin/ceph-crash…" 8 minutes ago Up 7 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-crash.UAT-CEPH-MASTER01
4502d3f07648 prom/alertmanager:v0.20.0 "/bin/alertmanager -…" 8 minutes ago Up 7 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-alertmanager.UAT-CEPH-MASTER01
7f8347395108 prom/prometheus:v2.18.1 "/bin/prometheus --c…" 8 minutes ago Up 8 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-prometheus.UAT-CEPH-MASTER01
bccca1eaec17 ceph/ceph:v15 "/usr/bin/ceph-mon -…" 8 minutes ago Up 7 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mon.UAT-CEPH-MASTER01
很多命令都测试顺利如ceph osd in/dump/out/tree/down/stop等等
但是,在测试过程中发现一个看起来简单的问题,却困扰了许久,也只能找到一个临时变通解决的办法;
使用命令 ceph osd stop osd.8
[root@UAT-CEPH-MASTER01 ~]# ceph osd stop osd.8 stop osd.8. [root@UAT-CEPH-MASTER01 ~]# ceph -s cluster: id: cf8ecbc0-6559-11eb-84de-fa163eaa0aa6 health: HEALTH_WARN clock skew detected on mon.UAT-CEPH-MASTER03, mon.UAT-CEPH-MASTER02 1 osds down Reduced data availability: 6 pgs inactive, 23 pgs peering services: mon: 3 daemons, quorum UAT-CEPH-MASTER01,UAT-CEPH-MASTER03,UAT-CEPH-MASTER02 (age 19m) mgr: UAT-CEPH-MASTER02.kgpbkj(active, since 49m), standbys: UAT-CEPH-MASTER01.fivcwu mds: uat_cephfs:1 {0=uat_cephfs.UAT-CEPH-MASTER03.eixgkh=up:active} 2 up:standby osd: 9 osds: 8 up (since 2s), 9 in (since 45m) data: pools: 3 pools, 49 pgs objects: 33 objects, 3.3 KiB usage: 9.8 GiB used, 590 GiB / 600 GiB avail pgs: 46.939% pgs not active 26 active+clean 23 peering [root@UAT-CEPH-MASTER01 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.58585 root default -3 0.19528 host UAT-CEPH-MASTER01 1 hdd 0.04880 osd.1 up 1.00000 1.00000 4 hdd 0.04880 osd.4 up 1.00000 1.00000 8 hdd 0.09769 osd.8 down 1.00000 1.00000 -5 0.19528 host UAT-CEPH-MASTER02 0 hdd 0.04880 osd.0 up 1.00000 1.00000 3 hdd 0.04880 osd.3 up 1.00000 1.00000 6 hdd 0.09769 osd.6 up 1.00000 1.00000 -7 0.19528 host UAT-CEPH-MASTER03 2 hdd 0.04880 osd.2 up 1.00000 1.00000 5 hdd 0.04880 osd.5 up 1.00000 1.00000 7 hdd 0.09769 osd.7 up 1.00000 1.00000 [root@UAT-CEPH-MASTER01 ~]#
然后开始使用各种办法尝试启动这个手动关闭的节点
[root@UAT-CEPH-MASTER01 ~]# ceph osd start osd.8 no valid command found; 10 closest matches: osd perf osd df [plain|tree] [class|name] [<filter>] osd blocked-by osd pool stats [<pool_name>] osd pool scrub <who>... osd pool deep-scrub <who>... osd pool repair <who>... osd pool force-recovery <who>... osd pool force-backfill <who>... osd pool cancel-force-recovery <who>... Error EINVAL: invalid command [root@UAT-CEPH-MASTER01 ~]# ceph osd up osd.8 no valid command found; 10 closest matches: osd perf osd df [plain|tree] [class|name] [<filter>] osd blocked-by osd pool stats [<pool_name>] osd pool scrub <who>... osd pool deep-scrub <who>... osd pool repair <who>... osd pool force-recovery <who>... osd pool force-backfill <who>... osd pool cancel-force-recovery <who>... Error EINVAL: invalid command [root@UAT-CEPH-MASTER01 ~]# /etc/init.d/ multi-queue-ecloud netconsole network [root@UAT-CEPH-MASTER01 ~]# /etc/init.d/ multi-queue-ecloud netconsole network [root@UAT-CEPH-MASTER01 ~]# /etc/init.d/init-ceph start osd.8 -bash: /etc/init.d/init-ceph: 没有那个文件或目录 [root@UAT-CEPH-MASTER01 ~]# systemctl start ceph-osd@8 Failed to start ceph-osd@8.service: Unit not found. [root@UAT-CEPH-MASTER01 ~]# systemctl start ceph-osd@osd.8 Failed to start ceph-osd@osd.8.service: Unit not found. [root@UAT-CEPH-MASTER01 ~]# systemctl start ceph osd.8 Failed to start ceph.service: Unit not found. Failed to start osd.8.service: Unit not found. [root@UAT-CEPH-MASTER01 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a43a580084aa ceph/ceph:v15 "/usr/bin/ceph-mgr -…" 22 minutes ago Up 22 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mgr.UAT-CEPH-MASTER01.fivcwu 22fd9c0c50d2 ceph/ceph:v15 "/usr/bin/ceph-mds -…" 22 minutes ago Up 22 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mds.uat_cephfs.UAT-CEPH-MASTER01.mixizk 53f71890ba19 ceph/ceph:v15 "/usr/bin/ceph-osd -…" 22 minutes ago Up 22 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-osd.4 505867e8cac2 ceph/ceph:v15 "/usr/bin/ceph-osd -…" 22 minutes ago Up 22 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-osd.1 6dcb14c66b69 ceph/ceph:v15 "/usr/bin/ceph-crash…" 22 minutes ago Up 22 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-crash.UAT-CEPH-MASTER01 c38fbdbcc6f2 prom/prometheus:v2.18.1 "/bin/prometheus --c…" 22 minutes ago Up 22 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-prometheus.UAT-CEPH-MASTER01 aeb1c59a9f8b prom/alertmanager:v0.20.0 "/bin/alertmanager -…" 22 minutes ago Up 22 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-alertmanager.UAT-CEPH-MASTER01 ca788ced4096 ceph/ceph:v15 "/usr/bin/ceph-mon -…" 22 minutes ago Up 22 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mon.UAT-CEPH-MASTER01 a3c39095f006 prom/node-exporter:v0.18.1 "/bin/node_exporter …" 22 minutes ago Up 22 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-node-exporter.UAT-CEPH-MASTER01 ce53f5523294 ceph/ceph-grafana:6.6.2 "/bin/sh -c 'grafana…" 22 minutes ago Up 22 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-grafana.UAT-CEPH-MASTER01
下面这些是ceph 官网文档提供拉起osd节点的命令,但是一样全部失败
https://docs.ceph.com/en/latest/rados/operations/monitoring-osd-pg/ [root@UAT-CEPH-MASTER01 ~]# systemctl start ceph-osd@8.services Failed to start ceph-osd@8.services.service: Unit not found. [root@UAT-CEPH-MASTER01 ~]# systemctl start ceph-osd@8.service Failed to start ceph-osd@8.service: Unit not found. [root@UAT-CEPH-MASTER01 ~]# sudo systemctl start ceph-osd@1 Failed to start ceph-osd@1.service: Unit not found.
后来,我还是节点拉起来了,先是使用一个比较极端的办法,重启osd.8节点所在的集群,然后重启后,节点就正常了;
后面,我又尝试找到一个别的方法解决,不用重启系统,但是只能说稍微好点,也不是很满意
通过这些命令,我们可以知道osd.8所在的服务器的所有服务都是都是基于 ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6.target,所以可以通过重启整个服务,来顺便拉起down 掉的节点 [root@UAT-CEPH-MASTER01 ~]# systemctl status ceph-*.target ● ceph-c955f00a-5e40-11eb-aec4-fa163eaa0aa6.target - Ceph cluster c955f00a-5e40-11eb-aec4-fa163eaa0aa6 Loaded: loaded (/etc/systemd/system/ceph-c955f00a-5e40-11eb-aec4-fa163eaa0aa6.target; enabled; vendor preset: disabled) Active: active since 四 2021-02-04 22:09:11 CST; 59min ago 2月 04 22:09:11 UAT-CEPH-MASTER01 systemd[1]: Reached target Ceph cluster c955f00a-5e40-11eb-aec4-fa163eaa0aa6. ● ceph-e52185c2-6491-11eb-aac0-fa163eaa0aa6.target - Ceph cluster e52185c2-6491-11eb-aac0-fa163eaa0aa6 Loaded: loaded (/etc/systemd/system/ceph-e52185c2-6491-11eb-aac0-fa163eaa0aa6.target; enabled; vendor preset: disabled) Active: active since 四 2021-02-04 22:39:10 CST; 29min ago 2月 04 22:39:10 UAT-CEPH-MASTER01 systemd[1]: Reached target Ceph cluster e52185c2-6491-11eb-aac0-fa163eaa0aa6. ● ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6.target - Ceph cluster cf8ecbc0-6559-11eb-84de-fa163eaa0aa6 Loaded: loaded (/etc/systemd/system/ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6.target; enabled; vendor preset: disabled) Active: active since 四 2021-02-04 22:40:06 CST; 28min ago 2月 04 22:40:06 UAT-CEPH-MASTER01 systemd[1]: Reached target Ceph cluster cf8ecbc0-6559-11eb-84de-fa163eaa0aa6. [root@UAT-CEPH-MASTER01 ~]# ll /var/lib/ceph/cf8ecbc0-6559-11eb-84de-fa163eaa0aa6/ 总用量 48 drwx------ 3 nfsnobody nfsnobody 4096 2月 2 21:24 alertmanager.UAT-CEPH-MASTER01 drwx------ 3 ceph ceph 4096 2月 2 21:23 crash drwx------ 2 ceph ceph 4096 2月 2 21:24 crash.UAT-CEPH-MASTER01 drwx------ 4 chrony upatch 4096 2月 2 21:24 grafana.UAT-CEPH-MASTER01 drwx------ 2 ceph ceph 4096 2月 2 21:41 mds.uat_cephfs.UAT-CEPH-MASTER01.mixizk drwx------ 2 ceph ceph 4096 2月 2 21:23 mgr.UAT-CEPH-MASTER01.fivcwu drwx------ 3 ceph ceph 4096 2月 2 21:23 mon.UAT-CEPH-MASTER01 drwx------ 2 nfsnobody nfsnobody 4096 2月 2 21:24 node-exporter.UAT-CEPH-MASTER01 drwx------ 2 ceph ceph 4096 2月 4 22:39 osd.1 drwx------ 2 ceph ceph 4096 2月 4 22:39 osd.4 drwx------ 2 ceph ceph 4096 2月 4 22:39 osd.8 drwx------ 4 nfsnobody nfsnobody 4096 2月 2 21:24 prometheus.UAT-CEPH-MASTER01
[root@UAT-CEPH-MASTER01 ~]# ceph -s cluster: id: cf8ecbc0-6559-11eb-84de-fa163eaa0aa6 health: HEALTH_WARN clock skew detected on mon.UAT-CEPH-MASTER03, mon.UAT-CEPH-MASTER02 services: mon: 3 daemons, quorum UAT-CEPH-MASTER01,UAT-CEPH-MASTER03,UAT-CEPH-MASTER02 (age 31m) mgr: UAT-CEPH-MASTER02.kgpbkj(active, since 62m), standbys: UAT-CEPH-MASTER01.fivcwu mds: uat_cephfs:1 {0=uat_cephfs.UAT-CEPH-MASTER03.eixgkh=up:active} 2 up:standby osd: 9 osds: 8 up (since 12m), 8 in (since 2m) data: pools: 3 pools, 49 pgs objects: 33 objects, 3.3 KiB usage: 8.7 GiB used, 491 GiB / 500 GiB avail pgs: 49 active+clean 您在 /var/spool/mail/root 中有新邮件 [root@UAT-CEPH-MASTER01 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a43a580084aa ceph/ceph:v15 "/usr/bin/ceph-mgr -…" 31 minutes ago Up 31 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mgr.UAT-CEPH-MASTER01.fivcwu 22fd9c0c50d2 ceph/ceph:v15 "/usr/bin/ceph-mds -…" 31 minutes ago Up 31 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mds.uat_cephfs.UAT-CEPH-MASTER01.mixizk 53f71890ba19 ceph/ceph:v15 "/usr/bin/ceph-osd -…" 31 minutes ago Up 31 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-osd.4 505867e8cac2 ceph/ceph:v15 "/usr/bin/ceph-osd -…" 31 minutes ago Up 31 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-osd.1 6dcb14c66b69 ceph/ceph:v15 "/usr/bin/ceph-crash…" 31 minutes ago Up 31 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-crash.UAT-CEPH-MASTER01 c38fbdbcc6f2 prom/prometheus:v2.18.1 "/bin/prometheus --c…" 31 minutes ago Up 31 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-prometheus.UAT-CEPH-MASTER01 aeb1c59a9f8b prom/alertmanager:v0.20.0 "/bin/alertmanager -…" 31 minutes ago Up 31 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-alertmanager.UAT-CEPH-MASTER01 ca788ced4096 ceph/ceph:v15 "/usr/bin/ceph-mon -…" 31 minutes ago Up 31 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mon.UAT-CEPH-MASTER01 a3c39095f006 prom/node-exporter:v0.18.1 "/bin/node_exporter …" 31 minutes ago Up 31 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-node-exporter.UAT-CEPH-MASTER01 ce53f5523294 ceph/ceph-grafana:6.6.2 "/bin/sh -c 'grafana…" 31 minutes ago Up 31 minutes ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-grafana.UAT-CEPH-MASTER01 [root@UAT-CEPH-MASTER01 ~]# systemctl status ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6.target ● ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6.target - Ceph cluster cf8ecbc0-6559-11eb-84de-fa163eaa0aa6 Loaded: loaded (/etc/systemd/system/ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6.target; enabled; vendor preset: disabled) Active: active since 四 2021-02-04 22:40:06 CST; 32min ago 2月 04 22:40:06 UAT-CEPH-MASTER01 systemd[1]: Reached target Ceph cluster cf8ecbc0-6559-11eb-84de-fa163eaa0aa6. [root@UAT-CEPH-MASTER01 ~]# systemctl restart ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6.target [root@UAT-CEPH-MASTER01 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 73714477b474 ceph/ceph:v15 "/usr/bin/ceph-mgr -…" 11 seconds ago Up 10 seconds ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mgr.UAT-CEPH-MASTER01.fivcwu a77a70a5f0c5 ceph/ceph:v15 "/usr/bin/ceph-osd -…" 18 seconds ago Up 17 seconds ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-osd.4 887021eebf6b ceph/ceph:v15 "/usr/bin/ceph-osd -…" 18 seconds ago Up 17 seconds ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-osd.1 7a31f6948f13 ceph/ceph:v15 "/usr/bin/ceph-osd -…" 20 seconds ago Up 18 seconds ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-osd.8 d730eef29dca ceph/ceph:v15 "/usr/bin/ceph-mds -…" 20 seconds ago Up 19 seconds ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mds.uat_cephfs.UAT-CEPH-MASTER01.mixizk 4da43f0ae2c1 ceph/ceph:v15 "/usr/bin/ceph-crash…" 20 seconds ago Up 19 seconds ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-crash.UAT-CEPH-MASTER01 d624a1bda027 ceph/ceph:v15 "/usr/bin/ceph-mon -…" 21 seconds ago Up 19 seconds ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-mon.UAT-CEPH-MASTER01 ff92845c1d48 prom/alertmanager:v0.20.0 "/bin/alertmanager -…" 21 seconds ago Up 19 seconds ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-alertmanager.UAT-CEPH-MASTER01 65c9d687b70b prom/prometheus:v2.18.1 "/bin/prometheus --c…" 21 seconds ago Up 19 seconds ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-prometheus.UAT-CEPH-MASTER01 d158f6a7a5ef ceph/ceph-grafana:6.6.2 "/bin/sh -c 'grafana…" 21 seconds ago Up 19 seconds ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-grafana.UAT-CEPH-MASTER01 1a4ee30df46d prom/node-exporter:v0.18.1 "/bin/node_exporter …" 21 seconds ago Up 19 seconds ceph-cf8ecbc0-6559-11eb-84de-fa163eaa0aa6-node-exporter.UAT-CEPH-MASTER01 [root@UAT-CEPH-MASTER01 ~]# ceph -s cluster: id: cf8ecbc0-6559-11eb-84de-fa163eaa0aa6 health: HEALTH_OK services: mon: 3 daemons, quorum UAT-CEPH-MASTER01,UAT-CEPH-MASTER03,UAT-CEPH-MASTER02 (age 18s) mgr: UAT-CEPH-MASTER02.kgpbkj(active, since 63m), standbys: UAT-CEPH-MASTER01.fivcwu mds: uat_cephfs:1 {0=uat_cephfs.UAT-CEPH-MASTER03.eixgkh=up:active} 2 up:standby osd: 9 osds: 9 up (since 17s), 9 in (since 17s) data: pools: 3 pools, 49 pgs objects: 33 objects, 3.3 KiB usage: 9.8 GiB used, 590 GiB / 600 GiB avail pgs: 49 active+clean [root@UAT-CEPH-MASTER01 ~]#
问题是只是暂时解决,仍需要继续研究,更简单的方法