hadoop掉盘报警 | 欢迎来到落英缤纷的小小世界

某地的hadoop群集中有个节点盘符错乱,重启进入LSI raid卡界面中显示有块4T盘状态异常(不是绿色的online,而是红色的JBOD),运维同事恢复该盘正常radi状态后重启，
进入系统后该盘已无原有分区,重新分区并格式化为ext4,手工挂载后( sdk = data10 )OK。
安装MegaCLI工具：
# yum localinstall MegaCli-8.07.10-1.noarch.rpm
# /opt/MegaRAID/MegaCli/MegaCli64 -cfgdsply -aALL |grep “DISK GROUP”

# /opt/MegaRAID/MegaCli/MegaCli64 -Pdlist -a0 |grep “Firmware state: Online” |wc -l

## /opt/MegaRAID/MegaCli/MegaCli64 -cfgdsply -aALL |grep “RAID Level” |tail -1 |awk -F: ‘{print $1″:”$2}’
# /opt/MegaRAID/MegaCli//MegaCli64 -cfgdsply -aALL |grep -c “Non Coerced Size”

mutt+msmtp报警脚本：

#/bin/bash
#ssh=ssh -P 2233
online=13
MAIL_TO_ARR[1]=xxxxx@163.com
MAIL_TO_NUM=1

for ips in `cat ./hadoop.ip`; do
#echo ${ips};
diskonline=`ssh -p 2233 root@${ips} ‘/opt/MegaRAID/MegaCli/MegaCli64 -Pdlist -a0 |grep “Firmware state: Online” |wc -l’`;
#echo ${ips}–diskonline–$diskonline;
if [ “$diskonline” = “$online” ];then

echo -e “\033[32m ${ips} \033[0m disk is all \033[32m online \033[0m “
else
num=$[$online – $diskonline]
# echo $num
# echo -e “\033[31m ${ips} \033[0m ${num} disk is maybe \033[31m offline \033[0m “
J=1
while [ $J -le $MAIL_TO_NUM ]
do
# echo “${ips} : ${num} disk is maybe offline, please check!!” | mutt -s “${ips} : ${num} disk is maybe offline” xxxxx@xxxxxxx.com.cn
echo “${ips} : ${num} disk is maybe offline, please check!!” | mutt -s “${ips} : ${num} disk is maybe offline” ${MAIL_TO_ARR[$J]}
let J++
done
fi
done

一	二	三	四	五	六	日
« 8月
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

发表评论 取消回复

发表评论取消回复