hadoop掉盘报警

某地的hadoop群集中有个节点盘符错乱,重启进入LSI raid卡界面中显示有块4T盘状态异常(不是绿色的online,而是红色的JBOD),运维同事恢复该盘正常radi状态后重启,
进入系统后该盘已无原有分区,重新分区并格式化为ext4,手工挂载后( sdk = data10 )OK。
安装MegaCLI工具:
# yum localinstall MegaCli-8.07.10-1.noarch.rpm
# /opt/MegaRAID/MegaCli/MegaCli64 -cfgdsply -aALL |grep “DISK GROUP”

# /opt/MegaRAID/MegaCli/MegaCli64 -Pdlist -a0 |grep “Firmware state: Online” |wc -l

## /opt/MegaRAID/MegaCli/MegaCli64 -cfgdsply -aALL |grep “RAID Level” |tail -1 |awk -F: ‘{print $1″:”$2}’
# /opt/MegaRAID/MegaCli//MegaCli64 -cfgdsply -aALL |grep -c “Non Coerced Size”

mutt+msmtp报警脚本:

#/bin/bash
#ssh=ssh -P 2233
online=13
MAIL_TO_ARR[1]=xxxxx@163.com
MAIL_TO_NUM=1

for ips in `cat ./hadoop.ip`; do
#echo ${ips};
diskonline=`ssh -p 2233 root@${ips} ‘/opt/MegaRAID/MegaCli/MegaCli64 -Pdlist -a0 |grep “Firmware state: Online” |wc -l’`;
#echo ${ips}–diskonline–$diskonline;
if [ “$diskonline” = “$online” ];then

echo -e “\033[32m ${ips} \033[0m disk is all \033[32m online \033[0m “
else
num=$[$online – $diskonline]
# echo $num
# echo -e “\033[31m ${ips} \033[0m ${num} disk is maybe \033[31m offline \033[0m “
J=1
while [ $J -le $MAIL_TO_NUM ]
do
# echo “${ips} : ${num} disk is maybe offline, please check!!” | mutt -s “${ips} : ${num} disk is maybe offline” xxxxx@xxxxxxx.com.cn
echo “${ips} : ${num} disk is maybe offline, please check!!” | mutt -s “${ips} : ${num} disk is maybe offline” ${MAIL_TO_ARR[$J]}
let J++
done
fi
done

Print Friendly

发表评论

电子邮件地址不会被公开。 必填项已用*标注