Linux集群
LB:
并发处理能力
HA:High Availability,高可用
在线时间/(在线时间+故障处理时间)
99%,99.9%,99.99%
HP(HPC):高性能
High Performance
并行处理集群
分布式存储:分布式文件系统
将大任务切割为小任务,分别进行处理的机制
rsync+inotify:文件同步
sersync:文件同步
fencing:隔离
节点隔离:STONTIH
资源隔离:
LVS
Linux Virtual Server
Hardware
F5, BIG IP
Citrix ,Netscaler
A10
Software
四层
LVS
七层:反向代理
Nginx
http , smtp ,pop3 ,imap
haproxy
http,tcp (mysql, smtp)
LVS
ipvsadm:管理集群服务的命令行工具
ipvs:内核模块
CIP-->VIP-->DIP-->RIP
三种类型
NAT:地址转换
集群节点跟director必须在同一个IP网络中
RIP地址通常是私有地址,仅用于各集群节点之间通信
director位于client和real server之间,并负责处理进出的所有通信
realserver必须将网关指向DIP
支持端口映射
realserver可以使用任意操作系统
较大规模应用场景中,director易成为系统瓶颈
DR:直接路由
集群节点跟director必须在同一个物理网络中
RIP可以使用公网地址,实现便捷的远程管理和监控
director仅负责处理入站请求,响应报文则由director直接发往客户端
realserver不能能将网关指定DIP
不支持端口映射
TUN:隧道
集群节点可以跨越互联网
RIP必须是公网地址
director仅服务处理入站请求,响应报文则由realserver直接发往客户端
realserver网关不能指向director
只有指向隧道功能的系统才能用于realserver
不支持端口映射
固定调度
rr:轮叫,轮询
wrr:weight,加权
sh:source hash ,源地址hash
dh:
动态调度
lc:最少连接
active*256+inactive
谁的小,挑谁
wlc:加权最少连接
(active*256+inactive)/weight
sed : 最短期望延迟
(active+1)*256/weight
nq:never queue
LBLC : 基于本地的最少连接
DH:
LBLCR:基于本地的带复制功能的最少连接
ipvsadm:
管理集群服务
添加:-A -t|u|f service-address [-s scheduler]
-t:tcp协议集群
-u:udp协议的集群
-f:fwm:防火墙标记
server-address:mark number
#ipvsadm -A -t 172.16.100.1:80 -s rr
修改:-E
删除:-D -t|u|f service-address
管理集群服务中的RS
添加:-a -t|u|f service-address -r server-address [-g|i|m] [-w weight] [-x upper] [-y lower]
-t|u|f service-address:事先定义好的某集群服务
-r server-address :某RS的地址,在NAT模型中,可使用IP:prot实现端口映射
[-g|i|m]:LVS类型
-g:DR
-i:TUN
-m:NAT
[-w weight] :定义服务器权重
修改:-e
删除:-d -t|u|f service-address -r server-address
# ipvsadm -a -t 172.16.100.1:80 -r 192.168.10.8 -m
#ipvsadm -a -t 172.16.100.1:80 -r 192.168.10.9 -m
查看:
-L| l
-n:数字格式显示主机地址和端口
--stats:统计数据
--rate:速率
--timeout:显示tcp、tcpfin和udp的会话超时时长
--sort : 排序
-c:显示当前的ipvs链接状况
删除所有集群服务
-C:清空ipvs规则
保存规则
-S
#ipvsadm -S > /path/to/somefile
载入此前的规则:
-R
# ipvsadm -R < /path/torm/somefile
各节点之间的世界偏差不应该超出1秒种
NTP:Network Time Protocol
ntpdate timeserver (和时间服务器同步时间,server启动ntp服务)
NAT类型配置
director:VIP:172.16.100.1 DIP:192.168.10.7# yum install ipvsadm -y install
# ipvsadm -A 172.16.100.1:80 -s rr
# ipvsadm -a -t 172.16.100.1:80 -r 192.168.10.8 -m
# ipvsadm -a -t 172.16.100.1:80 -r 192.168.10.9 -m
# service ipvsadm save
# echo 1 >/proc/sys/net/ipv4/ip_forward
更改为wrr
# ipvsadm -E -t 172.16.100.1:80 -s wrr
# ipvsadm -e -t 172.16.100.1:80 -r 192.168.10.8 -m -w 3
# ipvsadm -e -t 172.16.100.1:80 -r 192.168.10.9 -m -w 1
realserver1:192.168.10.8 网关:192.168.10.7
# yum install httpd -y
# ntpdate 192.168.10.7
# echo "my test one" >/var/httpd/html/index.html
# service httpd restart
realserver2:192.168.10.9 网关:192.168.10.7
# yum install httpd -y
# ntpdate 192.168.10.7
# echo "my test two" >/var/httpd/html/index.html
# service httpd restart
DR:
arptables:
kernel parameter
arp_ignore:定义接收到ARP请求时的响应级别
0:只要本地配置的有相应地址,就给予响应
1:仅在请求的目标地址配置请求到达的接口上的时候,才给予响应
arp_announce:定义将自己地址向外通告时的通告级别
0:将本地任何接口上的任何地址向外通告
1:试图仅向目标网络通告与其网络匹配的地址
2:仅向与本地接口上地址匹配的网络进行通告DR类型配置
Director:
eth0,DIP:172.16.100.2
eth0:0 VIP : 172.16.100.1
#ifconfig eth0:0 172.16.100.1/16
#route add -host 172.16.100.1 dev eth0:0
#ipvsadm -A -t 172.16.100.1:80 -s wlc
#ipvsadm -a -t 172.16.100.1:80 -r 172.16.100.7 -g -w 2
#ipvsadm -a -t 172.16.100.1:80 -r 172.16.100.8 -g -w 1
RS1:
eth0:rip1:172.16.100.7
Lo:0 , vip :172.16.100.1
#sysctl -w net.ipv4.conf.eht0.arp_announce=2
#sysctl -w net.ipv4.conf.all.arp_announce=2
#sysctl -w net.ipv4.conf.eht0.arp_ignore=1
#sysctl -w net.ipv4.conf.all.arp_ignore=1#ifconfig lo:0 172.16.100.1 broadcast 172.16.100.1 netmask 255.255.255.255
#route add -host 172.16.100.1 dev lo:0
RS2:
eth0:rip1:172.16.100.8
Lo:0 , vip :172.16.100.1#sysctl -w net.ipv4.conf.eht0.arp_announce=2
#sysctl -w net.ipv4.conf.all.arp_announce=2#sysctl -w net.ipv4.conf.eht0.arp_ignore=1
#sysctl -w net.ipv4.conf.all.arp_ignore=1
#ifconfig lo:0 172.16.100.1 broadcast 172.16.100.1 netmask 255.255.255.255#route add -host 172.16.100.1 dev lo:0
DR类型中,Director和RealServer的配置脚本示例:
Director脚本:
#!/bin/bash
#
# LVS script for VS/DR
# chkconfig: - 90 10
#
. /etc/rc.d/init.d/functions
#
VIP=172.16.100.1
DIP=172.16.100.2
RIP1=172.16.100.7
RIP2=172.16.100.8
PORT=80
RSWEIGHT1=2
RSWEIGHT2=5
#
case "$1" in
start)
/sbin/ifconfig eth0:1 $VIP broadcast $VIP netmask 255.255.255.255 up
/sbin/route add -host $VIP dev eth0:0
# Since this is the Director we must be able to forward packets
echo 1 > /proc/sys/net/ipv4/ip_forward
# Clear all iptables rules.
/sbin/iptables -F
# Reset iptables counters.
/sbin/iptables -Z
# Clear all ipvsadm rules/services.
/sbin/ipvsadm -C
# Add an IP virtual service for VIP 192.168.0.219 port 80
# In this recipe, we will use the round-robin scheduling method.
# In production, however, you should use a weighted, dynamic scheduling method.
/sbin/ipvsadm -A -t $VIP:80 -s wlc
# Now direct packets for this VIP to
# the real server IP (RIP) inside the cluster
/sbin/ipvsadm -a -t $VIP:80 -r $RIP1 -g -w $RSWEIGHT1
/sbin/ipvsadm -a -t $VIP:80 -r $RIP2 -g -w $RSWEIGHT2
/bin/touch /var/lock/subsys/ipvsadm &> /dev/null
;;
stop)
# Stop forwarding packets
echo 0 > /proc/sys/net/ipv4/ip_forward
# Reset ipvsadm
/sbin/ipvsadm -C
# Bring down the VIP interface
/sbin/ifconfig eth0:0 down
/sbin/route del $VIP
/bin/rm -f /var/lock/subsys/ipvsadm
echo "ipvs is stopped..."
;;
status)
if [ ! -e /var/lock/subsys/ipvsadm ]; then
echo "ipvsadm is stopped ..."
else
echo "ipvs is running ..."
ipvsadm -L -n
fi
;;
*)
echo "Usage: $0 {start|stop|status}"
;;
esac
RealServer脚本:
#!/bin/bash
#
# Script to start LVS DR real server.
# chkconfig: - 90 10
# description: LVS DR real server
#
. /etc/rc.d/init.d/functions
VIP=172.16.100.1
host=`/bin/hostname`
case "$1" in
start)
# Start LVS-DR real server on this machine.
/sbin/ifconfig lo down
/sbin/ifconfig lo up
echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
/sbin/ifconfig lo:0 $VIP broadcast $VIP netmask 255.255.255.255 up
/sbin/route add -host $VIP dev lo:0
;;
stop)
# Stop LVS-DR real server loopback device(s).
/sbin/ifconfig lo:0 down
echo 0 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 0 > /proc/sys/net/ipv4/conf/lo/arp_announce
echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
;;
status)
# Status of LVS-DR real server.
islothere=`/sbin/ifconfig lo:0 | grep $VIP`
isrothere=`netstat -rn | grep "lo:0" | grep $VIP`
if [ ! "$islothere" -o ! "isrothere" ];then
# Either the route or the lo:0 device
# not found.
echo "LVS-DR real server Stopped."
else
echo "LVS-DR real server Running."
fi
;;
*)
# Invalid entry.
echo "$0: Usage: $0 {start|status|stop}"
exit 1
;;
esac
RS健康状态检查脚本示例第一版:
#!/bin/bash
#
VIP=192.168.10.3
CPORT=80
FAIL_BACK=127.0.0.1
FBSTATUS=0
RS=("192.168.10.7" "192.168.10.8")
RSTATUS=("1" "1")
RW=("2" "1")
RPORT=80
TYPE=g
add() {
ipvsadm -a -t $VIP:$CPORT -r $1:$RPORT -$TYPE -w $2
[ $? -eq 0 ] && return 0 || return 1
}
del() {
ipvsadm -d -t $VIP:$CPORT -r $1:$RPORT
[ $? -eq 0 ] && return 0 || return 1
}
while :; do
let COUNT=0
for I in ${RS[*]}; do
if curl --connect-timeout 1 http://$I &> /dev/null; then
if [ ${RSTATUS[$COUNT]} -eq 0 ]; then
add $I ${RW[$COUNT]}
[ $? -eq 0 ] && RSTATUS[$COUNT]=1
fi
else
if [ ${RSTATUS[$COUNT]} -eq 1 ]; then
del $I
[ $? -eq 0 ] && RSTATUS[$COUNT]=0
fi
fi
let COUNT++
done
sleep 5
done
RS健康状态检查脚本示例第二版:
#!/bin/bash
#
VIP=192.168.10.3
CPORT=80
FAIL_BACK=127.0.0.1
RS=("192.168.10.7" "192.168.10.8")
declare -a RSSTATUS
RW=("2" "1")
RPORT=80
TYPE=g
CHKLOOP=3
LOG=/var/log/ipvsmonitor.log
addrs() {
ipvsadm -a -t $VIP:$CPORT -r $1:$RPORT -$TYPE -w $2
[ $? -eq 0 ] && return 0 || return 1
}
delrs() {
ipvsadm -d -t $VIP:$CPORT -r $1:$RPORT
[ $? -eq 0 ] && return 0 || return 1
}
checkrs() {
local I=1
while [ $I -le $CHKLOOP ]; do
if curl --connect-timeout 1 http://$1 &> /dev/null; then
return 0
fi
let I++
done
return 1
}
initstatus() {
local I
local COUNT=0;
for I in ${RS[*]}; do
if ipvsadm -L -n | grep "$I:$RPORT" && > /dev/null ; then
RSSTATUS[$COUNT]=1
else
RSSTATUS[$COUNT]=0
fi
let COUNT++
done
}
initstatus
while :; do
let COUNT=0
for I in ${RS[*]}; do
if checkrs $I; then
if [ ${RSSTATUS[$COUNT]} -eq 0 ]; then
addrs $I ${RW[$COUNT]}
[ $? -eq 0 ] && RSSTATUS[$COUNT]=1 && echo "`date +'%F %H:%M:%S'`, $I is back." >> $LOG
fi
else
if [ ${RSSTATUS[$COUNT]} -eq 1 ]; then
delrs $I
[ $? -eq 0 ] && RSSTATUS[$COUNT]=0 && echo "`date +'%F %H:%M:%S'`, $I is gone." >> $LOG
fi
fi
let COUNT++
done
sleep 5
done
LVS持久连接:
无论使用算法,LVS持久都能实现在一定时间内,将来自同一个客户端请求派发至此前选定的RS。
持久连接模板(内存缓冲区):
每一个客户端 及分配给它的RS的映射关系;
ipvsadm -A|E ... -p timeout:
timeout: 持久连接时长,默认300秒;单位是秒;
在基于SSL,需要用到持久连接;
PPC:将来自于同一个客户端对同一个集群服务的请求,始终定向至此前选定的RS; 持久端口连接
PCC:将来自于同一个客户端对所有端口的请求,始终定向至此前选定的RS; 持久客户端连接
PNMPP:持久防火墙标记连接
PREROUTING
80:10
23:10
#iptables -t mangle -A PREROUTING -d 192.168.10.3 -i eht0 -p tcp --dport 80 -j MARK --set-mark 8
#iptables -t mangle -A PREROUTING -d 192.168.10.3 -i eht0 -p tcp --dport 23 -j MARK --set-mark 8
#ipvsadm -A -f 8 -s rr
#ipvsadm -a -f 8 -r 192.168.10.7 -g -w 2
#ipvsadm -a -f 8 -r 192.168.10.8 -g -w 5
高可用集群原理详解
FailOver:故障转移
资源粘性:location
资源更倾向于运行于哪个节点,通过score定义
CRM:集群资源管理器(crmd端口5560)
LRM:本地资源管理器
RA:资源代理(脚本)
资源约束:Constraint
排列约束:(colation)
资源是否能够运行与同一节点
score:
正值:可以在一起
负值:不能在一起
位置约束:(location),score(分数)
正值:倾向于此节点
负值:倾向于逃离于此节点
顺序约束:(order)
定义资源启动或关闭时的次序
vip,ipvs
ipvs-->vip
-inf:负无穷
inf:正无穷
资源隔离:
节点级别:
STONITH
资源级别:
例如:FC SAN switch 可以实现在储存资源级别拒绝某节点的访问
split-brain:集群节点无法有效获取其他节点的状态信息时,产生脑裂
后果之一:抢占共享存储
active/active: 高可用
IDE:(ATA),130M
SATA:600M
7200rpm
IOPS: 100
SCSI: 320M
SAS:
15000rpm
IOPS: 200
USB 3.0: 400M
机械:
随机读写
顺序读写
固态:
IDE, SCSI: 并口
SATA, SAS, USB: 串口
DAS:
Direct Attached Storage
直接接到主板总线,BUS
文件:块
NAS:
Network
文件服务器:文件级别
SAN:
Storage Area network
存储区域网络
FC SAN
IP SAN: iSCSI
SCSI: Small Computer System Interface
without_quorum_policy:
freeze:
stop:停止
ignore:忽略
Messaging Layer
heartbeat(v1,v2,v3):UDP:694
heartbeat v1: 自带的资源管理器
haresources
heartbeat v2:自带的资源管理器
haresources
crm
heartbeat v3 :资源管理器crm发展为独立的项目,pacemaker
heartbeat,pacemaker,cluster-glue
corosync
cman
keepalived
ultramokey
CRM
haresource,crm(heartbeat v1/v2)
pacermaker(hearbeat v corosync)
rgmanager(cman)
Resource:
primitive(native)
clone
STONITH
Cluster Filesystem
dlm: Distributed lock Manager
group
master/slave
RA: Resource Agent
RA Classes:
Legacy heartbeat v1 RA
LSB (/etc/rc.d/init.d)
OCF (Open Cluster Framework)
pacemaker
linbit (drbd)
stonith
隔离级别:
节点级别
STONTIH
资源级别
FC SAN Switch
Stonith设备
1、Power Distribution Units (PDU)
Power Distribution Units are an essential element in managing power capacity and functionality for critical network, server and data center equipment. They can provide remote load monitoring of connected equipment and individual outlet power control for remote power recycling.
2、Uninterruptible Power Supplies (UPS)
A stable power supply provides emergency power to connected equipment by supplying power from a separate source in the event of utility power failure.
3、Blade Power Control Devices
If you are running a cluster on a set of blades, then the power control device in the blade enclosure is the only candidate for fencing. Of course, this device must be
capable of managing single blade computers.
4、Lights-out Devices
Lights-out devices (IBM RSA, HP iLO, Dell DRAC) are becoming increasingly popular and may even become standard in off-the-shelf computers. However, they are inferior to UPS devices, because they share a power supply with their host (a cluster node). If a node stays without power, the device supposed to control it would be just as useless. In that case, the CRM would continue its attempts to fence the node indefinitely while all other resource operations would wait for the fencing/STONITH operation to complete.
5、Testing Devices
Testing devices are used exclusively for testing purposes. They are usually more gentle on the hardware. Once the cluster goes into production, they must be replaced
with real fencing devices.
ssh 172.16.100.1 'reboot'
meatware
STONITH的实现:
stonithd
stonithd is a daemon which can be accessed by local processes or over the network. It accepts the commands which correspond to fencing operations: reset, power-off, and power-on. It can also check the status of the fencing device.
The stonithd daemon runs on every node in the CRM HA cluster. The stonithd instance running on the DC node receives a fencing request from the CRM. It is up to this and other stonithd programs to carry out the desired fencing operation.
STONITH Plug-ins
For every supported fencing device there is a STONITH plug-in which is capable of controlling said device. A STONITH plug-in is the interface to the fencing device.
On each node, all STONITH plug-ins reside in /usr/lib/stonith/plugins (or in /usr/lib64/stonith/plugins for 64-bit architectures). All STONITH plug-ins look the same to stonithd, but are quite different on the other side reflecting the nature of the fencing device.
Some plug-ins support more than one device. A typical example is ipmilan (or external/ipmi) which implements the IPMI protocol and can control any device which supports this protocol.
CIB: Cluster Information Base xml格式
heartbeat安装配置
heartbeat v2
ha web
node1 node2
节点名称,/etc/hosts
节点名称必须跟uname -n命令的执行结果一致
ssh互相通信
时间同步
1、配置ip
node1 172.16.100.6
node2 172.16.100.7
2、配置hostname
#hostname node1.name
#vim /etc/sysconfig/network
#hostname node2.name
#vim /etc/sysconfig/network
3、ssh互相通信
#ssh-keygen -t rsa -f ~/.ssh/id_rsa -P''
#ssh-copy-id -i .ssh/id_rsa.pub root@172.16.100.7
#ssh-keygen -t rsa -f ~/.ssh/id_rsa -P''
#ssh-copy-id -i .ssh/id_rsa.pub root@172.16.100.64、配置host
#vim /etc/hosts
172.16.100.6 node1.name node1
172.16.100.6 node2.name node2
#scp /etc/hosts node2:/etc
5、时间同步
#ntpdate server_ip
#crontab -e
*/5 * * * * /sbin/ntpdate server_ip &>/dev/null
#scp /var/spool/cron/root node2:/var/spool/cron/
6、安装heartbeat
heartbeat - Heartbeat subsystem for High-Availability Linux 核心包
heartbeat-devel - Heartbeat development package 开发包
heartbeat-gui - Provides a gui interface to manage heartbeat clusters 图形界面
heartbeat-ldirectord - Monitor daemon for maintaining high availability resources, 为ipvs高可用提供规则自动生成及后端realserver健康状态检查的组件;
heartbeat-pils - Provides a general plugin and interface loading library 装载库的接口
heartbeat-stonith - Provides an interface to Shoot The Other Node In The Head
7、配置hearbeat
三个配置文件:
1、密钥文件,600,anthkeys
2、heartbeat服务的配置文件ha.cf
3、资源管理配置文件
haresources
#cp -p /usr/share/doc/heartbeat/{authkeys,ha.cf,haresources} /etc/ha.d/
#dd if=/dev/random count=1 bs=512 | md5sum //生成随机数
#vim /etc/ha.d/authkeys
auth 1
1 md5 随机数#vim /etc/ha.d/ha.cf
#node ken3
#node kathynode node1.name
node node2.name
bcast //去掉前面的#号,启动广播
#chkconfig httpd off
#vim /etc/ha.d/haresources
node1.name IPaddr::172.16.100.1/16/eth0 httdp
#scp -p authkeys haresources ha.cf node2:/etc/ha.d/
#service heartbeat start
#ssh node2'service heartbeat start'
#/usr/lib/heartbeat/hb_standby //切换为备用节点
8、共享磁盘(172.16.100.10)
#vim /etc/exports
/web/htdocs 172.16.0.0/255.255.0.0(ro)
node1~#ssh node2 '/etc/init.d/heartbeat stop'
node1~#service hearbeat stop
node1~#vim /etc/ha.d/haresources
node1.name IPaddr::172.16.100.1/16/eth0 Filesystem::172.16.100.10:/web/htdocs::/var/www/html::nfs httdp
node1~#scp /etc/ha.d/haresources node2:/etc/ha.d/
node1~#service heartbeat start
node1~#ssh node2'services heartbeat start'
基于crm进行资源管理
node1~#ssh node2 '/etc/init.d/heartbeat stop'
node1~#service hearbeat stopnode1~#vim /etc/ha.d/ha.cf
mcast eth0 225.0.0.15 694 1 0 //去掉前面的# 启用组播
crm respawn
node1~#/usr/lib/heartbeat/ha_propagete
node1~#service heartbeat start
node1~#ssh node2'services heartbeat start'
基于hb v2,crm来实现MySQL高可用集群
nfs,samba,iscsi
NFS:MySQL app ,data
/etc/my.cnf --> /etc/mysql/mysql.cnf
$MYSQL_BASE
--default-extra-file =
共享磁盘
#pvcreate /dev/sdb2
#vgcreate myvg /dev/sdb2
#lvcreate -L 10G -n mydata myvg
#mke2fs -j /dev/myvg/mydata
#groupadd -g 3306 mysql
#useradd -u 3306 -g mysql -s /sbin/nologin -M mysql
#vim /etc/fstab
/dev/myvg/mydata /mydata ext3 defaults 0 0
#mkdir /mydata/data
#chown -R mysql.mysql /mysata/data/
#vim /etc/exports
/mydata 172.16.0.0/255.255.0.0(no_root_squash,rw)
#exportfs -arv
node1~#ssh node2'service heartbeat stop'
node1~#service heartbeat stop
node1~#groupadd -g 3306 mysql
node1~#useradd -g 3306 -s /sbin/nologin -M mysql
node1~#mount -t nfs 172.16.100.10:/mydata /mydata //挂载以后测试mysql用户是否有写入权限
node1~#umount /mydaa
node2~#groupadd -g 3306 mysql
node2~#useradd -g 3306 -s /sbin/nologin -M mysql
node2~#mount -t nfs 172.16.100.10:/mydata /mydata //挂载以后测试mysql用户是否有写入权限
node2~#umount /mydata
在node1和node2安装mysql 指定数据库数据目录为/mydata/data
node1~#service heartbeat start
node1~#ssh node2'services heartbeat start'
REHL 6.x RHCS: corosync
RHEL 5.x RHCS: openais, cman, rgmanager
corosync: Messaging Layer
openais: AIS
corosync --> pacemaker
SUSE Linux Enterprise Server: Hawk, WebGUI
LCMC: Linux Cluster Management Console
RHCS: Conga(luci/ricci)
webGUI
keepalived: VRRP, 2节点