Linux集群LVS介绍以及heartbeat安装配置

2017年3月4日23:35:05 发表评论 4,225 ℃

Linux集群

LB：

并发处理能力

HA：High Availability,高可用

在线时间/(在线时间+故障处理时间)

99%，99.9%，99.99%

HP（HPC）：高性能

High Performance

并行处理集群

分布式存储：分布式文件系统

将大任务切割为小任务，分别进行处理的机制

rsync+inotify：文件同步

sersync：文件同步

fencing：隔离

节点隔离：STONTIH

资源隔离：

LVS

Linux Virtual Server

Hardware

F5, BIG IP

Citrix ,Netscaler

A10

Software

四层

LVS

七层：反向代理

Nginx

http , smtp ,pop3 ,imap

haproxy

http,tcp (mysql, smtp)

LVS

ipvsadm：管理集群服务的命令行工具

ipvs：内核模块

CIP-->VIP-->DIP-->RIP

三种类型

NAT：地址转换

集群节点跟director必须在同一个IP网络中

RIP地址通常是私有地址，仅用于各集群节点之间通信

director位于client和real server之间，并负责处理进出的所有通信

realserver必须将网关指向DIP

支持端口映射

realserver可以使用任意操作系统

较大规模应用场景中，director易成为系统瓶颈

DR：直接路由

集群节点跟director必须在同一个物理网络中

RIP可以使用公网地址，实现便捷的远程管理和监控

director仅负责处理入站请求，响应报文则由director直接发往客户端

realserver不能能将网关指定DIP

不支持端口映射

TUN：隧道

集群节点可以跨越互联网

RIP必须是公网地址

director仅服务处理入站请求，响应报文则由realserver直接发往客户端

realserver网关不能指向director

只有指向隧道功能的系统才能用于realserver

不支持端口映射

固定调度

rr：轮叫，轮询

wrr：weight，加权

sh：source hash ,源地址hash

dh：

动态调度

lc：最少连接

active*256+inactive

谁的小，挑谁

wlc：加权最少连接

（active*256+inactive）/weight

sed : 最短期望延迟

（active+1）*256/weight

nq：never queue

LBLC : 基于本地的最少连接

DH：

LBLCR：基于本地的带复制功能的最少连接

ipvsadm：

管理集群服务

添加：-A -t|u|f service-address [-s scheduler]

-t：tcp协议集群

-u：udp协议的集群

-f：fwm：防火墙标记

server-address：mark number

#ipvsadm -A -t 172.16.100.1:80 -s rr

修改：-E

删除：-D -t|u|f service-address

管理集群服务中的RS

添加：-a -t|u|f service-address -r server-address [-g|i|m] [-w weight] [-x upper] [-y lower]

-t|u|f service-address：事先定义好的某集群服务

-r server-address ：某RS的地址，在NAT模型中，可使用IP：prot实现端口映射

[-g|i|m]：LVS类型

-g：DR

-i：TUN

-m：NAT

[-w weight] :定义服务器权重

修改：-e

删除：-d -t|u|f service-address -r server-address

# ipvsadm -a -t 172.16.100.1:80 -r 192.168.10.8 -m

#ipvsadm -a -t 172.16.100.1:80 -r 192.168.10.9 -m

查看：

-L| l

-n：数字格式显示主机地址和端口

--stats：统计数据

--rate：速率

--timeout：显示tcp、tcpfin和udp的会话超时时长

--sort : 排序

-c：显示当前的ipvs链接状况

删除所有集群服务

-C：清空ipvs规则

保存规则

-S

#ipvsadm -S > /path/to/somefile

载入此前的规则：

-R

# ipvsadm -R < /path/torm/somefile

各节点之间的世界偏差不应该超出1秒种

NTP：Network Time Protocol

ntpdate timeserver （和时间服务器同步时间，server启动ntp服务）

NAT类型配置

director：VIP：172.16.100.1 DIP：192.168.10.7# yum install ipvsadm -y install

# ipvsadm -A 172.16.100.1:80 -s rr

# ipvsadm -a -t 172.16.100.1:80 -r 192.168.10.8 -m

# ipvsadm -a -t 172.16.100.1:80 -r 192.168.10.9 -m

# service ipvsadm save

# echo 1 >/proc/sys/net/ipv4/ip_forward

更改为wrr

# ipvsadm -E -t 172.16.100.1:80 -s wrr

# ipvsadm -e -t 172.16.100.1:80 -r 192.168.10.8 -m -w 3

# ipvsadm -e -t 172.16.100.1:80 -r 192.168.10.9 -m -w 1

realserver1：192.168.10.8 网关：192.168.10.7

# yum install httpd -y

# ntpdate 192.168.10.7

# echo "my test one" >/var/httpd/html/index.html

# service httpd restart

realserver2：192.168.10.9 网关：192.168.10.7

# yum install httpd -y

# ntpdate 192.168.10.7

# echo "my test two" >/var/httpd/html/index.html

# service httpd restart

DR:

arptables：

kernel parameter

arp_ignore：定义接收到ARP请求时的响应级别

0：只要本地配置的有相应地址，就给予响应

1：仅在请求的目标地址配置请求到达的接口上的时候，才给予响应

arp_announce：定义将自己地址向外通告时的通告级别

0：将本地任何接口上的任何地址向外通告
1：试图仅向目标网络通告与其网络匹配的地址
2：仅向与本地接口上地址匹配的网络进行通告

DR类型配置

Director:

eth0,DIP：172.16.100.2

eth0:0 VIP : 172.16.100.1

#ifconfig eth0:0 172.16.100.1/16

#route add -host 172.16.100.1 dev eth0:0

#ipvsadm -A -t 172.16.100.1:80 -s wlc

#ipvsadm -a -t 172.16.100.1:80 -r 172.16.100.7 -g -w 2

#ipvsadm -a -t 172.16.100.1:80 -r 172.16.100.8 -g -w 1

RS1：

eth0：rip1：172.16.100.7

Lo:0 , vip ：172.16.100.1

#sysctl -w net.ipv4.conf.eht0.arp_announce=2

#sysctl -w net.ipv4.conf.all.arp_announce=2

#sysctl -w net.ipv4.conf.eht0.arp_ignore=1
#sysctl -w net.ipv4.conf.all.arp_ignore=1

#ifconfig lo:0 172.16.100.1 broadcast 172.16.100.1 netmask 255.255.255.255

#route add -host 172.16.100.1 dev lo:0

RS2：

eth0：rip1：172.16.100.8
Lo:0 , vip ：172.16.100.1

#sysctl -w net.ipv4.conf.eht0.arp_announce=2
#sysctl -w net.ipv4.conf.all.arp_announce=2

#sysctl -w net.ipv4.conf.eht0.arp_ignore=1
#sysctl -w net.ipv4.conf.all.arp_ignore=1
#ifconfig lo:0 172.16.100.1 broadcast 172.16.100.1 netmask 255.255.255.255

#route add -host 172.16.100.1 dev lo:0

DR类型中，Director和RealServer的配置脚本示例：

Director脚本:

#!/bin/bash

#

# LVS script for VS/DR

# chkconfig: - 90 10

#

. /etc/rc.d/init.d/functions

#

VIP=172.16.100.1

DIP=172.16.100.2

RIP1=172.16.100.7

RIP2=172.16.100.8

PORT=80

RSWEIGHT1=2

RSWEIGHT2=5

#

case "$1" in

start)

/sbin/ifconfig eth0:1 $VIP broadcast $VIP netmask 255.255.255.255 up

/sbin/route add -host $VIP dev eth0:0

# Since this is the Director we must be able to forward packets

echo 1 > /proc/sys/net/ipv4/ip_forward

# Clear all iptables rules.

/sbin/iptables -F

# Reset iptables counters.

/sbin/iptables -Z

# Clear all ipvsadm rules/services.

/sbin/ipvsadm -C

# Add an IP virtual service for VIP 192.168.0.219 port 80

# In this recipe, we will use the round-robin scheduling method.

# In production, however, you should use a weighted, dynamic scheduling method.

/sbin/ipvsadm -A -t $VIP:80 -s wlc

# Now direct packets for this VIP to

# the real server IP (RIP) inside the cluster

/sbin/ipvsadm -a -t $VIP:80 -r $RIP1 -g -w $RSWEIGHT1

/sbin/ipvsadm -a -t $VIP:80 -r $RIP2 -g -w $RSWEIGHT2

/bin/touch /var/lock/subsys/ipvsadm &> /dev/null

;;

stop)

# Stop forwarding packets

echo 0 > /proc/sys/net/ipv4/ip_forward

# Reset ipvsadm

/sbin/ipvsadm -C

# Bring down the VIP interface

/sbin/ifconfig eth0:0 down

/sbin/route del $VIP



/bin/rm -f /var/lock/subsys/ipvsadm



echo "ipvs is stopped..."

;;

status)

if [ ! -e /var/lock/subsys/ipvsadm ]; then

echo "ipvsadm is stopped ..."

else

echo "ipvs is running ..."

ipvsadm -L -n

fi

;;

*)

echo "Usage: $0 {start|stop|status}"

;;

esac

RealServer脚本:

#!/bin/bash

#

# Script to start LVS DR real server.

# chkconfig: - 90 10

# description: LVS DR real server

#

. /etc/rc.d/init.d/functions

VIP=172.16.100.1

host=`/bin/hostname`

case "$1" in

start)

# Start LVS-DR real server on this machine.

/sbin/ifconfig lo down

/sbin/ifconfig lo up

echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore

echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce

echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore

echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce

/sbin/ifconfig lo:0 $VIP broadcast $VIP netmask 255.255.255.255 up

/sbin/route add -host $VIP dev lo:0

;;

stop)

# Stop LVS-DR real server loopback device(s).

/sbin/ifconfig lo:0 down

echo 0 > /proc/sys/net/ipv4/conf/lo/arp_ignore

echo 0 > /proc/sys/net/ipv4/conf/lo/arp_announce

echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore

echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce

;;

status)

# Status of LVS-DR real server.

islothere=`/sbin/ifconfig lo:0 | grep $VIP`

isrothere=`netstat -rn | grep "lo:0" | grep $VIP`

if [ ! "$islothere" -o ! "isrothere" ];then

# Either the route or the lo:0 device

# not found.

echo "LVS-DR real server Stopped."

else

echo "LVS-DR real server Running."

fi

;;

*)

# Invalid entry.

echo "$0: Usage: $0 {start|status|stop}"

exit 1

;;

esac

RS健康状态检查脚本示例第一版：

#!/bin/bash

#

VIP=192.168.10.3

CPORT=80

FAIL_BACK=127.0.0.1

FBSTATUS=0

RS=("192.168.10.7" "192.168.10.8")

RSTATUS=("1" "1")

RW=("2" "1")

RPORT=80

TYPE=g

add() {

ipvsadm -a -t $VIP:$CPORT -r $1:$RPORT -$TYPE -w $2

[ $? -eq 0 ] && return 0 || return 1

}

del() {

ipvsadm -d -t $VIP:$CPORT -r $1:$RPORT

[ $? -eq 0 ] && return 0 || return 1

}

while :; do

let COUNT=0

for I in ${RS[*]}; do

if curl --connect-timeout 1 http://$I &> /dev/null; then

if [ ${RSTATUS[$COUNT]} -eq 0 ]; then

add $I ${RW[$COUNT]}

[ $? -eq 0 ] && RSTATUS[$COUNT]=1

fi

else

if [ ${RSTATUS[$COUNT]} -eq 1 ]; then

del $I

[ $? -eq 0 ] && RSTATUS[$COUNT]=0

fi

fi

let COUNT++

done

sleep 5

done

RS健康状态检查脚本示例第二版：

#!/bin/bash

#

VIP=192.168.10.3

CPORT=80

FAIL_BACK=127.0.0.1

RS=("192.168.10.7" "192.168.10.8")

declare -a RSSTATUS

RW=("2" "1")

RPORT=80

TYPE=g

CHKLOOP=3

LOG=/var/log/ipvsmonitor.log

addrs() {

ipvsadm -a -t $VIP:$CPORT -r $1:$RPORT -$TYPE -w $2

[ $? -eq 0 ] && return 0 || return 1

}

delrs() {

ipvsadm -d -t $VIP:$CPORT -r $1:$RPORT

[ $? -eq 0 ] && return 0 || return 1

}

checkrs() {

local I=1

while [ $I -le $CHKLOOP ]; do

if curl --connect-timeout 1 http://$1 &> /dev/null; then

return 0

fi

let I++

done

return 1

}

initstatus() {

local I

local COUNT=0;

for I in ${RS[*]}; do

if ipvsadm -L -n | grep "$I:$RPORT" && > /dev/null ; then

RSSTATUS[$COUNT]=1

else

RSSTATUS[$COUNT]=0

fi

let COUNT++

done

}

initstatus

while :; do

let COUNT=0

for I in ${RS[*]}; do

if checkrs $I; then

if [ ${RSSTATUS[$COUNT]} -eq 0 ]; then

addrs $I ${RW[$COUNT]}

[ $? -eq 0 ] && RSSTATUS[$COUNT]=1 && echo "`date +'%F %H:%M:%S'`, $I is back." >> $LOG

fi

else

if [ ${RSSTATUS[$COUNT]} -eq 1 ]; then

delrs $I

[ $? -eq 0 ] && RSSTATUS[$COUNT]=0 && echo "`date +'%F %H:%M:%S'`, $I is gone." >> $LOG

fi

fi

let COUNT++

done

sleep 5

done

LVS持久连接:

无论使用算法，LVS持久都能实现在一定时间内，将来自同一个客户端请求派发至此前选定的RS。

持久连接模板(内存缓冲区)：

每一个客户端及分配给它的RS的映射关系；

ipvsadm -A|E ... -p timeout:

timeout: 持久连接时长，默认300秒；单位是秒；

在基于SSL，需要用到持久连接；

PPC：将来自于同一个客户端对同一个集群服务的请求，始终定向至此前选定的RS；持久端口连接

PCC：将来自于同一个客户端对所有端口的请求，始终定向至此前选定的RS；持久客户端连接

PNMPP：持久防火墙标记连接

PREROUTING

80:10

23:10

#iptables -t mangle -A PREROUTING -d 192.168.10.3 -i eht0 -p tcp --dport 80 -j MARK --set-mark 8

#iptables -t mangle -A PREROUTING -d 192.168.10.3 -i eht0 -p tcp --dport 23 -j MARK --set-mark 8

#ipvsadm -A -f 8 -s rr

#ipvsadm -a -f 8 -r 192.168.10.7 -g -w 2

#ipvsadm -a -f 8 -r 192.168.10.8 -g -w 5

高可用集群原理详解

FailOver：故障转移

资源粘性：location

资源更倾向于运行于哪个节点，通过score定义

CRM：集群资源管理器(crmd端口5560)

LRM：本地资源管理器

RA：资源代理（脚本）

资源约束：Constraint

排列约束：（colation）

资源是否能够运行与同一节点

score：

正值：可以在一起

负值：不能在一起

位置约束：（location），score（分数）

正值：倾向于此节点

负值：倾向于逃离于此节点

顺序约束：（order）

定义资源启动或关闭时的次序

vip,ipvs

ipvs-->vip

-inf：负无穷

inf：正无穷

资源隔离：

节点级别：

STONITH

资源级别：

例如：FC SAN switch 可以实现在储存资源级别拒绝某节点的访问

split-brain：集群节点无法有效获取其他节点的状态信息时，产生脑裂

后果之一：抢占共享存储

active/active：高可用

IDE:(ATA)，130M

SATA：600M

7200rpm

IOPS: 100

SCSI: 320M

SAS:

15000rpm

IOPS: 200

USB 3.0: 400M

机械：

随机读写

顺序读写

固态：

IDE, SCSI: 并口

SATA, SAS, USB：串口

DAS:

Direct Attached Storage

直接接到主板总线，BUS

文件：块

NAS：

Network

文件服务器：文件级别

SAN:

Storage Area network

存储区域网络

FC SAN

IP SAN: iSCSI

SCSI: Small Computer System Interface

without_quorum_policy:

freeze：

stop：停止

ignore：忽略

Messaging Layer

heartbeat(v1,v2,v3):UDP:694

heartbeat v1: 自带的资源管理器

haresources

heartbeat v2：自带的资源管理器

haresources

crm

heartbeat v3 :资源管理器crm发展为独立的项目，pacemaker

heartbeat,pacemaker,cluster-glue

corosync

cman

keepalived

ultramokey

CRM

haresource,crm(heartbeat v1/v2)

pacermaker(hearbeat v corosync)

rgmanager(cman)

Resource:

primitive(native)

clone

STONITH

Cluster Filesystem

dlm: Distributed lock Manager

group

master/slave

RA: Resource Agent

RA Classes:

Legacy heartbeat v1 RA

LSB (/etc/rc.d/init.d)

OCF (Open Cluster Framework)

pacemaker

linbit (drbd)

stonith

隔离级别：

节点级别

STONTIH

资源级别

FC SAN Switch

Stonith设备

1、Power Distribution Units (PDU)

Power Distribution Units are an essential element in managing power capacity and functionality for critical network, server and data center equipment. They can provide remote load monitoring of connected equipment and individual outlet power control for remote power recycling.

2、Uninterruptible Power Supplies (UPS)

A stable power supply provides emergency power to connected equipment by supplying power from a separate source in the event of utility power failure.

3、Blade Power Control Devices

If you are running a cluster on a set of blades, then the power control device in the blade enclosure is the only candidate for fencing. Of course, this device must be

capable of managing single blade computers.

4、Lights-out Devices

Lights-out devices (IBM RSA, HP iLO, Dell DRAC) are becoming increasingly popular and may even become standard in off-the-shelf computers. However, they are inferior to UPS devices, because they share a power supply with their host (a cluster node). If a node stays without power, the device supposed to control it would be just as useless. In that case, the CRM would continue its attempts to fence the node indefinitely while all other resource operations would wait for the fencing/STONITH operation to complete.

5、Testing Devices

Testing devices are used exclusively for testing purposes. They are usually more gentle on the hardware. Once the cluster goes into production, they must be replaced

with real fencing devices.

ssh 172.16.100.1 'reboot'

meatware

STONITH的实现：

stonithd

stonithd is a daemon which can be accessed by local processes or over the network. It accepts the commands which correspond to fencing operations: reset, power-off, and power-on. It can also check the status of the fencing device.

The stonithd daemon runs on every node in the CRM HA cluster. The stonithd instance running on the DC node receives a fencing request from the CRM. It is up to this and other stonithd programs to carry out the desired fencing operation.

STONITH Plug-ins

For every supported fencing device there is a STONITH plug-in which is capable of controlling said device. A STONITH plug-in is the interface to the fencing device.

On each node, all STONITH plug-ins reside in /usr/lib/stonith/plugins (or in /usr/lib64/stonith/plugins for 64-bit architectures). All STONITH plug-ins look the same to stonithd, but are quite different on the other side reflecting the nature of the fencing device.

Some plug-ins support more than one device. A typical example is ipmilan (or external/ipmi) which implements the IPMI protocol and can control any device which supports this protocol.

CIB: Cluster Information Base xml格式

heartbeat安装配置

heartbeat v2

ha web

node1 node2

节点名称，/etc/hosts

节点名称必须跟uname -n命令的执行结果一致

ssh互相通信

时间同步

1、配置ip

node1 172.16.100.6

node2 172.16.100.7

2、配置hostname

#hostname node1.name

#vim /etc/sysconfig/network

#hostname node2.name

#vim /etc/sysconfig/network

3、ssh互相通信

#ssh-keygen -t rsa -f ~/.ssh/id_rsa -P''

#ssh-copy-id -i .ssh/id_rsa.pub root@172.16.100.7

#ssh-keygen -t rsa -f ~/.ssh/id_rsa -P''
#ssh-copy-id -i .ssh/id_rsa.pub root@172.16.100.6

4、配置host

#vim /etc/hosts

172.16.100.6 node1.name node1

172.16.100.6 node2.name node2

#scp /etc/hosts node2:/etc

5、时间同步

#ntpdate server_ip

#crontab -e

*/5 * * * * /sbin/ntpdate server_ip &>/dev/null

#scp /var/spool/cron/root node2:/var/spool/cron/

6、安装heartbeat

heartbeat - Heartbeat subsystem for High-Availability Linux 核心包

heartbeat-devel - Heartbeat development package 开发包

heartbeat-gui - Provides a gui interface to manage heartbeat clusters 图形界面

heartbeat-ldirectord - Monitor daemon for maintaining high availability resources, 为ipvs高可用提供规则自动生成及后端realserver健康状态检查的组件；

heartbeat-pils - Provides a general plugin and interface loading library 装载库的接口

heartbeat-stonith - Provides an interface to Shoot The Other Node In The Head

7、配置hearbeat

三个配置文件：

1、密钥文件，600，anthkeys

2、heartbeat服务的配置文件ha.cf

3、资源管理配置文件

haresources

#cp -p /usr/share/doc/heartbeat/{authkeys,ha.cf,haresources} /etc/ha.d/

#dd if=/dev/random count=1 bs=512 | md5sum //生成随机数

#vim /etc/ha.d/authkeys

auth 1
1 md5 随机数

#vim /etc/ha.d/ha.cf

#node ken3
#node kathy

node node1.name

node node2.name

bcast //去掉前面的#号，启动广播

#chkconfig httpd off

#vim /etc/ha.d/haresources

node1.name IPaddr::172.16.100.1/16/eth0 httdp

#scp -p authkeys haresources ha.cf node2:/etc/ha.d/

#service heartbeat start

#ssh node2'service heartbeat start'

#/usr/lib/heartbeat/hb_standby //切换为备用节点

8、共享磁盘(172.16.100.10)

#vim /etc/exports

/web/htdocs 172.16.0.0/255.255.0.0(ro)

node1~#ssh node2 '/etc/init.d/heartbeat stop'

node1~#service hearbeat stop

node1~#vim /etc/ha.d/haresources

node1.name IPaddr::172.16.100.1/16/eth0 Filesystem::172.16.100.10:/web/htdocs::/var/www/html::nfs httdp

node1~#scp /etc/ha.d/haresources node2:/etc/ha.d/

node1~#service heartbeat start

node1~#ssh node2'services heartbeat start'

基于crm进行资源管理

node1~#ssh node2 '/etc/init.d/heartbeat stop'
node1~#service hearbeat stop

node1~#vim /etc/ha.d/ha.cf

mcast eth0 225.0.0.15 694 1 0 //去掉前面的# 启用组播

crm respawn

node1~#/usr/lib/heartbeat/ha_propagete

node1~#service heartbeat start

node1~#ssh node2'services heartbeat start'

基于hb v2，crm来实现MySQL高可用集群

nfs，samba，iscsi

NFS：MySQL app ,data

/etc/my.cnf --> /etc/mysql/mysql.cnf

$MYSQL_BASE

--default-extra-file =

共享磁盘

#pvcreate /dev/sdb2

#vgcreate myvg /dev/sdb2

#lvcreate -L 10G -n mydata myvg

#mke2fs -j /dev/myvg/mydata

#groupadd -g 3306 mysql

#useradd -u 3306 -g mysql -s /sbin/nologin -M mysql

#vim /etc/fstab

/dev/myvg/mydata /mydata ext3 defaults 0 0

#mkdir /mydata/data

#chown -R mysql.mysql /mysata/data/

#vim /etc/exports

/mydata 172.16.0.0/255.255.0.0(no_root_squash,rw)

#exportfs -arv

node1~#ssh node2'service heartbeat stop'

node1~#service heartbeat stop

node1~#groupadd -g 3306 mysql

node1~#useradd -g 3306 -s /sbin/nologin -M mysql

node1~#mount -t nfs 172.16.100.10:/mydata /mydata //挂载以后测试mysql用户是否有写入权限

node1~#umount /mydaa

node2~#groupadd -g 3306 mysql

node2~#useradd -g 3306 -s /sbin/nologin -M mysql

node2~#mount -t nfs 172.16.100.10:/mydata /mydata //挂载以后测试mysql用户是否有写入权限

node2~#umount /mydata

在node1和node2安装mysql 指定数据库数据目录为/mydata/data