多路径环境下RHCS和GFS的timeout装备

适用环境：Cluster or GFS on RHEL4 and later

毛病现象：日志报错

openais[3345]: [CMAN ] lost contact with quorum device

现在只需客户有同享存储，在布置Cluster和GFS的时分，都主张装备quorum disk。而上面这个报错信任咱们都不会生疏吧。这个问题一般是因为qdisk进程太长时刻没有与cman/ais通讯，超越了qdisk的poll投票时刻，然后此节点被断开。特别是在装备了multipath、rdac等多路径软件环境中做链路失效切换测验时，因为failover的时刻或许比较长，形成链路切换之前qdisk就现已丢掉了，节点直接被reboot，而这当然不是咱们所等待的成果。那怎样处理这个问题呢?

先来了解几个基本概念：

① 集群要以为一个节点健康，需求以下3要素

· CMAN以为该节点online

· 该节点能满足接连的读写quorum disk

· 该节点heuristic有满足的score

② qdisk包含两个首要线程：主线程担任循环和进行I/O操作;第二线程担任heuristic相关。

主线程另一个作业便是每隔一段时刻告知cman/ais自己还活着。假设qdisk超越quorum_dev_poll的时刻而没有和cman/ais通讯，cman就会声明说此节点与quorum disk断开衔接，此刻日志便会有如上报错。默许的cman.h里

#define DEFAULT_QUORUMDEV_POLL 10000

单位是ms，即10秒。修正quorum_dev_poll需求在cluster.conf文件里修正cman标签：

cman quorum_dev_poll=50000>/cman>

③咱们平常指的qdisk timeout是指接连一段时刻对quorum disk的读写都是失利。假设cluster.conf里

quorumd device=/dev/sdb1 interval=3 min_score=2 tko=13 votes=2>

其间

interval=3

This is the frequency of read/write cycles, in seconds.读写quorum disk的频率

tko=13

This is the number of cycles a node must miss in order to be declared dead.接连失利多少次则断定此节点死掉

qdisk_timeout = interval x tko

④再来看看RHEL5里cman timeout是怎样去装备的，

token

This timeout specifies in milliseconds until a token loss is declared after not receiving a token. This is the time spent detecting a failure of a processor in the current configuration. Reforming a new configuration takes about 50 milliseconds in addition to this timeout. The default is 1000 milliseconds. 接连多长时刻没有收到token就断定令牌丢掉。默许1秒，其间有50ms是生成一个新的装备的时刻。

retransmits_before_loss

This value identifies how many token retransmits should be attempted before forming a new configuration. If this value is set, retransmit and hold will be automati- cally calculated from retransmits_before_loss and token. The default is 4 retransmissions. 接连丢掉几回token，才会生成新的cluster装备(将丢掉token的节点踢出集群)。默许4次。

token_retransmit

This timeout specifies in milliseconds after how long before receiving a token the token is retransmitted. This will be automatically calculated if token is modi- fied. It is not recommended to alter this value without guidance from the openais community. The default is 238 milliseconds. 重发token的时刻距离，这个值是由上面的token和token_retransmit主动核算的。(1000-50)/4≈238ms

假设呈现上面说的丢掉心跳token的时分，日志会呈现如下报错：

openais[3345]: [TOTEM] The token was lost in the OPERATIONAL state.

留意单位为毫秒。别的，也能够修正cman的标签：

注：RHEL4并未运用openais的架构，因而只能经过deadnode_timeout来修正。

好，有了前面的根底，不难想象到各个timeout值，用T(*)表明，应有如下联系：

T(MPIO)

RH官方有如下主张：

T(qdisk) = T(MPIO) × 1.3

T(cman) = T(MPIO) × 2.7

参阅文档：

Red Hat Knowledgebase

、man page of

qdisk(5)

、

openais.conf(5)

扫一扫打开手机网站

微信扫一扫关注我们

多路径环境下RHCS和GFS的timeout装备

联系我们

微信扫一扫关注我们

为您推荐

在STM32F407单片机上使用多块不连续空间实现堆的软件方法介绍

速度与激情：为电动摩托车设计寿命更长的 16S-17S 锂离子电池组

瑞萨高精度旋转变压器电机控制方案

技术帖 | AM62x处理器SPI的详解与应用

DCDC引发EMI辐射超标的整改案例

小漫电子经销 Cosel（科索）CHS3002405-B等系列产品

联系我们

微信扫一扫关注我们