首页 > 代码库 > Open-Falcon 监控系统监控 MySQL/Redis/MongoDB 状态监控

Open-Falcon 监控系统监控 MySQL/Redis/MongoDB 状态监控

背景:

Open-Falcon 是小米运维部开源的一款互联网企业级监控系统解决方案,具体的安装和使用说明请见官网:http://open-falcon.org/,是一款比较全的监控。而且提供各种API,只需要把数据按照规定给出就能出图,以及报警、集群支持等等。

监控:

1) MySQL 收集信息脚本(mysql_monitor.py)

#!/bin/env python# -*- encoding: utf-8 -*-from __future__ import divisionimport MySQLdbimport datetimeimport timeimport osimport sysimport fileinputimport requestsimport jsonimport reclass MySQLMonitorInfo():    def __init__(self,host,port,user,password):        self.host     = host        self.port     = port        self.user     = user        self.password = password    def stat_info(self):        try:            m = MySQLdb.connect(host=self.host,user=self.user,passwd=self.password,port=self.port,charset=utf8)            query = "SHOW GLOBAL STATUS"            cursor = m.cursor()            cursor.execute(query)            Str_string = cursor.fetchall()            Status_dict = {}            for Str_key,Str_value in Str_string:                Status_dict[Str_key] = Str_value            cursor.close()            m.close()            return Status_dict        except Exception, e:            print (datetime.datetime.now()).strftime("%Y-%m-%d %H:%M:%S")            print e            Status_dict = {}            return Status_dict     def engine_info(self):        try:            m = MySQLdb.connect(host=self.host,user=self.user,passwd=self.password,port=self.port,charset=utf8)            _engine_regex = re.compile(ur(History list length) ([0-9]+\.?[0-9]*)\n)            query = "SHOW ENGINE INNODB STATUS"            cursor = m.cursor()            cursor.execute(query)            Str_string = cursor.fetchone()            a,b,c = Str_string            cursor.close()            m.close()            return dict(_engine_regex.findall(c))        except Exception, e:            print (datetime.datetime.now()).strftime("%Y-%m-%d %H:%M:%S")            print e            return dict(History_list_length=0)if __name__ == __main__:    open_falcon_api = http://192.168.200.86:1988/v1/push    db_list= []    for line in fileinput.input():        db_list.append(line.strip())    for db_info in db_list:#        host,port,user,password,endpoint,metric = db_info.split(‘,‘)        host,port,user,password,endpoint = db_info.split(,)        timestamp = int(time.time())        step      = 60#        tags      = "port=%s" %port        tags      = ""        conn = MySQLMonitorInfo(host,int(port),user,password)        stat_info = conn.stat_info()        engine_info = conn.engine_info()        mysql_stat_list = []        monitor_keys = [            (Com_select,COUNTER),            (Qcache_hits,COUNTER),            (Com_insert,COUNTER),            (Com_update,COUNTER),            (Com_delete,COUNTER),            (Com_replace,COUNTER),            (MySQL_QPS,COUNTER),            (MySQL_TPS,COUNTER),            (ReadWrite_ratio,GAUGE),            (Innodb_buffer_pool_read_requests,COUNTER),            (Innodb_buffer_pool_reads,COUNTER),            (Innodb_buffer_read_hit_ratio,GAUGE),            (Innodb_buffer_pool_pages_flushed,COUNTER),            (Innodb_buffer_pool_pages_free,GAUGE),            (Innodb_buffer_pool_pages_dirty,GAUGE),            (Innodb_buffer_pool_pages_data,GAUGE),            (Bytes_received,COUNTER),            (Bytes_sent,COUNTER),            (Innodb_rows_deleted,COUNTER),            (Innodb_rows_inserted,COUNTER),            (Innodb_rows_read,COUNTER),            (Innodb_rows_updated,COUNTER),            (Innodb_os_log_fsyncs,COUNTER),            (Innodb_os_log_written,COUNTER),            (Created_tmp_disk_tables,COUNTER),            (Created_tmp_tables,COUNTER),            (Connections,COUNTER),            (Innodb_log_waits,COUNTER),            (Slow_queries,COUNTER),            (Binlog_cache_disk_use,COUNTER)        ]        for _key,falcon_type in monitor_keys:            if _key == MySQL_QPS:                _value = int(stat_info.get(Com_select,0)) + int(stat_info.get(Qcache_hits,0))            elif _key == MySQL_TPS:                _value = int(stat_info.get(Com_insert,0)) + int(stat_info.get(Com_update,0)) + int(stat_info.get(Com_delete,0)) + int(stat_info.get(Com_replace,0))            elif _key == Innodb_buffer_read_hit_ratio:                try:                    _value = round((int(stat_info.get(Innodb_buffer_pool_read_requests,0)) - int(stat_info.get(Innodb_buffer_pool_reads,0)))/int(stat_info.get(Innodb_buffer_pool_read_requests,0)) * 100,3)                except ZeroDivisionError:                    _value = 0            elif _key == ReadWrite_ratio:                try:                    _value = round((int(stat_info.get(Com_select,0)) + int(stat_info.get(Qcache_hits,0)))/(int(stat_info.get(Com_insert,0)) + int(stat_info.get(Com_update,0)) + int(stat_info.get(Com_delete,0)) + int(stat_info.get(Com_replace,0))),2)                except ZeroDivisionError:                    _value = 0                        else:                _value = int(stat_info.get(_key,0))            falcon_format = {                    Metric: %s % (_key),                    Endpoint: endpoint,                    Timestamp: timestamp,                    Step: step,                    Value: _value,                    CounterType: falcon_type,                    TAGS: tags                }            mysql_stat_list.append(falcon_format)        #_key : History list length        for _key,_value in  engine_info.items():            _key = "Undo_Log_Length"            falcon_format = {                    Metric: %s % (_key),                    Endpoint: endpoint,                    Timestamp: timestamp,                    Step: step,                    Value: int(_value),                    CounterType: "GAUGE",                    TAGS: tags                }            mysql_stat_list.append(falcon_format)        print json.dumps(mysql_stat_list,sort_keys=True,indent=4)        requests.post(open_falcon_api, data=json.dumps(mysql_stat_list))

指标说明:收集指标里的COUNTER表示每秒执行次数,GAUGE表示直接输出值。

指标类型说明
 Undo_Log_Length GAUGE未清除的Undo事务数
 Com_select COUNTER select/秒=QPS
 Com_insert COUNTER insert/秒
 Com_update COUNTER update/秒
 Com_delete COUNTER delete/秒
 Com_replace COUNTER replace/秒
 MySQL_QPS COUNTER QPS
 MySQL_TPS COUNTER TPS 
 ReadWrite_ratio GAUGE 读写比例
 Innodb_buffer_pool_read_requests COUNTER innodb buffer pool 读次数/秒
 Innodb_buffer_pool_reads COUNTER Disk 读次数/秒
 Innodb_buffer_read_hit_ratio GAUGE innodb buffer pool 命中率
 Innodb_buffer_pool_pages_flushed COUNTER innodb buffer pool 刷写到磁盘的页数/秒
 Innodb_buffer_pool_pages_free GAUGE innodb buffer pool 空闲页的数量
 Innodb_buffer_pool_pages_dirty GAUGE innodb buffer pool 脏页的数量
 Innodb_buffer_pool_pages_data GAUGE innodb buffer pool 数据页的数量
 Bytes_received COUNTER 接收字节数/秒
 Bytes_sent COUNTER 发送字节数/秒
 Innodb_rows_deleted COUNTER innodb表删除的行数/秒
 Innodb_rows_inserted COUNTER  innodb表插入的行数/秒
 Innodb_rows_read COUNTER  innodb表读取的行数/秒
 Innodb_rows_updated  COUNTER  innodb表更新的行数/秒
 Innodb_os_log_fsyncs COUNTER  Redo Log fsync次数/秒 
 Innodb_os_log_written COUNTER  Redo Log 写入的字节数/秒
 Created_tmp_disk_tables COUNTER  创建磁盘临时表的数量/秒
 Created_tmp_tables COUNTER  创建内存临时表的数量/秒
 Connections COUNTER  连接数/秒
 Innodb_log_waits COUNTER  innodb log buffer不足等待的数量/秒
 Slow_queries COUNTER  慢查询数/秒
 Binlog_cache_disk_use COUNTER  Binlog Cache不足的数量/秒

使用说明:读取配置到都数据库列表执行,配置文件格式如下(mysqldb_list.txt):

 IP,Port,User,Password,endpoint

192.168.2.21,3306,root,123,mysql-21:3306192.168.2.88,3306,root,123,mysql-88:3306

最后执行:

python mysql_monitor.py mysqldb_list.txt 

2) Redis 收集信息脚本(redis_monitor.py)

#!/bin/env python#-*- coding:utf-8 -*-import jsonimport timeimport reimport redisimport requestsimport fileinputimport datetimeclass RedisMonitorInfo():    def __init__(self,host,port,password):        self.host     = host        self.port     = port        self.password = password    def stat_info(self):         try:            r = redis.Redis(host=self.host, port=self.port, password=self.password)            stat_info = r.info()            return stat_info         except Exception, e:            print (datetime.datetime.now()).strftime("%Y-%m-%d %H:%M:%S")            print e            return dict()    def cmdstat_info(self):        try:            r = redis.Redis(host=self.host, port=self.port, password=self.password)            cmdstat_info = r.info(Commandstats)            return cmdstat_info        except Exception, e:            print (datetime.datetime.now()).strftime("%Y-%m-%d %H:%M:%S")            print e            return dict()if __name__ == __main__:    open_falcon_api = http://192.168.200.86:1988/v1/push    db_list= []    for line in fileinput.input():        db_list.append(line.strip())    for db_info in db_list:#        host,port,password,endpoint,metric = db_info.split(‘,‘)        host,port,password,endpoint = db_info.split(,)        timestamp = int(time.time())        step      = 60        falcon_type = COUNTER#        tags      = "port=%s" %port        tags      = ""            conn = RedisMonitorInfo(host,port,password)            #查看各个命令每秒执行次数        redis_cmdstat_dict = {}        redis_cmdstat_list = []        cmdstat_info = conn.cmdstat_info()        for cmdkey in cmdstat_info:            redis_cmdstat_dict[cmdkey] = cmdstat_info[cmdkey][calls]        for _key,_value in redis_cmdstat_dict.items():            falcon_format = {                    Metric: %s % (_key),                    Endpoint: endpoint,                    Timestamp: timestamp,                    Step: step,                    Value: int(_value),                    CounterType: falcon_type,                    TAGS: tags                }            redis_cmdstat_list.append(falcon_format)            #查看Redis各种状态,根据需要增删监控项,str的值需要转换成int        redis_stat_list = []        monitor_keys = [            (connected_clients,GAUGE),            (blocked_clients,GAUGE),            (used_memory,GAUGE),            (used_memory_rss,GAUGE),            (mem_fragmentation_ratio,GAUGE),            (total_commands_processed,COUNTER),            (rejected_connections,COUNTER),            (expired_keys,COUNTER),            (evicted_keys,COUNTER),            (keyspace_hits,COUNTER),            (keyspace_misses,COUNTER),            (keyspace_hit_ratio,GAUGE),            (keys_num,GAUGE),        ]        stat_info = conn.stat_info()           for _key,falcon_type in monitor_keys:            #计算命中率            if _key == keyspace_hit_ratio:                try:                    _value = round(float(stat_info.get(keyspace_hits,0))/(int(stat_info.get(keyspace_hits,0)) + int(stat_info.get(keyspace_misses,0))),4)*100                except ZeroDivisionError:                    _value = 0            #碎片率是浮点数            elif _key == mem_fragmentation_ratio:                _value = float(stat_info.get(_key,0))            #拿到key的数量            elif _key == keys_num:                _value = 0                 for i in range(16):                    _key = db+str(i)                    _num = stat_info.get(_key)                    if _num:                        _value += int(_num.get(keys))                _key = keys_num            #其他的都采集成counter,int            else:                try:                    _value = int(stat_info[_key])                except:                    continue            falcon_format = {                    Metric: %s % (_key),                    Endpoint: endpoint,                    Timestamp: timestamp,                    Step: step,                    Value: _value,                    CounterType: falcon_type,                    TAGS: tags                }            redis_stat_list.append(falcon_format)            load_data = redis_stat_list+redis_cmdstat_list        print json.dumps(load_data,sort_keys=True,indent=4)        requests.post(open_falcon_api, data=json.dumps(load_data))

指标说明:收集指标里的COUNTER表示每秒执行次数,GAUGE表示直接输出值。

指标类型说明
 connected_clients GAUGE连接的客户端个数
 blocked_clients GAUGE被阻塞客户端的数量
 used_memory GAUGE Redis分配的内存的总量
 used_memory_rss GAUGE OS分配的内存的总量
 mem_fragmentation_ratio GAUGE 内存碎片率,used_memory_rss/used_memory
 total_commands_processed COUNTER 每秒执行的命令数,比较准确的QPS
 rejected_connections COUNTER 被拒绝的连接数/秒
 expired_keys COUNTER 过期KEY的数量/秒 
 evicted_keys COUNTER 被驱逐KEY的数量/秒
 keyspace_hits COUNTER 命中KEY的数量/秒
 keyspace_misses COUNTER 未命中KEY的数量/秒
 keyspace_hit_ratio GAUGE KEY的命中率
 keys_num GAUGE KEY的数量
 cmd_* COUNTER 各种名字都执行次数/秒

使用说明:读取配置到都数据库列表执行,配置文件格式如下(redisdb_list.txt):

 IP,Port,Password,endpoint

192.168.1.56,7021,zhoujy,redis-56:7021192.168.1.55,7021,zhoujy,redis-55:7021

最后执行:

 python redis_monitor.py redisdb_list.txt

3) MongoDB 收集信息脚本(mongodb_monitor.py)

...后续添加

 

4)其他相关的监控(需要装上agent),比如下面的指标:

告警项触发条件备注
load.1minall(#3)>10Redis服务器过载,处理能力下降
cpu.idleall(#3)<10CPU idle过低,处理能力下降
df.bytes.free.percentall(#3)<20磁盘可用空间百分比低于20%,影响从库RDB和AOF持久化
mem.memfree.percentall(#3)<15内存剩余低于15%,Redis有OOM killer和使用swap的风险
mem.swapfree.percentall(#3)<80使用20% swap,Redis性能下降或OOM风险
net.if.out.bytesall(#3)>94371840网络出口流量超90MB,影响Redis响应
net.if.in.bytesall(#3)>94371840网络入口流量超90MB,影响Redis响应
disk.io.utilall(#3)>90磁盘IO可能存负载,影响从库持久化和阻塞写

 

相关文档:

https://github.com/iambocai/falcon-monit-scripts(redis monitor)

https://github.com/ZhuoRoger/redismon(redis monitor)

 

Open-Falcon 监控系统监控 MySQL/Redis/MongoDB 状态监控