最近把Prometheus监控迁移到了kubernetes集群中,部署文档参考《Kubernetes环境使用Prometheus Operator自发现监控SpringBoot》,各类监控项的数据采集,以及grafana的监控展示测试都正常,于是进入下一步报警的迁入测试,alertmanager原生不支持钉钉报警,所以只能通过webhook的方式,好在已经有大佬开源了一套基于prometheus 钉钉报警的webhook(项目地址https://github.com/timonwong/prometheus-webhook-dingtalk),所以我们直接配置使用就可以了。
怎么创建钉钉机器人非常简单这里就不介绍了,创建好钉钉机器人以后,下一步就是部署webhook,接收alertmanager的报警信息,格式化以后再发送给钉钉机器人。非kubernetes集群部署也是非常简单,直接编写一个docker-compose文件,直接运行就可以了。
1、在kubernetes集群中,pod之间需要通信,需要使用service,所以先编写一个kubernetes的yaml文件dingtalk-webhook.yaml。
apiVersion: apps/v1 kind: Deployment metadata: name: webhook-dingtalk namespace: monitoring spec: selector: matchLabels: app: dingtalk replicas: 1 template: metadata: labels: app: dingtalk spec: restartPolicy: Always containers: - name: dingtalk image: timonwong/prometheus-webhook-dingtalk:v1.4.0 imagePullPolicy: IfNotPresent args: - '--web.enable-ui' - '--web.enable-lifecycle' - '--config.file=/config/config.yaml' ports: - containerPort: 8060 protocol: TCP volumeMounts: - mountPath: "/config" name: dingtalk-volume resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi volumes: - name: dingtalk-volume persistentVolumeClaim: claimName: dingding-pvc --- apiVersion: v1 kind: Service metadata: name: webhook-dingtalk namespace: monitoring spec: ports: - port: 80 protocol: TCP targetPort: 8060 selector: app: dingtalk sessionAffinity: None
1.1、第一种方式通过数据持久化,把配置文件config.yaml和报警模板放在了共享存储里面,这样webhook不管部署到哪台node,都可以读取到配置文件和报警模板。怎么通过NFS让数据持久化可以参考文档《Kubernetes使用StorageClass动态生成NFS类型的PV》。
dingding-pvc.yaml
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: dingding-pvc annotations: volume.beta.kubernetes.io/storage-class: "atang-nfs" namespace: monitoring spec: accessModes: - ReadWriteMany resources: requests: storage: 50Mi
配置文件config.yaml:
templates: - /config/template.tmpl targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=替换成自己的钉钉机器人token
报警模板template.tmpl:
{{ define "ding.link.title" }}[监控报警]{{ end }}
{{ define "ding.link.content" -}}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range $i, $alert := .Alerts.Firing }}
[告警项目]:{{ index $alert.Labels "alertname" }}
[告警实例]:{{ index $alert.Labels "instance" }}
[告警级别]:{{ index $alert.Labels "severity" }}
[告警阀值]:{{ index $alert.Annotations "value" }}
[告警详情]:{{ index $alert.Annotations "description" }}
[触发时间]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range $i, $alert := .Alerts.Resolved }}
[项目]:{{ index $alert.Labels "alertname" }}
[实例]:{{ index $alert.Labels "instance" }}
[状态]:恢复正常
[开始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
[恢复]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{- end }}
{{- end }}
可以根据自己的喜欢自己修改模板,“.EndsAt.Add 28800e9”是UTC时间+8小时,因为prometheus和alertmanager默认都是使用的UTC时间,另外需要把这两个文件的属主和属组设置成65534,不然webhook容器没有权限访问这两个文件。
1.2、第二种方式通过configMap方式(推荐)挂载配置文件和模板,需要修改原来的dingtalk-webhook.yaml文件,添加挂载为configMap。
apiVersion: v1
kind: ConfigMap
metadata:
name: dingtalk-config
namespace: monitoring
data:
config.yaml: |
templates:
- /config/template.tmpl
targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=your_dingding_token
template.tmpl: |
{{ define "ding.link.title" }}[监控报警]{{ end }}
{{ define "ding.link.content" -}}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range $i, $alert := .Alerts.Firing }}
[告警项目]:{{ index $alert.Labels "alertname" }}
[告警实例]:{{ index $alert.Labels "instance" }}
[告警级别]:{{ index $alert.Labels "severity" }}
[告警阀值]:{{ index $alert.Annotations "value" }}
[告警详情]:{{ index $alert.Annotations "description" }}
[触发时间]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range $i, $alert := .Alerts.Resolved }}
[项目]:{{ index $alert.Labels "alertname" }}
[实例]:{{ index $alert.Labels "instance" }}
[状态]:恢复正常
[开始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
[恢复]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{- end }}
{{- end }}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: dingding-webhook
namespace: monitoring
spec:
selector:
matchLabels:
app: dingtalk
replicas: 1
template:
metadata:
labels:
app: dingtalk
spec:
restartPolicy: Always
containers:
- name: dingtalk
image: timonwong/prometheus-webhook-dingtalk:v1.4.0
imagePullPolicy: IfNotPresent
args:
- '--web.enable-ui'
- '--web.enable-lifecycle'
- '--config.file=/config/config.yaml'
ports:
- containerPort: 8060
protocol: TCP
volumeMounts:
- name: dingtalk-config
mountPath: "/config"
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 100m
memory: 100Mi
volumes:
- name: dingtalk-config
configMap:
name: dingtalk-config
---
apiVersion: v1
kind: Service
metadata:
name: dingding-webhook
namespace: monitoring
spec:
ports:
- port: 80
protocol: TCP
targetPort: 8060
selector:
app: dingtalk
sessionAffinity: None
2、修改alertmanager默认的配置文件,增加webhook_configs,直接修改kube-prometheus-master/manifests/alertmanager-secret.yaml文件为以下内容:
apiVersion: v1
data: {}
kind: Secret
metadata:
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
"global":
"resolve_timeout": "5m"
"inhibit_rules":
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "critical"
"target_match_re":
"severity": "warning|info"
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "warning"
"target_match_re":
"severity": "info"
"receivers":
- "name": "www.amd5.cn"
#- "name": "Watchdog"
#- "name": "Critical"
#- "name": "webhook"
"webhook_configs":
- "url": "http://webhook-dingtalk/dingtalk/webhook1/send"
"send_resolved": true
"route":
"group_by":
- "namespace"
"group_interval": "5m"
"group_wait": "30s"
"receiver": "www.amd5.cn"
"repeat_interval": "12h"
#"routes":
#- "match":
# "alertname": "Watchdog"
# "receiver": "Watchdog"
#- "match":
# "severity": "critical"
# "receiver": "Critical"
所有的yaml文件准备好以后,执行
kubectl apply -f dingding-pvc.yaml kubectl apply -f dingtalk-webhook.yaml kubectl apply -f alertmanager-secret.yaml
查看执行结果

然后访问alertmanager的地址(把alertmanager.amd5.cn替换为自己的地址)查看配置webhook_configs是否已经生效,http://alertmanager.amd5.cn/#/status。
3、生效以后,我们就添加报警规则,等待触发规则阈值报警测试。
直接修改kube-prometheus-master/manifests/prometheus-rules.yaml在末尾添加下面的内容,注意缩进。
- name: prometheus-operator
rules:
- alert: PrometheusOperatorReconcileErrors
annotations:
message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace
}} Namespace.
expr: |
rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
for: 10m
labels:
severity: warning
- alert: PrometheusOperatorNodeLookupErrors
annotations:
message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace.
expr: |
rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
for: 10m
labels:
severity: warning
#以下为添加的报警测试规则
- name: www.amd5.cn
rules:
- alert: '钉钉报警测试'
expr: |
jvm_threads_live > 140
for: 1m
labels:
severity: '警告'
annotations:
summary: "{{ $labels.instance }}: 钉钉报警测试"
description: "{{ $labels.instance }}:钉钉报警测试"
custom: "钉钉报警测试"
value: "{{$value}}"
然后执行命令更新规则
kubectl apply -f prometheus-rules.yaml
然后访问prometheus地址http://prometheus.amd5.cn/alerts查看rule生效情况,如下图:

等故障持续到我们设置规则时间后,钉钉就会收到报警:



