最近把Prometheus监控迁移到了kubernetes集群中,部署文档参考《Kubernetes环境使用Prometheus Operator自发现监控SpringBoot》,各类监控项的数据采集,以及grafana的监控展示测试都正常,于是进入下一步报警的迁入测试,alertmanager原生不支持钉钉报警,所以只能通过webhook的方式,好在已经有大佬开源了一套基于prometheus 钉钉报警的webhook(项目地址https://github.com/timonwong/prometheus-webhook-dingtalk),所以我们直接配置使用就可以了。
怎么创建钉钉机器人非常简单这里就不介绍了,创建好钉钉机器人以后,下一步就是部署webhook,接收alertmanager的报警信息,格式化以后再发送给钉钉机器人。非kubernetes集群部署也是非常简单,直接编写一个docker-compose文件,直接运行就可以了。
1、在kubernetes集群中,pod之间需要通信,需要使用service,所以先编写一个kubernetes的yaml文件dingtalk-webhook.yaml。
apiVersion: apps/v1 kind: Deployment metadata: name: webhook-dingtalk namespace: monitoring spec: selector: matchLabels: app: dingtalk replicas: 1 template: metadata: labels: app: dingtalk spec: restartPolicy: Always containers: - name: dingtalk image: timonwong/prometheus-webhook-dingtalk:v1.4.0 imagePullPolicy: IfNotPresent args: - '--web.enable-ui' - '--web.enable-lifecycle' - '--config.file=/config/config.yaml' ports: - containerPort: 8060 protocol: TCP volumeMounts: - mountPath: "/config" name: dingtalk-volume resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi volumes: - name: dingtalk-volume persistentVolumeClaim: claimName: dingding-pvc --- apiVersion: v1 kind: Service metadata: name: webhook-dingtalk namespace: monitoring spec: ports: - port: 80 protocol: TCP targetPort: 8060 selector: app: dingtalk sessionAffinity: None
1.1、第一种方式通过数据持久化,把配置文件config.yaml和报警模板放在了共享存储里面,这样webhook不管部署到哪台node,都可以读取到配置文件和报警模板。怎么通过NFS让数据持久化可以参考文档《Kubernetes使用StorageClass动态生成NFS类型的PV》。
dingding-pvc.yaml
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: dingding-pvc annotations: volume.beta.kubernetes.io/storage-class: "atang-nfs" namespace: monitoring spec: accessModes: - ReadWriteMany resources: requests: storage: 50Mi
配置文件config.yaml:
templates: - /config/template.tmpl targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=替换成自己的钉钉机器人token
报警模板template.tmpl:
{{ define "ding.link.title" }}[监控报警]{{ end }} {{ define "ding.link.content" -}} {{- if gt (len .Alerts.Firing) 0 -}} {{ range $i, $alert := .Alerts.Firing }} [告警项目]:{{ index $alert.Labels "alertname" }} [告警实例]:{{ index $alert.Labels "instance" }} [告警级别]:{{ index $alert.Labels "severity" }} [告警阀值]:{{ index $alert.Annotations "value" }} [告警详情]:{{ index $alert.Annotations "description" }} [触发时间]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- if gt (len .Alerts.Resolved) 0 -}} {{ range $i, $alert := .Alerts.Resolved }} [项目]:{{ index $alert.Labels "alertname" }} [实例]:{{ index $alert.Labels "instance" }} [状态]:恢复正常 [开始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} [恢复]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- end }}
可以根据自己的喜欢自己修改模板,“.EndsAt.Add 28800e9”是UTC时间+8小时,因为prometheus和alertmanager默认都是使用的UTC时间,另外需要把这两个文件的属主和属组设置成65534,不然webhook容器没有权限访问这两个文件。
1.2、第二种方式通过configMap方式(推荐)挂载配置文件和模板,需要修改原来的dingtalk-webhook.yaml文件,添加挂载为configMap。
apiVersion: v1 kind: ConfigMap metadata: name: dingtalk-config namespace: monitoring data: config.yaml: | templates: - /config/template.tmpl targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=your_dingding_token template.tmpl: | {{ define "ding.link.title" }}[监控报警]{{ end }} {{ define "ding.link.content" -}} {{- if gt (len .Alerts.Firing) 0 -}} {{ range $i, $alert := .Alerts.Firing }} [告警项目]:{{ index $alert.Labels "alertname" }} [告警实例]:{{ index $alert.Labels "instance" }} [告警级别]:{{ index $alert.Labels "severity" }} [告警阀值]:{{ index $alert.Annotations "value" }} [告警详情]:{{ index $alert.Annotations "description" }} [触发时间]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- if gt (len .Alerts.Resolved) 0 -}} {{ range $i, $alert := .Alerts.Resolved }} [项目]:{{ index $alert.Labels "alertname" }} [实例]:{{ index $alert.Labels "instance" }} [状态]:恢复正常 [开始]:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} [恢复]:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{ end }}{{- end }} {{- end }} --- apiVersion: apps/v1 kind: Deployment metadata: name: dingding-webhook namespace: monitoring spec: selector: matchLabels: app: dingtalk replicas: 1 template: metadata: labels: app: dingtalk spec: restartPolicy: Always containers: - name: dingtalk image: timonwong/prometheus-webhook-dingtalk:v1.4.0 imagePullPolicy: IfNotPresent args: - '--web.enable-ui' - '--web.enable-lifecycle' - '--config.file=/config/config.yaml' ports: - containerPort: 8060 protocol: TCP volumeMounts: - name: dingtalk-config mountPath: "/config" resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi volumes: - name: dingtalk-config configMap: name: dingtalk-config --- apiVersion: v1 kind: Service metadata: name: dingding-webhook namespace: monitoring spec: ports: - port: 80 protocol: TCP targetPort: 8060 selector: app: dingtalk sessionAffinity: None
2、修改alertmanager默认的配置文件,增加webhook_configs,直接修改kube-prometheus-master/manifests/alertmanager-secret.yaml文件为以下内容:
apiVersion: v1 data: {} kind: Secret metadata: name: alertmanager-main namespace: monitoring stringData: alertmanager.yaml: |- "global": "resolve_timeout": "5m" "inhibit_rules": - "equal": - "namespace" - "alertname" "source_match": "severity": "critical" "target_match_re": "severity": "warning|info" - "equal": - "namespace" - "alertname" "source_match": "severity": "warning" "target_match_re": "severity": "info" "receivers": - "name": "www.amd5.cn" #- "name": "Watchdog" #- "name": "Critical" #- "name": "webhook" "webhook_configs": - "url": "http://webhook-dingtalk/dingtalk/webhook1/send" "send_resolved": true "route": "group_by": - "namespace" "group_interval": "5m" "group_wait": "30s" "receiver": "www.amd5.cn" "repeat_interval": "12h" #"routes": #- "match": # "alertname": "Watchdog" # "receiver": "Watchdog" #- "match": # "severity": "critical" # "receiver": "Critical"
所有的yaml文件准备好以后,执行
kubectl apply -f dingding-pvc.yaml kubectl apply -f dingtalk-webhook.yaml kubectl apply -f alertmanager-secret.yaml
查看执行结果
然后访问alertmanager的地址(把alertmanager.amd5.cn替换为自己的地址)查看配置webhook_configs是否已经生效,http://alertmanager.amd5.cn/#/status。
3、生效以后,我们就添加报警规则,等待触发规则阈值报警测试。
直接修改kube-prometheus-master/manifests/prometheus-rules.yaml在末尾添加下面的内容,注意缩进。
- name: prometheus-operator rules: - alert: PrometheusOperatorReconcileErrors annotations: message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace }} Namespace. expr: | rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1 for: 10m labels: severity: warning - alert: PrometheusOperatorNodeLookupErrors annotations: message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace. expr: | rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1 for: 10m labels: severity: warning #以下为添加的报警测试规则 - name: www.amd5.cn rules: - alert: '钉钉报警测试' expr: | jvm_threads_live > 140 for: 1m labels: severity: '警告' annotations: summary: "{{ $labels.instance }}: 钉钉报警测试" description: "{{ $labels.instance }}:钉钉报警测试" custom: "钉钉报警测试" value: "{{$value}}"
然后执行命令更新规则
kubectl apply -f prometheus-rules.yaml
然后访问prometheus地址http://prometheus.amd5.cn/alerts查看rule生效情况,如下图:
等故障持续到我们设置规则时间后,钉钉就会收到报警: