一、Prometheus的监控告警:
1、监控告警:
alert是一个单独的模块,需要我们单独的配置
需要声明邮箱地址。配置是以configmap进行配置的
alertmanager也是pod部署,部署在K8S集群中
2、创建alert文件,配置告警发送方式
vim alert-cfg.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager
namespace: monitor-sa
data:
alertmanager.yml: |-
global:
resolve_timeout: 1m
#定义告警项发送邮件的超时时间。默认15秒
smtp_smarthost: 'smtp.qq.com:25'
smtp_from: '1654129473@qq.com'
smtp_auth_username: '1654129473@qq.com'
smtp_auth_password: 'sjhdrdrcrwinbjjj'
smtp_require_tls: false
route:
#设置告警的分发策略
group_by: [alertname]
#分组依据,分组的名称
group_wait: 10s
#组告警的等待时间,也就是告警产生后等待十秒,如果同组内有其他的告警一起发送
group_interval: 10s
#上下两个组,发送告警的间隔时间
receiver: default-receiver
#定义由谁来收这个告警
receivers:
- name: 'default-receiver'
email_configs:
- to: '1654129473@qq.com'
#设置告警邮箱的收件人地址
send_resolved: true
3、配置alertmanager
上传prometheus-alertmanager-cfg.yaml
100多行开始修改
kubectl -n monitor-sa create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/server.key --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/ca.crt
#在指定命名空间内创建一个名称是etcd-certs的通用类型Secret。这个Secret用于存储etcd的证书文件
#etcd-certs:自定义名称
#generic:定义Secret的类型是通用类型
#file=/etc/kubernetes/pki/etcd/server.key:文件内容作为 "etcd-certs" 这个Secret中的一个选项
#这三个整数都是用于与etcd进行安全通信的。
prometheus%20%E5%92%8C%20alertmanager" style="background-color:transparent;margin-left:.0001pt;text-align:justify;">4、更新资源清单 yaml 文件,安装 prometheus 和 alertmanager
vim prometheus-alertmanager-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-server
namespace: monitor-sa
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
component: server
template:
metadata:
labels:
app: prometheus
component: server
annotations:
prometheus.io/scrape: 'false'
spec:
serviceAccountName: monitor
initContainers:
- name: init-chmod
image: busybox:latest
command: ['sh','-c','chmod -R 777 /prometheus;chmod -R 777 /etc']
volumeMounts:
- mountPath: /prometheus
name: prometheus-storage-volume
- mountPath: /etc/localtime
name: timezone
containers:
- name: prometheus
image: prom/prometheus:v2.45.0
command:
- prometheus
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention=720h
- --web.enable-lifecycle
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus/
- mountPath: /prometheus/
name: prometheus-storage-volume
- name: timezone
mountPath: /etc/localtime
- name: k8s-certs
mountPath: /var/run/secrets/kubernetes.io/k8s-certs/etcd/
- name: alertmanager
image: prom/alertmanager:v0.20.0
args:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--log.level=debug"
ports:
- containerPort: 9093
protocol: TCP
name: alertmanager
volumeMounts:
- name: alertmanager-config
mountPath: /etc/alertmanager
- name: alertmanager-storage
mountPath: /alertmanager
- name: localtime
mountPath: /etc/localtime
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
defaultMode: 0777
- name: prometheus-storage-volume
hostPath:
path: /data
type: Directory
- name: k8s-certs
secret:
secretName: etcd-certs
- name: timezone
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
- name: alertmanager-config
configMap:
name: alertmanager
- name: alertmanager-storage
hostPath:
path: /data/alertmanager
type: DirectoryOrCreate
- name: localtime
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
kubectl get pods -n kube-system | grep kube-proxy |awk '{print $1}' | xargs kubectl delete pods -n kube-system
#删除k8s集群中位于“kube-system”命名空间中所有名称包含 "kube-proxy" 的 Pod
#kubectl get pods -n kube-system:获取kube-system命名空间中的所有pod
#grep kube-proxy:通过grep过滤出kube-proxy 的 Pod。
#awk '{print $1}':使用awk提取每一行的第一个字段。
#xargs kubectl delete pods -n kube-system:将提取到的pod名称作为参数还给kubectl delete pods -n kube-system。从而将这些pod删除
#这条命令可以快速删除指定命名空间中和kube-proxy关的所有 Pod。适用于触发创建或者进行一些调试和维护操作
5、创建alertmanager的svc
apiVersion: v1
kind: Service
metadata:
labels:
app: prometheus
name: alertmanager
namespace: monitor-sa
spec:
ports:
- name: alertmanager
nodePort: 30066
port: 9093
targetPort: 9093
selector:
app: prometheus
type: NodePort
6、访问测试
http://20.0.0.61:30066/#/alerts
inactive:表示已经激活的告警监控项
pending:告警的阀值已经触发,正在等待发送邮件
firing:表示告警项已经触发了发送配置(邮件,短信,电话,钉钉)
压力测试:
关闭压力测试之后,回到inactive状态