Kubernetes 监控体系基本结构

news/2024/5/18 22:21:53 标签: 监控, kubernetes, prometheus, alertmanager, blackbox

在这里插入图片描述

实验环境

三节点 Kubernetes 1.17.2

  • 192.168.220.110 kmaster
  • 192.168.220.111 knode1
  • 192.168.220.112 knode2

组件

  • Prometheus:收集、查询、分析、存储监控数据,触发告警,部署方式为 deployment
  • Grafana:图形化展示监控数据,部署方式为 deployment
  • Alertmanager:发送告警通知,部署方式为 deployment
  • node-exporter:收集节点数据,部署方式为 daemonset
  • blackbox-exporter:黑盒监控kubernetes 服务,部署方式为 deployment

我在自己的实验环境部署的所有组件,每个组件的 configmap、deployment、service 都写在一个 yaml 文件中一起启停了,为了避免多次下载镜像我绑定了 node,存储都配置的是宿主机本地存储。Prometheus、Alertmanager 和 blackbox-exporter 的 configmap 最好单独部署,这样修改配置后向 /-/reload 端点 post 一下就可以重新加载配置而不用重启 deployment。

部署两个 service 的目的:-inner 用于组件间通信,-server用于外部访问组件的web页面。

全部组件运行在 monitoring 命名空间内。

apiVersion: v1
kind: Namespace
metadata:
   name: monitoring
   labels:
     name: monitoring

Prometheus

配置告警规则,数据采集参数,Alertmanager 地址。除了 Kubernetes 一些标准的监控对象,我额外通过 blackbox-exporter 监控了非Kubernetes 部署的一个 web 服务和一个 logstash 服务,告警的对象包括节点的基础资源和 blackbox-exporter 的两个监控对象。web 界面的 Targets 里除了 kube-state-metrics 都 up 了,这个后面再研究。

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: monitoring
data:
  prometheus.rules: |-
    groups:
    - name: web_probe
      rules:
       - alert: ProbeFailing
         expr: probe_success == 0
         # expr: up{job="blackbox"} == 0 or probe_success{job="blackbox"} == 0
         for: 10s
         labels:
          severity: critical
          team: blackbox
         annotations:
          summary: "网站/端口 {{ $labels.instance }} 失联 {{ $value }}。"
          description: "Job {{$labels.job}} 中的 网站/端口 {{ $labels.instance }} 已失联超过1分钟。"
    - name: node_alert
      rules:
      - alert: HighNodeMemory
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 5
        # expr: (node_memory_Buffers_bytes+node_memory_Cached_bytes+node_memory_MemFree_bytes)/node_memory_MemTotal_bytes*100 > 90 
        for: 10s
        labels:
          severity: critical
          team: node
        annotations:
          summary: "Instance {{ $labels.instance }} 节点内存使用率超过95%"
          description: "{{ $labels.instance }} of job {{$labels.job}}内存使用率超过95%,当前使用率[{{ $value }}]."
      - alert: HighNodeCpuLoad
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: node
        annotations:
          summary: "(instance {{ $labels.instance }}) 节点CPU 5分钟负载大于80%"
          description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: OutOfRootDiskSpace
        expr: (node_filesystem_avail_bytes{mountpoint="/"}  * 100) / node_filesystem_size_bytes{mountpoint="/"} < 10
        for: 5m
        labels:
          severity: warning
          team: node
        annotations:
          summary: "(instance {{ $labels.instance }}) 节点根目录剩余空间小于10%"
          description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: OutOfHomeDiskSpace
        expr: (node_filesystem_avail_bytes{mountpoint="/home"}  * 100) / node_filesystem_size_bytes{mountpoint="/home"} < 10
        for: 5m
        labels:
          severity: warning
          team: node
        annotations:
          summary: "(instance {{ $labels.instance }}) 节点家目录剩余空间小于10%"
          description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: OutOfRootInodes
        expr: node_filesystem_files_free{mountpoint ="/"} / node_filesystem_files{mountpoint ="/"} * 100 < 10
        for: 5m
        labels:
          severity: warning
          team: node
        annotations:
          summary: "(instance {{ $labels.instance }}) 节点根目录剩余inode小于10%"
          description: "Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: OutOfHomeInodes
        expr: node_filesystem_files_free{mountpoint ="/home"} / node_filesystem_files{mountpoint ="/home"} * 100 < 10
        for: 5m
        labels:
          severity: warning
          team: node
        annotations:
          summary: "(instance {{ $labels.instance }}) 节点家目录剩余inode小于10%"
          description: "Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  prometheus.yml: |-
    global:
      scrape_interval: 5s
      evaluation_interval: 5s
    rule_files:
      - /etc/prometheus/prometheus.rules
    alerting:
      alertmanagers:
      - scheme: http
        static_configs:
        - targets: 
          - alertmanager-inner.monitoring.svc:9093
          # - 192.168.220.112:30002
    scrape_configs:
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https
      - job_name: 'kubernetes-nodes'
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        kubernetes_sd_configs:
        - role: node
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics
      
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name
      
      - job_name: 'kube-state-metrics'
        static_configs:
          - targets: ['kube-state-metrics.kube-system.svc.cluster.local:9090']

      - job_name: 'kubernetes-cadvisor'
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        kubernetes_sd_configs:
        - role: node
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      
      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
        - role: endpoints
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name

      - job_name: 'web_status'
        scrape_interval: 5s
        metrics_path: /probe
        params:
          module: [http_2xx]
        static_configs:
            - targets:
              - https://www.baidu.com/
              - http://192.168.220.112:5000/health
        relabel_configs:
            - source_labels: [__address__]
              target_label: __param_target   
            - source_labels: [__param_target]
              target_label: instance         
            - target_label: __address__
              replacement: blackbox-exporter-inner.monitoring.svc:9115

      - job_name: "port_status"
        scrape_interval: 5s
        metrics_path: /probe
        params:
          module: [tcp_connect]
        static_configs:
            - targets: [ '192.168.220.111:30102' ]
        relabel_configs:
            - source_labels: [__address__]
              target_label: __param_target
            - source_labels: [__param_target]
              target_label: instance
            - target_label: __address__
              replacement: blackbox-exporter-inner.monitoring.svc:9115
---        
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-server
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      nodeSelector:
        kubernetes.io/hostname: knode2
      containers:
        - name: prometheus
          image: prom/prometheus:v2.16.0
          imagePullPolicy: IfNotPresent
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus/"
          ports:
            - containerPort: 9090
          volumeMounts:
            - name: prometheus-config-volume
              mountPath: /etc/prometheus/
            - name: prometheus-storage-volume
              mountPath: /prometheus/              
      volumes:
        - name: prometheus-config-volume
          configMap:
            defaultMode: 420
            name: prometheus-server-conf
        - name: prometheus-storage-volume
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
  namespace: monitoring
  annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port:   '9090'
spec:
  selector: 
    app: prometheus-server
  type: NodePort  
  ports:
    - port: 8080
      targetPort: 9090 
      nodePort: 30000

Grafana

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      name: grafana
      annotations:
        prometheus.io/scrape: "true"
      labels:
        app: grafana
    spec:
      containers:
        - name: grafana
          image: grafana/grafana
          imagePullPolicy: IfNotPresent
          ports:
          - name: grafana
            containerPort: 3000
          env:
            # The following env variables set up basic auth twith the default admin user and admin password.
            - name: GF_AUTH_BASIC_ENABLED
              value: "true"
            - name: GF_AUTH_ANONYMOUS_ENABLED
              value: "false"
            # - name: GF_AUTH_ANONYMOUS_ORG_ROLE
            #   value: Admin
            # does not really work, because of template variables in exported dashboards:
            # - name: GF_DASHBOARDS_JSON_ENABLED
            #   value: "true"
          readinessProbe:
            httpGet:
              path: /login
              port: 3000
            # initialDelaySeconds: 30
            # timeoutSeconds: 1
          volumeMounts:
            - mountPath: /var/lib/grafana
              name: grafana-storage
      volumes:
        - name: grafana-storage
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: grafana-service
  namespace: monitoring
spec:
  selector: 
    app: grafana
  type: NodePort
  selector:
    app: grafana
  ports:
    - port: 3000
      targetPort: 3000
      nodePort: 30001

启动以后在grafana里面配置prometheus源。

dashboard:

Node-exporter

apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
  labels:
    app: node-exporter
  name: node-exporter-inner
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: scrape
    port: 9100
    protocol: TCP
  selector:
    app: node-exporter
  type: ClusterIP
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
      name: node-exporter
    spec:
      hostNetwork: true
      hostPID: true
      hostIPC: true
      securityContext:
        runAsUser: 0
      containers:
      - image: prom/node-exporter
        imagePullPolicy: IfNotPresent 
        name: node-exporter
        volumeMounts:
          - mountPath: /run/systemd/private
            name: systemd-socket
            readOnly: true
        args:
          - "--collector.systemd"
          - "--collector.systemd.unit-whitelist=(docker|ssh|rsyslog|kubelet).service"
        ports:
          - containerPort: 9100
            hostPort: 9100
            name: scrape
        livenessProbe:
          httpGet:
            path: /metrics
            port: 9100
          initialDelaySeconds: 30
          timeoutSeconds: 10
          periodSeconds: 1
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: 9100
          initialDelaySeconds: 10
          timeoutSeconds: 10
          periodSeconds: 2
      volumes:
        - hostPath:  
            path: /run/systemd/private
          name: systemd-socket

Alertmanager

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-conf
  namespace: monitoring
data:
  config.yml: |-
    global:
      # 在没有告警的情况下声明为已解决的时间
      resolve_timeout: 1m
      # 配置邮件发送信息
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: 'sadfds@163.com'
      smtp_auth_username: 'sadfds@163.com'
      smtp_auth_password: 'sadfds'
      # smtp_hello: '163.com'
      smtp_require_tls: false
    # 所有报警信息进入后的根路由,用来设置告警的分发策略
    route:
      # 这里的标签列表是接收到报警信息后的重新分组标签,
      # 例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 
      # 这样的标签的报警信息将会批量被聚合到一个分组里面。
      group_by: ['alertname']

      # 当触发该组的一个告警后,需要等待至少group_wait时间来发送通知,
      # 这段时间内触发的同组告警将一起发送通知,减少告警条数。
      group_wait: 2s

      # 当第一个报警发送后,等待'group_interval'时间后发送该组新告警通知。
      group_interval: 1s

      # 如果一个告警通知已经发送,等待'repeat_interval'时间后再次发送,
      # 减少重复发送条数。
      repeat_interval: 10s

      # 默认的receiver:如果一个告警没有被一个route匹配,则发送给默认的接收器
      receiver: 'sms'

      # 以上属性都由所有子路由继承。
      routes:
      - receiver: 'sms'
        group_wait: 1s
        match:
          team: node
    receivers:
    - name: 'sms'
      webhook_configs:
      - url: 'http://192.168.220.112:5000/'
        send_resolved: false
    # - name: 'email'
    #   email_configs:
    #   - to: '123@qq.com'
    #     send_resolved: true
---        
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager-server
  template:
    metadata:
      labels:
        app: alertmanager-server
    spec:
      nodeSelector:
        kubernetes.io/hostname: knode2
      containers:        
        - name: alermanager
          image: prom/alertmanager:v0.20.0
          imagePullPolicy: IfNotPresent
          args:
          - "--config.file=/etc/alertmanager/config.yml"
          - "--storage.path=/alertmanager/data"
          ports:
          - containerPort: 9093
            name: http
          volumeMounts:
          - mountPath: "/etc/alertmanager"
            name: alertmanager-config-volume
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 100m
              memory: 256Mi
      volumes:
        - name: alertmanager-config-volume
          configMap:
            name: alertmanager-conf
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager-service
  namespace: monitoring
  annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port:   '9093'
spec:
  selector: 
    app: alertmanager-server
  type: NodePort  
  ports:
    - port: 9093
      targetPort: 9093 
      nodePort: 30002
---          
apiVersion: v1
kind: Service
metadata:
  name: alertmanager-inner
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http
    port: 9093
  selector:
    app: alertmanager-server

说明

  • prometheus.rules 配置中设定的 labels 作为 Alertmanager 的通知路由匹配规则,如team:node,annotations 是告警内容的说明,通常是发送的通知内容。
  • Prometheus 的 alert 页面中告警有三种状态:
    • inactive:未触发
    • pending:达到阈值,触发告警,等待 group_wait 和 group_interval超时
    • firing:已将告警发送至 Alertmanager,只要告警没有恢复,就持续在这个状态
  • 配置文件里的 send_resolved 我本来理解应该是指告警恢复是否发送通知 ,但实际是指告警恢复后是否仍然发送,所以要配置成 false,否则告警恢复了还会继续发送通知。
  • 通过 webhook 接收器自定义通知方式,Alertmanager 会把通知信息 post 到 webhook_configs 配置的url,我们自己启个服务接收通知然后处理,发微信、邮件、钉钉还是短信都可以自己写。
  • Alertmanager web界面自带的 silence 功能比较简单,只能根据 label 匹配设置固定的静默时间,通过自定义接收器可以灵活设置静默时间,比如每天的某一时段为部署窗口。

我用 Go 写了一个简单的通知接收程序,将收到的通知打印出来

package main

import (
	"encoding/json"
	"fmt"
	"github.com/julienschmidt/httprouter"
	"io/ioutil"
	"log"
	"net/http"
)

type Msg struct {
	Tel []string `json:"tel"`
	Msg string   `json:"msg"`
}

type Alert struct {
	Labels      map[string]string `json:"labels"`
	Annotations map[string]string `json:"annotations"`
	StartsAt    string            `json:"startsAt"`
	EndsAt      string            `json:"endsAt"`
}

type Alerts struct {
	Version string  ` json:"version"`
	Status  string  `json:"status"`
	Alerts  []Alert `json:"alerts"`
}

func homePage(w http.ResponseWriter, r *http.Request, _ httprouter.Params) {
	reqBody, _ := ioutil.ReadAll(r.Body)
	var alerts Alerts
	err := json.Unmarshal(reqBody, &alerts)
	if err != nil {
		log.Fatal("unmarshal request error: ", err)
	}

	log.Println(alerts)
	_, _ = fmt.Fprintln(w, alerts)
}

func healthCheck(w http.ResponseWriter, _ *http.Request, _ httprouter.Params) {

	health := map[string]string{
		"status": "up",
	}
	_ = json.NewEncoder(w).Encode(health)
}

func handleRequests() {
	router := httprouter.New()

	router.POST("/", homePage)
	router.GET("/health", healthCheck)
	log.Fatal(http.ListenAndServe(":5000", router))
}

func main() {
	handleRequests()
}

blackboxexporter_674">blackbox-exporter

apiVersion: apps/v1
kind: Deployment
metadata:
  name: blackbox-exporter
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: blackbox-exporter
  template:
    metadata:
      labels:
        app: blackbox-exporter
    spec:
      nodeSelector:
        kubernetes.io/hostname: knode2
      containers:
        - name: blackbox-exporter
          image: prom/blackbox-exporter:v0.16.0
          imagePullPolicy: IfNotPresent
          ports:
          - containerPort: 9115
            name: http
---
apiVersion: v1
kind: Service
metadata:
  name: blackbox-exporter-inner
  namespace: monitoring
  annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port:   '9115'
spec:
  selector:
    app: blackbox-exporter
  clusterIP: None
  ports:
  - name: http
    port: 9115 

http://www.niftyadmin.cn/n/1531857.html

相关文章

建站参考大全

在我个人开发网站的过程中&#xff0c;经常会参考一些网站&#xff0c;我不擅长美工&#xff0c;所以一般页面都喜欢直接拿来用&#xff0c;再自己修饰一下&#xff0c;以下一些网站是我经常访问的&#xff0c;共享给大家&#xff1a; 站酷&#xff08;http://www.zcool.com.cn…

Go 源码学习 - http库:浅析一次 http 请求的处理过程

文章目录出发web app 必要构件整体流程阅读源码1 http.ListenAndServe2 Server.ListenAndServe3 Server.Server4 *conn.Serve5. serverHandler.ServeHTTP6. ServeMux.ServeHTTP7 *ServeMux.handler总结出发 Go 的 web 编程非常简洁。python 写 web app&#xff0c;比如 flask …

2019/7/18----web前端查询网站

http://www.w3school.com.cn/ 转载于:https://www.cnblogs.com/curedfisher/p/11207259.html

Go 源码学习 - sync.atomic 库

这个包里的主要功能应该都是底层实现的&#xff0c;不是go写的&#xff0c;先看一下说明文档。 原子包提供了实用的底层原子内存原语用以实现同步算法。 这些函数需要非常小心才能正确使用。除底层应用这一特殊情况以外&#xff0c;同步最好使用通道或者 sync 包里的其他工具…

Go 使用普通锁实现读写锁

比较粗糙简单&#xff0c;思路就是用一个mutex作为资源锁&#xff0c;用两个整数记录读者数和写者数&#xff0c;用一个内部锁保护这个两个整数和mutex&#xff0c;读写操作互斥&#xff0c;写写操作互斥&#xff0c;读读操作不互斥。 type RWMutex struct {mu sync.Mutex /…

十五、常用模块collections,时间模块,random,os,sys,subproce,序列化模块

一、collections模块 1.namedtuple                                   # 1.namedtuple: namedtuple.(“名称”&#xff0c;“[属性list]”) # Circle namedtuple("circle"&#xff0c;"x y z") #tuple表示不变的集合&…

Prometheus 查询语言 PromQL 的 CPU 使用率计算方法

CPU 使用率的计算方法 翻了几篇 Prometheus 的 PromQL 查询 cpu 使用率的文章&#xff0c;说得都不是特别透&#xff0c;结合一篇英文文章终于搞明白了怎么计算这个指标。 cpu 模式 一颗 cpu 要通过分时复用的方式运行于不同的模式中&#xff0c;可以类比为让不同的人使用 c…

Vue学习笔记(七) 组件

0、入门 在开始正式讲解组件之前&#xff0c;我们首先来看一个简单的例子&#xff1a; <!DOCTYPE html> <html><head><title>Demo</title></head><body><div id"app"><button-counter></button-counter>…