K8S监控-Prometheus

K8S监控-Prometheus

在 Kubernetes(K8s)中使用 Prometheus 进行监控时,有一些关键的指标和参数,通常需要关注以确保集群的健康和性能

1、API Server

参数 含义
apiserver_request_duration_seconds API 请求的响应时间
apiserver_request_total API 请求的总次数,区分成功与失败
apiserver_dropped_requests_total 丢失的请求数
apiserver_long_running_queries 长时间运行的查询

2、Controller Manager

参数 含义
controller_manager_runs_total 控制器管理器的执行次数
controller_manager_duration_seconds 控制器管理器运行的时间

3、Scheduler

参数 含义
scheduler_e2e_duration_seconds 调度的总时长
scheduler_queue_latency_seconds 队列延迟的时间

4、Node Metrics

参数 含义
node_cpu_seconds_total 节点的 CPU 使用时间
node_memory_bytes 节点的内存使用量
node_disk_io_time_seconds_total 节点磁盘 I/O 时间
node_network_receive_bytes_total 网络流量的接收字节数
node_network_transmit_bytes_total 网络流量的发送字节数

5、Kubelet

参数 含义
kubelet_running_pods 当前正在运行的 pod 数量
kubelet_network_receive_bytes_total/kubelet_network_transmit_bytes_total Kubelet 网络流量
kubelet_cpu_usage_seconds_total Kubelet 使用的 CPU 时间
kubelet_memory_usage_bytes Kubelet 使用的内存量

6、Kube Proxy

参数 含义
kubeproxy_connections_open 当前打开的连接数
kubeproxy_errors_total 代理组件的错误总数

7、Pod

参数 含义
container_cpu_usage_seconds_total 容器的 CPU 使用时间
container_memory_usage_bytes 容器的内存使用量
container_memory_rss 容器的常驻内存集大小
container_network_receive_bytes_total/container_network_transmit_bytes_total 容器的网络接收或发送字节数
container_fs_usage_bytes 容器使用的文件系统存储量
container_fs_inodes_usage 容器使用的文件系统 inode 数量
kube_pod_container_status_restarts_total Pod 容器重启的总次数

8、Controller

参数 含义
kube_deployment_status_replicas 当前 Deployment 的副本数量
kube_replica_set_status_replicas 当前 ReplicaSet 的副本数量
kube_pod_status_ready Pod 是否处于就绪状态

9、SVC/Node

参数 含义
apiserver_request_total 向 Kubernetes API Server 发起的请求总数
kube_endpoint_address_available 服务的端点是否可用
service_request_duration_seconds 服务请求的响应时间
kube_node_status_allocatable_cpu_cores 节点可分配的 CPU 核心数
kube_node_status_allocatable_memory_bytes 节点可分配的内存量
kube_pod_container_resource_requests_cpu_cores Pod 对 CPU 的请求
kube_pod_container_resource_requests_memory_bytes Pod 对内存的请求

Prometheus简介

Prometheus 是一个开源的监控和报警系统,专门设计用于大规模的分布式系统监控。它最初由 SoundCloud 开发,后来成为 CNCF(云原生计算基金会)的一部分。Prometheus 主要用于收集、存储、查询和可视化指标数据(metrics),并能够通过定义的规则触发报警。它在容器化环境和微服务架构中尤其流行,广泛应用于 Kubernetes、Docker 和云原生应用的监控。

Prometheus 的核心功能

1、数据采集:Prometheus 通过 HTTP 协议定期从被监控的应用或服务中拉取(Pull)指标数据。这些数据通常是以时间序列(Time Series)的形式存储,每个时间序列都有一个唯一的标识符,通常由 指标名称 和一组 标签(Labels)组成

2、时间序列存储:Prometheus 使用自己的时序数据库(TSDB)来存储监控数据,支持高效地存储和查询时间序列数据。这些数据按时间戳组织,可以精确到秒级,支持长期存储

3、强大的查询语言(PromQL):Prometheus 提供了一种称为 PromQL(Prometheus Query Language)的查询语言,允许用户以灵活的方式查询存储的时间序列数据。你可以使用 PromQL 来执行聚合、过滤、计算、切分等操作

4、自动发现服务: Prometheus 支持自动服务发现(Service Discovery),可以自动识别和抓取目标的指标数据。它支持多种服务发现机制,如 Kubernetes、Consul、DNS、EC2 等。对于 Kubernetes,Prometheus 可以自动发现各个 Pod 和服务,轻松集成在容器化环境中

5、报警功能: Prometheus 具有强大的报警功能,用户可以基于 PromQL 查询表达式设置报警规则。如果某个指标超过指定阈值,Prometheus 可以将警报发送给 Alertmanager(一个专门处理警报的组件),并通过电子邮件、Slack、PagerDuty 等渠道发送警报通知

6、Grafana 集成: Prometheus 生成的时间序列数据可以通过 Grafana 进行可视化,Grafana 提供了丰富的仪表板模板,用户可以非常方便地展示 Prometheus 采集的数据

Prometheus架构和工作流程

Prometheus架构

核心组件

1、Prometheus Server:Prometheus 生态最重要的组件,主要用于抓取和存储时间序列数据,同时提供数据的查询和告警策略的配置管理

2、Exporters:主要用来采集监控数据,比如主机的监控数据可以通过 node_exporter采集,MySQL 的监控数据可以通过 mysql_exporter 采集,之后 Exporter 暴露一个接口,比如/metrics,Prometheus 可以通过该接口采集到数据,常见的Exporter有:Node Exporter(监控主机(如 CPU、内存、磁盘、网络等)级别的指标)、Kube-State-Metrics(提供 Kubernetes 集群和资源(如 Pod、节点、部署等)的状态指标)、Blackbox Exporter(用于检查 HTTP、HTTPS、DNS 等服务的可用性)

3、Alertmanager: Alertmanager 用于接收来自 Prometheus 的报警,并根据报警规则执行相应的动作(如发送邮件、Slack 通知、Webhook 等)。它负责报警的抑制、分组和路由

4、Service Discovery: Prometheus 可以通过 服务发现机制 自动发现需要监控的目标,这对于动态变化的环境(如 Kubernetes、Docker 容器、云环境等)尤其重要

5、Push Gateway:Prometheus 本身是通过 Pull 的方式拉取数据,但是有些监控数据可能是短期的,如果没有采集数据可能会出现丢失。Push Gateway 可以用来解决此类问题,它可以用来接收数据,也就是客户端可以通过 Push 的方式将数据推送到 Push Gateway,之后 Prometheus 可以通过 Pull 拉取该数据

6、PromQL:PromQL 其实不算 Prometheus 的组件,它是用来查询数据的一种语法,比如查询数据库的数据,可以通过 SQL 语句,查询 Loki 的数据,可以通过 LogQL,查询 Prometheus 数据的叫做 PromQL

7、Grafana: Grafana 是一个开源的数据可视化工具,通常与 Prometheus 配合使用,用于展示Prometheus 收集的时间序列数据,提供丰富的图表和仪表盘

Prometheus工作流程

1、数据收集: Prometheus 使用 HTTP 轮询方式从目标(如 Kubernetes Pod、节点、应用程序)拉取监控指标。被监控的服务通常会暴露一个 /metrics 的 HTTP 接口,Prometheus 会定期访问这些接口以收集数据

2、存储: Prometheus 将收集到的指标数据存储在本地时序数据库中,并根据时间戳索引这些数据。每个指标数据都以时间序列的方式保存,这些时间序列数据由标签(如 pod、service、namespace)等维度进行区分

3、查询: Prometheus 提供 PromQL 查询语言,允许用户对存储的时间序列数据进行灵活查询。用户可以通过 PromQL 查询出如 CPU 使用率、内存占用等指标,并进行聚合分析、计算等操作

4、报警: 基于 PromQL 表达式,用户可以设置报警规则(如 CPU 使用率超过 80%)。当某个指标触发报警条件时,Prometheus 会将报警通知发送给 Alertmanager

5、可视化: 可视化方面,Prometheus 本身提供了一些基本的图表功能,但通常与 Grafana 配合使用,Grafana 提供了更强大、更友好的数据可视化功能,支持通过 Prometheus 数据源绘制各种图表、仪表板、报告等

Prometheus的安装

Prometheus的安装有多种方式,包括二进制安装、容器安装、Helm、Prometheus Operator和Kube-Prometheus Stack。本文档采用Kube-Prometheus Stack的方式进行安装

访问kube-prometheus的项目地址,查看K8S集群和当前技术栈的匹配信息,作者的K8S集群版本为1.30.x

https://github.com/prometheus-operator/kube-prometheus/

compatibility

拉取对应版本的git包(注意:当前的网络问题和后面的镜像拉取问题,请参考作者的日志收集章节或云原生存储章节配置代理)

[root@master-01 ~]# git clone -b release-0.14 https://github.com/prometheus-operator/kube-prometheus.git
Cloning into 'kube-prometheus'...
remote: Enumerating objects: 20499, done.
remote: Counting objects: 100% (3921/3921), done.
remote: Compressing objects: 100% (285/285), done.
remote: Total 20499 (delta 3738), reused 3714 (delta 3617), pack-reused 16578 (from 1)
Receiving objects: 100% (20499/20499), 12.43 MiB | 1.20 MiB/s, done.
Resolving deltas: 100% (14112/14112), done.

切换到manifests目录下,包含了创建Prometheus技术栈包含的所有资源,setup目录下包含了Prometheus的CRD定义。首先需要创建setup目录下的资源,这些 CRD 使 Kubernetes 可以处理不同的监控和告警资源

[root@master-01 manifests]# kubectl create -f setup/
customresourcedefinition.apiextensions.k8s.io/alertmanagerconfigs.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/probes.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheusagents.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/scrapeconfigs.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com created
namespace/monitoring created

读者也可以用此方法进行创建(创建命名空间和 CRD,然后等待它们可用,然后再创建剩余资源 )

[root@master-01 manifests]# kubectl wait \
> --for condition=Established \
> --all CustomResourceDefinition \
> --namespace=monitoring
customresourcedefinition.apiextensions.k8s.io/alertmanagerconfigs.monitoring.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/bgpfilters.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/caliconodestatuses.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/ipreservations.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/kubecontrollersconfigurations.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org condition met
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/probes.monitoring.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/prometheusagents.monitoring.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/redisclusters.cache.tongdun.net condition met
customresourcedefinition.apiextensions.k8s.io/redisstandbies.cache.tongdun.net condition met
customresourcedefinition.apiextensions.k8s.io/scrapeconfigs.monitoring.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com condition met

等待资源声明完成,随后创建manifests目录中的其余资源(此处会创建非常多的资源,并且新版本的Prometheus-operator会在此处生成,旧版本在上一步会生成),读者可以在创建前配置持久化存储、修改SVC类型或者修改对应Pod资源的Replcation数量

[root@master-01 kube-prometheus]# kubectl apply -f manifests/
alertmanager.monitoring.coreos.com/main created
networkpolicy.networking.k8s.io/alertmanager-main created
poddisruptionbudget.policy/alertmanager-main created
prometheusrule.monitoring.coreos.com/alertmanager-main-rules created
secret/alertmanager-main created
service/alertmanager-main created
serviceaccount/alertmanager-main created
servicemonitor.monitoring.coreos.com/alertmanager-main created
clusterrole.rbac.authorization.k8s.io/blackbox-exporter created
clusterrolebinding.rbac.authorization.k8s.io/blackbox-exporter created
configmap/blackbox-exporter-configuration created
deployment.apps/blackbox-exporter created
networkpolicy.networking.k8s.io/blackbox-exporter created
service/blackbox-exporter created
serviceaccount/blackbox-exporter created
servicemonitor.monitoring.coreos.com/blackbox-exporter created
secret/grafana-config created
secret/grafana-datasources created
configmap/grafana-dashboard-alertmanager-overview created
configmap/grafana-dashboard-apiserver created
configmap/grafana-dashboard-cluster-total created
configmap/grafana-dashboard-controller-manager created
configmap/grafana-dashboard-grafana-overview created
configmap/grafana-dashboard-k8s-resources-cluster created
configmap/grafana-dashboard-k8s-resources-multicluster created
configmap/grafana-dashboard-k8s-resources-namespace created
configmap/grafana-dashboard-k8s-resources-node created
configmap/grafana-dashboard-k8s-resources-pod created
configmap/grafana-dashboard-k8s-resources-workload created
configmap/grafana-dashboard-k8s-resources-workloads-namespace created
configmap/grafana-dashboard-kubelet created
configmap/grafana-dashboard-namespace-by-pod created
configmap/grafana-dashboard-namespace-by-workload created
configmap/grafana-dashboard-node-cluster-rsrc-use created
configmap/grafana-dashboard-node-rsrc-use created
configmap/grafana-dashboard-nodes-darwin created
configmap/grafana-dashboard-nodes created
configmap/grafana-dashboard-persistentvolumesusage created
configmap/grafana-dashboard-pod-total created
configmap/grafana-dashboard-prometheus-remote-write created
configmap/grafana-dashboard-prometheus created
configmap/grafana-dashboard-proxy created
configmap/grafana-dashboard-scheduler created
configmap/grafana-dashboard-workload-total created
configmap/grafana-dashboards created
deployment.apps/grafana created
networkpolicy.networking.k8s.io/grafana created
prometheusrule.monitoring.coreos.com/grafana-rules created
service/grafana created
serviceaccount/grafana created
servicemonitor.monitoring.coreos.com/grafana created
prometheusrule.monitoring.coreos.com/kube-prometheus-rules created
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
deployment.apps/kube-state-metrics created
networkpolicy.networking.k8s.io/kube-state-metrics created
prometheusrule.monitoring.coreos.com/kube-state-metrics-rules created
service/kube-state-metrics created
serviceaccount/kube-state-metrics created
servicemonitor.monitoring.coreos.com/kube-state-metrics created
prometheusrule.monitoring.coreos.com/kubernetes-monitoring-rules created
servicemonitor.monitoring.coreos.com/kube-apiserver created
servicemonitor.monitoring.coreos.com/coredns created
servicemonitor.monitoring.coreos.com/kube-controller-manager created
servicemonitor.monitoring.coreos.com/kube-scheduler created
servicemonitor.monitoring.coreos.com/kubelet created
clusterrole.rbac.authorization.k8s.io/node-exporter created
clusterrolebinding.rbac.authorization.k8s.io/node-exporter created
daemonset.apps/node-exporter created
networkpolicy.networking.k8s.io/node-exporter created
prometheusrule.monitoring.coreos.com/node-exporter-rules created
service/node-exporter created
serviceaccount/node-exporter created
servicemonitor.monitoring.coreos.com/node-exporter created
clusterrole.rbac.authorization.k8s.io/prometheus-k8s created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-k8s created
networkpolicy.networking.k8s.io/prometheus-k8s created
poddisruptionbudget.policy/prometheus-k8s created
prometheus.monitoring.coreos.com/k8s created
prometheusrule.monitoring.coreos.com/prometheus-k8s-prometheus-rules created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s-config created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s-config created
role.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s created
role.rbac.authorization.k8s.io/prometheus-k8s created
service/prometheus-k8s created
serviceaccount/prometheus-k8s created
servicemonitor.monitoring.coreos.com/prometheus-k8s created
Warning: resource apiservices/v1beta1.metrics.k8s.io is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resouces created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io configured
clusterrole.rbac.authorization.k8s.io/prometheus-adapter created
Warning: resource clusterroles/system:aggregated-metrics-reader is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be usd on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader configured
clusterrolebinding.rbac.authorization.k8s.io/prometheus-adapter created
clusterrolebinding.rbac.authorization.k8s.io/resource-metrics:system:auth-delegator created
clusterrole.rbac.authorization.k8s.io/resource-metrics-server-resources created
configmap/adapter-config created
deployment.apps/prometheus-adapter created
networkpolicy.networking.k8s.io/prometheus-adapter created
poddisruptionbudget.policy/prometheus-adapter created
rolebinding.rbac.authorization.k8s.io/resource-metrics-auth-reader created
service/prometheus-adapter created
serviceaccount/prometheus-adapter created
servicemonitor.monitoring.coreos.com/prometheus-adapter created
clusterrole.rbac.authorization.k8s.io/prometheus-operator created
clusterrolebinding.rbac.authorization.k8s.io/prometheus-operator created
deployment.apps/prometheus-operator created
networkpolicy.networking.k8s.io/prometheus-operator created
prometheusrule.monitoring.coreos.com/prometheus-operator-rules created
service/prometheus-operator created
serviceaccount/prometheus-operator created
servicemonitor.monitoring.coreos.com/prometheus-operator created

查看容器状态,需要确保容器都处于正常运行状态

[root@master-01 manifests]# kubectl get pods -n monitoring
NAME                                   READY   STATUS    RESTARTS   AGE
alertmanager-main-0                    2/2     Running   0          14h
alertmanager-main-1                    2/2     Running   0          14h
alertmanager-main-2                    2/2     Running   0          14h
blackbox-exporter-74465f5fcb-z7q8z     3/3     Running   0          14h
grafana-b4bcd98cc-t6vnm                1/1     Running   0          14h
kube-state-metrics-59dcf5dbb-645v6     3/3     Running   0          14h
node-exporter-4vsd9                    2/2     Running   0          14h
node-exporter-cfng2                    2/2     Running   0          14h
node-exporter-m2lp5                    2/2     Running   0          14h
node-exporter-pd4v9                    2/2     Running   0          14h
node-exporter-x9sqm                    2/2     Running   0          14h
prometheus-adapter-5794d7d9f5-bdrh9    1/1     Running   0          14h
prometheus-adapter-5794d7d9f5-tkwz2    1/1     Running   0          14h
prometheus-k8s-0                       2/2     Running   0          14h
prometheus-k8s-1                       2/2     Running   0          14h
prometheus-operator-6f948f56f8-tft4h   2/2     Running   0          14h

修改Gafana的SVC类型为NodePort

[root@master-01 manifests]# kubectl edit -n monitoring service grafana
service/grafana edited
[root@master-01 manifests]# kubectl get svc -n monitoring
NAME                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
alertmanager-main       ClusterIP   10.96.85.152    <none>        9093/TCP,8080/TCP            14h
alertmanager-operated   ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP   14h
blackbox-exporter       ClusterIP   10.96.200.93    <none>        9115/TCP,19115/TCP           14h
grafana                 NodePort    10.96.214.185   <none>        3000:30584/TCP               14h
kube-state-metrics      ClusterIP   None            <none>        8443/TCP,9443/TCP            14h
node-exporter           ClusterIP   None            <none>        9100/TCP                     14h
prometheus-adapter      ClusterIP   10.96.28.210    <none>        443/TCP                      14h
prometheus-k8s          ClusterIP   10.96.204.197   <none>        9090/TCP,8080/TCP            14h
prometheus-operated     ClusterIP   None            <none>        9090/TCP                     14h
prometheus-operator     ClusterIP   None            <none>        8443/TCP                     14h

Grafana 默认登录的账号密码为 admin/admin。然后相同的方式更改 Prometheus 的 Service 为NodePort

[root@master-01 manifests]# kubectl edit -n monitoring service prometheus-k8s
service/prometheus-k8s edited
[root@master-01 manifests]# kubectl get svc -n monitoring
NAME                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                         AGE
alertmanager-main       ClusterIP   10.96.85.152    <none>        9093/TCP,8080/TCP               14h
alertmanager-operated   ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP      14h
blackbox-exporter       ClusterIP   10.96.200.93    <none>        9115/TCP,19115/TCP              14h
grafana                 NodePort    10.96.214.185   <none>        3000:30584/TCP                  14h
kube-state-metrics      ClusterIP   None            <none>        8443/TCP,9443/TCP               14h
node-exporter           ClusterIP   None            <none>        9100/TCP                        14h
prometheus-adapter      ClusterIP   10.96.28.210    <none>        443/TCP                         14h
prometheus-k8s          NodePort    10.96.204.197   <none>        9090:32675/TCP,8080:32446/TCP   14h
prometheus-operated     ClusterIP   None            <none>        9090/TCP                        14h
prometheus-operator     ClusterIP   None            <none>        8443/TCP                        14h

通过浏览器访问,访问格式IP+NP端口(注意:由于官方为了安全性考虑,为Pod配置了NetworkPolicy,所以直接访问是不行的,需要将之前的网络策略删除)

[root@master-01 manifests]# kubectl delete networkpolicy --all -n monitoring
networkpolicy.networking.k8s.io "alertmanager-main" deleted
networkpolicy.networking.k8s.io "blackbox-exporter" deleted
networkpolicy.networking.k8s.io "grafana" deleted
networkpolicy.networking.k8s.io "kube-state-metrics" deleted
networkpolicy.networking.k8s.io "node-exporter" deleted
networkpolicy.networking.k8s.io "prometheus-adapter" deleted
networkpolicy.networking.k8s.io "prometheus-k8s" deleted
networkpolicy.networking.k8s.io "prometheus-operator" deleted
[root@master-01 manifests]# telnet 192.168.132.236 30277
Trying 192.168.132.236...
Connected to 192.168.132.236.
Escape character is '^]'.

Grafana访问界面,登录后提示需要修改密码。登录到主界面后,依次单击Home→Explore→Metrics按钮,即可观察集群的资源图表

grafana

main界面

Prometheus的UI也是同样的访问方法(http://192.168.132.236:32675/)。注意:刚开始的时候会出现告警,此时可以忽略

Prometheus监控数据来源

非云原生的监控一般采用exporter进行监控,而云原生的应用的一般通过服务自身暴露的/metrics接口让Prometheus进行pull采集监控信息

比如,node-exporter监听的9100端口,其实就是监控采集的数据来源

[root@master-01 manifests]# ps -aux | grep node_exporter
nfsnobo+  16629  0.8  0.3 1242204 18672 ?       Ssl  Nov24  13:14 /bin/node_exporter --web.listen-address=127.0.0.1:9100 --path.sysfs=/host/sys --path.rootfs=/host/root --path.udev.data=/host/root/run/udev/data --no-collector.wifi --no-collector.hwmon --no-collector.btrfs --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run/k3s/containerd/.+|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/) --collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15})$ --collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15})$
root      95083  0.0  0.0 112828  2304 pts/0    S+   23:24   0:00 grep --color=auto node_exporter
[root@master-01 manifests]# ss -lntp | grep 9100
LISTEN     0      16384  192.168.132.169:9100                     *:*                   users:(("kube-rbac-proxy",pid=17188,fd=3))
LISTEN     0      16384  127.0.0.1:9100                     *:*                   users:(("node_exporter",pid=16629,fd=3))
[root@master-01 manifests]# curl 127.0.0.1:9100/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 3.0034e-05
go_gc_duration_seconds{quantile="0.25"} 4.7561e-05
go_gc_duration_seconds{quantile="0.5"} 5.7244e-05
go_gc_duration_seconds{quantile="0.75"} 6.7569e-05
go_gc_duration_seconds{quantile="1"} 0.078837684
go_gc_duration_seconds_sum 4.79348581
go_gc_duration_seconds_count 28174
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 8

Grafana导入数据来源和下载模版,即可对这些数据进行可视化的展示

grafana数据展示

常用的exporter工具如下

类型 Exporter
数据库 MySQL Exporter, Redis Exporter, MongoDB Exporter, MSSQL Exporter
硬件 Apcupsd Exporter,IoT Edison Exporter, IPMI Exporter, Node Exporter
消息队列 Beanstalkd Exporter, Kafka Exporter, NSQ Exporter, RabbitMQ Exporter
存储 Ceph Exporter, Gluster Exporter, HDFS Exporter, ScaleIO Exporter
HTTP 服务 Apache Exporter, HAProxy Exporter, Nginx Exporter
API 服务 AWS ECS Exporter, Docker Cloud Exporter, Docker Hub Exporter, GitHub Exporter
日志 Fluentd Exporter, Grok Exporter
监控系统 Collectd Exporter, Graphite Exporter, InfluxDB Exporter, Nagios Exporter, SNMP Exporter
其它 Blackbox Exporter, JIRA Exporter, Jenkins Exporter, Confluence Exporter
云原生ETCD监控

ServiceMonitor 是 Prometheus Operator 提供的一个 Custom Resource (CR),它用于定义 Prometheus 如何抓取 Kubernetes 服务(Service)暴露的监控数据。ServiceMonitor 的主要作用是通过 Kubernetes 服务(Service)发现并抓取暴露的指标

ServiceMonitor 通过定义 spec.selector 和 spec.endpoints 来指定要抓取的目标服务(Service)和暴露的端口。ServiceMonitor 还支持其他配置,如 interval(抓取频率)、path(指标端点路径)、scheme(抓取协议)等。ServiceMonitor 配置完成后,Prometheus Operator 会定期查询 Kubernetes API,根据 ServiceMonitor 定义的规则来自动发现和抓取对应服务的指标。

测试访问Etcd Metrics接口

[root@master-01 etcd]# curl -s --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://192.168.132.169:2379/metrics -k | tail -10
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 2
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

创建Etcd的Service以及Endpoint

[root@master-01 ~]# vim etcd-svc.yaml
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app: etcd-prom
  name: etcd-prom
  namespace: kube-system
subsets:
- addresses:
  - ip: 192.168.132.169
  - ip: 192.168.132.170
  - ip: 192.168.132.171
  ports:
  - name: https-metrics
    port: 2379 # etcd 端口
    protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: etcd-prom
  name: etcd-prom
  namespace: kube-system
spec:
  ports:
  - name: https-metrics
    port: 2379
    protocol: TCP
    targetPort: 2379
  type: ClusterIP

创建资源并查看对应的ClusterIP(注意:该SVC是创建在kube-system命名空间下的,其次如果读者有配置代理的话,请将对应的ClusterIP放行,否则monitor无法采集数据)

[root@master-01 ~]# kubectl create -f etcd-svc.yaml
endpoints/etcd-prom created
service/etcd-prom created
[root@master-01 manifests]# kubectl get svc -n kube-system
NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                        AGE
etcd-prom        ClusterIP   10.96.116.30   <none>        2379/TCP                       5m48s
[root@master-01 manifests]# kubectl get endpoints -n kube-system
NAME             ENDPOINTS                                                                        AGE
etcd-prom        192.168.132.169:2379,192.168.132.170:2379,192.168.132.171:2379                   6m2s

对SVC的ClusterIP进行访问测试(注意:只有该步骤成功才能继续往下操作,否则是获取不到数据的)

[root@master-01 ~]# curl -s --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://10.96.116.30:2379/metrics -k | tail -2
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

创建Etcd的Secret(注意:证书路径需要根据实际路径填写)

[root@master-01 ~]# kubectl create secret generic etcd-ssl --from-file=/etc/kubernetes/pki/etcd/ca.crt --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/server.key -n monitoring
secret/etcd-ssl created

将Secret挂载到Prometheus容器上

[root@master-01 ~]# kubectl edit prometheus -n monitoring k8s
spec: 
...省略部分输出...
  secrets:
  - etcd-ssl
...省略部分输出...
prometheus.monitoring.coreos.com/k8s edited

挂载完成后,可以观察到prometheus容器开始重启

[root@master-01 ~]# kubectl get pods -n monitoring
NAME                                   READY   STATUS     RESTARTS   AGE
alertmanager-main-0                    2/2     Running    0          38h
alertmanager-main-1                    2/2     Running    0          38h
alertmanager-main-2                    2/2     Running    0          38h
blackbox-exporter-74465f5fcb-z7q8z     3/3     Running    0          38h
grafana-b4bcd98cc-t6vnm                1/1     Running    0          38h
kube-state-metrics-59dcf5dbb-645v6     3/3     Running    0          38h
node-exporter-4vsd9                    2/2     Running    0          38h
node-exporter-cfng2                    2/2     Running    0          38h
node-exporter-m2lp5                    2/2     Running    0          38h
node-exporter-pd4v9                    2/2     Running    0          38h
node-exporter-x9sqm                    2/2     Running    0          38h
prometheus-adapter-5794d7d9f5-bdrh9    1/1     Running    0          38h
prometheus-adapter-5794d7d9f5-tkwz2    1/1     Running    0          38h
prometheus-k8s-0                       2/2     Running    0          38h
prometheus-k8s-1                       0/2     Init:0/1   0          103s
prometheus-operator-6f948f56f8-tft4h   2/2     Running    0          38h

查看证书是否挂载到容器内部

[root@master-01 ~]# kubectl exec -n monitoring prometheus-k8s-0 -c prometheus -- ls /etc/prometheus/secrets/etcd-ssl/
ca.crt
server.crt
server.key

创建 Etcd 的 ServiceMonitor

[root@master-01 ~]# cat servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: etcd
  namespace: monitoring
  labels:
    app: etcd
spec:
  jobLabel: k8s-app
  endpoints:
  - interval: 30s  # 采集频率
    port: https-metrics  # 这个 port 对应 Service.spec.ports.name
    scheme: https  # 采集协议
    tlsConfig:
      caFile: /etc/prometheus/secrets/etcd-ssl/ca.crt  # CA 证书路径
      certFile: /etc/prometheus/secrets/etcd-ssl/server.crt   # 客户端证书路径
      keyFile: /etc/prometheus/secrets/etcd-ssl/server.key  # 客户端证书私钥路径
      insecureSkipVerify: true  # 关闭证书校验
  selector:
    matchLabels:
      app: etcd-prom  # 与 Service 的标签匹配
  namespaceSelector:
    matchNames:
    - kube-system  # 在 kube-system 命名空间中寻找 Service

创建资源并查看资源状态

[root@master-01 ~]# kubectl create -f servicemonitor.yaml
servicemonitor.monitoring.coreos.com/etcd created
[root@master-01 ~]# kubectl get servicemonitors.monitoring.coreos.com -n monitoring etcd
NAME   AGE
etcd   5m1s

登录Grafana的UI界面,依次单击Dashboards→New→New dashboard→Import dashboard,在出现的Import dashboard界面填写心仪的dashboard界面即可。下面附带官网提供的dashboard模板链接

https://grafana.com/grafana/dashboards/

import dashboard

添加dashboard的名称和添加prometheus数据源后,单击import按钮,等待一段时间即可查看Etcd监控数据

import dashboard-2

监控数据

非云原生监控 Exporter

使用 MySQL 作为测试用例,演示如何使用 Exporter 监控非云原生应用

部署Mysql服务,为Mysql设置密码,并暴露3306端口

[root@master-01 ~]# kubectl create deploy mysql --image=registry.cn-beijing.aliyuncs.com/dotbalo/mysql:5.7.23
deployment.apps/mysql created
[root@master-01 ~]# kubectl set env deploy/mysql MYSQL_ROOT_PASSWORD=mysql
deployment.apps/mysql env updated
[root@master-01 ~]# kubectl expose deploy mysql --port 3306
service/mysql exposed

检查容器和Service服务是否正常

[root@master-01 ~]# kubectl get po -l app=mysql
NAME                     READY   STATUS    RESTARTS   AGE
mysql-7d6ff9c689-m5smn   1/1     Running   0          32s
[root@master-01 ~]# kubectl get svc -l app=mysql
NAME    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
mysql   ClusterIP   10.96.175.244   <none>        3306/TCP   22s

登录Mysql,创建 Exporter 所需的用户和权限

[root@master-01 ~]# kubectl exec -it mysql-7d6ff9c689-m5smn -- bash
root@mysql-7d6ff9c689-m5smn:/# mysql -uroot -pmysql
mysql: [Warning] Using a password on the command line interface can be insecure                                                                                                                               .
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.23 MySQL Community Server (GPL)

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create user 'exporter'@'%' identified by 'exporter' with MAX_USER_CONNEC                                                                                                                               TIONS 3;
Query OK, 0 rows affected (0.02 sec)
mysql> grant process,replication client,select on *.* to 'exporter'@'%';
Query OK, 0 rows affected (0.00 sec)

mysql> quit
Bye
root@mysql-7d6ff9c689-m5smn:/# exit
exit

配置 MySQL Exporter 采集 MySQL 监控数据(注意 DATA_SOURCE_NAME 的配置,需要将 exporter:exporter@(mysql.default:3306)改成 自 己 的 实 际 配 置 , 格 式 如 下 USERNAME:PASSWORD@MYSQL_HOST_ADDRESS:MYSQL_PORT)

[root@master-01 ~]# vim mysql-exporter
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql-exporter
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: mysql-exporter
  template:
    metadata:
      labels:
        k8s-app: mysql-exporter
    spec:
      containers:
        - name: mysql-exporter
          image: registry.cn-beijing.aliyuncs.com/dotbalo/mysqld-exporter
          env:
            - name: DATA_SOURCE_NAME
              value: "exporter:exporter@(mysql.default:3306)/"
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 9104
---
apiVersion: v1
kind: Service
metadata:
  name: mysql-exporter
  namespace: monitoring
  labels:
    k8s-app: mysql-exporter
spec:
  type: ClusterIP
  selector:
    k8s-app: mysql-exporter
  ports:
    - name: api
      port: 9104
      protocol: TCP
[root@master-01 ~]# kubectl create -f mysql-exporter
deployment.apps/mysql-exporter created
service/mysql-exporter created
[root@master-01 ~]# kubectl get -f mysql-exporter
NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mysql-exporter   1/1     1            1           21h

NAME                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/mysql-exporter   ClusterIP   10.96.191.150   <none>        9104/TCP   21h

对暴露的SVC端口进行访问测试(读者也可以直接使用curl serviceIP+Port进行访问)

[root@master-01 ~]# kubectl port-forward -n monitoring svc/mysql-exporter 9104:                                                                                                                               9104
Forwarding from 127.0.0.1:9104 -> 9104
Forwarding from [::1]:9104 -> 9104
Handling connection for 9104

打开新的终端窗口,访问127.0.0.1:9104/metrics(上面的命令就是将本地9104端口与容器的9104端口进行映射)

[root@master-01 ~]# curl http://127.0.0.1:9104/metrics | tail -4
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collec                                                                                                                               tion cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0

配置mysql ServiceMonitor,并创建资源

[root@master-01 ~]# vim mysqlmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mysql-exporter
  namespace: monitoring
  labels:
    k8s-app: mysql-exporter
spec:
  jobLabel: k8s-app
  endpoints:
    - port: api
      interval: 30s
      scheme: http
  selector:
    matchLabels:
      k8s-app: mysql-exporter
  namespaceSelector:
    matchNames:
      - monitoring

等待数据采集完成,通过Prometheus UI界面可以看到mysql-exporter已经上线(依次单击Status→Service Discovery)

serviceMonitor/monitoring/mysql-exporter/0 (1 / 47 active targets)

导入Grafana Dashboard界面,即可将监控数据可视化,下面附带Dashboard链接

https://grafana.com/grafana/dashboards/14057-mysql/

mysql数据展示

Prometheus无法监控kube-controller-manager和kube-scheduler

表现:Prometheus界面无法观察到数据,并且两个组件都处于Firing状态

两个组件告警

原因:两个组件都只监听127.0.0.1,Prometheus的ServiceMonitor没有找到组件对应的Service

解决方法如下:

所有master节点修改kube-controller-manager配置文件,将监听地址修改为0.0.0.0(注意:Kubeadm安装和二进制安装的路径可能不一致,请读者按照实际情况进行修改)

[root@master-01 ~]# vim /etc/kubernetes/manifests/kube-controller-manager.yaml
...省略部分输出...
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --bind-address=0.0.0.0
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --cluster-cidr=172.16.0.0/16
    - --cluster-name=kubernetes
...省略部分输出...

在/etc/kubernetes/manifests/目录下的文件为集群的静态容器配置,配置完成后不需要手动重启容器,集群会自动读取配置文件进行重启容器

[root@master-01 kubernetes]# kubectl get pods -n kube-system
NAME                                      READY   STATUS    RESTARTS        AGE
...省略部分输出...
kube-apiserver-master-03                  1/1     Running   21 (2d3h ago)   28d
kube-controller-manager-master-01         1/1     Running   0               5h23m
kube-controller-manager-master-02         1/1     Running   0               42s
kube-controller-manager-master-03         0/1     Pending   0               1s
kube-proxy-84trv                          1/1     Running   6 (2d3h ago)    28d
...省略部分输出...

修改完成后,查看ServiceMonitor配置,可以观察到kube-controller-manager组件对应匹配的命名空间为kube-system,对应匹配的标签为app.kubernetes.io/name: kube-controller-manager,对应的service端口为https-metrics

[root@master-01 ~]# kubectl get servicemonitors.monitoring.coreos.com -n monitoring | egrep 'kube-controller-manager|kube-scheduler'
kube-controller-manager   2d19h
kube-scheduler            2d19h
[root@master-01 ~]# kubectl get servicemonitors.monitoring.coreos.com -n monitoring kube-controller-manager -oyaml | tail -20
      insecureSkipVerify: true
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 5s
    metricRelabelings:
    - action: drop
      regex: process_start_time_seconds
      sourceLabels:
      - __name__
    path: /metrics/slis
    port: https-metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-controller-manager

检查kube-system命名空间下并没有对应的Service

[root@master-01 ~]# kubectl get service -n kube-system -l app.kubernetes.io/name=kube-controller-manager
No resources found in kube-system namespace.

创建kube-controller-manager对应的service,新版本对应监听的端口为10257(注意:labels和ports名称要与servicemonitor相匹配)

[root@master-01 ~]# vim controller-svc.yaml
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app.kubernetes.io/name: kube-controller-manager
  name: kube-controller-manager-prom
  namespace: kube-system
subsets:
  - addresses:
      - ip: 192.168.132.169
      - ip: 192.168.132.170
      - ip: 192.168.132.171
    ports:
      - name: https-metrics
        port: 10257
        protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: kube-controller-manager
  name: kube-controller-manager-prom
  namespace: kube-system
spec:
  ports:
    - name: https-metrics
      port: 10257
      protocol: TCP
      targetPort: 10257
  sessionAffinity: None
  type: ClusterIP
[root@master-01 ~]# kubectl create -f controller-svc.yaml
endpoints/kube-controller-manager-prom created
service/kube-controller-manager-prom created
[root@master-01 ~]# kubectl get service -n kube-system -l app.kubernetes.io/name=kube-controller-manager
NAME                           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE
kube-controller-manager-prom   ClusterIP   10.96.22.199   <none>        10257/TCP   5s

注意!!旧版本的kube-controller-manager为HTTP协议,可以直接进行测试,但是新版本换成了HTTPS协议,直接访问测试会提示没有权限或者返回空。解决方案参见以下链接:重新编写一个clusterrole,权限是对metrics接口有get权限,创建clusterrolebinding,绑定到某个serviceaccount上,然后通过对应的Token构造HTTPS头部进行访问

https://zhuanlan.zhihu.com/p/601741895

完成上述操作后,Prometheus上关于controller-manager的告警会消失,并且在Service Discovery界面会出现关于controller-manager的监控项

serviceMonitor/monitoring/kube-controller-manager/0 (3 / 28 active targets)
serviceMonitor/monitoring/kube-controller-manager/1 (3 / 28 active targets)

恢复kube-scheduler组件告警的方法跟controller-manager相似,master节点修改/etc/kubernetes/manifests/kube-scheduler.yaml文件的监听端口为0.0.0.0

[root@master-01 ~]# vim /etc/kubernetes/manifests/kube-scheduler.yaml
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=0.0.0.0
    - --kubeconfig=/etc/kubernetes/scheduler.conf
...省略部分输出...

静态容器自动重启

[root@master-01 ~]# kubectl get pods -n kube-system
NAME                                      READY   STATUS    RESTARTS         AGE
kube-proxy-rjf4q                          1/1     Running   6 (2d19h ago)    29d
kube-scheduler-master-01                  1/1     Running   0                18s
kube-scheduler-master-02                  1/1     Running   0                21s
kube-scheduler-master-03                  1/1     Running   0                22s

查看kube-scheduler的servicemonitor,检查是否有对应的Service,若无则创建(kube-scheduler监听的端口为10259)

[root@master-01 ~]# ss -lntp | grep kube-scheduler
LISTEN     0      16384     [::]:10259                 [::]:*                   users:(("kube-scheduler",pid=67358,fd=3))
[root@master-01 ~]# kubectl get servicemonitors.monitoring.coreos.com -n monitoring kube-scheduler -oyaml  | tail -12
    path: /metrics/slis
    port: https-metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-scheduler
[root@master-01 ~]# vim scheduler-svc.yaml
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app.kubernetes.io/name: kube-scheduler
  name: kube-scheduler-prom
  namespace: kube-system
subsets:
  - addresses:
      - ip: 192.168.132.169
      - ip: 192.168.132.170
      - ip: 192.168.132.171
    ports:
      - name: https-metrics
        port: 10259
        protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: kube-scheduler
  name: kube-scheduler-prom
  namespace: kube-system
spec:
  ports:
    - name: https-metrics
      port: 10259
      protocol: TCP
      targetPort: 10259
  sessionAffinity: None
  type: ClusterIP

创建完成后,Prometheus上的告警消失

[root@master-01 ~]# kubectl create -f scheduler.yaml
endpoints/kube-scheduler-prom created
service/kube-scheduler-prom created
[root@master-01 ~]# kubectl get svc -n kube-system -l app.kubernetes.io/name=kube-scheduler
NAME                  TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE
kube-scheduler-prom   ClusterIP   10.96.87.125   <none>        10259/TCP   26s
serviceMonitor/monitoring/kube-scheduler/0 (3 / 31 active targets)
serviceMonitor/monitoring/kube-scheduler/1 (3 / 31 active targets)

通过 Service Monitor 监控应用时,如果监控没有找到目标主机的排查步骤时,排查步骤大致如下:

1、 确认 Service Monitor 是否成功创建

2、确认 Prometheus 是否生成了相关配置

3、确认存在 Service Monitor 匹配的 Service

4、确认通过 Service 能够访问程序的 Metrics 接口

5、确认 Service 的端口和 Scheme 和 Service Monitor 一致

黑盒监控

Prometheus 黑盒监控(Blackbox Exporter)是 Prometheus 生态中一个非常重要的组件,通常用于监控无法直接暴露指标的外部服务,比如 HTTP、HTTPS、DNS、TCP 等协议的服务。黑盒监控并不依赖于被监控目标本身的指标导出,而是通过主动发起请求、检查服务的可用性来评估目标服务的健康状态

新版 Prometheus Stack 已经默认安装了 BlackboxExporter

[root@master-01 ~]# kubectl get servicemonitors -n monitoring -l app.kubernetes.io/name=blackbox-exporter
NAME                AGE
blackbox-exporter   4d2h

BlackboxExporter同时也会创建一个 Service,可以通过该 Service 访问 Blackbox Exporter 并传递一些参数

[root@master-01 ~]# kubectl get svc -n monitoring -l app.kubernetes.io/name=blackbox-exporter
NAME                TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)              AGE
blackbox-exporter   ClusterIP   10.96.200.93   <none>        9115/TCP,19115/TCP   4d2h

检测blog.caijxlinux.work网站的状态(注意:使用任何一个公网域名或者公司内的域名探测都可以,若读者有使用代理,请unset http_proxy变量和https_proxy变量,否则Service无法访问对应的域名)

[root@master-01 ~]# curl -s "http://10.96.200.93:19115/probe?target=blog.caijxlinux.work&module=http_2xx" | tail -1
probe_tls_version_info{version="TLS 1.2"} 1
参数 解析
probe 接口地址
target 检测目标
module 使用对应的模块检测

Prometheus静态配置

考虑到有些读者可能使用传统的安装方法进行安装,如二进制等,需要使用静态文件更新Prometheus配置,所以使用黑盒监控作为例子,演示在集群内部如何使用静态配置添加监控

创建prometheus-additional.yaml文件,将此文件配置为Secret,作为Prometheus的静态配置

[root@master-01 ~]# touch prometheus-additional.yaml
[root@master-01 ~]# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
secret/additional-configs created

编辑Prometheus的配置文件,添加以下内容

[root@master-01 ~]# kubectl edit prometheus -n monitoring k8s
...省略部分输出...
spec:
  additionalScrapeConfigs:
    key: prometheus-additional.yaml
    name: additional-configs
    optional: true
...省略部分输出...

添加完成后,写入对应的监控内容到prometheus-additional.yaml文件内

[root@master-01 ~]# vim prometheus-additional.yaml
- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx] # Look for a HTTP 200 response.
  static_configs:
    - targets:
        - https://blog.caijxlinux.work # Target to probe with http.
        - https://www.baidu.com # Target to probe with https.
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox-exporter:19115 # The blackbox exporter's real hostname:port.

热更新Secret

[root@master-01 ~]# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml --dry-run=client -oyaml | kubectl replace -f - -n monitoring
secret/additional-configs replaced

等待一段时间后,打开Prometheus UI界面,依次单击Status→Targets,即可观察到对应的黑盒监控

caijx_http

监控处于正常状态后,在Grafana UI界面,导入黑盒监控模板,即可将监控结果可视化。在下方附带图表链接

blackboix

https://grafana.com/grafana/dashboards/13659

Prometheus监控Windows主机

下载对应的Exporter到Windows主机内(类似于zabbix-agent),选择安装的版本为windows_exporter-0.30.0-rc.0-amd64.msi

https://github.com/prometheus-community/windows_exporter/releases/tag/v0.30.0-rc.0

在Windows主机安装完成后,可以通过CMD查看暴露的端口为9182

netstat -anio
 TCP    0.0.0.0:9182           0.0.0.0:0              LISTENING       26632

在prometheus-additional.yaml文件内添加配置

- job_name: 'WindowsServerMonitor'
  static_configs:
    - targets:
        - "192.168.132.1:9182"
      labels:
        server_type: 'windows'
  relabel_configs:
    - source_labels: [__address__]
      target_label: instance

热更新Secret

[root@master-01 ~]# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml --dry-run=client -oyaml | kubectl replace -f - -n monitoring
secret/additional-configs replaced

在Prometheus UI查看到对应的监控数据,导入Grafana模板即可。此处不再赘述

windows

windows grafa

https://grafana.com/grafana/dashboards/12566

Prometheus 语法 PromQL

PromQL是查询 Prometheus 数据的强大工具,可以执行实时分析、计算和监控规则。读者需要掌握该工具,用于查询和计算某些特定的数据,并且PromQL是后续章节用来构造告警规则的基础语句

基本组成部分

1、指标名 (Metric Name):表示某一特定的时间序列数据

up

up #查询 up 指标,通常用于检查目标是否存活

2、标签(Labels):使用 {} 过滤指标,基于键值对

labels

http_requests_total{job="api-server", method="GET"} #查询 http_requests_total 指标中 job="api-server" 且 method="GET" 的数据

3、时间范围 (Range Vector):使用 [时间范围] 表示一段时间的数据

rate(http_requests_total[5m]) #计算过去 5 分钟的每秒请求速率

PromQL也支持如下表达式

!= #不等于;
=~ #表示等于符合正则表达式的指标;
!~ #和=~类似,=~表示正则匹配,!~表示正则不匹配。
up{node!=master-01}

查看Kubernetes集群中每个宿主机的磁盘总量

node_filesystem

node_filesystem_size_bytes

查询自定分区大小

node_filesystem_size_bytes{mountpoint="/"}

查询分区不是/boot,且磁盘是/dev/开头的分区大小

node_filesystem_size_bytes{device=~"/dev/.*", mountpoint!="/boot"}

查询主机 master-01 在最近 5 分钟可用的磁盘空间变化

node_filesystem_avail_bytes{instance="master-01", mountpoint="/",device="/dev/mapper/centos-root"}[5m]

查询10分钟之前磁盘可用空间,指定offset参数

node_filesystem_avail_bytes{instance="master-01", mountpoint="/",device="/dev/mapper/centos-root"} offset 10m

查询 10 分钟之前,5 分钟区间的磁盘可用空间的变化

node_filesystem_avail_bytes{instance="master-01", mountpoint="/",device="/dev/mapper/centos-root"}[5m] offset 10m
PromQL 操作符

将查询到的主机磁盘的空间数据,转换为GB

node_filesystem_avail_bytes{instance="master-01", mountpoint="/",device="/dev/mapper/centos-root"} / 1024 / 1024 / 1024
node_filesystem_avail_bytes{instance="master-01", mountpoint="/",device="/dev/mapper/centos-root"} / (1024 ^ 3)

在master-01执行df -Th命令,与上图(主机磁盘空间)和下图(磁盘可用率)结果进行对比

[root@master-01 ~]# df -Th
Filesystem              Type      Size  Used Avail Use% Mounted on
devtmpfs                devtmpfs  2.4G     0  2.4G   0% /dev
tmpfs                   tmpfs     2.5G     0  2.5G   0% /dev/shm
tmpfs                   tmpfs     2.5G   30M  2.4G   2% /run
tmpfs                   tmpfs     2.5G     0  2.5G   0% /sys/fs/cgroup
/dev/mapper/centos-root xfs        39G   12G   27G  30% /

查询master-01根分区磁盘可用率

node_filesystem_avail_bytes{instance="master-01", mountpoint="/",device="/dev/mapper/centos-root"} / node_filesystem_size_bytes{instance="master-01", mountpoint="/",device="/dev/mapper/centos-root"}

查询所有主机根分区可用率

node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

转化为百分百的形式

(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100

找到集群中根分区空间可用率大于 60%的主机

(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 > 60

PromQL支持语法

参数 含义
== 相等
> 大于
< 小于
>= 大于等于
<= 小于等于
and 并且
or
unless 排除

磁盘可用率大于 30%小于等于 60%的主机

30 < (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 <= 60
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 > 30 and (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 <=60
PromQL 常用函数

使用 sum 函数统计当前监控目标所有主机根分区剩余的空间

sum(node_filesystem_free_bytes{mountpoint="/"}) / 1024^3

根据 statuscode 字段对http_request_total进行统计请求数据

sum(http_request_total) by (statuscode)

根据 statuscode 和 handler 两个指标进一步统计

sum(http_request_total) by (statuscode, handler)

PromQL还支持topk()、bottomk()、min()、max()、avg()、ceil()、floor()、sort()、sort_desc()等其他函数,在此处附带一篇文档,里面有关于PromQL函数的详细解析,有需要的读者可以认真阅读,会受益匪浅。

https://blog.caijxlinux.work/promql-learning.pdf