Kubernetes 监控

在几分钟（而非几天）内可视化您的 Kubernetes 集群并设置告警。

在 Grafana Cloud 中开始使用

文档

为何在 Grafana Cloud 中使用 Kubernetes 监控？

加速实现价值

使用这个开箱即用的监控工具可减少部署、设置和故障排除时间，它只需运行几个 CLI 命令或对 Helm chart 进行少量修改即可。

更快识别根源问题

通过集群导航视图深入查看您的基础设施，识别并解决问题，无需麻烦地在不同窗口和监控工具之间切换。

降低成本

效率和成本监控可视化提供了全面的支出洞察，有助于您基于数据对资源分配、扩展策略和技术投资做出决策。

易于部署

可在任何主流云托管的 Kubernetes 服务和 Kubernetes 发行版上部署 Helm chart。

选择要启用的功能
获取针对您需求定制的 Helm 安装说明

了解更多

成本管理

通过成本监控功能更好地了解您的 Kubernetes 成本、支出趋势和潜在节省，该功能基于开源项目 OpenCost。

每个组件级别的用量和成本归因
跨云提供商分解成本和资源分配
可视化成本趋势和预计节省
按资源类型组织 Kubernetes 成本
根据您的资源使用情况获取节省建议

了解更多

观看演示

高优先级问题一览

通过基础设施组件的总体快照，即时识别超出预设阈值的集群问题

集群 CPU 和内存使用率
容器镜像分布
触发中的 Pod 和容器告警

了解更多

从 Kubernetes 集群到容器的全面可见性

全面查看 Kubernetes 集群，然后深入查看特定容器级别的信息。

每个基础设施级别的成本和资源使用归因
颜色编码的资源使用可视化和图标有助于更快地识别和解决问题
峰值 vs. 平均资源效率的并排比较

了解更多

优化、分析和预测您的资源使用

即时分析 CPU 和内存使用趋势。关联实际使用量与限制和请求。主动识别问题以实现优化的资源管理。

每个基础设施级别包含历史趋势的详细洞察
在专用选项卡中深入查看网络、能源、日志和事件
由机器学习驱动的资源预测
自动化 Pod CPU 异常值检测

了解更多

观看演示

网络稳定性与性能洞察

识别何时限制导致网络饱和和丢包。

检测带宽限制
防止丢包
优化整个集群的网络性能

了解更多

了解您的环境足迹和能源消耗

监控您的 Kubernetes 能源消耗，以优化效率、降低成本并增强可持续性。

24 小时 GPU、DRAM 和 Packages 消耗趋势
节点和命名空间能源消耗明细

了解更多

即时 Prometheus 相关日志

Prometheus 和 Grafana Loki 的元数据为您的 Kubernetes 集群保持相同的标签，因此访问相关的 Kubernetes 指标和日志变得无比简单。

了解更多

Kubernetes 容器洞察

使用集群到容器导航即时清晰地了解容器。

大小调整建议
访问历史数据以查明 CPU 限制和重启

了解更多

观看演示

精选指标和告警

访问有效监控 Kubernetes 集群所需的 kube-state-metrics 和告警规则。

精心挑选的一组指标以避免基数爆炸
社区构建的告警标准

了解更多

观看演示

入门非常简单

有关完整的实现详情和最佳实践

查看指南

1

注册

创建您的免费 Grafana Cloud 账户。

2

连接您的数据

只需点击几下，即可为预构建的可视化和告警规则设置默认配置。

3

部署

数据将从您的集群流入 Grafana Cloud。

Grafana Cloud 上的 Kubernetes 监控集成使我们的工程师能够进行原生监控。他们不再需要联系我们的 SRE 团队。相反，他们只需点击 Grafana Cloud 集成选项卡上的一个按钮，导航到开箱即用的仪表盘，就可以看到他们自己解决问题所需的所有信息——CPU 使用率、日志、指标。这非常简单，帮助我们快速发现问题，并为我们所有人节省了大量的自定义开发时间。

James Wojewoda

首席网站可靠性工程师 | Beeswax

Kubernetes 指标和告警规则

Grafana Cloud 中的 Kubernetes 监控解决方案以 60 秒的抓取间隔摄取一组默认指标。告警规则集有助于设置和运行集群及其工作负载的告警。

阅读更多关于Kubernetes 指标和告警规则的信息

包含的关键告警规则

*可滚动

KubeNodeNotReady

KubeNodeUnreachable

KubeletTooManyPods

KubeNodeReadinessFlapping

KubeletPlegDurationHigh

KubeletPodStartUpLatencyHigh

KubeletClientCertificateExpiration

KubeletServerCertificateExpiration

KubeletClientCertificateRenewalErrors

KubeletServerCertificateRenewalErrors

KubeletDown

KubeVersionMismatch

KubeClientErrors

KubeCPUOvercommit

KubeMemoryOvercommit

KubeCPUQuotaOvercommit

KubeMemoryQuotaOvercommit

KubeQuotaAlmostFull

KubeQuotaFullyUsed

KubeQuotaExceeded

CPUThrottlingHigh

KubePodCrashLooping

KubePodNotRead

KubeDeploymentGenerationMismatch

KubeDeploymentReplicasMismatch

KubeStatefulSetReplicasMismatch

KubeStatefulSetGenerationMismatch

KubeStatefulSetUpdateNotRolledOut

KubeDaemonSetRolloutStuck

KubeContainerWaiting

KubeDaemonSetNotScheduled

KubeDaemonSetMisScheduled

KubeJobCompletion

KubeJobFailed

KubeHpaReplicasMismatch

KubeHpaMaxedOut

包含的关键指标

*可滚动

cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits

cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests

cluster:namespace:pod_memory:active:kube_pod_container_resource_limits

cluster:namespace:pod_memory:active:kube_pod_container_resource_requests

container_cpu_cfs_periods_total

container_cpu_cfs_throttled_periods_total

container_cpu_usage_seconds_total

container_fs_reads_bytes_total

container_fs_reads_total

container_fs_writes_bytes_total

container_fs_writes_total

container_memory_cache

container_memory_rss

container_memory_swap

container_memory_working_set_bytes

container_network_receive_bytes_total

container_network_receive_packets_dropped_total

container_network_receive_packets_total

container_network_transmit_bytes_total

container_network_transmit_packets_dropped_total

container_network_transmit_packets_total

go_goroutines

kube_daemonset_status_current_number_scheduled

kube_daemonset_status_desired_number_scheduled

kube_daemonset_status_number_available

kube_daemonset_status_number_misscheduled

kube_daemonset_updated_number_scheduled

kube_deployment_metadata_generation

kube_deployment_spec_replicas

kube_deployment_status_observed_generation

kube_deployment_status_replicas_available

kube_deployment_status_replicas_updated

kube_horizontalpodautoscaler_spec_max_replicas

kube_horizontalpodautoscaler_spec_min_replicas

kube_horizontalpodautoscaler_status_current_replicas

kube_horizontalpodautoscaler_status_desired_replicas

kube_job_failed

kube_job_spec_completions

kube_job_status_succeeded

kube_namespace_created

kube_node_info

kube_node_spec_taint

kube_node_status_allocatable

kube_node_status_capacity

kube_node_status_condition

kube_pod_container_resource_limits

kube_pod_container_resource_requests

kube_pod_container_status_waiting_reason

kube_pod_info

kube_pod_owner

kube_pod_status_phase

kube_replicaset_owner

kube_resourcequota

kube_statefulset_metadata_generation

kube_statefulset_replicas

kube_statefulset_status_current_revision

kube_statefulset_status_observed_generation

kube_statefulset_status_replicas

kube_statefulset_status_replicas_ready

kube_statefulset_status_replicas_updated

kube_statefulset_status_update_revision

kubelet_certificate_manager_client_expiration_renew_errors

kubelet_certificate_manager_client_ttl_seconds

kubelet_certificate_manager_server_ttl_seconds

kubelet_cgroup_manager_duration_seconds_bucket

kubelet_cgroup_manager_duration_seconds_count

kubelet_node_config_error

kubelet_node_name

kubelet_pleg_relist_duration_seconds_bucket

kubelet_pleg_relist_duration_seconds_count

kubelet_pleg_relist_interval_seconds_bucket

kubelet_pod_start_duration_seconds_count

kubelet_pod_worker_duration_seconds_bucket

kubelet_pod_worker_duration_seconds_count

kubelet_running_container_count

kubelet_running_containers

kubelet_running_pod_count

kubelet_running_pods

kubelet_runtime_operations_duration_seconds_bucket

kubelet_runtime_operations_errors_total

kubelet_runtime_operations_total

kubelet_server_expiration_renew_errors

kubelet_volume_stats_available_bytes

kubelet_volume_stats_capacity_bytes

kubelet_volume_stats_inodes

kubelet_volume_stats_inodes_used

kubernetes_build_info

machine_memory_bytes

namespace_cpu:kube_pod_container_resource_limits:sum

namespace_cpu:kube_pod_container_resource_requests:sum

namespace_memory:kube_pod_container_resource_limits:sum

namespace_memory:kube_pod_container_resource_requests:sum

namespace_workload_pod

namespace_workload_pod:kube_pod_owner:relabel

node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate

node_namespace_pod_container:container_memory_cache

node_namespace_pod_container:container_memory_rss

node_namespace_pod_container:container_memory_swap

node_namespace_pod_container:container_memory_working_set_bytes

node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile

process_cpu_seconds_total

process_resident_memory_bytes

rest_client_request_duration_seconds_bucket

rest_client_requests_total

storage_operation_duration_seconds_bucket

storage_operation_duration_seconds_count

storage_operation_errors_total

volume_manager_total_volumes

准备好开始使用 Kubernetes 监控了吗？

要在 Grafana Cloud 中使用 Kubernetes 监控，您有三种选择。所有计划都包含预构建的可视化以及指标和告警规则。

Cloud Free

永不付费。

最适合初期和小型团队，每月最多 3 个活跃用户。

创建免费账户

最简单的入门方式

Cloud Pro

按使用量付费

最适合需要扩展用户数超过 3 人并解锁 8x5 支持的成长型团队。

开始 14 天试用

Cloud Advanced

高级捆绑

最适合希望连接企业插件并解锁 24x7 支持的团队。

联系我们

查看定价详情

有用资源

60 分钟

GrafanaLive：使用 Grafana Cloud 改进 Beeswax 平台的可观测性

博客文章

在 Grafana Cloud 中推出 Kubernetes 监控

成功案例

随着在 Kubernetes 上运行的微服务数量激增，PayIt 转向 Grafana 和 Prometheus 实现云原生规模的可观测性

博客文章

在 Grafana Cloud 中推出 Kubernetes 监控

博客文章

在 Kubernetes 环境中使用 Grafana Loki 进行大规模日志记录的五个技巧

博客文章

Kubernetes 监控的 5 大核心优势

博客文章

如何在 Grafana Cloud 中监控 Kubernetes 节点的健康状况和资源使用情况

博客文章

在 Grafana Cloud 中使用 Kubernetes 监控引入即时 Kubernetes 日志记录

博客文章

如何使用 Prometheus Operator 监控 Kubernetes 集群

博客文章

如何使用 Kubernetes 事件实现有效的告警和监控

博客文章

监控 Kubernetes 层：需要了解的关键指标

博客文章

Kubernetes 应用中的分布式链路追踪：你需要了解什么

博客文章

Kubernetes 应用程序监控入门指南

博客文章

如何使用 Grafana Loki、Grafana 和 Grafana Agent 收集和查询 Kubernetes 日志

博客文章

如何使用 Argo CD 配置 Grafana Cloud 中的 Kubernetes 监控

博客文章

如何将现有 Grafana 仪表盘和告警迁移到 Grafana Cloud 中的 Kubernetes 监控

博客文章

如何在 Grafana Cloud 的 Kubernetes 监控中优化资源利用率

博客文章

Kubernetes 告警：使用 Grafana Cloud 简化 Kubernetes 集群中的异常检测

博客文章

Kubernetes 资源限制的意义：可预测性 vs. 效率

6 月 4 日

Grafana Cloud 中的 Kubernetes 监控入门

60 分钟

GrafanaLive：使用 Grafana Cloud 改进 Beeswax 平台的可观测性