菜单
Enterprise 开源

使用配置文件配置告警资源

使用可进行版本控制的配置文件管理您的告警资源。Grafana 启动时,会配置您配置文件中定义的资源。配置可以创建、更新或删除 Grafana 实例中的现有资源。

本指南概述了使用 YAML 文件配置告警资源的步骤和参考信息。如需实际演示,您可以克隆并试用这个使用 Grafana OSS 和 Docker Compose 的示例。

注意

  • 使用配置文件配置 Grafana 在 Grafana Cloud 中不可用。

  • 您无法在 Grafana 中编辑通过文件配置的资源。您只能通过更改配置文件并重启 Grafana 或执行热重载来更改资源属性。这可以防止对资源进行的更改在重新配置文件或执行热重载时被覆盖。

  • 使用配置文件进行的配置在 Grafana 系统初始设置期间进行,但您可以随时使用 Grafana Admin API 重新运行它。

  • 导入现有的告警资源会导致冲突。首先,如果存在,请移除您计划导入的资源。

以下列出了如何设置文件以及每个对象所需的字段的详细信息,具体取决于您正在配置的资源。

导入告警规则

在您的 Grafana 实例中使用配置文件创建或删除告警规则。

  1. 在 Grafana 中找到告警规则组。

  2. 导出并下载您的告警规则配置文件。

  3. 将内容复制到 YAML 或 JSON 配置文件中,并将其添加到您要导入告警资源的 Grafana 实例的 provisioning/alerting 目录。

    示例配置文件如下所示。

  4. 重启您的 Grafana 实例(或使用 Admin API 重新加载配置文件)。

以下是创建告警规则的示例配置文件。

yaml
# config file version
apiVersion: 1

# List of rule groups to import or update
groups:
  # <int> organization ID, default = 1
  - orgId: 1
    # <string, required> name of the rule group
    name: my_rule_group
    # <string, required> name of the folder the rule group will be stored in
    folder: my_first_folder
    # <duration, required> interval that the rule group should evaluated at
    interval: 60s
    # <list, required> list of rules that are part of the rule group
    rules:
      # <string, required> unique identifier for the rule. Should not exceed 40 symbols. Only letters, numbers, - (hyphen), and _ (underscore) allowed.
      - uid: my_id_1
        # <string, required> title of the rule that will be displayed in the UI
        title: my_first_rule
        # <string, required> which query should be used for the condition
        condition: A
        # <list, required> list of query objects that should be executed on each
        #                  evaluation - should be obtained through the API
        data:
          - refId: A
            datasourceUid: '__expr__'
            model:
              conditions:
                - evaluator:
                    params:
                      - 3
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - A
                  reducer:
                    type: last
                  type: query
              datasource:
                type: __expr__
                uid: '__expr__'
              expression: 1==0
              intervalMs: 1000
              maxDataPoints: 43200
              refId: A
              type: math
        # <string> UID of a dashboard that the alert rule should be linked to
        dashboardUid: my_dashboard
        # <int> ID of the panel that the alert rule should be linked to
        panelId: 123
        # <string> the state the alert rule will have when no data is returned
        #          possible values: "NoData", "Alerting", "OK", default = NoData
        noDataState: Alerting
        # <string> the state the alert rule will have when the query execution
        #          failed - possible values: "Error", "Alerting", "OK"
        #          default = Alerting
        execErrState: Alerting
        # <duration, required> for how long should the alert fire before alerting
        for: 60s
        # <map<string, string>> a map of strings to pass around any data
        annotations:
          some_key: some_value
        # <map<string, string> a map of strings that can be used to filter and
        #                      route alerts
        labels:
          team: sre_team_1

以下是删除告警规则的示例配置文件。

yaml
# config file version
apiVersion: 1

# List of alert rule UIDs that should be deleted
deleteRules:
  # <int> organization ID, default = 1
  - orgId: 1
    # <string, required> unique identifier for the rule
    uid: my_id_1

导入联系点

在您的 Grafana 实例中使用配置文件创建或删除联系点。

  1. 在 Grafana 中找到联系点。

  2. 导出并下载您的联系点配置文件。

  3. 将内容复制到 YAML 或 JSON 配置文件中,并将其添加到您要导入告警资源的 Grafana 实例的 provisioning/alerting 目录。

    示例配置文件如下所示。

  4. 重启您的 Grafana 实例(或使用 Admin API 重新加载配置文件)。

以下是创建联系点的示例配置文件。

yaml
# config file version
apiVersion: 1

# List of contact points to import or update
contactPoints:
  # <int> organization ID, default = 1
  - orgId: 1
    # <string, required> name of the contact point
    name: cp_1
    receivers:
      # <string, required> unique identifier for the receiver. Should not exceed 40 symbols. Only letters, numbers, - (hyphen), and _ (underscore) allowed.
      - uid: first_uid
        # <string, required> type of the receiver
        type: prometheus-alertmanager
        # <bool, optional> Disable the additional [Incident Resolved] follow-up alert, default = false
        disableResolveMessage: false
        # <object, required> settings for the specific receiver type
        settings:
          url: http://test:9000

以下是删除联系点的示例配置文件。

yaml
# config file version
apiVersion: 1

# List of receivers that should be deleted
deleteContactPoints:
  # <int> organization ID, default = 1
  - orgId: 1
    # <string, required> unique identifier for the receiver
    uid: first_uid

设置

以下是一些您可以用于不同联系点集成的设置示例。

导入通知模板组

在您的 Grafana 实例中使用配置文件创建或删除通知模板组。

  1. 在 Grafana 中找到通知模板组。

  2. 通过复制模板内容和名称导出模板组。

  3. 将内容复制到 YAML 或 JSON 配置文件中,并将其添加到您要导入告警资源的 Grafana 实例的 provisioning/alerting 目录。

    示例配置文件如下所示。

  4. 重启您的 Grafana 实例(或使用 Admin API 重新加载配置文件)。

以下是创建通知模板组的示例配置文件。

yaml
# config file version
apiVersion: 1

# List of templates to import or update
templates:
  # <int> organization ID, default = 1
  - orgId: 1
    # <string, required> name of the template group, must be unique
    name: my_first_template
    # <string, required> content of the template group
    template: |
      {{ define "my_first_template" }}
        Custom notification message
      {{ end }}

以下是删除通知模板组的示例配置文件。

yaml
# config file version
apiVersion: 1

# List of alert rule UIDs that should be deleted
deleteTemplates:
  # <int> organization ID, default = 1
  - orgId: 1
    # <string, required> name of the template group, must be unique
    name: my_first_template

导入通知策略

在您的 Grafana 实例中使用配置文件创建或重置通知策略树。

在 Grafana 中,整个通知策略树被视为一个单一的大型资源。新的具体策略应作为根策略下的子策略添加。由于具体策略可能相互依赖,您不能配置策略树的子集;整个策略树必须在一个地方定义。

警告

由于策略树是单一资源,配置它将覆盖通知策略树中的所有策略。但是,这不会影响告警规则直接选择联系点时创建的内部策略。

  1. 在 Grafana 中找到通知策略树。

  2. 导出并下载您的通知策略树配置文件。

  3. 将内容复制到 YAML 或 JSON 配置文件中,并将其添加到您要导入告警资源的 Grafana 实例的 provisioning/alerting 目录。

    示例配置文件如下所示。

  4. 重启您的 Grafana 实例(或使用 Admin API 重新加载配置文件)。

以下是创建通知策略的示例配置文件。

yaml
# config file version
apiVersion: 1

# List of notification policies
policies:
  # <int> organization ID, default = 1
  - orgId: 1
    # <string> name of the contact point that should be used for this route
    receiver: grafana-default-email
    # <list> The labels by which incoming alerts are grouped together. For example,
    #        multiple alerts coming in for cluster=A and alertname=LatencyHigh would
    #        be batched into a single group.
    #
    #        To aggregate by all possible labels use the special value '...' as
    #        the sole label name, for example:
    #        group_by: ['...']
    #        This effectively disables aggregation entirely, passing through all
    #        alerts as-is. This is unlikely to be what you want, unless you have
    #        a very low alert volume or your upstream notification system performs
    #        its own grouping.
    group_by: ['...']
    # <list> a list of prometheus-like matchers that an alert rule has to fulfill to match the node (allowed chars
    #        [a-zA-Z_:])
    matchers:
      - alertname = Watchdog
      - service_id_X = serviceX
      - severity =~ "warning|critical"
    # <list> a list of grafana-like matchers that an alert rule has to fulfill to match the node
    object_matchers:
      - ['alertname', '=', 'CPUUsage']
      - ['service_id-X', '=', 'serviceX']
      - ['severity', '=~', 'warning|critical']
    # <list> Times when the route should be muted. These must match the name of a
    #        mute time interval.
    #        Additionally, the root node cannot have any mute times.
    #        When a route is muted it will not send any notifications, but
    #        otherwise acts normally (including ending the route-matching process
    #        if the `continue` option is not set)
    mute_time_intervals:
      - abc
    # <duration> How long to initially wait to send a notification for a group
    #            of alerts. Allows to collect more initial alerts for the same group.
    #            (Usually ~0s to few minutes), default = 30s
    group_wait: 30s
    # <duration> How long to wait before sending a notification about new alerts that
    #            are added to a group of alerts for which an initial notification has
    #            already been sent. (Usually ~5m or more), default = 5m
    group_interval: 5m
    # <duration>  How long to wait before sending a notification again if it has already
    #             been sent successfully for an alert. (Usually ~3h or more), default = 4h
    repeat_interval: 4h
    # <list> Zero or more child policies. The schema is the same as the root policy.
    # routes:
    #   # Another recursively nested policy...
    #   - receiver: another-receiver
    #     matchers:
    #       - ...
    #     ...

以下是将策略树重置回其默认值的示例配置文件。

yaml
# config file version
apiVersion: 1

# List of orgIds that should be reset to the default policy
resetPolicies:
  - 1

导入静默时间

在您的 Grafana 实例中使用配置文件创建或删除静默时间。

  1. 在 Grafana 中找到静默时间。

  2. 导出并下载您的静默时间配置文件。

  3. 将内容复制到 YAML 或 JSON 配置文件中,并将其添加到您要导入告警资源的 Grafana 实例的 provisioning/alerting 目录。

    示例配置文件如下所示。

  4. 重启您的 Grafana 实例(或使用 Admin API 重新加载配置文件)。

以下是创建静默时间的示例配置文件。

yaml
# config file version
apiVersion: 1

# List of mute time intervals to import or update
muteTimes:
  # <int> organization ID, default = 1
  - orgId: 1
    # <string, required> name of the mute time interval, must be unique
    name: mti_1
    # <list> time intervals that should trigger the muting
    #        refer to https://prometheus.ac.cn/docs/alerting/latest/configuration/#time_interval-0
    time_intervals:
      - times:
          - start_time: '06:00'
            end_time: '23:59'
        location: 'UTC'
        weekdays: ['monday:wednesday', 'saturday', 'sunday']
        months: ['1:3', 'may:august', 'december']
        years: ['2020:2022', '2030']
        days_of_month: ['1:5', '-3:-1']

以下是删除静默时间的示例配置文件。

yaml
# config file version
apiVersion: 1

# List of mute time intervals that should be deleted
deleteMuteTimes:
  # <int> organization ID, default = 1
  - orgId: 1
    # <string, required> name of the mute time interval, must be unique
    name: mti_1

模板变量插值

配置使用 $variable 语法对环境变量进行插值。

yaml
contactPoints:
  - orgId: 1
    name: My Contact Email Point
    receivers:
      - uid: 1
        type: email
        settings:
          addresses: $EMAIL

在此示例中,配置会将 $EMAIL 替换为 EMAIL 环境变量的值,如果不存在则替换为空字符串。更多信息,请参阅配置文档中的使用环境变量

在告警资源中,大多数属性支持模板变量插值,但有少数例外情况:

  • 告警规则注释:groups[].rules[].annotations
  • 告警规则时间范围:groups[].rules[].relativeTimeRange
  • 告警规则查询模型:groups[].rules[].data.model
  • 静默时间名称:muteTimes[].name
  • 静默时间间隔:muteTimes[].time_intervals[]
  • 通知模板组名称:templates[].name
  • 通知模板组内容:templates[].template

注意:对于支持插值的属性,您可能会在不需要时意外替换模板变量。为避免这种情况,您可以使用 $$variable 来转义 $variable

例如,在配置 contactPoints.receivers.settings 对象中的 subject 属性时,该属性旨在使用 $labels 变量。

  1. subject: '{{ $labels }}' 将进行插值,错误地将 subject 定义为 subject: '{{ }}'
  2. subject: '{{ $$labels }}' 将不进行插值,正确地将 subject 定义为 subject: '{{ $labels }}'

更多示例

有关本指南概念的更多示例