配置 Tempo
您可以使用 Grafana Cloud,而无需自行安装、维护和扩展 Grafana Tempo 实例。创建一个免费账户即可开始使用,其中包括永久免费访问 1 万指标、50GB 日志、50GB 追踪、500 VUh k6 测试等功能。创建一个免费账户即可开始使用,其中包括永久免费访问 1 万指标、50GB 日志、50GB 追踪、500 VUh k6 测试等功能。
本文档解释了 Tempo 的配置选项及其影响的详细信息。
提示
有关配置 Tempo 数据源的说明,请参阅Grafana Cloud 和Grafana 文档。
Tempo 配置选项包括
此外,您可以查阅 TLS 文档以配置集群组件通过 TLS 进行通信,或通过 TLS 接收 traces。
在配置中使用环境变量
您可以在配置文件中使用环境变量引用来设置部署期间需要可配置的值。为此,请传递 -config.expand-env=true
参数并使用
${VAR}
其中 VAR
是环境变量的名称。
每个变量引用在启动时会被环境变量的值替换。替换是区分大小写的,并在解析 YAML 文件之前发生。除非您指定默认值或自定义错误文本,否则未定义变量的引用将被空字符串替换。
要指定默认值,请使用
${VAR:-default_value}
其中 default_value
是在环境变量未定义时使用的值。
您可以在此处找到有关其他支持语法的信息。此处。
Server
Tempo 使用 dskit/server
中的 server。有关配置选项的更多信息,请参阅此文件。
# Optional. Setting to true enables multitenancy and requires X-Scope-OrgID header on all requests.
[multitenancy_enabled: <bool> | default = false]
# Optional. String prefix for all http api endpoints. Must include beginning slash.
[http_api_prefix: <string>]
server:
# HTTP server listen host
[http_listen_address: <string>]
# HTTP server listen port
[http_listen_port: <int> | default = 80]
# gRPC server listen host
[grpc_listen_address: <string>]
# gRPC server listen port
[grpc_listen_port: <int> | default = 9095]
# Register instrumentation handlers (/metrics, etc.)
[register_instrumentation: <boolean> | default = true]
# Timeout for graceful shutdowns
[graceful_shutdown_timeout: <duration> | default = 30s]
# Read timeout for HTTP server
[http_server_read_timeout: <duration> | default = 30s]
# Write timeout for HTTP server
[http_server_write_timeout: <duration> | default = 30s]
# Idle timeout for HTTP server
[http_server_idle_timeout: <duration> | default = 120s]
# Max gRPC message size that can be received
# This value may need to be increased if you have large traces
[grpc_server_max_recv_msg_size: <int> | default = 16777216]
# Max gRPC message size that can be sent
# This value may need to be increased if you have large traces
[grpc_server_max_send_msg_size: <int> | default = 16777216]
Distributor
有关配置选项的更多信息,请参阅此文件。
Distributors 接收 spans 并将其转发到相应的 ingesters。
以下配置启用了所有可用接收器及其默认配置。对于生产部署,仅启用您需要的接收器。更多文档和更高级的配置选项可在 receiver README 中找到。
# Distributor config block
distributor:
# receiver configuration for different protocols
# config is passed down to opentelemetry receivers
# for a production deployment you should only enable the receivers you need!
receivers:
otlp:
protocols:
grpc:
http:
jaeger:
protocols:
thrift_http:
grpc:
thrift_binary:
thrift_compact:
zipkin:
opencensus:
kafka:
# Optional.
# Configures forwarders that asynchronously replicate ingested traces
# to specified endpoints. Forwarders work on per-tenant basis, so to
# fully enable this feature, overrides configuration must also be updated.
#
# Note: Forwarders work asynchronously and can fail or decide not to forward
# some traces. This feature works in a "best-effort" manner.
forwarders:
# Forwarder name. Must be unique within the list of forwarders.
# This name can be referenced in the overrides configuration to
# enable forwarder for a tenant.
- name: <string>
# The forwarder backend to use
# Should be "otlpgrpc".
backend: <string>
# otlpgrpc configuration. Will be used only if value of backend is "otlpgrpc".
otlpgrpc:
# List of otlpgrpc compatible endpoints.
endpoints: <list of string>
tls:
# Optional.
# Disables TLS if set to true.
[insecure: <boolean> | default = false]
# Optional.
# Path to the TLS certificate. This field must be set if insecure = false.
[cert_file: <string | default = "">]
# Optional.
# Configures filtering in forwarder that lets you drop spans and span events using
# the OpenTelemetry Transformation Language (OTTL) syntax. For detailed overview of
# the OTTL syntax, please refer to the official Open Telemetry documentation.
filter:
traces:
span: <list of string>
spanevent: <list of string>
- (repetition of above...)
# Optional.
# Enable to log every received span to help debug ingestion or calculate span error distributions using the logs.
# This is not recommended for production environments
log_received_spans:
[enabled: <boolean> | default = false]
[include_all_attributes: <boolean> | default = false]
[filter_by_status_error: <boolean> | default = false]
# Optional.
# Enable to log every discarded span to help debug ingestion or calculate span error distributions using the logs.
log_discarded_spans:
[enabled: <boolean> | default = false]
[include_all_attributes: <boolean> | default = false]
[filter_by_status_error: <boolean> | default = false]
# Optional.
# Enable to metric every received span to help debug ingestion
# This is not recommended for production environments
metric_received_spans:
[enabled: <boolean> | default = false]
[root_only: <boolean> | default = false]
# Optional.
# Disables write extension with inactive ingesters. Use this along with ingester.lifecycler.unregister_on_shutdown = true
# note that setting these two config values reduces tolerance to failures on rollout b/c there is always one guaranteed to be failing replica
[extend_writes: <bool>]
# Optional.
# Configures the time to retry after returned to the client when Tempo returns a GRPC ResourceExhausted. This parameter
# defaults to 0 which means that by default ResourceExhausted is not retried. Set this to a duration such as `1s` to
# instruct the client how to retry.
[retry_after_on_resource_exhausted: <duration> | default = '0' ]
# Optional
# Configures the max size an attribute can be. Any key or value that exceeds this limit will be truncated before storing
# Setting this parameter to '0' would disable this check against attribute size
[max_span_attr_byte: <int> | default = '2048']
# Optional.
# Configures usage trackers in the distributor which expose metrics of ingested traffic grouped by configurable
# attributes exposed on /usage_metrics.
usage:
cost_attribution:
# Enables the "cost-attribution" usage tracker. Per-tenant attributes are configured in overrides.
[enabled: <boolean> | default = false]
# Maximum number of series per tenant.
[max_cardinality: <int> | default = 10000]
# Interval after which a series is considered stale and will be deleted from the registry.
# Once a metrics series is deleted, it won't be emitted anymore, keeping active series low.
[stale_duration: <duration> | default = 15m0s]
设置最大属性大小以帮助控制内存不足错误
Tempo querier 在获取包含具有非常大属性的 spans 的 traces 时可能会耗尽内存。尝试使用 tracebyID
端点获取单个 trace 时,已观察到此问题。虽然 trace 可能没有太多 spans(大约 500 个),但其大小可能很大(大约 250KB)。该 trace 中的某些 spans 具有值大小非常大的属性。
为了避免这些内存不足崩溃,请使用 max_span_attr_byte
限制任何单个属性的最大允许大小。任何超过配置限制的键或值在存储前会被截断。默认值为 2048
。
使用 tempo_distributor_attributes_truncated_total
指标来跟踪被截断的属性数量。
有关更多信息,请参阅故障排除内存不足错误。
gRPC 压缩
从 Tempo 2.7.1 开始,所有组件之间的 gRPC 压缩默认使用 snappy
。使用 snappy
为组件之间的压缩提供了一种平衡的方法,适用于大多数安装。
如果您更喜欢不同的 CPU/内存与带宽平衡,请考虑禁用压缩或使用 zstd
。
禁用压缩可能会提供一些性能提升。基准测试表明,在不使用压缩的情况下,queriers 和 distributors 使用更少的 CPU 和内存。
但是,您可能会注意到 ingester 数据和网络流量的增加,特别是对于较大的集群。这种增加的数据可能会影响 Grafana Cloud 的计费。
您可以在 distributor 的 querier
、ingester
和 metrics_generator
客户端中配置 gRPC 压缩。
要禁用压缩,请从 grpc_compression
行中删除 snappy
。
要重新启用压缩,请使用 snappy
并配置以下设置
ingester_client:
grpc_client_config:
grpc_compression: "snappy"
metrics_generator_client:
grpc_client_config:
grpc_compression: "snappy"
querier:
frontend_worker:
grpc_client_config:
grpc_compression: "snappy"
Ingester
有关配置选项的更多信息,请参阅此文件。
ingester 负责批量处理 traces 并将其推送到 TempoDB。
活跃 trace 是指在配置的时间(默认 10 秒,由 ingester.trace_idle_period
设置)内接收到新批次 spans 的 trace。在 10 秒(或配置的时间)后,该 trace 将被刷新到磁盘并追加到 WAL。当 Tempo 接收到新批次时,会在内存中创建一个新的活跃 trace。
# Ingester configuration block
ingester:
# Lifecycler is responsible for managing the lifecycle of entries in the ring.
# For a complete list of config options check the lifecycler section under the ingester config at the following link -
# https://cortexmetrics.io/docs/configuration/configuration-file/#ingester_config
lifecycler:
ring:
# number of replicas of each span to make while pushing to the backend
replication_factor: 3
# set sidecar proxy port
[port: <int>]
# amount of time a trace must be idle before flushing it to the wal.
# (default: 10s)
[trace_idle_period: <duration>]
# how often to sweep all tenants and move traces from live -> wal -> completed blocks.
# (default: 10s)
[flush_check_period: <duration>]
# maximum size of a block before cutting it
# (default: 524288000 = 500MB)
[max_block_bytes: <int>]
# maximum length of time before cutting a block
# (default: 30m)
[max_block_duration: <duration>]
# duration to keep blocks in the ingester after they have been flushed
# (default: 15m)
[ complete_block_timeout: <duration>]
# Flush all traces to backend when ingester is stopped
[flush_all_on_shutdown: <bool> | default = false]
指标生成器
有关配置选项的更多信息,请参阅此文件。
metrics-generator 处理 spans 并使用 Prometheus remote write 协议写入指标。有关 metrics-generator 的更多信息,请参阅 Metrics-generator 文档。
metrics-generator 处理器默认禁用。要为特定租户启用它,请在 overrides 部分中设置 metrics_generator.processors
。
注意
如果您想为您的 Grafana Cloud 账户启用 metrics-generator,请参阅Grafana Cloud 中的 Metrics-generator 文档。
您可以使用 metrics_ingestion_time_range_slack
将结束时间在配置持续时间内的 spans 限制在指标生成中进行考虑。在 Grafana Cloud 中,此值默认为 30 秒,因此所有发送到 metrics-generation 且时间超过 30 秒的 spans 都将被丢弃或拒绝。
有关 local-blocks
配置选项的更多信息,请参阅 TraceQL 指标。
# Metrics-generator configuration block
metrics_generator:
# Ring configuration
ring:
kvstore: <KVStore config>
[store: <string> | default = memberlist]
[prefix: <string> | default = "collectors/"]
# Period at which to heartbeat the instance
# 0 disables heartbeat altogether
[heartbeat_period: <duration> | default = 5s]
# The heartbeat timeout, after which, the instance is skipped.
# 0 disables timeout.
[heartbeat_timeout: <duration> | default = 1m]
# Our Instance ID to register as in the ring.
[instance_id: <string> | default = os.Hostname()]
# Name of the network interface to read address from.
[instance_interface_names: <list of string> | default = ["eth0", "en0"] ]
# Our advertised IP address in the ring, (usefull if the local ip =/= the external ip)
# Will default to the configured `instance_id` ip address,
# if unset, will fallback to ip reported by `instance_interface_names`
# (Effected by `enable_inet6`)
[instance_addr: <string> | default = auto(instance_id, instance_interface_names)]
# Our advertised port in the ring
# Defaults to the configured GRPC listing port
[instance_port: <int> | default = auto(listen_port)]
# Enables the registering of ipv6 addresses in the ring.
[enable_inet6: <bool> | default = false]
# Processor-specific configuration
processor:
service_graphs:
# Wait is the value to wait for an edge to be completed.
[wait: <duration> | default = 10s]
# MaxItems is the amount of edges that will be stored in the store.
[max_items: <int> | default = 10000]
# Workers is the amount of workers that will be used to process the edges
[workers: <int> | default = 10]
# Buckets for the latency histogram in seconds.
[histogram_buckets: <list of float> | default = 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8]
# Additional dimensions to add to the metrics. Dimensions are searched for in the
# resource and span attributes and are added to the metrics if present.
[dimensions: <list of string>]
# Prefix additional dimensions with "client_" and "_server". Adds two labels
# per additional dimension instead of one.
[enable_client_server_prefix: <bool> | default = false]
# If enabled another histogram will be produced for interactions over messaging systems middlewares
# If this feature is relevant over long time ranges (high latencies) - consider increasing
# `wait` value for this processor.
[enable_messaging_system_latency_histogram: <bool> | default = false]
# Attributes that will be used to create a peer edge
# Attributes are searched in the order they are provided
# See: https://pkg.go.dev/go.opentelemetry.io/otel/semconv/v1.18.0
# Example: ["peer.service", "db.name", "db.system", "host.name"]
[peer_attributes: <list of string> | default = ["peer.service", "db.name", "db.system"] ]
# Attribute Key to multiply span metrics
# Note that the attribute name is searched for in both
# resouce and span level attributes
[span_multiplier_key: <string> | default = ""]
# Enables additional labels for services and virtual nodes.
[enable_virtual_node_label: <bool> | default = false]
span_metrics:
# Buckets for the latency histogram in seconds.
[histogram_buckets: <list of float> | default = 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.024, 2.048, 4.096, 8.192, 16.384]
# Configure intrinsic dimensions to add to the metrics. Intrinsic dimensions are taken
# directly from the respective resource and span properties.
intrinsic_dimensions:
# Whether to add the name of the service the span is associated with.
[service: <bool> | default = true]
# Whether to add the name of the span.
[span_name: <bool> | default = true]
# Whether to add the span kind describing the relationship between spans.
[span_kind: <bool> | default = true]
# Whether to add the span status code.
[status_code: <bool> | default = true]
# Whether to add a status message. Important note: The span status message may
# contain arbitrary strings and thus have a very high cardinality.
[status_message: <bool> | default = false]
# Additional dimensions to add to the metrics along with the intrinsic dimensions.
# Dimensions are searched for in the resource and span attributes and are added to
# the metrics if present.
[dimensions: <list of string>]
# Custom labeling mapping
dimension_mappings: <list of label mappings>
# The new label name
- [name: <string>]
# The actual attributes that will make the value of the new label
[source_labels: <list of strings>]
# The separator used to join multiple `source_labels`
[join: <string>]
# Enable traces_target_info metrics
[enable_target_info: <bool> | default = false]
# Attribute Key to multiply span metrics
# Note that the attribute name is searched for in both
# resouce and span level attributes
[span_multiplier_key: <string> | default = ""]
# List of policies that will be applied to spans for inclusion or exclusion.
[filter_policies: <list of filter policies config> | default = []]
# Drop specific labels from `traces_target_info` metrics
[target_info_excluded_dimensions: <list of string>]
local_blocks:
# Block configuration
block: <Block config>
# Search configuration
search: <Search config>
# How often to run the flush loop to cut idle traces and blocks
[flush_check_period: <duration> | default = 10s]
# A trace is considered complete after this period of inactivity (no new spans recieved)
[trace_idle_period: <duration> | default = 10s]
# Maximum duration which the head block can be appended to, before cutting it.
[max_block_duration: <duration> | default = 1m]
# Maximum size of the head block, before cutting it
[max_block_bytes: <uint64> | default = 500000000]
# Duration to keep blocks in the ingester after they have been flushed
[complete_block_timeout: <duration> | default = 1h]
# Maximum amount of live traces
# If this value is exceeded, traces will be dropped with reason: `live_traces_exceeded`
# A value of 0 disables this limit.
[max_live_traces: <uint64>]
# Whether server spans should be filtered in or not.
# If enabled, only parent spans or spans with the SpanKind of `server` will be retained
[filter_server_spans: <bool> | default = true]
# Whether server spans should be flushed to storage.
# Setting `flush_to_storage` to `true` ensures that metrics blocks are flushed to storage so TraceQL metrics queries against historical data.
[flush_to_storage: <bool> | default = false]
# Number of blocks that are allowed to be processed concurrently.
[concurrent_blocks: <uint> | default = 10]
# A tuning factor that controls whether the trace-level timestamp columns are used in a metrics query.
# If a block overlaps the time window by less than this ratio, then the columns are skipped.
# A value of 1.0 will always load the columns, and 0.0 will never load any.
[time_overlap_cutoff: <float64> | default = 0.2]
# Registry configuration
registry:
# Interval to collect metrics and remote write them.
[collection_interval: <duration> | default = 15s]
# Interval after which a series is considered stale and will be deleted from the registry.
# Once a metrics series is deleted, it won't be emitted anymore, keeping active series low.
[stale_duration: <duration> | default = 15m]
# A list of labels that will be added to all generated metrics.
[external_labels: <map>]
# If set, the tenant ID will added as label with the given label name to all generated metrics.
[inject_tenant_id_as: <string>]
# The maximum length of label names. Label names exceeding this limit will be truncated.
[max_label_name_length: <int> | default = 1024]
# The maximum length of label values. Label values exceeding this limit will be truncated.
[max_label_value_length: <int> | default = 2048]
# Configuration block for the Write Ahead Log (WAL)
traces_storage: <WAL config>
# Path to store the WAL files.
# Must be set.
# Example: "/var/tempo/generator/traces"
[path: <string> | default = ""]
# Storage and remote write configuration
storage:
# Path to store the WAL. Each tenant will be stored in its own subdirectory.
path: <string>
# Configuration for the Prometheus Agent WAL
# https://github.com/prometheus/prometheus/blob/v2.51.2/tsdb/agent/db.go#L62-L84
wal: <prometheus agent WAL config>
# How long to wait when flushing samples on shutdown
[remote_write_flush_deadline: <duration> | default = 1m]
# Whether to add X-Scope-OrgID header in remote write requests
[remote_write_add_org_id_header: <bool> | default = true]
# A list of remote write endpoints.
# https://prometheus.ac.cn/docs/prometheus/latest/configuration/configuration/#remote_write
remote_write:
[- <Prometheus remote write config>]
# This option only allows spans with end times that occur within the configured duration to be
# considered in metrics generation.
# This is to filter out spans that are outdated.
[metrics_ingestion_time_range_slack: <duration> | default = 30s]
# Timeout for metric requests
[query_timeout: <duration> | default = 30s ]
# Overides the key used to register the metrics-generator in the ring.
[override_ring_key: <string> | default = "metrics-generator"]
Query-frontend
有关配置选项的更多信息,请参阅此文件。
Query Frontend 负责分片传入请求,以便 queriers 可以并行更快地处理。
# Query Frontend configuration block
query_frontend:
# number of times to retry a request sent to a querier
# (default: 2)
[max_retries: <int>]
# The number of goroutines dedicated to consuming, unmarshalling and recombining responses per request. This
# same parameter is used for all endpoints.
# (default: 10)
[response_consumers: <int>]
# Maximum number of outstanding requests per tenant per frontend; requests beyond this error with HTTP 429.
# (default: 2000)
[max_outstanding_per_tenant: <int>]
# The number of jobs to batch together in one http request to the querier. Set to 1 to
# disable.
# (default: 5)
[max_batch_size: <int>]
# Enable multi-tenant queries.
# If enabled, queries can be federated across multiple tenants.
# The tenant IDs involved need to be specified separated by a '|'
# character in the 'X-Scope-OrgID' header.
# note: this is no-op if cluster doesn't have `multitenancy_enabled: true`
# (default: true)
[multi_tenant_queries_enabled: <bool>]
# Comma-separated list of request header names to include in query logs. Applies
# to both query stats and slow queries logs.
[log_query_request_headers: <string> | default = ""]
# Set a maximum timeout for all api queries at which point the frontend will cancel queued jobs
# and return cleanly. HTTP will return a 503 and GRPC will return a context canceled error.
# This timeout impacts all http and grpc streaming queries as part of the Tempo api surface such as
# search, metrics summary, tags and tag values lookups, etc.
# Generally it is preferred to let the client cancel context. This is a failsafe to prevent a client
# from imposing more work on Tempo than desired.
# (default: 0)
[api_timeout: <duration>]
# A list of regular expressions for refusing matching requests, these will apply for every request regardless of the endpoint.
[url_deny_list: <list of strings> | default = <empty list>]]
# Max allowed TraceQL expression size, in bytes. queries bigger then this size will be rejected.
# (default: 128 KiB)
[max_query_expression_size_bytes: <int> | default = 131072]]
search:
# The number of concurrent jobs to execute when searching the backend.
# (default: 1000)
[concurrent_jobs: <int>]
# The target number of bytes for each job to handle when performing a backend search.
# (default: 104857600)
[target_bytes_per_job: <int>]
# Limit used for search requests if none is set by the caller
# (default: 20)
[default_result_limit: <int>]
# The maximum allowed value of the limit parameter on search requests. If the search request limit parameter
# exceeds the value configured here the frontend will return a 400.
# The default value of 0 disables this limit.
# (default: 0)
[max_result_limit: <int>]
# The maximum allowed time range for a search.
# 0 disables this limit.
# (default: 168h)
[max_duration: <duration>]
# query_backend_after and query_ingesters_until together control where the query-frontend searches for traces.
# Time ranges before query_ingesters_until will be searched in the ingesters only.
# Time ranges after query_backend_after will be searched in the backend/object storage only.
# Time ranges from query_backend_after through query_ingesters_until will be queried from both locations.
# query_backend_after must be less than or equal to query_ingesters_until.
# (default: 15m)
[query_backend_after: <duration>]
# (default: 30m)
[query_ingesters_until: <duration>]
# If set to a non-zero value, it's value will be used to decide if query is within SLO or not.
# Query is within SLO if it returned 200 within duration_slo seconds OR processed throughput_slo bytes/s data.
# NOTE: Requires `duration_slo` AND `throughput_bytes_slo` to be configured.
[duration_slo: <duration> | default = 0s ]
# If set to a non-zero value, it's value will be used to decide if query is within SLO or not.
# Query is within SLO if it returned 200 within duration_slo seconds OR processed throughput_slo bytes/s data.
[throughput_bytes_slo: <float> | default = 0 ]
# The number of shards to break ingester queries into.
[ingester_shards]: <int> | default = 3]
# The maximum allowed value of spans per span set. 0 disables this limit.
[max_spans_per_span_set]: <int> | default = 100]
# SLO configuration for Metadata (tags and tag values) endpoints.
metadata_slo:
# If set to a non-zero value, it's value will be used to decide if metadata query is within SLO or not.
# Query is within SLO if it returned 200 within duration_slo seconds OR processed throughput_slo bytes/s data.
# NOTE: Requires `duration_slo` AND `throughput_bytes_slo` to be configured.
[duration_slo: <duration> | default = 0s ]
# If set to a non-zero value, it's value will be used to decide if metadata query is within SLO or not.
# Query is within SLO if it returned 200 within duration_slo seconds OR processed throughput_slo bytes/s data.
[throughput_bytes_slo: <float> | default = 0 ]
# Trace by ID lookup configuration
trace_by_id:
# The number of shards to split a trace by id query into.
# (default: 50)
[query_shards: <int>]
# The maximum number of shards to execute at once. If set to 0 query_shards is used.
# (default: 0)
[concurrent_shards: <int>]
# If set to a non-zero value, it's value will be used to decide if query is within SLO or not.
# Query is within SLO if it returned 200 within duration_slo seconds.
[duration_slo: <duration> | default = 0s ]
# Metrics query configuration
metrics:
# The number of concurrent jobs to execute when querying the backend.
[concurrent_jobs: <int> | default = 1000 ]
# The target number of bytes for each job to handle when querying the backend.
[target_bytes_per_job: <int> | default = 100MiB ]
# The maximum allowed time range for a metrics query.
# 0 disables this limit.
[max_duration: <duration> | default = 3h ]
# Maximun number of exemplars per range query. Limited to 100.
[max_exemplars: <int> | default = 100 ]
# query_backend_after controls where the query-frontend searches for traces.
# Time ranges older than query_backend_after will be searched in the backend/object storage only.
# Time ranges between query_backend_after and now will be queried from the metrics-generators.
[query_backend_after: <duration> | default = 30m ]
# The target length of time for each job to handle when querying the backend.
[interval: <duration> | default = 5m ]
# If set to a non-zero value, it's value will be used to decide if query is within SLO or not.
# Query is within SLO if it returned 200 within duration_slo seconds OR processed throughput_slo bytes/s data.
# NOTE: `duration_slo` and `throughput_bytes_slo` both must be configured for it to work
[duration_slo: <duration> | default = 0s ]
# If set to a non-zero value, it's value will be used to decide if query is within SLO or not.
# Query is within SLO if it returned 200 within duration_slo seconds OR processed throughput_slo bytes/s data.
[throughput_bytes_slo: <float> | default = 0 ]
限制查询大小以提高性能和稳定性
查询大型追踪数据带来了一些挑战。包含大量 spans 的 span sets 会影响查询性能和稳定性。同样,过大的查询结果大小也可能对查询性能产生负面影响。
限制每个 spanset 的 spans 数量
您可以通过为 query-frontend 设置 max_spans_per_span_set
来限制每个 spanset 的最大 spans 数。默认值为 100。
在 Grafana 或 Grafana Cloud 中,您可以在 Grafana Explore 中的 TraceQL 查询编辑器中使用 Span Limit 字段。此字段设置每个 span set 返回的最大 spans 数量。您可以为 Span Limit 值(或 spss 查询)设置的最大值由 max_spans_per_span_set
控制。要禁用每个 span set 的最大 spans 数限制,请将 max_spans_per_span_set
设置为 0
。设置为 0
时,没有最大限制,用户可以在 Span Limit 中输入任何值。但是,这只能由 Tempo 管理员设置,而不能由用户设置。
设置最大查询长度上限
您可以使用 query-frontend 的 query_frontend.max_query_expression_size_bytes
配置参数设置查询的最大长度。默认值为 128 KB。
此限制用于在运行潜在耗时的大型查询时,保护系统稳定性免受潜在滥用或错误的影响。
您可以通过在 query_frontend
配置部分中设置该值来降低或提高它,例如
query_frontend:
max_query_expression_size_bytes: 10000
Querier
有关配置选项的更多信息,请参阅此文件。
Querier 负责查询后端/缓存以获取 traceID。
# querier config block
querier:
# The query frontend turns both trace by id (/api/traces/<id>) and search (/api/search?<params>) requests
# into subqueries that are then pulled and serviced by the queriers.
# This value controls the overall number of simultaneous subqueries that the querier will service at once. It does
# not distinguish between the types of queries.
[max_concurrent_queries: <int> | default = 20]
# If shuffle sharding is enabled, queriers fetch in-memory traces from the minimum set of required ingesters,
# selecting only ingesters which might have received series since now - <ingester flush period>. Otherwise, the
# request is sent to all ingesters.
[shuffle_sharding_ingesters_enabled: <bool> | default = true]
# Lookback period to include ingesters that were part of the shuffle sharded subring.
[shuffle_sharding_ingesters_lookback_period: <duration> | default = 1hr]
# The query frontend sends sharded requests to ingesters and querier (/api/traces/<id>)
# By default, all healthy ingesters are queried for the trace id.
# When true the querier will hash the trace id in the same way that distributors do and then
# only query those ingesters who own the trace id hash as determined by the ring.
# If this parameter is set, the number of 404s could increase during rollout or scaling of ingesters.
[query_relevant_ingesters: <bool> | default = false]
trace_by_id:
# Timeout for trace lookup requests
[query_timeout: <duration> | default = 10s]
search:
# Timeout for search requests
[query_timeout: <duration> | default = 30s]
# NOTE: The Tempo serverless feature is now deprecated and will be removed in an upcoming release.
# A list of external endpoints that the querier will use to offload backend search requests. They must
# take and return the same value as /api/search endpoint on the querier. This is intended to be
# used with serverless technologies for massive parallelization of the search path.
# The default value of "" disables this feature.
[external_endpoints: <list of strings> | default = <empty list>]
# If search_external_endpoints is set then the querier will primarily act as a proxy for whatever serverless backend
# you have configured. This setting allows the operator to have the querier prefer itself for a configurable
# number of subqueries. In the default case of 2 the querier will process up to 2 search requests subqueries before starting
# to reach out to search_external_endpoints.
# Setting this to 0 will disable this feature and the querier will proxy all search subqueries to search_external_endpoints.
[prefer_self: <int> | default = 10 ]
# If set to a non-zero value a second request will be issued at the provided duration. Recommended to
# be set to p99 of external search requests to reduce long tail latency.
# (default: 8s)
[external_hedge_requests_at: <duration>]
# The maximum number of requests to execute when hedging. Requires hedge_requests_at to be set.
# (default: 2)
[external_hedge_requests_up_to: <int>]
# The serverless backend to use. If external_backend is set, then authorization credentials will be provided
# when querying the external endpoints. "google_cloud_run" is the only value supported at this time.
# The default value of "" omits credentials when querying the external backend.
[external_backend: <string> | default = ""]
# Google Cloud Run configuration. Will be used only if the value of external_backend is "google_cloud_run".
google_cloud_run:
# A list of external endpoints that the querier will use to offload backend search requests. They must
# take and return the same value as /api/search endpoint on the querier. This is intended to be
# used with serverless technologies for massive parallelization of the search path.
# The default value of "" disables this feature.
[external_endpoints: <list of strings> | default = <empty list>]
# config of the worker that connects to the query frontend
frontend_worker:
# the address of the query frontend to connect to, and process queries
# Example: "frontend_address: query-frontend-discovery.default.svc.cluster.local:9095"
[frontend_address: <string>]
它还查询落在 (2 * BlocklistPoll) 范围内的压缩块,其中 Blocklist poll duration 的值在下面的 storage 部分定义。
Compactor
有关配置选项的更多信息,请参阅此文件。
Compactors 从存储后端流式传输块,将它们组合并写回。下面显示的值是默认值。
compactor:
# Optional. Disables backend compaction. Default is false.
# Note: This should only be used in a non-production context for debugging purposes. This will allow blocks to say in the backend for further investigation if desired.
[disabled: <bool>]
ring:
kvstore: <KVStore config>
[store: <string> | default = memberlist]
[prefix: <string> | default = "collectors/" ]
compaction:
# Optional. Duration to keep blocks. Default is 14 days (336h).
[block_retention: <duration>]
# Optional. Duration to keep blocks that have been compacted elsewhere. Default is 1h.
[compacted_block_retention: <duration>]
# Optional. Blocks in this time window will be compacted together. Default is 1h.
[compaction_window: <duration>]
# Optional. Maximum number of traces in a compacted block. Default is 6 million.
# WARNING: Deprecated. Use max_block_bytes instead.
[max_compaction_objects: <int>]
# Optional. Maximum size of a compacted block in bytes. Default is 100 GB.
[max_block_bytes: <int>]
# Optional. Number of tenants to process in parallel during retention. Default is 10.
[retention_concurrency: <int>]
# Optional. The maximum amount of time to spend compacting a single tenant before moving to the next. Default is 5m.
[max_time_per_tenant: <duration>]
# Optional. The time between compaction cycles. Default is 30s.
# Note: The default will be used if the value is set to 0.
[compaction_cycle: <duration>]
# Optional. Amount of data to buffer from input blocks. Default is 5 MiB.
[v2_in_buffer_bytes: <int>]
# Optional. Flush data to backend when buffer is this large. Default is 20 MB.
[v2_out_buffer_bytes: <int>]
# Optional. Number of traces to buffer in memory during compaction. Increasing may improve performance but will also increase memory usage. Default is 1000.
[v2_prefetch_traces_count: <int>]
Storage
Tempo 支持 Amazon S3、GCS、Azure 和本地文件系统作为存储。此外,您还可以使用 Memcached 或 Redis 来提高查询性能。
有关配置选项的更多信息,请参阅此文件。
本地存储建议
虽然您可以使用本地存储,但建议生产工作负载使用对象存储。分布式部署的本地后端无法正确检索 traces,除非所有组件都能访问同一磁盘。Tempo 更适合对象存储而非本地存储。
在 Grafana Labs,我们在使用本地存储时使用 SSD 运行 Tempo。尚未测试过硬盘。
您可以通过考虑摄取字节数和保留时间来估算所需的存储空间。例如,每日摄取字节数 * 保留天数 = 存储字节数。
您不能在同一个 Tempo 部署中同时使用本地存储和对象存储。
存储块配置示例
storage 块配置 TempoDB。以下示例显示了常用选项。有关更多特定于平台的信息,请参阅以下内容
# Storage configuration for traces
storage:
trace:
# The storage backend to use
# Should be one of "gcs", "s3", "azure" or "local" (only supported in the monolithic mode)
# CLI flag -storage.trace.backend
[backend: <string>]
# GCS configuration. Will be used only if value of backend is "gcs"
# Check the GCS doc within this folder for information on GCS specific permissions.
gcs:
# Bucket name in gcs
# Tempo requires a bucket to maintain a top-level object structure. You can use prefix option with this to nest all objects within a shared bucket.
# Example: "bucket_name: tempo"
[bucket_name: <string>]
# optional.
# Prefix name in gcs
# Tempo has this additional option to support a custom prefix to nest all the objects withing a shared bucket.
[prefix: <string>]
# Buffer size for reads. Default is 10MB
# Example: "chunk_buffer_size: 5_000_000"
[chunk_buffer_size: <int>]
# Optional
# Api endpoint override
# Example: "endpoint: https://storage.googleapis.com/storage/v1/"
[endpoint: <string>]
# Optional. Default is false.
# Example: "insecure: true"
# Set to true to disable authentication and certificate checks on gcs requests
[insecure: <bool>]
# The number of list calls to make in parallel to the backend per instance.
# Adjustments here will impact the polling time, as well as the number of Go routines.
# Default is 3
[list_blocks_concurrency: <int>]
# Optional. Default is 0 (disabled)
# Example: "hedge_requests_at: 500ms"
# If set to a non-zero value a second request will be issued at the provided duration. Recommended to
# be set to p99 of GCS requests to reduce long tail latency. This setting is most impactful when
# used with queriers and has minimal to no impact on other pieces.
[hedge_requests_at: <duration>]
# Optional. Default is 2
# Example: "hedge_requests_up_to: 2"
# The maximum number of requests to execute when hedging. Requires hedge_requests_at to be set.
[hedge_requests_up_to: <int>]
# Optional
# Example: "object_cache_control: "no-cache""
# A string to specify the behavior with respect to caching of the objects stored in GCS.
# See the GCS documentation for more detail: https://cloud.google.com/storage/docs/metadata
[object_cache_control: <string>]
# Optional
# Example: "object_metadata: {'key': 'value'}"
# A map key value strings for user metadata to store on the GCS objects.
# See the GCS documentation for more detail: https://cloud.google.com/storage/docs/metadata
[object_metadata: <map[string]string>]
# S3 configuration. Will be used only if value of backend is "s3"
# Check the S3 doc within this folder for information on s3 specific permissions.
s3:
# Bucket name in s3
# Tempo requires a bucket to maintain a top-level object structure. You can use prefix option with this to nest all objects within a shared bucket.
[bucket: <string>]
# optional.
# Prefix name in s3
# Tempo has this additional option to support a custom prefix to nest all the objects withing a shared bucket.
[prefix: <string>]
# api endpoint to connect to. use AWS S3 or any S3 compatible object storage endpoint.
# Example: "endpoint: s3.dualstack.us-east-2.amazonaws.com"
[endpoint: <string>]
# The number of list calls to make in parallel to the backend per instance.
# Adjustments here will impact the polling time, as well as the number of Go routines.
# Default is 3
[list_blocks_concurrency: <int>]
# optional.
# By default the region is inferred from the endpoint,
# but is required for some S3-compatible storage engines.
# Example: "region: us-east-2"
[region: <string>]
# optional.
# access key when using static credentials.
[access_key: <string>]
# optional.
# secret key when using static credentials.
[secret_key: <string>]
# optional.
# session token when using static credentials.
[session_token: <string>]
# optional.
# enable if endpoint is http
[insecure: <bool>]
# optional.
# Path to the client certificate file.
[tls_cert_path: <string>]
# optional.
# Path to the private client key file.
[tls_key_path: <string>]
# optional.
# Path to the CA certificate file.
[tls_ca_path: <string>]
# optional.
# Path to the CA certificate file.
[tls_server_name: <string>]
# optional.
# Set to true to disable verification of a TLS endpoint. The default value is false.
[tls_insecure_skip_verify: <bool>]
# optional.
# Override the default cipher suite list, separated by commas.
[tls_cipher_suites: <string>]
# optional.
# Override the default minimum TLS version. The default value is VersionTLS12. Allowed values: VersionTLS10, VersionTLS11, VersionTLS12, VersionTLS13
[tls_min_version: <string>]
# optional.
# enable to use path-style requests.
[forcepathstyle: <bool>]
# Optional.
# Enable to use dualstack endpoint for DNS resolution.
# Check out the (S3 documentation on dualstack endpoints)[https://docs.aws.amazon.com/AmazonS3/latest/userguide/dual-stack-endpoints.html]
[enable_dual_stack: <bool>]
# Optional. Default is 0
# Example: "bucket_lookup_type: 0"
# options: 0: BucketLookupAuto, 1: BucketLookupDNS, 2: BucketLookupPath
# See the [S3 documentation on virtual-hosted–style and path-style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access) for more detail.
# See the [Minio-API documentation on opts.BucketLookup](https://github.com/minio/minio-go/blob/master/docs/API.md#newendpoint-string-opts-options-client-error)] for more detail.
# Notice: ignore this option if `forcepathstyle` is set true, this option allow expose minio's sdk configure.
[bucket_lookup_type: <int> | default = 0]
# Optional. Default is 0 (disabled)
# Example: "hedge_requests_at: 500ms"
# If set to a non-zero value a second request will be issued at the provided duration. Recommended to
# be set to p99 of S3 requests to reduce long tail latency. This setting is most impactful when
# used with queriers and has minimal to no impact on other pieces.
[hedge_requests_at: <duration>]
# Optional. Default is 2
# Example: "hedge_requests_up_to: 2"
# The maximum number of requests to execute when hedging. Requires hedge_requests_at to be set.
[hedge_requests_up_to: <int>]
# Optional
# Example: "tags: {'key': 'value'}"
# A map of key value strings for user tags to store on the S3 objects. This helps set up filters in S3 lifecycles.
# See the [S3 documentation on object tagging](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html) for more detail.
[tags: <map[string]string>]
# azure configuration. Will be used only if value of backend is "azure"
# EXPERIMENTAL
azure:
# store traces in this container.
# Tempo requires bucket to maintain a top-level object structure. You can use prefix option to nest all objects within a shared bucket
[container_name: <string>]
# optional.
# Prefix for azure.
# Tempo has this additional option to support a custom prefix to nest all the objects withing a shared bucket.
[prefix: <string>]
# optional.
# Azure endpoint to use, defaults to Azure global(core.windows.net) for other
# regions this needs to be changed e.g Azure China(blob.core.chinacloudapi.cn),
# Azure German(blob.core.cloudapi.de), Azure US Government(blob.core.usgovcloudapi.net).
[endpoint_suffix: <string>]
# Name of the azure storage account
[storage_account_name: <string>]
# optional.
# access key when using access key credentials.
[storage_account_key: <string>]
# optional.
# use Azure Managed Identity to access Azure storage.
[use_managed_identity: <bool>]
# optional.
# Use a Federated Token to authenticate to the Azure storage account.
# Enable if you want to use Azure Workload Identity. Expects AZURE_CLIENT_ID,
# AZURE_TENANT_ID, AZURE_AUTHORITY_HOST and AZURE_FEDERATED_TOKEN_FILE envs to be present
# (these are set automatically when using Azure Workload Identity).
[use_federated_token: <bool>]
# optional.
# The Client ID for the user-assigned Azure Managed Identity used to access Azure storage.
[user_assigned_id: <bool>]
# Optional. Default is 0 (disabled)
# Example: "hedge_requests_at: 500ms"
# If set to a non-zero value a second request will be issued at the provided duration. Recommended to
# be set to p99 of Azure Block Storage requests to reduce long tail latency. This setting is most impactful when
# used with queriers and has minimal to no impact on other pieces.
[hedge_requests_at: <duration>]
# Optional. Default is 2
# Example: "hedge_requests_up_to: 2"
# The maximum number of requests to execute when hedging. Requires hedge_requests_at to be set.
[hedge_requests_up_to: <int>]
# How often to repoll the backend for new blocks. Default is 5m
[blocklist_poll: <duration>]
# Number of blocks to process in parallel during polling. Default is 50.
[blocklist_poll_concurrency: <int>]
# By default components will pull the blocklist from the tenant index. If that fails the component can
# fallback to scanning the entire bucket. Set to false to disable this behavior. Default is true.
[blocklist_poll_fallback: <bool>]
# Maximum number of compactors that should build the tenant index. All other components will download
# the index. Default 2.
[blocklist_poll_tenant_index_builders: <int>]
# Number of tenants to poll concurrently. Default is 1.
[blocklist_poll_tenant_concurrency: <int>]
# The oldest allowable tenant index. If an index is pulled that is older than this duration,
# the polling will consider this an error. Note that `blocklist_poll_fallback` applies here.
# If fallback is true and a tenant index exceeds this duration, it will fall back to listing
# the bucket contents.
# Default 0 (disabled).
[blocklist_poll_stale_tenant_index: <duration>]
# Offsets the concurrent blocklist polling by a random amount. The maximum amount of offset
# is the provided value in milliseconds. This configuration value can be used if the polling
# cycle is overwhelming your backend with concurrent requests.
# Default 0 (disabled)
[blocklist_poll_jitter_ms: <int>]
# Polling will tolerate this many consecutive errors during the poll of
# a single tenant before marking the tenant as failed.
# This can be set to 0 which means a single error is sufficient to mark the tenant failed
# and exit early. Any previous results for the failing tenant will be kept.
# See also `blocklist_poll_tolerate_tenant_failures` below.
# Default 1
[blocklist_poll_tolerate_consecutive_errors: <int>]
# Polling will tolerate this number of tenants which have failed to poll.
# This can be set to 0 which means a single tenant failure sufficient to fail and exit
# early.
# Default 1
[blocklist_poll_tolerate_tenant_failures: <int>]
# Used to tune how quickly the poller will delete any remaining backend
# objects found in the tenant path. This functionality requires enabling
# below.
# Default: 12h
[empty_tenant_deletion_age: <duration>]
# Polling will delete the index for a tenant if no blocks are found to
# exist. If this setting is enabled, the poller will also delete any
# remaining backend objects found in the tenant path. This is used to
# clean up partial blocks which may have not been cleaned up by the
# retention.
[empty_tenant_deletion_enabled: <bool> | default = false]
# Cache type to use. Should be one of "redis", "memcached"
# Example: "cache: memcached"
# Deprecated. See [cache](#cache) section.
[cache: <string>]
# Minimum compaction level of block to qualify for bloom filter caching. Default is 0 (disabled), meaning
# that compaction level is not used to determine if the bloom filter should be cached.
# Example: "cache_min_compaction_level: 2"
[cache_min_compaction_level: <int>]
# Max block age to qualify for bloom filter caching. Default is 0 (disabled), meaning that block age is not
# used to determine if the bloom filter should be cached.
# Example: "cache_max_block_age: 48h"
[cache_max_block_age: <duration>]
# Configuration parameters that impact trace search
search: <Search config>
# Background cache configuration. Requires having a cache configured.
# Deprecated. See [cache](#cache) section.
background_cache:
# Memcached caching configuration block
# Deprecated. See [cache](#cache) section.
memcached:
# Redis configuration block
# EXPERIMENTAL
# Deprecated. See [cache](#cache) section.
redis:
# the worker pool is used primarily when finding traces by id, but is also used by other
pool:
# total number of workers pulling jobs from the queue
[max_workers: <int> | default = 30]
# length of job queue. important for querier as it queues a job for every block it has to search
[queue_depth: <int> | default = 10000 ]
# configuration block for the Write Ahead Log (WAL)
wal: <WAL config>
[path: <string> | default = "/var/tempo/wal"]
[v2_encoding: <string> | default = snappy]
[search_encoding: <string> | default = none]
[ingestion_time_range_slack: <duration> | default = 2m]
# block configuration
block: <Block config>
Memberlist
Memberlist 是 Tempo 所有组件相互协调的默认机制。
memberlist:
# Name of the node in memberlist cluster. Defaults to hostname.
[node_name: <string> | default = ""]
# Add random suffix to the node name.
[randomize_node_name: <boolean> | default = true]
# The timeout for establishing a connection with a remote node, and for
# read/write operations.
[stream_timeout: <duration> | default = 10s]
# Multiplication factor used when sending out messages (factor * log(N+1)).
[retransmit_factor: <int> | default = 2]
# How often to use pull/push sync.
[pull_push_interval: <duration> | default = 30s]
# How often to gossip.
[gossip_interval: <duration> | default = 1s]
# How many nodes to gossip to.
[gossip_nodes: <int> | default = 2]
# How long to keep gossiping to dead nodes, to give them chance to refute their
# death.
[gossip_to_dead_nodes_time: <duration> | default = 30s]
# How soon can dead node's name be reclaimed with new address. Defaults to 0,
# which is disabled.
[dead_node_reclaim_time: <duration> | default = 0s]
# Other cluster members to join. Can be specified multiple times. It can be an
# IP, hostname or an entry specified in the DNS Service Discovery format (see
# https://cortexmetrics.io/docs/configuration/arguments/#dns-service-discovery
# for more details).
# A "Headless" Cluster IP service in Kubernetes.
# Example:
# - gossip-ring.tracing.svc.cluster.local:7946
[join_members: <list of string> | default = ]
# Min backoff duration to join other cluster members.
[min_join_backoff: <duration> | default = 1s]
# Max backoff duration to join other cluster members.
[max_join_backoff: <duration> | default = 1m]
# Max number of retries to join other cluster members.
[max_join_retries: <int> | default = 10]
# If this node fails to join memberlist cluster, abort.
[abort_if_cluster_join_fails: <boolean> | default = true]
# If not 0, how often to rejoin the cluster. Occasional rejoin can help to fix
# the cluster split issue, and is harmless otherwise. For example when using
# only few components as a seed nodes (via -memberlist.join), then it's
# recommended to use rejoin. If -memberlist.join points to dynamic service that
# resolves to all gossiping nodes (eg. Kubernetes headless service), then rejoin
# is not needed.
[rejoin_interval: <duration> | default = 0s]
# How long to keep LEFT ingesters in the ring.
[left_ingesters_timeout: <duration> | default = 5m]
# Timeout for leaving memberlist cluster.
[leave_timeout: <duration> | default = 5s]
# IP address to listen on for gossip messages.
# Multiple addresses may be specified.
[bind_addr: <list of string> | default = ["0.0.0.0"] ]
# Port to listen on for gossip messages.
[bind_port: <int> | default = 7946]
# Timeout used when connecting to other nodes to send packet.
[packet_dial_timeout: <duration> | default = 5s]
# Timeout for writing 'packet' data.
[packet_write_timeout: <duration> | default = 5s]
配置块
定义重用的配置块。
Block 配置
# block format version. options: v2, vParquet2, vParquet3, vParquet4
[version: <string> | default = vParquet4]
# bloom filter false positive rate. lower values create larger filters but fewer false positives
[bloom_filter_false_positive: <float> | default = 0.01]
# maximum size of each bloom filter shard
[bloom_filter_shard_size_bytes: <int> | default = 100KiB]
# number of bytes per index record
[v2_index_downsample_bytes: <uint64> | default = 1MiB]
# block encoding/compression. options: none, gzip, lz4-64k, lz4-256k, lz4-1M, lz4, snappy, zstd, s2
[v2_encoding: <string> | default = zstd]
# search data encoding/compression. same options as block encoding.
[search_encoding: <string> | default = snappy]
# number of bytes per search page
[search_page_size_bytes: <int> | default = 1MiB]
# an estimate of the number of bytes per row group when cutting Parquet blocks. lower values will
# create larger footers but will be harder to shard when searching. It is difficult to calculate
# this field directly and it may vary based on workload. This is roughly a lower bound.
[parquet_row_group_size_bytes: <int> | default = 100MB]
# Configures attributes to be stored in dedicated columns within the parquet file, rather than in the
# generic attribute key-value list. This allows for more efficient searching of these attributes.
# Up to 10 span attributes and 10 resource attributes can be configured as dedicated columns.
# Requires vParquet3
parquet_dedicated_columns: <list of columns>
# name of the attribute
- [name: <string>]
# type of the attribute. options: string
[type: <string>]
# scope of the attribute.
# options: resource, span
[scope: <string>]
Filter policy 配置
Span filter 配置块
Filter policy
# Exclude filters (positive matching)
[include: <policy match>]
# Exclude filters (negative matching)
[exclude: <policy match>]
Policy 匹配
# How to match the value of attributes
# Options: "strict", "regex"
[match_type: <string>]
# List of attributes to match
attributes: <list of policy atributes>
# Attribute key
- [key: <string>]
# Attribute value
[value: <any>]
示例
exclude:
match_type: "regex"
attributes:
- key: "resource.service.name"
value: "unknown_service:myservice"
include:
match_type: "strict"
attributes:
- key: "foo.bar"
value: "baz"
KVStore 配置
kvstore 配置块
# Set backing store to use
[store: <string> | default = "consul"]
# What prefix to use for keys
[prefix: <string> | default = "ring."]
# Store specific configs
consul:
[host: <string> | default = "localhost:8500"]
[acl_token: <secret string> | default = "" ]
[http_client_timeout: <duration> | default = 20s]
[consistent_reads: <bool> | default = false]
[watch_rate_limit: <float64> | default = 1.0]
[watch_burst_size: <int> | default = 1]
[cas_retry_delay: <duration> | default 1s]
etcd:
[endpoints: <list of string> | default = [] ]
[dial_timeout: <duration> | default = 10s]
[max_retries: <int> | default = 10 ]
[tls_enabled: <bool> | default = false]
# TLS config
[tls_cert_path: <string> | default = ""]
[tls_key_path: <string> | default = ""]
[tls_ca_path: <string> | default = ""]
[tls_server_name: <string> | default = ""]
[tls_insecure_skip_verify: <bool> | default = false]
[tls_cipher_suites: <string> | default = ""]
[tls_min_version: <string> | default = ""]
[username: <string> | default = ""]
[password: <secret string> | default = ""]
multi:
[primary: <string> | default = ""]
[secondary: <string> | default = ""]
[mirror_enabled: <bool> | default = false]
[mirror_timeout: <bool> | default = 2s]
搜索配置
# Target number of bytes per GET request while scanning blocks. Default is 1MB. Reducing
# this value could positively impact trace search performance at the cost of more requests
# to object storage.
[chunk_size_bytes: <uint32> | default = 1000000]
# Number of traces to prefetch while scanning blocks. Default is 1000. Increasing this value
# can improve trace search performance at the cost of memory.
[prefetch_trace_count: <int> | default = 1000]
# Number of read buffers used when performing search on a vparquet block. This value times the read_buffer_size_bytes
# is the total amount of bytes used for buffering when performing search on a parquet block.
[read_buffer_count: <int> | default = 32]
# Size of read buffers used when performing search on a vparquet block. This value times the read_buffer_count
# is the total amount of bytes used for buffering when performing search on a parquet block.
[read_buffer_size_bytes: <int> | default = 1048576]
# Granular cache control settings for parquet metadata objects
# Deprecated. See [Cache](#cache) section.
cache_control:
# Specifies if footer should be cached
[footer: <bool> | default = false]
# Specifies if column index should be cached
[column_index: <bool> | default = false]
# Specifies if offset index should be cached
[offset_index: <bool> | default = false]
WAL 配置
storage WAL 配置块。
# Where to store the wal files while they are being appended to.
# Must be set.
# Example: "/var/tempo/wal
[path: <string> | default = ""]
# WAL encoding/compression.
# options: none, gzip, lz4-64k, lz4-256k, lz4-1M, lz4, snappy, zstd, s2
[v2_encoding: <string> | default = "zstd" ]
# Defines the search data encoding/compression protocol.
# Options: none, gzip, lz4-64k, lz4-256k, lz4-1M, lz4, snappy, zstd, s2
[search_encoding: <string> | default = "snappy"]
# When a span is written to the WAL it adjusts the start and end times of the block it is written to.
# This block start and end time range is then used when choosing blocks for search.
# This is also used for querying traces by ID when the start and end parameters are specified. To prevent spans too far
# in the past or future from impacting the block start and end times we use this configuration option.
# This option only allows spans that occur within the configured duration to adjust the block start and
# end times.
# This can result in trace not being found if the trace falls outside the slack configuration value as the
# start and end times of the block will not be updated in this case.
[ingestion_time_range_slack: <duration> | default = unset]
# WAL file format version
# Options: v2, vParquet, vParquet2, vParquet3
[version: <string> | default = "vParquet3"]
覆盖设置
Tempo 提供了一个 overrides 模块,供用户设置全局或按租户的覆盖设置。
摄取限制
在高流量追踪环境中,Tempo 中的默认限制可能不足。超出这些限制时会发生包括 RATE_LIMITED
/TRACE_TOO_LARGE
/LIVE_TRACES_EXCEEDED
在内的错误。请参阅下文了解如何全局或按租户覆盖这些限制。
标准覆盖设置
您可以创建一个 overrides
部分来配置适用于集群所有租户的摄取限制。此处提供了一个 config.yaml
文件片段,展示了 overrides 部分的配置方式。此处。
# Overrides configuration block
overrides:
# Global ingestion limits configurations
defaults:
# Ingestion related overrides
ingestion:
# Specifies whether the ingestion rate limits should be applied by each instance
# of the distributor and ingester individually, or the limits are to be shared
# across all instances. See the "override strategies" section for an example.
[rate_strategy: <global|local> | default = local]
# Burst size (bytes) used in ingestion.
# Results in errors like
# RATE_LIMITED: ingestion rate limit (20000000 bytes) exceeded while
# adding 10 bytes
[burst_size_bytes: <int> | default = 20000000 (20MB) ]
# Per-user ingestion rate limit (bytes) used in ingestion.
# Results in errors like
# RATE_LIMITED: ingestion rate limit (15000000 bytes) exceeded while
# adding 10 bytes
[rate_limit_bytes: <int> | default = 15000000 (15MB) ]
# Maximum number of active traces per user, per ingester.
# A value of 0 disables the check.
# Results in errors like
# LIVE_TRACES_EXCEEDED: max live traces per tenant exceeded:
# per-user traces limit (local: 10000 global: 0 actual local: 1) exceeded
# This override limit is used by the ingester.
[max_traces_per_user: <int> | default = 10000]
# Maximum number of active traces per user, across the cluster.
# A value of 0 disables the check.
[max_global_traces_per_user: <int> | default = 0]
# Shuffle sharding shards used for this user. A value of 0 uses all ingesters in the ring.
# Should not be lower than RF.
[tenant_shard_size: <int> | default = 0]
# Maximum bytes any attribute can be for both keys and values.
[max_attribute_bytes: <int> | default = 0]
# Read related overrides
read:
# Maximum size in bytes of a tag-values query. Tag-values query is used mainly
# to populate the autocomplete dropdown. This limit protects the system from
# tags with high cardinality or large values such as HTTP URLs or SQL queries.
# This override limit is used by the ingester and the querier.
# A value of 0 disables the limit.
[max_bytes_per_tag_values_query: <int> | default = 1000000 (1MB) ]
# Maximum number of blocks to be inspected for a tag values query. Tag-values
# query is used mainly to populate the autocomplete dropdown. This limit
# protects the system from long block lists in the ingesters.
# This override limit is used by the ingester and the querier.
# A value of 0 disables the limit.
[max_blocks_per_tag_values_query: <int> | default = 0 (disabled) ]
# Per-user max search duration. If this value is set to 0 (default), then max_duration
# in the front-end configuration is used.
[max_search_duration: <duration> | default = 0s]
# Per-user max duration for metrics queries. If this value is set to 0 (default), then metrics max_duration
# in the front-end configuration is used.
[max_metrics_duration: <duration> | default = 0s]
# Compaction related overrides
compaction:
# Per-user block retention. If this value is set to 0 (default),
# then block_retention in the compactor configuration is used.
[block_retention: <duration> | default = 0s]
# Per-user compaction window. If this value is set to 0 (default),
# then block_retention in the compactor configuration is used.
[compaction_window: <duration> | default = 0s]
# Allow compaction to be deactivated on a per-tenant basis. Default value
# is false (compaction active). Useful to perform operations on the backend
# that require compaction to be disabled for a period of time.
[compaction_disabled: <bool> | default = false]
# Metrics-generator related overrides
metrics_generator:
# Per-user configuration of the metrics-generator ring size. If set, the tenant will use a
# ring with at most the given amount of instances. Shuffle sharding is used to spread out
# smaller rings across all instances. If the value 0 or a value larger than the total amount
# of instances is used, all instances will be included in the ring.
#
# Together with metrics_generator.max_active_series this can be used to control the total
# amount of active series. The total max active series for a specific tenant will be:
# metrics_generator.ring_size * metrics_generator.max_active_series
[ring_size: <int>]
# Per-user configuration of the metrics-generator processors. The following processors are
# supported:
# - service-graphs
# - span-metrics
# - local-blocks
[processors: <list of strings>]
# Maximum number of active series in the registry, per instance of the metrics-generator. A
# value of 0 disables this check.
# If the limit is reached, no new series will be added but existing series will still be
# updated. The amount of limited series can be observed with the metric
# tempo_metrics_generator_registry_series_limited_total
[max_active_series: <int>]
# Per-user configuration of the collection interval. A value of 0 means the global default is
# used set in the metrics_generator config block.
[collection_interval: <duration>]
# Per-user flag of the registry collection operation. If set, the registry will not be
# collected and no samples will be exported from the metrics-generator. The metrics-generator
# will still ingest spans and update its internal counters, including the amount of active
# series. To disable metrics generation entirely, clear metrics_generator.processors for this
# tenant.
#
# This setting is useful if you wish to test how many active series a tenant will generate, without
# actually writing these metrics.
[disable_collection: <bool> | default = false]
# Per-user configuration of the trace-id label name. This value will be used as name for the label to store the
# trace ID of exemplars in generated metrics. If not set, the default value "trace_id" will be used.
[trace_id_label_name: <string> | default = "trace_id"]
# This option only allows spans with end time that occur within the configured duration to be
# considered in metrics generation.
# This is to filter out spans that are outdated.
[ingestion_time_range_slack: <duration>]
# Configures the histogram implementation to use for span metrics and
# service graphs processors. If native histograms are desired, the
# receiver must be configured to ingest native histograms.
[generate_native_histograms: <classic|native|both> | default = classic]
# Distributor -> metrics-generator forwarder related overrides
forwarder:
# Spans are stored in a queue in the distributor before being sent to the metrics-generators.
# The length of the queue and the amount of workers pulling from the queue can be configured.
[queue_size: <int> | default = 100]
[workers: <int> | default = 2]
# Per processor configuration
processor:
# Configuration for the service-graphs processor
service_graphs:
[histogram_buckets: <list of float>]
[dimensions: <list of string>]
[peer_attributes: <list of string>]
[enable_client_server_prefix: <bool>]
[enable_messaging_system_latency_histogram: <bool>]
# Configuration for the span-metrics processor
span_metrics:
[histogram_buckets: <list of float>]
# Allowed keys for intrinsic dimensions are: service, span_name, span_kind, status_code, and status_message.
[dimensions: <list of string>]
[intrinsic_dimensions: <map string to bool>]
[filter_policies: [
[
include/exclude:
match_type: <string> # options: strict, regexp
attributes:
- key: <string>
value: <any>
]
]
[dimension_mappings: <list of map>]
# Enable target_info metrics
[enable_target_info: <bool>]
# Drop specific resource labels from traces_target_info
[target_info_excluded_dimensions: <list of string>]
# Configuration for the local-blocks processor
local-blocks:
[max_live_traces: <int>]
[max_block_duration: <duration>]
[max_block_bytes: <int>]
[flush_check_period: <duration>]
[trace_idle_period: <duration>]
[complete_block_timeout: <duration>]
[concurrent_blocks: <duration>]
[filter_server_spans: <bool>]
# Generic forwarding configuration
# Per-user configuration of generic forwarder feature. Each forwarder in the list
# must refer by name to a forwarder defined in the distributor.forwarders configuration.
forwarders: <list of string>
# Global enforced overrides
global:
# Maximum size of a single trace in bytes. A value of 0 disables the size
# check.
# This limit is used in 3 places:
# - During search, traces will be skipped when they exceed this threshold.
# - During ingestion, traces that exceed this threshold will be refused.
# - During compaction, traces that exceed this threshold will be partially dropped.
# During ingestion, exceeding the threshold results in errors like
# TRACE_TOO_LARGE: max size of trace (5000000) exceeded while adding 387 bytes
[max_bytes_per_trace: <int> | default = 5000000 (5MB) ]
# Storage enforced overrides
storage:
# Configures attributes to be stored in dedicated columns within the parquet file, rather than in the
# generic attribute key-value list. This allows for more efficient searching of these attributes.
# Up to 10 span attributes and 10 resource attributes can be configured as dedicated columns.
# Requires vParquet3
parquet_dedicated_columns:
[
name: <string>, # name of the attribute
type: <string>, # type of the attribute. options: string
scope: <string> # scope of the attribute. options: resource, span
]
# Cost attribution usage tracker configuration
cost_attribution:
# List of attributes to group ingested data by. Map value is optional. Can be used to rename and
# combine attributes.
dimensions: <map string to string>
# Tenant-specific overrides settings configuration file. The empty string (default
# value) disables using an overrides file.
[per_tenant_override_config: <string> | default = ""]
# How frequent tenant-specific overrides are read from the configuration file.
[per_tenant_override_period: <duration> | default = 10s]
# User-configurable overrides configuration
user_configurable_overrides:
# Enable the user-configurable overrides module
[enabled: <bool> | default = false]
# How often to poll the backend for new user-configurable overrides
[poll_interval: <duration> | default = 60s]
client:
# The storage backend to use
# Should be one of "gcs", "s3", "azure" or "local"
[backend: <string>]
# Backend-specific configuration, support the same configuration options as the
# trace backend configuration
local:
gcs:
s3:
azure:
# Check whether the backend supports versioning at startup. If enabled Tempo will not start if
# the backend doesn't support versioning.
[confirm_versioning: <bool> | default = true]
api:
# When enabled, Tempo will refuse request that modify overrides that are already set in the
# runtime overrides. For more details, see user-configurable overrides docs.
[check_for_conflicting_runtime_overrides: <bool> | default = false]
租户特定的覆盖设置
租户特定的覆盖设置有两种类型
- 运行时覆盖设置
- 用户可配置的覆盖设置
运行时覆盖设置
您可以在单独的文件中设置租户特定的覆盖设置,并将 per_tenant_override_config
指向该文件。此覆盖文件是动态加载的。它可以在运行时更改并由 Tempo 重新加载而无需重启应用。这些覆盖设置可以按租户设置。
# /conf/tempo.yaml
# Overrides configuration block
overrides:
per_tenant_override_config: /conf/overrides.yaml
---
# /conf/overrides.yaml
# Tenant-specific overrides configuration
overrides:
"<tenant-id>":
ingestion:
[burst_size_bytes: <int>]
[rate_limit_bytes: <int>]
[max_traces_per_user: <int>]
global:
[max_bytes_per_trace: <int>]
# A "wildcard" override can be used that will apply to all tenants if a match is not found otherwise.
"*":
ingestion:
[burst_size_bytes: <int>]
[rate_limit_bytes: <int>]
[max_traces_per_user: <int>]
global:
[max_bytes_per_trace: <int>]
用户可配置的覆盖设置
这些租户特定的覆盖设置存储在对象存储中,可以使用 API 请求进行修改。用户可配置的覆盖设置优先于运行时覆盖设置。有关更多详细信息,请参阅用户可配置的覆盖设置。
覆盖策略
各种参数指定的 trace 限制默认按每个 distributor 的限制应用。例如,max_traces_per_user
设置为 10000 意味着集群中的每个 distributor 对每个用户有 10000 个 trace 的限制。这被称为本地策略,因为指定的 trace 限制对每个 distributor 来说是本地的。
在本地级别应用的设置非常有助于确保每个 distributor 独立地处理 traces 直到达到限制,而不会影响其他 distributor 的追踪限制。
然而,随着集群规模变大,这可能导致大量 traces。另一种策略可能是设置一个全局 trace 限制,该限制在集群的所有 distributor 中建立所有 traces 的总预算。全局限制通过使用 distributor ring 在所有 distributor 中进行平均。
# /conf/tempo.yaml
overrides:
defaults:
ingestion:
[rate_strategy: <global|local> | default = local]
例如,此配置指定 distributor 的每个实例将应用 15MB/s
的限制。
overrides:
defaults:
ingestion:
strategy: local
limit_bytes: 15000000
此配置指定所有 distributor 实例总共将应用 15MB/s
的限制。因此,如果有 5 个实例,每个实例将应用本地限制 (15MB/s / 5) = 3MB/s
。
overrides:
defaults:
ingestion:
strategy: global
limit_bytes: 15000000
使用报告
默认情况下,Tempo 会向 Grafana Labs 报告关于部署形态的匿名使用数据。此数据用于确定某些功能的部署有多普遍、是否启用了功能标志以及使用了哪些复制因子或压缩级别。
通过提供人们如何使用 Tempo 的信息,使用报告有助于 Tempo 团队决定将开发和文档工作集中在哪里。不收集任何私人信息,所有报告都完全匿名。
报告由配置选项控制。
使用了以下配置值
- 启用的接收器
- Frontend 并发和版本
- Storage 缓存、后端、WAL 和块编码
- Ring 复制因子和
kvstore
- 启用的功能开关
不收集性能数据。
您可以使用以下配置禁用此通用信息的自动报告
usage_report:
reporting_enabled: false
如果您正在使用 Helm chart,可以通过更改 reportingEnabled
值来启用或禁用使用报告。此值在 tempo-distributed 和 tempo Helm charts 中可用。
# -- If true, Tempo will report anonymous usage data about the shape of a deployment to Grafana Labs
reportingEnabled: true
缓存
使用此块配置整个应用中可用的缓存。可以创建多个缓存并分配角色,这些角色决定了 Tempo 如何使用它们。
cache:
# Background cache configuration. Requires having a cache configured. These settings apply
# to all configured caches.
background:
# At what concurrency to write back to cache. Default is 10.
[writeback_goroutines: <int>]
# How many key batches to buffer for background write-back. Default is 10000.
[writeback_buffer: <int>]
caches:
# Roles determine how this cache is used in Tempo. Roles must be unique across all caches and
# every cache must have at least one role.
# Allowed values:
# bloom - Bloom filters for trace id lookup.
# parquet-footer - Parquet footer values. Useful for search and trace by id lookup.
# parquet-page - Parquet "pages". WARNING: This will attempt to cache most reads from parquet and, as a result, is very high volume.
# frontend-search - Frontend search job results.
- roles:
- <role1>
- <role2>
# Memcached caching configuration block
memcached:
# Hostname for memcached service to use. If empty and if addresses is unset, no memcached will be used.
# Example: "host: memcached"
[host: <string>]
# Optional
# SRV service used to discover memcache servers. (default: memcached)
# Example: "service: memcached-client"
[service: <string>]
# Optional
# Comma separated addresses list in DNS Service Discovery format. Refer - https://cortexmetrics.io/docs/configuration/arguments/#dns-service-discovery.
# (default: "")
# Example: "addresses: memcached"
[addresses: <comma separated strings>]
# Optional
# Maximum time to wait before giving up on memcached requests.
# (default: 100ms)
[timeout: <duration>]
# Optional
# Maximum number of idle connections in pool.
# (default: 16)
[max_idle_conns: <int>]
# Optional
# Period with which to poll DNS for memcache servers.
# (default: 1m)
[update_interval: <duration>]
# Optional
# Use consistent hashing to distribute keys to memcache servers.
# (default: true)
[consistent_hash: <bool>]
# Optional
# Trip circuit-breaker after this number of consecutive dial failures.
# (default: 10)
[circuit_breaker_consecutive_failures: 10]
# Optional
# Duration circuit-breaker remains open after tripping.
# (default: 10s)
[circuit_breaker_timeout: 10s]
# Optional
# Reset circuit-breaker counts after this long.
# (default: 10s)
[circuit_breaker_interval: 10s]
# Enable connecting to Memcached with TLS.
# CLI flag: -<prefix>.memcached.tls-enabled
[tls_enabled: <boolean> | default = false]
# Path to the client certificate, which will be used for authenticating with
# the server. Also requires the key path to be configured.
# CLI flag: -<prefix>.memcached.tls-cert-path
[tls_cert_path: <string> | default = ""]
# Path to the key for the client certificate. Also requires the client
# certificate to be configured.
# CLI flag: -<prefix>.memcached.tls-key-path
[tls_key_path: <string> | default = ""]
# Path to the CA certificates to validate server certificate against. If not
# set, the host's root CA certificates are used.
# CLI flag: -<prefix>.memcached.tls-ca-path
[tls_ca_path: <string> | default = ""]
# Override the expected name on the server certificate.
# CLI flag: -<prefix>.memcached.tls-server-name
[tls_server_name: <string> | default = ""]
# Skip validating server certificate.
# CLI flag: -<prefix>.memcached.tls-insecure-skip-verify
[tls_insecure_skip_verify: <boolean> | default = false]
# Override the default cipher suite list (separated by commas). Allowed
# values:
#
# Secure Ciphers:
# - TLS_RSA_WITH_AES_128_CBC_SHA
# - TLS_RSA_WITH_AES_256_CBC_SHA
# - TLS_RSA_WITH_AES_128_GCM_SHA256
# - TLS_RSA_WITH_AES_256_GCM_SHA384
# - TLS_AES_128_GCM_SHA256
# - TLS_AES_256_GCM_SHA384
# - TLS_CHACHA20_POLY1305_SHA256
# - TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA
# - TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
# - TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
# - TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
# - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
# - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
# - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
# - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
# - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
# - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
#
# Insecure Ciphers:
# - TLS_RSA_WITH_RC4_128_SHA
# - TLS_RSA_WITH_3DES_EDE_CBC_SHA
# - TLS_RSA_WITH_AES_128_CBC_SHA256
# - TLS_ECDHE_ECDSA_WITH_RC4_128_SHA
# - TLS_ECDHE_RSA_WITH_RC4_128_SHA
# - TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA
# - TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
# - TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
# CLI flag: -<prefix>.memcached.tls-cipher-suites
[tls_cipher_suites: <string> | default = ""]
# Override the default minimum TLS version. Allowed values: VersionTLS10,
# VersionTLS11, VersionTLS12, VersionTLS13
# CLI flag: -<prefix>.memcached.tls-min-version
[tls_min_version: <string> | default = ""]
# Redis configuration block
# EXPERIMENTAL
redis:
# Redis endpoint to use when caching.
[endpoint: <string>]
# optional.
# Maximum time to wait before giving up on redis requests. (default 100ms)
[timeout: 500ms]
# optional.
# Redis Sentinel master name. (default "")
# Example: "master-name: redis-master"
[master-name: <string>]
# optional.
# Database index. (default 0)
[db: <int>]
# optional.
# How long keys stay in the redis. (default 0)
[expiration: <duration>]
# optional.
# Enable connecting to redis with TLS. (default false)
[tls-enabled: <bool>]
# optional.
# Skip validating server certificate. (default false)
[tls-insecure-skip-verify: <bool>]
# optional.
# Maximum number of connections in the pool. (default 0)
[pool-size: <int>]
# optional.
# Password to use when connecting to redis. (default "")
[password: <string>]
# optional.
# Close connections after remaining idle for this duration. (default 0s)
[idle-timeout: <duration>]
# optional.
# Close connections older than this duration. (default 0s)
[max-connection-age: <duration>]
# optional.
# Password to use when connecting to redis sentinel. (default "")
[sentinel_password: <string>]
配置示例
cache:
background:
writeback_goroutines: 5
caches:
- roles:
- parquet-footer
memcached:
host: memcached-instance
- roles:
- bloom
redis:
endpoint: redis-instance