Service-to-service dashboard
This page provides reference information about the Grafana dashboard configuration included in the hashicorp/consul
GitHub repository. The service-to-service dashboard provides deep visibility into the traffic and interactions between services within the Consul service mesh. It focuses on critical metrics such as logs, error rates, traffic patterns, and success rates, all of which help operators maintain smooth and reliable service-to-service communication.
Grafana queries overview
This dashboard provides the following information about service mesh operations.
Access logs and errors monitoring
Description: This section provides visibility into logs and errors related to service-to-service communications. It tracks and displays the number of logs generated, errors encountered, and the percentage of logs matching specific patterns.
Total logs
Description: This metric counts the total number of log lines produced by Consul dataplane containers. It provides an overview of the volume of logs being generated for a specific namespace.
sum(count_over_time(({container="consul-dataplane",namespace=~"$namespace"})[$__interval]))
Total logs containing "$searchable_pattern"
Description: This metric tracks the number of logs containing the specified pattern. It is useful for filtering and monitoring specific log events across the service mesh.
sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__interval]))
Percentage of logs containing "$searchable_pattern"
Description: This metric calculates the percentage of logs containing the specified search pattern within the total log volume. It helps gauge the proportion of specific log events.
(sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__interval])) * 100) / sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} [$__interval]))
Total response code distribution
Description: This pie chart visualizes the distribution of HTTP response codes, helping identify any 4xx and 5xx error codes generated by the services.
sum by(response_code) (count_over_time({container="consul-dataplane", namespace="$namespace"} | json | response_code != "0" | __error__= [$__range]))
Rate of logs containing "$searchable_pattern" per service
Description: This metric monitors the rate at which specific patterns appear in logs per service, helping to detect trends and anomalies in log data.
sum by(app) (rate({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__range]))
TCP metrics - service level
TCP inbound and outbound bytes
Description: This metric tracks the inbound and outbound TCP bytes transferred between services. It is essential for understanding the network traffic flow between source and destination services.
sum(rate(envoy_tcp_downstream_cx_rx_bytes_total{}[10m])) by (service, destination_service)
TCP inbound and outbound bytes buffered
Description: This metric monitors the amount of TCP bytes buffered for inbound and outbound traffic between services. It helps identify potential network performance bottlenecks.
sum(rate(envoy_tcp_downstream_cx_rx_bytes_buffered{}[10m])) by (service, destination_service)
TCP downstream connections
Description: This metric counts the number of active TCP downstream connections from the source service to the destination service, providing visibility into the volume of connections between services.
sum(envoy_tcp_downstream_cx_total) by (service, destination_service)
Outbound traffic monitoring
Upstream traffic
Description: This metric monitors the upstream traffic from the source service to the destination service. It shows how much traffic is being sent between services.
sum(irate(envoy_cluster_upstream_rq_total{local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m]))
Upstream request response timeliness
Description: This metric calculates the 95th percentile of upstream request response times between the source and destination services. It helps ensure that service communications are handled promptly.
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{local_cluster=~"$source_service",consul_destination_target!=""}[10m])) by (le, consul_destination_target))
Upstream request success rate
Description: This metric tracks the success rate of requests from the source service to the destination service, excluding 4xx and 5xx errors. It helps assess the reliability of service communications.
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m]))
Inbound traffic monitoring
Requests sent
Description: This metric tracks the number of requests sent between the source service and destination service within the service mesh.
sum(irate(envoy_cluster_upstream_rq_total{consul_destination_datacenter="dc1",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m])) by (consul_destination_service, local_cluster)
Request success rate
Description: This metric tracks the success rate of requests from the source service to the destination service, helping identify failures or bottlenecks in communication.
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m])) by (local_cluster, consul_destination_service) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$destination_service"}[10m])) by (local_cluster, consul_destination_service)
Response success by status code
Description: This metric tracks response success by status code for requests sent by the source service to the destination service.
sum(increase(envoy_http_downstream_rq_xx{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster, envoy_response_code_class)
Request duration
Description: This metric tracks the request duration between the source and destination services, helping monitor performance and response times.
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{consul_destination_datacenter="dc1", consul_destination_service=~"$destination_service",local_cluster=~"$source_service"}[10m])) by (le, cluster, local_cluster, consul_destination_service))
Response success
Description: This metric tracks the success of responses for the source service's requests across the service mesh.
sum(increase(envoy_http_downstream_rq_total{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster)
Request response rate
Description: This metric tracks the rate at which responses are being generated by the source service, providing insight into service activity and performance.
sum(irate(envoy_http_downstream_rq_total{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster)
Customization options
The service-to-service dashboard includes a variety of customization options to help you analyze specific aspects of service-to-service communications, tailor the dashboard for more targeted monitoring, and enhance visibility into the service mesh.
Filter by source service: You can filter the dashboard to focus on traffic originating from a specific source service, allowing you to analyze interactions from the source service to all destination services.
Filter by destination service: Similarly, you can filter the dashboard by destination service to track and analyze the traffic received by specific services. This helps pinpoint communication issues or performance bottlenecks related to specific services.
Filter by namespace: The dashboard can be customized to focus on service interactions within a particular namespace. This is especially useful for isolating issues in multi-tenant environments or clusters that operate with strict namespace isolation.
Log pattern search: You can apply custom search patterns to logs to filter out specific log events of interest, such as error messages or specific HTTP status codes. This enables you to narrow down on specific log entries and identify patterns that may indicate issues.
Time range selection: The dashboard supports dynamic time range selection, allowing you to focus on service interactions over specific time intervals. This helps in analyzing traffic trends, troubleshooting incidents, and understanding the timing of service communications.
By using these customization options, you can tailor the dashboard to your specific needs and ensure they are always monitoring the most relevant data for maintaining a healthy and performant service mesh.