|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315 |
- # How to monitor Synapse metrics using Prometheus
-
- 1. Install Prometheus:
-
- Follow instructions at
- <http://prometheus.io/docs/introduction/install/>
-
- 1. Enable Synapse metrics:
-
- In `homeserver.yaml`, make sure `enable_metrics` is
- set to `True`.
-
- 1. Enable the `/_synapse/metrics` Synapse endpoint that Prometheus uses to
- collect data:
-
- There are two methods of enabling the metrics endpoint in Synapse.
-
- The first serves the metrics as a part of the usual web server and
- can be enabled by adding the `metrics` resource to the existing
- listener as such as in this example:
-
- ```yaml
- listeners:
- - port: 8008
- tls: false
- type: http
- x_forwarded: true
- bind_addresses: ['::1', '127.0.0.1']
-
- resources:
- # added "metrics" in this line
- - names: [client, federation, metrics]
- compress: false
- ```
-
- This provides a simple way of adding metrics to your Synapse
- installation, and serves under `/_synapse/metrics`. If you do not
- wish your metrics be publicly exposed, you will need to either
- filter it out at your load balancer, or use the second method.
-
- The second method runs the metrics server on a different port, in a
- different thread to Synapse. This can make it more resilient to
- heavy load meaning metrics cannot be retrieved, and can be exposed
- to just internal networks easier. The served metrics are available
- over HTTP only, and will be available at `/_synapse/metrics`.
-
- Add a new listener to homeserver.yaml as in this example:
-
- ```yaml
- listeners:
- - port: 8008
- tls: false
- type: http
- x_forwarded: true
- bind_addresses: ['::1', '127.0.0.1']
-
- resources:
- - names: [client, federation]
- compress: false
-
- # beginning of the new metrics listener
- - port: 9000
- type: metrics
- bind_addresses: ['::1', '127.0.0.1']
- ```
-
- 1. Restart Synapse.
-
- 1. Add a Prometheus target for Synapse.
-
- It needs to set the `metrics_path` to a non-default value (under
- `scrape_configs`):
-
- ```yaml
- - job_name: "synapse"
- scrape_interval: 15s
- metrics_path: "/_synapse/metrics"
- static_configs:
- - targets: ["my.server.here:port"]
- ```
-
- where `my.server.here` is the IP address of Synapse, and `port` is
- the listener port configured with the `metrics` resource.
-
- If your prometheus is older than 1.5.2, you will need to replace
- `static_configs` in the above with `target_groups`.
-
- 1. Restart Prometheus.
-
- 1. Consider using the [grafana dashboard](https://github.com/matrix-org/synapse/tree/master/contrib/grafana/)
- and required [recording rules](https://github.com/matrix-org/synapse/tree/master/contrib/prometheus/)
-
- ## Monitoring workers
-
- To monitor a Synapse installation using [workers](workers.md),
- every worker needs to be monitored independently, in addition to
- the main homeserver process. This is because workers don't send
- their metrics to the main homeserver process, but expose them
- directly (if they are configured to do so).
-
- To allow collecting metrics from a worker, you need to add a
- `metrics` listener to its configuration, by adding the following
- under `worker_listeners`:
-
- ```yaml
- - type: metrics
- bind_address: ''
- port: 9101
- ```
-
- The `bind_address` and `port` parameters should be set so that
- the resulting listener can be reached by prometheus, and they
- don't clash with an existing worker.
- With this example, the worker's metrics would then be available
- on `http://127.0.0.1:9101`.
-
- Example Prometheus target for Synapse with workers:
-
- ```yaml
- - job_name: "synapse"
- scrape_interval: 15s
- metrics_path: "/_synapse/metrics"
- static_configs:
- - targets: ["my.server.here:port"]
- labels:
- instance: "my.server"
- job: "master"
- index: 1
- - targets: ["my.workerserver.here:port"]
- labels:
- instance: "my.server"
- job: "generic_worker"
- index: 1
- - targets: ["my.workerserver.here:port"]
- labels:
- instance: "my.server"
- job: "generic_worker"
- index: 2
- - targets: ["my.workerserver.here:port"]
- labels:
- instance: "my.server"
- job: "media_repository"
- index: 1
- ```
-
- Labels (`instance`, `job`, `index`) can be defined as anything.
- The labels are used to group graphs in grafana.
-
- ## Renaming of metrics & deprecation of old names in 1.2
-
- Synapse 1.2 updates the Prometheus metrics to match the naming
- convention of the upstream `prometheus_client`. The old names are
- considered deprecated and will be removed in a future version of
- Synapse.
- **The old names will be disabled by default in Synapse v1.71.0 and removed
- altogether in Synapse v1.73.0.**
-
- | New Name | Old Name |
- | ---------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
- | python_gc_objects_collected_total | python_gc_objects_collected |
- | python_gc_objects_uncollectable_total | python_gc_objects_uncollectable |
- | python_gc_collections_total | python_gc_collections |
- | process_cpu_seconds_total | process_cpu_seconds |
- | synapse_federation_client_sent_transactions_total | synapse_federation_client_sent_transactions |
- | synapse_federation_client_events_processed_total | synapse_federation_client_events_processed |
- | synapse_event_processing_loop_count_total | synapse_event_processing_loop_count |
- | synapse_event_processing_loop_room_count_total | synapse_event_processing_loop_room_count |
- | synapse_util_caches_cache_hits | synapse_util_caches_cache:hits |
- | synapse_util_caches_cache_size | synapse_util_caches_cache:size |
- | synapse_util_caches_cache_evicted_size | synapse_util_caches_cache:evicted_size |
- | synapse_util_caches_cache | synapse_util_caches_cache:total |
- | synapse_util_caches_response_cache_size | synapse_util_caches_response_cache:size |
- | synapse_util_caches_response_cache_hits | synapse_util_caches_response_cache:hits |
- | synapse_util_caches_response_cache_evicted_size | synapse_util_caches_response_cache:evicted_size |
- | synapse_util_metrics_block_count_total | synapse_util_metrics_block_count |
- | synapse_util_metrics_block_time_seconds_total | synapse_util_metrics_block_time_seconds |
- | synapse_util_metrics_block_ru_utime_seconds_total | synapse_util_metrics_block_ru_utime_seconds |
- | synapse_util_metrics_block_ru_stime_seconds_total | synapse_util_metrics_block_ru_stime_seconds |
- | synapse_util_metrics_block_db_txn_count_total | synapse_util_metrics_block_db_txn_count |
- | synapse_util_metrics_block_db_txn_duration_seconds_total | synapse_util_metrics_block_db_txn_duration_seconds |
- | synapse_util_metrics_block_db_sched_duration_seconds_total | synapse_util_metrics_block_db_sched_duration_seconds |
- | synapse_background_process_start_count_total | synapse_background_process_start_count |
- | synapse_background_process_ru_utime_seconds_total | synapse_background_process_ru_utime_seconds |
- | synapse_background_process_ru_stime_seconds_total | synapse_background_process_ru_stime_seconds |
- | synapse_background_process_db_txn_count_total | synapse_background_process_db_txn_count |
- | synapse_background_process_db_txn_duration_seconds_total | synapse_background_process_db_txn_duration_seconds |
- | synapse_background_process_db_sched_duration_seconds_total | synapse_background_process_db_sched_duration_seconds |
- | synapse_storage_events_persisted_events_total | synapse_storage_events_persisted_events |
- | synapse_storage_events_persisted_events_sep_total | synapse_storage_events_persisted_events_sep |
- | synapse_storage_events_state_delta_total | synapse_storage_events_state_delta |
- | synapse_storage_events_state_delta_single_event_total | synapse_storage_events_state_delta_single_event |
- | synapse_storage_events_state_delta_reuse_delta_total | synapse_storage_events_state_delta_reuse_delta |
- | synapse_federation_server_received_pdus_total | synapse_federation_server_received_pdus |
- | synapse_federation_server_received_edus_total | synapse_federation_server_received_edus |
- | synapse_handler_presence_notified_presence_total | synapse_handler_presence_notified_presence |
- | synapse_handler_presence_federation_presence_out_total | synapse_handler_presence_federation_presence_out |
- | synapse_handler_presence_presence_updates_total | synapse_handler_presence_presence_updates |
- | synapse_handler_presence_timers_fired_total | synapse_handler_presence_timers_fired |
- | synapse_handler_presence_federation_presence_total | synapse_handler_presence_federation_presence |
- | synapse_handler_presence_bump_active_time_total | synapse_handler_presence_bump_active_time |
- | synapse_federation_client_sent_edus_total | synapse_federation_client_sent_edus |
- | synapse_federation_client_sent_pdu_destinations_count_total | synapse_federation_client_sent_pdu_destinations:count |
- | synapse_federation_client_sent_pdu_destinations_total | synapse_federation_client_sent_pdu_destinations:total |
- | synapse_handlers_appservice_events_processed_total | synapse_handlers_appservice_events_processed |
- | synapse_notifier_notified_events_total | synapse_notifier_notified_events |
- | synapse_push_bulk_push_rule_evaluator_push_rules_invalidation_counter_total | synapse_push_bulk_push_rule_evaluator_push_rules_invalidation_counter |
- | synapse_push_bulk_push_rule_evaluator_push_rules_state_size_counter_total | synapse_push_bulk_push_rule_evaluator_push_rules_state_size_counter |
- | synapse_http_httppusher_http_pushes_processed_total | synapse_http_httppusher_http_pushes_processed |
- | synapse_http_httppusher_http_pushes_failed_total | synapse_http_httppusher_http_pushes_failed |
- | synapse_http_httppusher_badge_updates_processed_total | synapse_http_httppusher_badge_updates_processed |
- | synapse_http_httppusher_badge_updates_failed_total | synapse_http_httppusher_badge_updates_failed |
- | synapse_admin_mau_current | synapse_admin_mau:current |
- | synapse_admin_mau_max | synapse_admin_mau:max |
- | synapse_admin_mau_registered_reserved_users | synapse_admin_mau:registered_reserved_users |
-
- Removal of deprecated metrics & time based counters becoming histograms in 0.31.0
- ---------------------------------------------------------------------------------
-
- The duplicated metrics deprecated in Synapse 0.27.0 have been removed.
-
- All time duration-based metrics have been changed to be seconds. This
- affects:
-
- | msec -> sec metrics |
- | -------------------------------------- |
- | python_gc_time |
- | python_twisted_reactor_tick_time |
- | synapse_storage_query_time |
- | synapse_storage_schedule_time |
- | synapse_storage_transaction_time |
-
- Several metrics have been changed to be histograms, which sort entries
- into buckets and allow better analysis. The following metrics are now
- histograms:
-
- | Altered metrics |
- | ------------------------------------------------ |
- | python_gc_time |
- | python_twisted_reactor_pending_calls |
- | python_twisted_reactor_tick_time |
- | synapse_http_server_response_time_seconds |
- | synapse_storage_query_time |
- | synapse_storage_schedule_time |
- | synapse_storage_transaction_time |
-
- Block and response metrics renamed for 0.27.0
- ---------------------------------------------
-
- Synapse 0.27.0 begins the process of rationalising the duplicate
- `*:count` metrics reported for the resource tracking for code blocks and
- HTTP requests.
-
- At the same time, the corresponding `*:total` metrics are being renamed,
- as the `:total` suffix no longer makes sense in the absence of a
- corresponding `:count` metric.
-
- To enable a graceful migration path, this release just adds new names
- for the metrics being renamed. A future release will remove the old
- ones.
-
- The following table shows the new metrics, and the old metrics which
- they are replacing.
-
- | New name | Old name |
- | ------------------------------------------------------------- | ---------------------------------------------------------- |
- | synapse_util_metrics_block_count | synapse_util_metrics_block_timer:count |
- | synapse_util_metrics_block_count | synapse_util_metrics_block_ru_utime:count |
- | synapse_util_metrics_block_count | synapse_util_metrics_block_ru_stime:count |
- | synapse_util_metrics_block_count | synapse_util_metrics_block_db_txn_count:count |
- | synapse_util_metrics_block_count | synapse_util_metrics_block_db_txn_duration:count |
- | synapse_util_metrics_block_time_seconds | synapse_util_metrics_block_timer:total |
- | synapse_util_metrics_block_ru_utime_seconds | synapse_util_metrics_block_ru_utime:total |
- | synapse_util_metrics_block_ru_stime_seconds | synapse_util_metrics_block_ru_stime:total |
- | synapse_util_metrics_block_db_txn_count | synapse_util_metrics_block_db_txn_count:total |
- | synapse_util_metrics_block_db_txn_duration_seconds | synapse_util_metrics_block_db_txn_duration:total |
- | synapse_http_server_response_count | synapse_http_server_requests |
- | synapse_http_server_response_count | synapse_http_server_response_time:count |
- | synapse_http_server_response_count | synapse_http_server_response_ru_utime:count |
- | synapse_http_server_response_count | synapse_http_server_response_ru_stime:count |
- | synapse_http_server_response_count | synapse_http_server_response_db_txn_count:count |
- | synapse_http_server_response_count | synapse_http_server_response_db_txn_duration:count |
- | synapse_http_server_response_time_seconds | synapse_http_server_response_time:total |
- | synapse_http_server_response_ru_utime_seconds | synapse_http_server_response_ru_utime:total |
- | synapse_http_server_response_ru_stime_seconds | synapse_http_server_response_ru_stime:total |
- | synapse_http_server_response_db_txn_count | synapse_http_server_response_db_txn_count:total |
- | synapse_http_server_response_db_txn_duration_seconds | synapse_http_server_response_db_txn_duration:total |
-
- Standard Metric Names
- ---------------------
-
- As of synapse version 0.18.2, the format of the process-wide metrics has
- been changed to fit prometheus standard naming conventions. Additionally
- the units have been changed to seconds, from milliseconds.
-
- | New name | Old name |
- | ---------------------------------------- | --------------------------------- |
- | process_cpu_user_seconds_total | process_resource_utime / 1000 |
- | process_cpu_system_seconds_total | process_resource_stime / 1000 |
- | process_open_fds (no \'type\' label) | process_fds |
-
- The python-specific counts of garbage collector performance have been
- renamed.
-
- | New name | Old name |
- | -------------------------------- | -------------------------- |
- | python_gc_time | reactor_gc_time |
- | python_gc_unreachable_total | reactor_gc_unreachable |
- | python_gc_counts | reactor_gc_counts |
-
- The twisted-specific reactor metrics have been renamed.
-
- | New name | Old name |
- | -------------------------------------- | ----------------------- |
- | python_twisted_reactor_pending_calls | reactor_pending_calls |
- | python_twisted_reactor_tick_time | reactor_tick_time |
|