EN
Currency:
EUR – €
Choose a currency
  • Euro EUR – €
  • United States dollar USD – $
VAT:
OT 0%
Choose your country (VAT)
  • OT All others 0%
Choose a language
  • Choose a currency
    Choose you country (VAT)
    Dedicated Servers
  • Instant
  • Custom
  • Single CPU servers
  • Dual CPU servers
  • Servers with 4th Gen CPUs
  • Servers with AMD Ryzen and Intel Core i9
  • Storage Servers
  • Servers with 10Gbps ports
  • Hosting virtualization nodes
  • GPU
  • Sale
  • VPS
  • General VPS
  • Performance VPS
  • Edge VPS
  • Storage VPS
  • VDS
  • Ryzen VPS
  • GPU
  • Dedicated GPU server
  • VM with GPU
  • Tesla A100 80GB & H100 Servers
  • Sale
    Apps
    Cloud
  • VMware and RedHat's oVirt Сlusters
  • Proxmox VE
  • Colocation
  • Colocation in the Netherlands
  • Remote smart hands
  • Services
  • L3-L4 DDoS Protection
  • Network equipment
  • IPv4 and IPv6 address
  • Managed servers
  • SLA packages for technical support
  • Monitoring
  • Software
  • VLAN
  • Announcing your IP or AS (BYOIP)
  • USB flash/key/flash drive
  • Traffic
  • Hardware delivery for EU data centers
  • AI Chatbot Lite
  • About
  • Careers at HOSTKEY
  • Server Control Panel & API
  • Data Centers
  • Network
  • Speed test
  • Hot deals
  • Sales contact
  • Reseller program
  • Affiliate Program
  • Grants for winners
  • Grants for scientific projects and startups
  • News
  • Our blog
  • Payment terms and methods
  • Legal
  • Abuse
  • Looking Glass
  • The KYC Verification
  • Hot Deals

    31.10.2022

    Using Prometheus + Alertmanager + Node Exporter to monitor a company's geo-distributed infrastructure.

    server one
    HOSTKEY
    Rent dedicated and virtual servers with instant deployment in reliable TIER III class data centers in the Netherlands and the USA. Free protection against DDoS attacks included, and your server will be ready for work in as little as 15 minutes. 24/7 Customer Support.

    Author: Senior Devops. Lead Infrastructure Specialist Hostkey Nikita Zubarev

    Hostkey provides servers for rent - this imposes on us, the employees of the company, the obligation to control the quality of the equipment. One of the key elements of maintaining a large infrastructure is an effective monitoring system that allows you to quickly identify server failures. We want to share our experience in implementing and using various tools to monitor the operation of equipment.

    In this article, we will briefly review the options for installing Prometheus, Alertmanager and Node Exporter federation, and dwell on some of the features and settings. You can use the installation from the docker-compose file or deploy the system in a Kubernetes cluster. Our task is to collect metrics of the servers and services of the company's infrastructure, store them and respond to alerts. To solve these problems, a database is needed.

    We chose Prometheus for a number of reasons:

    • it is a popular tool with extensive documentation and a large community;
    • open source code written in Golang;
    • compatible with Docker etc.

    Prometheus is a system and service monitoring system. It collects metrics from configured targets at specified time intervals, evaluates rule expressions, displays results, and can trigger alerts when certain conditions are met. We use ansible-playbook to install all system components.

    One of the features of our company is a geo-distributed infrastructure: HOSTKEY has equipment in several data centers in Moscow, Amsterdam and New York. Hierarchical federation allows a Prometheus server to collect selected time series from another Prometheus server, as well as scale to environments with multiple data centers and nodes. In this case, the federation topology resembles a tree, where higher level Prometheus servers collect aggregated time series data from a larger number of downstream servers.

    Configuration example from merging metrics from one server to another:

    scrape_configs:
    - job_name: 'federate'
    	scrape_interval: 15s
    
    	honor_labels: true
    	metrics_path: '/federate'
    
    	params:
    		'match[]':
    			- '{job="prometheus"}'
    			- '{__name__=~"job:.*"}'
    
    	static_configs:
    		- targets:
    			- 'ru-<instance_address>:9090'
    			- 'nl-<instance_address>:9090'
    			- 'us-<instance_address>:9090'

    As a source of basic metrics for Prometheus, Node Exporter is used - an http application that collects operating system metrics.
    Prometheus collects data from one or more instances of the Node Exporter (we will cover how to write your own exporter in the following articles). Node Exporter must be installed on every host.

    After the launch, the metrics are available on port 9100:

    In order for the metric to get into the Prometheus database, you need to describe the configuration in the file:
    ru-federation-app01a / # vim /etc/prometheus/prometheus.yml

     - job_name: 'infra.hostkey'
    	static_configs:
    		- targets:
    			- <instance_address>:9100
    			- <instance_address>:9100
    			- <instance_address>:9100

    The file must be in .yaml format (we'll cover configuration management in future articles).

    After restarting the Prometheus service, you can open its user interface in the browser "<server_address>:9090". The "/targets" page displays the status of the systems you are getting metrics from.

    Visualization

    Grafana —is a frontend to various time series db: Graphite, InfluxDB, Prometheus — with beautiful and understandable graphs, and most importantly - with many templates suitable for 90% of possible tasks. The .json format makes it easy to customize charts. For convenience, we proxy the Grafana port through nginx and enable ldap authorization.

    Almost all Node Exporter default values ??are presented as a graph. Import dashboard:

    Alert display

    Create an alert rule configuration <instance_address>.yml.

    In fact, you can control the alerting of any parameter available in the node_exporter:

    groups:
    	- name: <instance_address>
    		rules:
    		- alert: InstanceDown
    			expr: up{job="<instance_address>"} == 0
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: "Instance {{ .instance }} down"
    				description: "{{ .instance }} of job {{ .job }} has been down for more than 5 minutes."
    	
    	#==========CPU==================================================================
    		- alert: NodeCPUUsage
    			expr: (100 - (irate(node_cpu{mode="idle", job="<instance_address>"}[5m]) * 100)) > 50
    			for: 2m
    			labels:
    				severity: page
    			annotations:
    				summary: High CPU usage detected CPU usage is above 75%
    	# ==============Memory==========================================================
    		- alert: NodeSwapUsageSwap
    			expr: (((node_memory_SwapTotal-node_memory_SwapFree{job="<instance_address>"})/node_memory_SwapTotal{job="<instance_address>"})*100) > 30
    			for: 2m 
    			labels:
    				severity: page
    			annotations:
    				summary: Swap usage detected Swap usage usage is above 50%
    		- alert: NodeMemoryUsageMemFree
    			expr: (((node_memory_MemTotal-node_memory_MemFree-node_memory_Cached{job="<instance_address>"})/(node_memory_MemTotal{job="<instance_address>"})*100)) > 30
    			for: 2m
    			labels:
    				severity: page
    			annotations:
    				summary: High memory usage detected, Memory usage is above 50%
    	#==============Load=============================================================
    	
    		- alert: nodeLoad1
    			expr: node_load1{job="<instance_address>"} > 0.7
    			for: 1m
    			labels:
    				severity: page
    			annotations:
    				summary: Load
    	#================Disk space Used ================================================
    		- alert: diskSpaceUsed
    			expr: (100.0 - 100 * (node_filesystem_avail{job="<instance_address>"} / node_filesystem_size{job="<instance_address>"})) > 80
    			for: 10m
    			labels:
    				severity: page
    			annotations:
    				summary: Disk space userd 80
    	#=============nodeContrack========================================
    		- alert: nodeContrack
    			expr: node_nf_conntrack_entries{job="<instance_address>"} > 200
    			for: 10m
    			labels:
    				severity: page
    			annotations:
    				summary: nodeContrack
    	#=============nodeCntextSwitches ========================================
    		- alert: nodeCntextSwitches
    			expr: irate(node_context_switches{job="<instance_address>"}[5m]) > 100
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: nodeCntextSwitches
    	#=============Disk Utilization per Device ========================================
    		- alert: DiskUtilizationPerDevice
    			expr: irate(node_disk_io_time_ms{job="<instance_address>"}[5m])/10 > 0.2
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: DiskUtilizationPerDevice
    	#============Disk IOs per Device ========================================
    		- alert: DiskIOsPerDeviceRead
    			expr: irate(node_disk_reads_completed{job="<instance_address>"}[5m]) >10
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: DiskIOsPerDevice
    	
    		- alert: DiskIOsPerDeviceWrite
    			expr: irate(node_disk_writes_completed{job="<instance_address>"}[5m]) > 10
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: DiskIOsPerDevice
    	#===========Disk Throughput per Device========================================
    		- alert: DiskThroughputPerDeviceReads
    			expr: irate(node_disk_sectors_read{job="<instance_address>"}[5m]) * 512 >10000000
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: DiskIOsPerDevice
    		- alert: DiskThroughputPerDeviceWrites
    			expr: irate(node_disk_sectors_written{job="<instance_address>"}[5m]) * 512 > 10000000
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: DiskIOsPerDevice
    	#===========Network Traffic========================================
    		- alert: NetworkTrafficReceive
    			expr: irate(node_network_receive_bytes{job="<instance_address>"}[5m])*8 > 5000
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: NetworkTrafficReceive
    		- alert: NetworkTrafficTransmit
    			expr: irate(node_network_transmit_bytes{job="<instance_address>"}[5m])*8 > 5000
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: NetworkTrafficTransmit
    	
    	#===========Netstat========================================
    		- alert: Netstat
    			expr: node_netstat_Tcp_CurrEstab{job="<instance_address>"} > 20
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: Netstat
    	
    	#===========UDP Stats============================
    		- alert: UDPStatsInDatagrams
    			expr: irate(node_netstat_Udp_InDatagrams{job="<instance_address>"}[5m]) > 50
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: UDPStats
    		- alert: UDPStatsInErrors
    			expr: irate(node_netstat_Udp_InErrors{job="<instance_address>"}[5m]) > 20
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: UDPStats
    		- alert: UDPStatsOutDatagrams
    			expr: irate(node_netstat_Udp_OutDatagrams{job="<instance_address>"}[5m]) > 50
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: UDPStats
    		- alert: UDPStatsNoPorts
    			expr: irate(node_netstat_Udp_NoPorts{job="<instance_address>"}[5m]) > 20
    			for: 5m
    			labels:
    				severity: page
    			annotations:
    				summary: UDPStats
    	

    We connect the alert configuration in prometheus.yml:

    rule_files:
    - "<instance_address>.yml"

    Alertmanager

    Alerts can be redirected to almost any system: email, Telegram, Slack, Rocket.Chat.

    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - alertmamager-app01a.infra.hostkey.ru:9093

    Example for Rocket.Chat:

    alertmanager-app01a ~ # vin /etc/alertmanager/alertmanager.yml
    
    				route:
    						repeat_interval: 5m
    						group_interval: 5m
    						receiver: 'rocketchat'
    				
    				receivers:
    						- name: 'rocketchat'
    							webhook_configs:
    									- send_resolved: false
    										url: 'https://chat_address/hooks/devtest'
    				
    				

    Consider the main contents of the configuration file:

    global:
    	route:
    		group_b: ['what rules to group the parameters by']
    		group_wait: "delay before sending a notification for a group"
    		group_interval: "time interval before sending second message for a group"
    		repeat_interval: "time interval before sending repeated message"
    		receive: "the name of the messaging method"
    	receivers:
    		- nam: "the name of the messaging method"
    		configuration

    All alerts received by the AlertManager must be grouped so as not to send information about the same events several times. The group_by directive specifies which fields to use during grouping. After creating a new alert group, the group_wait parameter is connected to the case. It specifies the time to wait before sending the first message. The parameter allows you to receive the entire set of group alerts in one message, rather than a separate message for each alert.

    The group_interval parameter specifies how long to wait after sending the first message for a group before sending repeated messages from the same group. However, not only does it affect the interval between messages, there is also the repeat_interval parameter, which indicates after what period of time the same alert can be sent again.

    We get:

    Prometheus AlertManager

    A visualization toolbar based on a single dashboard in our company is used to monitor the shift on duty:

    As a result, we got a working monitoring system for a geo-distributed infrastructure, which we later adapted to our specifics (we will talk about this experience in future articles). The flexibility of the settings of the described services allows you to adapt the monitoring system to the specific needs of any company and significantly reduce the labor costs of engineers and other specialists to maintain the infrastructure.

    Rent dedicated and virtual servers with instant deployment in reliable TIER III class data centers in the Netherlands and the USA. Free protection against DDoS attacks included, and your server will be ready for work in as little as 15 minutes. 24/7 Customer Support.

    Other articles

    28.11.2024

    OpenWebUI Just Got an Upgrade: What's New in Version 0.4.5?

    OpenWebUI has been updated to version 0.4.5! New features for RAG, user groups, authentication, improved performance, and more. Learn how to upgrade and maximize its potential.

    25.11.2024

    How We Replaced the IPMI Console with HTML5 for Managing Our Servers

    Tired of outdated server management tools? See how we replaced the IPMI console with an HTML5-based system, making remote server access seamless and efficient for all users.

    25.10.2024

    TS3 Manager: What Happens When You Fill in the Documentation Gaps

    Having trouble connecting to TS3 Manager after installing it on your VPS? Managing your TeamSpeak server through TS3 Manager isn't as straightforward as it might seem. Let's troubleshoot these issues together!

    16.09.2024

    10 Tips for Open WebUI to Enhance Your Work with AI

    Unleash the true power of Open WebUI and transform your AI workflow with these 10 indispensable tips.

    27.08.2024

    Comparison of SaaS solutions for online store on Wix and WordPress.com versus an on-premise solution on a VPS with WordPress and WooCommerce

    This article compares the simplicity and cost of SaaS platforms like Wix and WordPress.com versus the flexibility and control of a VPS with WordPress and WooCommerce for e-commerce businesses.

    HOSTKEY Dedicated servers and cloud solutions Pre-configured and custom dedicated servers. AMD, Intel, GPU cards, Free DDoS protection amd 1Gbps unmetered port 30
    4.3 67 67
    Upload