Author: Stepan Vakheta, DevOps specialist at the Hostkey company
At Hostkey we use oVirt as our main virtualization system. It is extremely important to keep the system running at a high level despite the constant growth of the infrastructure to dozens and even hundreds of physical servers. In this article, we will briefly describe our company's approach to oVirt certificate monitoring.
In past articles, we described options for using Prometheus + Alertmanager + Node Exporter and HTTP and SSL via Prometheus Blackbox_Exporter.
Today we are going to talk about monitoring certificates in local storage of two main components of oVirt: oVirt Engine and oVirt Node. It is through these certificates that communication between these components takes place.
- The oVirt Engine is the central management component that controls all virtualization hosts, disk shares and virtual networks.
- oVirt Node is a component installed on each individual host that manages all the resources of that host and the virtual machines running on it.
Depending on the architecture, oVirt nodes can be combined into clusters. In this case, it is important to maintain a high level of reliability of communication between system components.
Communication between the oVirt Engine and oVirt hosts is performed over an encrypted SSL connection based on the certificates of these components. Depending on the oVirt version, the validity period of these certificates may vary: before version 4.5 it was 398 days, and from version 4.5 it has been increased to 5 years.
It is important not to miss the next certificate reissuance. Once they expire, Engine hosts will not be able to communicate, making it impossible to manage virtual machines entailing considerable investment in time to restore performance.
The best solution to the problem is to prevent it from occurring in the first place. Accordingly, we will collect the necessary metrics using SSL Exporter - it allows you to assign a target parameter to collect metrics in the form of local files, which is ideal for our task.
After installing and launching the exporter, it is necessary to define the target parameters (targets) for each of the system components. According to the Documentation, the certificates of interest for each of the components are located in the following paths:
- for ovirt-engine — /etc/pki/ovirt-engine;
- for ovirt-host — /etc/pki/vdsm/ and /etc/pki/libvirt/.
This exporter has the ability to search and sample multiple files simultaneously (using the doublestar package), which we will use in our query.
Target parameter for the oVirt Engine:
http://<engine_address>:9219/probe?module=file&target=/etc/pki/ovirt-engine/**/**.pem
Target parameter for the oVirt Hosts:
http://<node_address>:9219/probe?module=file&target=/etc/pki/vdsm/**/**.pem
http://<node_address>:9219/probe?module=file&target=/etc/pki/libvirt/**/**.pem
A sample of the metrics collected:
Then it is necessary to describe the configuration for Prometheus and add it to the database. For clarity, we will divide it by job_name for further visualization in the AlertManager panel:
/etc/prometheus/prometheus.yml
- job_name: ssl_file_engine
metrics_path: /probe
params:
module:
- file
target:
- /etc/pki/ovirt-engine/**/**.pem
static_configs:
- targets:
- engine_address:9219
- engine_address:9219
- job_name: ssl_file_vdsm_node
metrics_path: /probe
params:
module:
- file
target:
- /etc/pki/vdsm/**/**.pem
static_configs:
- targets:
- node_address:9219
- node_address:9219
- job_name: ssl_file_libvirt_node
metrics_path: /probe
params:
module:
- file
target:
- /etc/pki/libvirt/**/**.pem
static_configs:
- targets:
- node_address:9219
- node_address:9219
Next we need to describe a configuration file with rules for triggering alerts. We will be interested in the certificate expiration date.
Let's add a rule that will be triggered 70 days or less before the certificate expiration date.
ssl_file_engine.yml
groups:
- name: ssl_file_engine
rules:
- alert: SSLCertExpiringSoon
expr: ssl_file_cert_not_after{job="ssl_file_engine"} - time() < 86400 * 70
for: 10m
labels:
severity: critical
annotations:
description: "SSL certificate will expire in {{ $value | humanizeDuration }} (instance {{ $labels.instance }}) (instance {{ $labels.file }})"
ssl_file_libvirt_node.yml
groups:
- name: ssl_file_libvirt_node
rules:
- alert: SSLCertExpiringSoon
expr: ssl_file_cert_not_after{job="ssl_file_libvirt_node"} - time() < 86400 * 70
for: 10m
labels:
severity: critical
annotations:
description: "SSL certificate will expire in {{ $value | humanizeDuration }} (instance {{ $labels.instance }}) (instance {{ $labels.file }})"
ssl_file_vdsm_node.yml
groups:
- name: ssl_file_vdsm_node
rules:
- alert: SSLCertExpiringSoon
expr: ssl_file_cert_not_after{job="ssl_file_vdsm_node"} - time() < 86400 * 70
for: 10m
labels:
severity: critical
annotations:
description: "SSL certificate will expire in {{ $value | humanizeDuration }} (instance {{ $labels.instance }}) (instance {{ $labels.file }})"
When the specified deadline expires, we will get the following visualization in the AlertManager panel:
Monitoring in this way helps prevent failures due to the tardy replacement of SSL certificates and ensures the stable operation of the virtual infrastructure. With a few simple steps, you can avoid problems that would otherwise cause downtime for a large number of resources.