WhatIsCeilometer in cdent-rhat

Backlinks

20150130185319 cdent

Ceilometer or OpenStack Telemetry, is the project responsible for collecting measurements of resources in an OpenStack cloud. The mission statement is:

To reliably collect measurements of the utilization of the physical and virtual resources comprising deployed clouds, persist these data for subsequent retrieval and analysis, and trigger actions when defined criteria are met.

This rather vague statement boils down to three major areas of functionality:

collecting, storing and making available samples (including statistical analysis of aggregates of samples)¹
collecting, storing and making available events²
creating and evaluating alarms based on sample thresholds and combinations thereof

Samples and events are about or come from the other services and are gathered in two ways:

polling (via the compute-agent and central-agent plus plugins)
notifications (sent over the AMQP bus and "heard" by the agent-notification service)

The System Architecture document has some pretty good images which describe the system in some detail:

overview

Flexibility in the system is made available on several dimensions:

Multiple storage engines are available and each large data type (meters, alarms, events) can use a different engine.
Polling and notification plugins can be added to gather whatever sorts of samples are needed.
A configurable pipeline is used to transform incoming samples into other samples (such as aggregate values) that can be sent on to other systems or kept and stored by Ceilometer.
Alarms, when triggered, call a webhook which can do whatever the creator of the webhook likes.

Scaling (and a degree of high availability) in the various services by running multiple agents, some of which use distributed group membership coordination to share work.

Ceilometer in Practice

The way to think about Ceilometer is as a service which gathers two types of data:

Samples, which are measurements of a particular thing at instances of time (e.g. the cpu usage % on instance X).
Events, which are statements that something happened at a particular time, in a particular context (e.g. a new instance was created by tenant X).

In the usual scenario this information (and transformations thereof) is then used for two purposes:

billing (e.g. customer X has used N cpu hours across all their instances)
auto-scaling via alarms that evaluate the metering data (e.g. create a new instance of type Z when cpu load is over 80% for ten minutes)

Caveat: A lot of the information that can be found about Ceilometer describes what is possible rather what has been done. In part this is because there are very few public installations of Ceilometer being used at scale that can be pointed at and inspected³. As such Ceilometer is often spoken about in terms of potential.

There is, however, quite a bit of potential. There is a huge ecosystem of polling agents and notification agents for gathering information from many systems (including information made available by IPMI and SNMP). For the most part these tools are oriented towards gathering information about the resources that an OpenStack deployment is hosting (instances hosted by Nova, hardware registered with Ironic, images in Glance, etc.). This information can then be used to get a view over those services.

For monitoring of the health of those services tools like nagios are probably better positioned, especially if it is already deployed in the infrastructure. One way to think about it is to say that nagios keeps track of what's running and Ceilometer keeps track of what those things are doing and how much.

Ceilometer is a toolkit on which solutions can be built, but not, in and of itself a solution.

Post-Deployment Monitoring

The monitoring plans list the following post-deployment situations that could need to be monitored.

Nodes per role with system load info
Compute (System load, CPU utilization)
Storage (swap and IO)
Network IO

This information, in various forms, can be gathered via the Ceilometer API. One strategy is to gather any of the information associated with a particular resource-id to see what meters are available (output truncated horizontally to save space).

$ ceilometer meter-list -q=resource_id=03aae315-ed59-417a-9d33-d925a9dda9a9
+--------------------------+------------+-----------+
| Name                     | Type       | Unit      |
+--------------------------+------------+-----------+
| cpu                      | cumulative | ns        |
| cpu_util                 | gauge      | %         |
| disk.ephemeral.size      | gauge      | GB        |
| disk.read.bytes          | cumulative | B         |
| disk.read.bytes.rate     | gauge      | B/s       |
| disk.read.requests       | cumulative | request   |
| disk.read.requests.rate  | gauge      | request/s |
| disk.root.size           | gauge      | GB        |
| disk.write.bytes         | cumulative | B         |
| disk.write.bytes.rate    | gauge      | B/s       |
| disk.write.requests      | cumulative | request   |
| disk.write.requests.rate | gauge      | request/s |
| instance                 | gauge      | instance  |
| instance:m1.tiny         | gauge      | instance  |
| memory                   | gauge      | MB        |
| vcpus                    | gauge      | vcpu      |
+--------------------------+------------+-----------+

And then get samples for a particular meter (the results can be restricted by a time window) from a particular resource:

$ ceilometer sample-list -m cpu_util -q=resource_id=03aae315-ed59-417a-9d33-d925a9dda9a9
+-----------------+------+---------------------+
| Volume          | Unit | Timestamp           |
+-----------------+------+---------------------+
| 0.0333333333333 | %    | 2015-01-07T17:07:12 |
| 0.0166666666667 | %    | 2015-01-07T17:06:12 |
| 0.0333333333333 | %    | 2015-01-07T17:05:12 |
| 0.0166666666667 | %    | 2015-01-07T17:04:12 |
| 0.0333333333333 | %    | 2015-01-07T17:03:12 |
| 0.0333333333333 | %    | 2015-01-07T17:02:12 |
| 0.133333333333  | %    | 2015-01-07T17:01:12 |
+-----------------+------+---------------------+

See SnmpWithCeilometer for information on how to configure to monitor arbitrary hosts with SNMP.

Kilo

The largest feature that will come out of Kilo is something called Gnocchi which provides an HTTP API for metrics what amounts to Time Series Database as a Service.

Gnocchi was created in recognition of the fact that traditional databases and NoSQL-style storage solutions are not oriented toward the metering use case. Beyond a relatively small number of metrics performance begins to suffer. Gnocchi provides an architecture for storing and indexing meters in a variety of backends that separate the information about a resource which has meters from the samples which are the values of the meters. This allows more efficient indexing and other data handling strategies.

In practice a user of Ceilometer will use the API and the command line (which uses the API) which hide the details of the backend being used.

A sample is a value in an instant of time of some measurement of something measurable from a cloud resource, for example the current cpu utilization of a node. Multiple samples are gathered over time at regular intervals to represent the state of the measurement as a meter of that resource. ↩
An event represents the state of an object in an OpenStack service at a point in time when something of interest has occurred ↩
This video demonstrates a tool, called hyperglance, which uses Ceilometer data to provide overviews of instance nodes, including visualizations of nodes that have a cpu usage that is over some threshold. ↩