No activity today, make something!
cdent-rhat TestCeiloRedisPackstack

20150319002438 cdent  

Notes on testing the redis-based coordination in ceilometer. If you just want the bad news, jump ahead to #things-dont-work.

General Setup

  • Create 4 fedora 20 vms, 10.0.0.[2-5], with updates.
  • Choose one as the packstack master (I chose 10.0.0.2) and ssh into it. The use may be fedora or ec2-user or something else, it depends on where the base image comes from.
  • Adjust /root/.ssh/authorized_keys on all hosts so that root can be ssh'd. On many cloud images this will mean removing: command="<some message here>"
  • Turn off selinux: sudo setenforce 0
  • On the host chosen as master, follow steps 1 and 2 at RDO Quickstart to set up the necessary repos and install packstack:

    sudo yum install -y https://rdo.fedorapeople.org/rdo-release.rpm sudo yum install -y openstack-packstack

Testing allinone

Install

  • Run packstack --allinone (Note: I did this with my modified code)

Confirm

  • Shortly after the packstack installation starts but before it finishes it is possible to inspect the hierdata that will be provided to the manifest files to see if it contains expected settings. It will be in a directory that looks something like /var/tmp/packstack/20150318-175543-S_GQje/hieradata.
  • Confirm that redis and the ceilometer central agent are running, but no redis sentinel: ps ax | egrep '(redis|ceilometer-agent-central)'
  • Confirm the coordination backend url is present and doesn't use sentinel: sudo grep backend_url /etc/ceilometer/ceilometer.conf
  • Confirm the central agent is using coordination:

    grep 'Coordination backend started successfully' /var/log/ceilometer/central.log grep 'Joined partitioning group central-global' /var/log/ceilometer/central.log

  • Confirm some samples have been gathered by the central pollster (this may take up to ten minutes before there are results):

    ceilometer sample-list -m storage.objects.containers

Testing Multi-Node

Install

Note: With four hosts this will take quite a while.

  • Generate an answerfile: packstack --gen-answer-file=test.pack.txt
  • Edit the answer file in a way that corresponds with this diff.
  • Run packstack: packstack --use-answer-file=test.pack.txt. Wait. Keep waiting. If you're VMs are too large or your compute host too slow, this may time out. Try again.

Confirm

At this stage I ran into some issues, the agents were unable to join the coordination system. The version of tooz installed via the packstack dependency tree was 0.9.0. This version does not support using redis-sentinel. 0.13.0 is the current version, 0.11.0 is required to make use of the redis-sentinel support. I addressed these problems by:

  • installing the latest version of tooz
  • restarting the openstack-ceilometer-central service

Once that was restarted I was able to confirm a few things.

On the controller host

  • Check the central.log as described above.
  • Check that both redis and redis-sentinel are running: ps ax | grep redis
  • Look at /etc/redis-sentinel.conf to see what information sentinel has gathered about the servers and slaves. Assuming nothing has died yet, then there should be an entry that says the master is 10.0.0.2 and known slaves and sentinels are on 3, 4 and 5.
  • Check backend_url in /etc/ceilometer/ceilometer.conf. It should look something like backend_url=redis://10.0.0.5:26379?sentinel=mymaster&sentinel_fallback=10.0.0.4:26379&sentinel_fallback=10.0.0.3:26379&sentinel_fallback=10.0.0.2:26379.

On the other nodes

  • Check redis and redis-sentinel as described above.

Confirm Redis Failover

This makes sure that the sentinels are doing their jobs. You can watch the activity by looking at the logs in /var/log/redis on any host.

  • Violently kill the redis server on the master host: sudo pkill -9 redis-server
  • At this point the sentinels will notice that the service is down but will not failover because of the relatively long delay set in redis-sentinel.conf (if you do not see the timeouts in your conf it is because sentinel has rewritten the conf and does not write out defaults). This long timeout is used because generally there are long polling periods used by the agents1.
  • /var/log/ceilometer/central.log will start spewing tooz errors about being unable to send a heartbeat.
  • aw crap

Things don't work

It's getting late so I'm afraid this will be confusing, will try to clarify in the morning.

As far as I can tell ceilometer and tooz are never conspiring to notice that the redis server is down and that it is necessary to re-ask the sentinel server for a new master. I'm relatively certain that I had checked on this in the past but its looking like it is not the case.

When the master is killed, what's supposed to happen is that the coordination agent will recognize that it either cannot send a heartbeat or cannot request group membership. At that point it should ask to rejoin the group using its initial configuration.

The problems appear to be here and here or here (these are references to Kilo code but the Juno code is much the same). In both the heartbeat and get members cases what that code is saying "If I get an error, log it and carry on". This means we never realize we need to recreate a redis client and it is through recreating the client that the sentinel magic works.

If I change exception handling (in both places) to have exception handling like this:

        try:
            self._coordinator.heartbeat()
        except (AttributeError, tooz.coordination.ToozConnectionError):
            LOG.warn(_LW('Coordination heartbeat reconnection required.'))
            self._started = False
        except tooz.coordination.ToozError:
            LOG.exception(_LE('Error sending a heartbeat to coordination '

I appear to get the desired failover. The AttributeError was thrown in because sometimes an internal variable in the tooz driver class becomes None.

I don't know if this is anything close to the right fix, and I'm too tired now to figure it out (it is past midnight).


  1. We may wish to investigate the default timeouts that are being used. It can generate a lot of noise in the logs.