RabbitMQ tuning & monitoring for Openstack

RabbitMQ config tuning:

  • {cluster_partition_handling, pause_minority},
  • Don't use rabbitmq_management plugin, or HIGHLY recommend disabling message rate collection if you use this. It dramatically cuts down on the work rabbit is having to do to collect the management db:
    {rabbitmq_management, [ {rates_mode, none} ]},
  • Also raise the collection interval and high water mark:
    {collect_statistics_interval, 30000}, {vm_memory_high_watermark, 0.6},
  • Use rabbitmq cluster mode mirrored_queue with "ha-mode: all" and it works well in productio +1+1+1+1+1+1+1
  • Increase file limits http://stackoverflow.com/a/23585144
  • [Try] RabbitMQ Autocluster with K8s hoặc Etcdv2 https://github.com/rabbitmq/rabbitmq-autocluster
  • RabbitMQ cluster should use directly, not over HAproxy
  • [Idea] Use seperate Rabbitmq-cluster(s) (containerization) for each Openstack's services
  • Config used by CERN:
* rabbit hiera configuration
rabbitmq::cluster_partition_handling: 'autoheal'  
rabbitmq::config_kernel_variables:  
  inet_dist_listen_min: 41055
  inet_dist_listen_max: 41055
rabbitmq::config_variables:  
  collect_statistics_interval: 60000
  reverse_dns_lookups: true
  vm_memory_high_watermark: 0.8
rabbitmq::environment_variables:  
  SERVER_ERL_ARGS: "'+K true +A 128 +P 1048576'"
rabbitmq::tcp_keepalive: true  
rabbitmq::tcp_backlog: 4096

* package versions
erlang-kernel-18.3.4.4-1  
rabbitmq-server-3.6.5-1  

http://www.mail-archive.com/openstack-operators@lists.openstack.org/msg07299.html

(ref: 1. https://github.com/rabbitmq/rabbitmq-server/blob/master/docs/rabbitmq.config.example
2. https://www.rabbitmq.com/configure.html#configuration-file)
3. https://raw.githubusercontent.com/michaelklishin/openstack-summit-tokyo-2015/master/RabbitMQ%20Operations.pdf
4. https://www.linkedin.com/pulse/13-rabbitmq-facts-i-wish-knew-from-start-gideon-arom
5. https://www.youtube.com/watch?v=XURkQ3biF6w

========================== Rabbit MQ monitoring:

  1. Monitor số unacked items trong các queue ( unacked queue growth is a great indicator for rabbit slowing down which can indicate memory issues)
  2. Monitor queue lengths và memory consumption
  3. We monitor how many items are in the queues, alert if more than 20 items in any queue.
  4. We monitor orphaned queues. Generally these can happen after hard service restarts or upgrades but other times they can indicate other issues
  5. http aliveness check - http://hg.rabbitmq.com/rabbitmq-management/raw-file/rabbitmqv22_0/priv/www-api/help.html
    "Declares a test queue, then publishes and consumes a message. Does not delete the queue to avoid creating many mnesia transactions when this is repeatedly pinged. Intended for use by monitoring tools. If everything is working correctly, will return HTTP status 200 with body ...". Có thể dùng Rabbitman client http://rabbitman.readthedocs.io/
  6. cluster_status
  7. file_descriptors number
  8. policies on nodes - are they in sync?
  9. https://github.com/NYTimes/collectd-rabbitmq

(On Nova/Openstack-services side, we also need to increase rabbimq timeout: rpc_response_timeout=180 If one of the RabbitMQ hosts went away, a lot of the services kept a hold of the old connection and lost a lot of messages --> RabbitMQ connection heartbeat & TCP keep alive.
Use Pika driver instead because it support RabbitMQ connection heartbeat and more! https://docs.openstack.org/developer/oslo.messaging/pika_driver.html)


Where does rabbit problems rank compared with other openstack stuff?

  • pretty high
  • top 5
  • How do I X OpenStack, How do I Y OpenStack, How do I keep goddamn rabbit running?
  • many bad failure modes. zombies everywhere.
  • hard to debug/monitor - eg which fanout queue belongs to which client? Tend to just restart things instead
  • sometimes rabbit failures are difficult to detect and are found after tracing back other issues
  • frequent problems - eg any time you stop rabbit on a node, or have some sort of networking glitch, have a failover - you get problems
  • "every time you do something with rabbit, something is going to get zombied"
  • most sites have a "restart everything that talks to rabbit" script

Ref:

OS Ops Meetup: https://wiki.openstack.org/wiki/Operations/Meetups

  1. Proxy, MySQL, RabbitMQ tuning (2015) https://etherpad.openstack.org/p/BCN-ops-haproxy-mysql-rabbit-tuning

  2. Rabbit HA and queue issues (Let's try to have some tangible action items/outcomes from this session) (2015) https://etherpad.openstack.org/p/PHL-ops-rabbit-queue

  3. Openstack RabbitMQ in practice (2015) https://etherpad.openstack.org/p/TYO-ops-rabbitmq-in-practice
    RabbitMQ operation (Tokyo 2015 summit) https://github.com/michaelklishin/openstack-summit-tokyo-2015/raw/master/RabbitMQ%20Operations.pdf

  4. OS RabbitMQ pitfalls HA (2017) https://etherpad.openstack.org/p/MIL-ops-rabbitmq-pitfalls-ha

  5. RabbitMQ (2015) https://etherpad.openstack.org/p/YVR-ops-rabbitmq

  6. RabbitMQ (2017) https://etherpad.openstack.org/p/MIL-ops-rabbitmq-pitfalls-ha