Production Checklist
Overview
Data services such as RabbitMQ often have many tunable parameters. Some configurations or practices make a lot of sense for development but are not really suitable for production. No single configuration fits every use case. It is, therefore, important to assess system configuration and have a plan for "day two operations" activities such as upgrades before going into production.
Overview
Production systems have concerns that go beyond configuration: system observability, security, application development practices, resource usage, release support timeline, and more.
Monitoring and metrics are the foundation of a production-grade system. Besides helping detect issues, it provides the operator data that can be used to size and configure both RabbitMQ nodes and applications.
This guide provides recommendations in a few areas:
- Storage considerations for node data directories
- Networking-related recommendations
- Recommendations related to virtual hosts, users and permissions
- Monitoring and resource usage
- Per-virtual host and per-user limits
- Security
- Clustering and multi-node deployments
- Application-level practices and considerations
and more.
Storage Considerations
Use Durable Storage
Modern RabbitMQ 3.x features, most notably quorum queues and streams, are not designed with transient storage in mind.
Data safety features of quorum queues and streams expect node data storage to be durable. Both data structures also assume reasonably stable latency of I/O operations, something that network-attached storage will not be always ready to provide in practice.
Quorum queue and stream replicas hosted on restarted nodes that use transient storage will have to perform a full sync of the entire data set on the leader replica. This can result in massive data transfers and network link overload that could have been avoided by using durable storage.
When nodes are restarted, the rest of the cluster expects them to retain the information about their cluster peers. When this is not the case, restarted nodes may be able to rejoin as new nodes but a special peer clean up mechanism would have to be enabled to remove their prior identities.
Transient entities (such as queues) and RAM node support will be removed in RabbitMQ 4.0.
Network-attached Storage (NAS)
Network-attached storage (NAS) can be used for RabbitMQ node data directories, provided that the NAS volume
- It offers low I/O latency
- It can guarantee no significant latency spikes (for example, due to sharing with other I/O-heavy services)
Quorum queues, streams, and other RabbitMQ features will benefit from fast local SSD and NVMe storage. When possible, prefer local storage to NAS.
Storage Isolation
RabbitMQ nodes must never share their data directories. Ideally, should not share their disk I/O with other services for most predictable latency and throughput.
Choice of a Filesystem
RabbitMQ nodes can use most widely used local filesystems: ext4, btfs, and so on.
Avoid using distributed filesystems for node data directories:
- RabbitMQ's storage subsystem assumes the standard local filesystem semantics for
fsync(2)
and other key operations. Distributed filesystems often deviate from these standard guarantees - Distributed filesystems are usually designed for shared access to a subset of directories. Sharing a data directory between RabbitMQ nodes is an absolute no-no and is guaranteed to result in data corruption since nodes will not coordinate their writes
Virtual Hosts, Users, Permissions
It is often necessary to seed a cluster with virtual hosts, users, permissions, topologies, policies
and so on. The recommended way of doing this at deployment time is via definition import.
Definitions can be imported on node boot or at any point after cluster deployment
using rabbitmqadmin
or the POST /api/definitions
HTTP API endpoint.
Virtual Hosts
In a single-tenant environment, for example, when your RabbitMQ
cluster is dedicated to power a single system in production,
using default virtual host (/
) is perfectly fine.
In multi-tenant environments, use a separate vhost for
each tenant/environment, e.g. project1_development
,
project1_production
, project2_development
,
project2_production
, and so on.
Users
For production environments, delete the default user (guest
).
Default user only can connect from localhost by default, because it has well-known
credentials. Instead of enabling remote connections, consider creating a separate user
with administrative permissions and a generated password.
It is recommended to use a separate user per application. For example, if you have a mobile app, a Web app, and a data aggregation system, you'd have 3 separate users. This makes a number of things easier:
- Correlating client connections with applications
- Using fine-grained permissions
- Credentials roll-over (e.g. periodically or in case of a breach)
In case there are many instances of the same application, there's a trade-off between better security (having a set of credentials per instance) and convenience of provisioning (sharing a set of credentials between some or all instances).
For IoT applications that involve many clients performing the same or similar function and having fixed IP addresses, it may make sense to authenticate using x509 certificates or source IP address ranges.
Monitoring and Resource Limits
RabbitMQ nodes are limited by various resources, both physical (e.g. the amount of RAM available) as well as software (e.g. max number of file handles a process can open). It is important to evaluate resource limit configurations before going into production and continuously monitor resource usage after that.
Monitoring
Monitoring several aspects of the system, from infrastructure and kernel metrics to RabbitMQ to application-level metrics is essential. While monitoring requires an upfront investment in terms of time, it is very effective at catching issues and noticing potentially problematic trends early (or at all).
Memory
RabbitMQ uses Resource-driven alarms to throttle publishers when consumers do not keep up.
By default, RabbitMQ will not accept any new messages when it detects
that it's using more than 40% of the available memory (as reported by the OS):
vm_memory_high_watermark.relative = 0.4
. This is a safe default
and care should be taken when modifying this value, even when the
host is a dedicated RabbitMQ node.
The OS and file system use system memory to speed up operations for all system processes. Failing to leave enough free system memory for this purpose will have an adverse effect on system performance due to OS swapping, and can even result in RabbitMQ process termination.
A few recommendations when adjusting the default
vm_memory_high_watermark
:
- Nodes hosting RabbitMQ should have at least 256 MiB of memory available at all times. Deployments that use quorum queues, Shovel and Federation may need more.
- The recommended
vm_memory_high_watermark.relative
range is0.4 to 0.7
- Values above
0.7
should be used with care and with solid memory usage and infrastructure-level monitoring in place. The OS and file system must be left with at least 30% of the memory, otherwise performance may degrade severely due to paging.
These are some very broad-stroked guidelines. As with every tuning scenario, monitoring, benchmarking and measuring are required to find the best setting for the environment and workload.
Learn more about RabbitMQ and system memory in a separate guide.
Disk Space
The current 50MB disk_free_limit
default works very well for
development and tutorials.
Production deployments require a much greater safety margin.
Insufficient disk space will lead to node failures and may result in data loss
as all disk writes will fail.
Why is the default 50MB then? Development
environments sometimes use really small partitions to host
/var/lib
, for example, which means nodes go
into resource alarm state right after booting. The very low
default ensures that RabbitMQ works out of the box for
everyone. As for production deployments, we recommend the
following:
The minimum recommended free disk space low watermark value is about the same as the high memory watermark. For example, on a node configured to have its memory watermark of 4GB,
disk_free_limit.absolute = 4G
would be a recommended minimum.In the example above, if available disk space drops below 4GB, all publishers will be blocked and no new messages will be accepted. Queues will need to be drained by consumers before publishing will be allowed to resume.
Continuing with the example above,
disk_free_limit.absolute = 6G
is a safer value.If RabbitMQ needs to flush to disk up to its high memory watermark worth of data, as can sometimes be the case during shutdown, there will be sufficient disk space available for RabbitMQ to start again in all but the most pessimistic scenarios. 6GB
Continuing with the example above,
disk_free_limit.absolute = 8G
is the safest value to use.It should be enough disk space for the most pessimistic scenario where a node first has to move up its high memory watermark worth of data (so, about 4 GiB) to disk, and then perform an on disk data operation that could temporarily nearly double the amount of disk space used.