Ceph

Wikimedia infrastructure
	Data centres and PoPs Networking HTTP Caching Wiki Media Swift; Media storage; upload.wikimedia.org; Logs

Ceph is a scalable storage cluster system. We're currently evaluating it in the same role as we currently use Swift: for distributed storage of media objects.

Architecture

The Ceph documentation has a Getting Started page as well as a detailed architecture page which are both recommended reads. The text below is intended as a simpler crash course of Ceph, specifically tailored to Wikimedia use cases.

Ceph has a low-level RADOS layer which stores named objects & key/value metadata, on top of which sit a number of different interfaces:

librados: Low-level access using a native interface for client programs explicitly written for Ceph
radosgw: an Amazon S3/Openstack Swift compatible HTTP gateway
RBD: a block device layer for e.g. hosting virtual machines
CephFS: a POSIX-compliant distributed file system

(note that the above interfaces are not interchangeable, i.e. you can't expect to store files using radosgw and then read them over CephFS or the other way around.)

A Ceph storage cluster has three essential daemon types:

Object Storage Device (OSD): stores data, handles data replication and reports health status on itself and its peers to the monitor. One OSD maps to one filesystem and usually to one block device (a disk or disk array). Therefore, it is expected for multiple OSDs to run on a server, one per each disk. OSDs also have the capability of journaling (spooling) writes to a separate file, a functionality that is being used to increase performance by putting journals on SSDs.
Monitor (mon): the initial contact point for clients & OSDs; maintains the cluster state & topology (the "cluster map"), which it distributes to clients and updates it based on events coming from OSDs ("osd 5 is down") or admin action. A cluster SPOF, however multiple mons can and is recommended to run on a high-availability setup (handled internally) to increase resiliency. A quorum (majority) of monitors must agree the cluster map, e.g. 3 out of 5, 4 out of 6, etc. It is recommended to run an odd number of monitors.
Metadata Server (MDS): stores metadata for CephFS. Optional, only needed if CephFS is needed.

Data come from Ceph clients using any of the above interfaces and are stored as objects in RADOS. RADOS has multiple pools to store data, each with different characteristics (e.g. replica count). Objects in a pool then map to a relatively small number of partitions called placement groups (PGs) which are then mapped via an algorithm called CRUSH [1] to a set of OSDs (replica count). Each of the (e.g. 3) OSDs then maps data to files in its (e.g. XFS) filesystem using an internal hierarchy that has nothing to do with the object's name or hierarchy.

Each RADOS client is aware of the whole cluster hierarchy (given to it by the monitor) and connects directly to the OSDs (and hence, servers) for reads & writes, thus eliminating SPOFs & centralized bottlenecks. Note, however, that for all intents and purposes, radosgw itself is such a client and hence radosgw servers are choke points for HTTP clients.

Implementation

eqiad media storage cluster

The eqiad media storage cluster, as of July 2013, has the following servers:

ms-be1001 - ms-be1012: 12 Dell PowerEdge R720xd servers, serving as data nodes
ms-fe1001 - ms-fe1004: 4 Dell PowerEdge R610 servers running as frontend nodes

Each of the data nodes has twelve disks of 2TB each, running in a JBOD configuration (actually single-disk RAID-0 array due to controller limitations), driven by an H710 controller with a 512MB BBU after severe performance issues due to hardware bugs of the H310 controller. They also have two Intel 320 SSDs of 160GB each in a RAID-0 configuration and set up as Write-Through (to leave the Write-Back cache for the 12 spindles). Each of the 12 spindles is formatted as an XFS filesystem mounted with nobarrier and is being driven by an OSD each. The SSD RAID is mounted as ext4 and serves as the root filesystem, while at the same time holding a 10GB OSD journal per each OSD daemon (= 120GB in total) to increase performance. Each of the nodes has 48GB of RAM which is useful to have as large page cache and XFS inode cache. The data nodes are distributed between multiple racks in rows A & C to increase network & power resiliency.

The frontend nodes all have a simple mostly unused 2-disk software RAID-1 array and 16GB of RAM. Each of those runs a radosgw server & Apache with mod_fastcgi, all participating in the LVS group ms-fe.eqiad.wmnet. Three of those (1001, 1003, 1004) also serve as monitors, with each of those being in separate rows (A, B, C, respectively).

Troubleshooting

Ceph's cluster operations and troubleshooting guide are more comprehensive external resources; see also Ceph Object Storage troubleshooting.

Note that "Ceph Object Storage" in the official Ceph documentation is alternative terminology for the Rados Gateway (radosgw), which we use at WMF.

Pinpointing the problem

There are a number of Nagios checks to monitor high & low layers of Ceph clusters. Front ends have apache and radosgw checks; front and back ends have raid checks.

As a first step to troubleshooting, it is important to figure out which part of this layered architecture is broken. Note that the issue could be non-Ceph related altogether and troubleshooting that is outside of the scope of this document; refer to Media storage for more information about the overall architecture.

Starting from an LVS HTTP alert for the service IP, examine if there are alerts for some or all of the realservers as well (e.g. when getting an alert for ms-fe.eqiad.wmnet, check if there were preceding alerts for ms-fe100N.eqiad.wmnet). Logs from PyBal can also help here.

Each of the Apache/radosgw servers provides two monitoring URLs: there is the /monitoring/backend URL which is a deep check, fetching an object from the radosgw S3/Swift monitoring container called backend; there is also the /monitoring/frontend URL which is served directly from Apache & the local filesystem, not relying on Ceph at all. Both should contain the plain-text string "OK\n". Note that a failure of backend might have cascaded into a frontend failure as well (e.g. reaching MaxClients) so don't use these URLs as the absolute truth about cluster status.

Ceph itself can provide a health status as well as interactive tailing of logs. Useful commands to dig deeper (run those on a random working frontend):

ceph health, ceph health detail: the former prints a single line of HEALTH_OK, HEALTH_WARN or HEALTH_ERR, while the latter provides a detailed output in the non-OK cases.
ceph status: prints the cluster's status, including the numbers of mons & OSDs that are up & down, as well as the status of PGs.
ceph osd tree: prints the cluster tree, with all racks, hostnames & OSDs as well as their status and weight. Extremely useful to immediately pinpoint e.g. network errors.
ceph -w: prints the status, followed by a tail of the log as events happen (similar to running tail -f /var/log/ceph/ceph.log on a monitor).

(Note that if the above commands fail completely, this indicates a full monitor outage. Recovering from these are much more complicated and outside the scope of a basic troubleshooting guide. Refer to the Ceph manual for more information)

The following concepts may explain the output above better:

OSDMap: an OSD may have a status of up/down or in/out. Down means that Ceph is supposed to place data on that OSD but it can't reach it, so the PGs that are supposed to be stored there are in a "degraded" status. Ceph is self-healing and automatically "outs" a down OSD after a period of time (by default 15 minutes), i.e. it starts reallocating PGs to other OSDs in the cluster and recovering them from their replicas.
PGMap: a PG may have multiple states. The healthy ones are anything that contains active+clean (ignore scrubbing & scrubbing+deep, these indicate normal consistency checks run periodically). Others are also non-fatal, like recovering & remapped, but some (like incomplete) are especially bad; refer to the Ceph manual in case of doubt.
MONMap: a monitor may be down; in that case, ceph quorum_status will be useful in pinpointing the issue futher.

Common issues

High-level HTTP/Swift

Ceph/radosgw behaves like a normal Swift cluster. Running the Swift client (or, alternatively, speaking the Swift REST protocol using curl) on one of the radosgw servers (or the LVS service IP) can provide insight on e.g. access controls & permission problems.

To list containers/objects, and/or remove them, follow the same steps as described in Swift/How_To. Just alter the following parameters:

-A http://ms-fe.eqiad.wmnet/auth/v1.0
-U mw:media
-K : A different password is used for Ceph than for Swift. Both can be retrieved from the wmf-config/PrivateSettings.php file.

radosgw failure

Upstart should restart a radosgw if it dies a number of times. If, however, a radosgw daemon is dead for some reason, logging in to the box and running start radosgw id=radosgw should be enough. The radosgw logs are located in /var/log/radosgw/radosgw.log and they should be watched for errors; the default error log tends to be useless, so if in need, make sure to adjust the ceph.conf debug rgw setting to a higher value.

radosgw by itself does nothing but listen to a Unix socket to which Apache/mod_fastcgi connects to. Normal Apache troubleshooting should be employed to debug this.

monitor failure

Upstart should restart a monitor if it dies a number of times. If, however, a monitor is dead for some reason, logging in to the box and running start ceph-mon id=$(hostname -s) should be enough. The monitor logs are located in /var/log/ceph/ceph-mon.$(hostname -s).log and they will report any errors/crashes.

OSD failures

Upstart restarts dead OSDs by itself a number of times before it gives up and Ceph should handle OSD failures gracefully, by marking an OSD as down when it's unresponsive to heartbeats. Note that in the case of I/O errors (disk failure), Ceph's behavior is for the OSD to just die and let the cluster recover from that. Therefore, when an OSD is down, make sure to check dmesg and kern.log to establish whether it's a normal reaction to a bad disk.

To stop, start or restart an OSD daemon use the upstart commands: stop|start|restart ceph-osd id=N where N is the OSD's number. The logs are located in /var/log/ceph/ceph-osd.N.log.

Slow requests, stuck PGs

While OSDs are supposed to be resilient, there have been cases where an OSD may be responsive to heartbeats but having PGs stuck in non-performing-IO states which oftens results to the infamous "slow request" log messages. Example log entries:

2013-06-06 21:12:57.492623 7f8f15b0d700  0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.069932 secs
2013-06-06 21:12:57.492642 7f8f15b0d700  0 log [WRN] : slow request 30.069932 seconds old, received at 2013-06-06 21:12:27.422592: pg_scan(get_digest 3.43f 0//0//3-0//0//3 e 185007/185007) v1 currently reached pg
2013-06-06 21:12:59.493068 7f8f15b0d700  0 log [WRN] : 2 slow requests, 1 included below; oldest blocked for > 32.070428 secs
2013-06-06 21:12:59.493082 7f8f15b0d700  0 log [WRN] : slow request 30.905341 seconds old, received at 2013-06-06 21:12:28.587679: pg_scan(get_digest 3.27eb 0//0//3-0//0//3 e 185008/185008) v1 currently reached pg
2013-06-06 21:13:05.494268 7f8f15b0d700  0 log [WRN] : 3 slow requests, 1 included below; oldest blocked for > 38.071638 secs
...

In such a case, stopping/restarting an OSD may be appropriate, to let the cluster recover from that. Another alternative is to manually mark the OSD as out by running ceph osd out NNN. To find out the responsible OSD, grepping the output of ceph pg dump for the bad PG state is useful, Sample entry (split for readability):

pg_stat objects mip     degr    unf     bytes            log     disklog  state           state_stamp ...
3.3ffc  17212   0       0        0       2903725143      0       0        active+clean    2013-06-23 03:32:31.773464 ...

...   v               reported       up          acting       last_scrub    scrub_stamp                 last_deep_scrub  deep_scrub_stamp
...   188720'86316    185972'13977  [67,103,16] [67,103,16]   188720'86221  2013-06-23 03:32:31.773386  188720'86221     2013-06-23 03:32:31.773386

The first field is the PG id, while the three numbers between brackets separated by commas are the OSDs for that PG.

Be careful for PGs in a "peering" state: PGs stuck in that state halts requests, but be patient and wait a few minutes before you restart OSDs, as the results might be the opposite than intended: stopping, starting or restarting OSDs results in more PGs peering and might aggravate the problem.

Metrics

While Ceph provides an extensive breadth of metrics, the collection of Ceph metrics across our infrastructure is currently limited. We currently use Ganglia to collect the basic system metrics as well as high-level Apache/radosgw requests per second & busy threads. Useful starting points to dig in further are: