Ceph placement groups backfilling

Ceph stores objects in pools which are divided in placement groups.

   +---------------------------- pool a ----+
   |+----- placement group 1 -------------+ |
   ||+-------+  +-------+                 | |
   |||object |  |object |                 | |
   ||+-------+  +-------+                 | |
   |+-------------------------------------+ |
   |+----- placement group 2 -------------+ |
   ||+-------+  +-------+                 | |
   |||object |  |object |   ...           | |
   ||+-------+  +-------+                 | |
   |+-------------------------------------+ |
   |               ....                     |
   |                                        |
   +----------------------------------------+

   +---------------------------- pool b ----+
   |+----- placement group 1 -------------+ |
   ||+-------+  +-------+                 | |
   |||object |  |object |                 | |
   ||+-------+  +-------+                 | |
   |+-------------------------------------+ |
   |+----- placement group 2 -------------+ |
   ||+-------+  +-------+                 | |
   |||object |  |object |   ...           | |
   ||+-------+  +-------+                 | |
   |+-------------------------------------+ |
   |               ....                     |
   |                                        |
   +----------------------------------------+

   ...

The placement group is supported by OSDs to store the objects. For instance objects from the placement group 1 of the pool a will be stored in files managed by an OSD on a designated disk. They are daemons running on machines where storage is available. For instance, a placement group supporting three replicates will have three OSDs at his disposal : one OSDs is the primary (OSD 0) and the other two store copies (OSD 1 and OSD 2).

       +-------- placement group   1 ---------+
       |+----------------+ +----------------+ |
       || object A       | | object B       | |
       |+----------------+ +----------------+ |
       +---+-------------+-----------+--------+
           |             |           |
           |             |           |
         OSD 0         OSD 1       OSD 2
        +------+      +------+    +------+
        |+---+ |      |+---+ |    |+---+ |
        || A | |      || A | |    || A | |
        |+---+ |      |+---+ |    |+---+ |
        |+---+ |      |+---+ |    |+---+ |
        || B | |      || B | |    || B | |
        |+---+ |      |+---+ |    |+---+ |
        +------+      +------+    +------+

Whenever an OSD dies, the placement group information and the associated objects stored in this OSD are gone and need to be reconstructed using another OSD.

       +-------- placement group   1 ---------+
       |+----------------+ +----------------+ |
       || object A       | | object B       | |
       |+----------------+ +----------------+ |
       +---+-------------+-----------------+--+
           |          |                    |
           |          |                    |
         OSD 0      OSD 1      OSD 2     OSD 3
        +------+   +------+   +------+  +------+
        |+---+ |   |+---+ |   |      |  |+---+ |
        || A | |   || A | |   |      |  || A | |
        |+---+ |   |+---+ |   | DEAD |  |+---+ |
        |+---+ |   |+---+ |   |      |  |+---+ >----- last_backfill
        || B | |   || B | |   |      |  || B | |
        |+---+ |   |+---+ |   |      |  |+---+ |
        +------+   +------+   +------+  +------+

The objects from the primary ( OSD 0 ) are copied to OSD 3 : this is called backfilling. It involves the primary ( OSD 0) and the backfill peer ( OSD 3) scanning over their content and copying the objects which are different or missing from the primary to the backfill peer. Because this may take a long time, the last_backfill attribute is tracked for each local placement group copy (i.e. the placement group information that resides on OSD 3 ) indicating how far the local copy has been backfilled. In the case that the copy is complete, last_backfill is hobject_t::max().

                OSD 3
         +----------------+
         |+--- object --+ |
         || name : B    | |
         || key : 2     | |
         |+-------------+ |
         |+--- object --+ >----- last_backfill
         || name : A    | |
         || key : 5     | |
         |+-------------+ |
         |                |
         |    ....        |
         +----------------+

Object names are hashed into an integer that can be used to order them. For instance, the object B above has been hashed to key 2 and the object A above has been hashed to key 5. The last_backfill attribute of the placement group draws the limit separating the objects that have already been copied from other OSDs and those in the process of being copied. The objects that are lower than last_backfill have been copied ( that would be object B above ) and the objects that are greater than last_backfill are going to be copied.

Backfilling is expensive and placement groups do not exclusively rely on it to recover from failure. The placement groups logs their changes, for instance deleting an object or modifying an object. When and OSD is unavailable for a short period of time, it may be cheaper to replay the logs.