ceph internals : buffer lists

The ceph buffers are used to process data in memory. For instance, when a FileStore handles an OP_WRITE transaction it writes a list of buffers to disk.

                                             +---------+
                                             | +-----+ |
    list              ptr                    | |     | |
 +----------+       +-----+                  | |     | |
 | append_  >------->     >-------------------->     | |
 |  buffer  |       +-----+                  | |     | |
 +----------+                        ptr     | |     | |
 |   _len   |      list            +-----+   | |     | |
 +----------+    +------+     ,--->+     >----->     | |
 | _buffers >---->      >-----     +-----+   | +-----+ |
 +----------+    +----^-+     \      ptr     |   raw   |
 |  last_p  |        /         `-->+-----+   | +-----+ |
 +--------+-+       /              +     >----->     | |
          |       ,-          ,--->+-----+   | |     | |
          |      /        ,---               | |     | |
          |     /     ,---                   | |     | |
        +-v--+-^--+--^+-------+              | |     | |
        | bl | ls | p | p_off >--------------->|     | |
        +----+----+-----+-----+              | +-----+ |
        |               | off >------------->|   raw   |
        +---------------+-----+              |         |
              iterator                       +---------+

The actual data is stored in buffer::raw opaque objects. They are accessed through a buffer::ptr. A buffer::list is a sequential list of buffer::ptr which can be used as if it was a contiguous data area although it can be spread over many buffer::raw containers, as represented by the rectangle enclosing the two buffer::raw objects in the above drawing. The buffer::list::iterator can be used to walk each character of the buffer::list as follows:

  bufferlist bl;
  bl.append("ABC", 3);
  {
    bufferlist::iterator i(&bl);
    ++i;
    EXPECT_EQ('B', *i);
  }

documentation and unit tests

The ultimate documentation for buffer.cc and buffer.h are the unit tests that demonstrate how it actually works. This document is a short guide designed to provide an overview and does not attempt to cover everything.

buffer::ptr and buffer::raw

The buffer::raw is where the data is actually stored. It is allocated with malloc, new or reusing a pointer provided by the caller. A variant of the malloc constructor provides an area that is aligned on CEPH_PAGE_SIZE. The address of the allocated memory will be a multiple of CEPH_PAGE_SIZE, which must be a power of two and a multiple of sizeof(void *).


 +-----------+                +-----+
 |           |                |     |
 |  offset   +----------------+     |
 |           |                |     |
 |  length   +----            |     |
 |           |    \-------    |     |
 +-----------+            \---+     |
 |   ptr     |                +-----+
 +-----------+                | raw |
                              +-----+

The buffer::raw area can only be accessed through the buffer::ptr. It adresses the buffer::raw bytes in the range [offset,offset+length[. The buffer::ptr methods are very flexible and mostly designed to be used to implement buffer::lists. The constructors can allocate a buffer::raw area with ptr(unsigned l) or be assigned an existing buffer::raw with ptr(raw *r), as in buffer::list::rebuild().
Bytes can be copied in or copied out within the [offset,offset+length[ range. If the underlying buffer::raw extends beyond offset+length (as reported by unused_tail_length()), bytes can be appended.

    bufferptr ptr(2);
    ptr.set_length(0);
    ptr.append('A');
    EXPECT_EQ((unsigned)1, ptr.length());
    EXPECT_EQ('A', ptr[0]);
    ptr.append("B", (unsigned)1);
    EXPECT_EQ((unsigned)2, ptr.length());
    EXPECT_EQ('B', ptr[1]);

buffer::list and buffer::list::iterator

A buffer::list is a list of buffer::ptr, as shown below.

                                             +---------+
                                             | +-----+ |
    list                                     | |     | |
 +----------+                                | |     | |
 | append_  |                                | |     | |
 |  buffer  |                                | |     | |
 +----------+                        ptr     | |     | |
 |   _len   |      list            +-----+   | |     | |
 +----------+    +------+     ,--->+     >----->     | |
 | _buffers >---->      >-----     +-----+   | +-----+ |
 +----------+    +----^-+     \      ptr     |   raw   |
 |  last_p  |                  `-->+-----+   | +-----+ |
 +----------+                      |     >----->     | |
                                   +-----+   | |     | |
                                             | |     | |
                                             | |     | |
                                             | |     | |
                                             | |     | |
                                             | +-----+ |
                                             |   raw   |
                                             |         |
                                             +---------+

The operator[] abstracts it.

  bufferlist bl;
  bl.append('A');
  bufferlist other;
  other.append('B');
  bl.append(other);
  EXPECT_EQ((unsigned)2, bl.buffers().size());
  EXPECT_EQ('B', bl[1]);

The buffer::list::iterator class provides some of the usual iterator behavior and is used in the buffer::list::contents_equal(ceph::buffer::list& other) method.

                                             +---------+
                                             | +-----+ |
    list                                     | |     | |
 +----------+                                | |     | |
 | append_  |                                | |     | |
 |  buffer  |                                | |     | |
 +----------+                        ptr     | |     | |
 |   _len   |      list            +-----+   | |     | |
 +----------+    +------+     ,--->+     >----->     | |
 | _buffers >---->      >-----     +-----+   | +-----+ |
 +----------+    +----^-+     \      ptr     |   raw   |
 |  last_p  |        /         `-->+-----+   | +-----+ |
 +--------+-+       /              +     >----->     | |
          |       ,-          ,--->+-----+   | |     | |
          |      /        ,---               | |     | |
          |     /     ,---                   | |     | |
        +-v--+-^--+--^+-------+              | |     | |
        | bl | ls | p | p_off >--------------->|     | |
        +----+----+-----+-----+              | +-----+ |
        |               | off >------------->|   raw   |
        +---------------+-----+              |         |
              iterator                       +---------+

The bl data member points to the buffer::list. The ls data member is used to avoid dereferencing a pointer and is equivalent to bl->_buffers. The p data member is a std::list<ptr>::iterator used to walk the list of buffer::ptr. The p_off data member is the offset at which the iterator currently is, within the buffer::raw pointed by p. The off data member is the offset of the iterator, as if there was only one buffer::raw.
Although buffer::list::iterator exposes copy in and out methods, they are designed to be used as supporting methods for the corresponding copy in and out methods from the buffer::list class.

{
    bufferlist bl;
    bufferlist dest;
    const char *expected = "ABC";
    bl.append(expected);
    bl.copy(1, 2, dest);
    EXPECT_EQ(0, ::memcmp(expected + 1, dest.c_str(), 2));
}
{
    bufferlist bl;
    bl.append("XXX");
    bl.copy_in(1, 2, "AB");
    EXPECT_EQ(0, ::memcmp("XAB", bl.c_str(), 3));
}

The internal representation of the buffer::list can be rebuilt to use a single buffer::raw.

  {
    bufferlist bl;
    const std::string str(CEPH_PAGE_SIZE, 'X');
    bl.append(str.c_str(), str.size());
    bl.append(str.c_str(), str.size());
    EXPECT_EQ((unsigned)2, bl.buffers().size());
    bl.rebuild();
    EXPECT_EQ((unsigned)1, bl.buffers().size());
  }

It can also be rebuilt to only use buffer::raw that are aligned on CEPH_PAGE_SIZE.

  {
    bufferlist bl;
    {
      bufferptr ptr(CEPH_PAGE_SIZE + 1);
      ptr.set_offset(1);
      ptr.set_length(CEPH_PAGE_SIZE);
      bl.append(ptr);
    }
    EXPECT_EQ((unsigned)1, bl.buffers().size());
    EXPECT_FALSE(bl.is_page_aligned());
    bl.rebuild_page_aligned();
    EXPECT_TRUE(bl.is_page_aligned());
    EXPECT_EQ((unsigned)1, bl.buffers().size());
  }

The content of the buffer::list can be read from a file or written to a file, either with a file descriptor or a path.

{
  bufferlist bl;
  bl.append("ABC");
  EXPECT_EQ(0, bl.write_file("testfile"));
}
{
  std::string error;
  bufferlist bl;
  ::system("echo ABC > testfile");
  EXPECT_EQ(0, bl.read_file("testfile", &error));
  EXPECT_EQ((unsigned)4, bl.length());
  std::string actual(bl.c_str(), bl.length());
  EXPECT_EQ("ABC\n", actual);
}