The ceph buffers are used to process data in memory. For instance, when a FileStore handles an OP_WRITE transaction it writes a list of buffers to disk.
+---------+
| +-----+ |
list ptr | | | |
+----------+ +-----+ | | | |
| append_ >-------> >--------------------> | |
| buffer | +-----+ | | | |
+----------+ ptr | | | |
| _len | list +-----+ | | | |
+----------+ +------+ ,--->+ >-----> | |
| _buffers >----> >----- +-----+ | +-----+ |
+----------+ +----^-+ \ ptr | raw |
| last_p | / `-->+-----+ | +-----+ |
+--------+-+ / + >-----> | |
| ,- ,--->+-----+ | | | |
| / ,--- | | | |
| / ,--- | | | |
+-v--+-^--+--^+-------+ | | | |
| bl | ls | p | p_off >--------------->| | |
+----+----+-----+-----+ | +-----+ |
| | off >------------->| raw |
+---------------+-----+ | |
iterator +---------+
The actual data is stored in buffer::raw opaque objects. They are accessed through a buffer::ptr. A buffer::list is a sequential list of buffer::ptr which can be used as if it was a contiguous data area although it can be spread over many buffer::raw containers, as represented by the rectangle enclosing the two buffer::raw objects in the above drawing. The buffer::list::iterator can be used to walk each character of the buffer::list as follows:
bufferlist bl;
bl.append("ABC", 3);
{
bufferlist::iterator i(&bl);
++i;
EXPECT_EQ('B', *i);
}
documentation and unit tests
The ultimate documentation for buffer.cc and buffer.h are the unit tests that demonstrate how it actually works. This document is a short guide designed to provide an overview and does not attempt to cover everything.
buffer::ptr and buffer::raw
The buffer::raw is where the data is actually stored. It is allocated with malloc, new or reusing a pointer provided by the caller. A variant of the malloc constructor provides an area that is aligned on CEPH_PAGE_SIZE. The address of the allocated memory will be a multiple of CEPH_PAGE_SIZE, which must be a power of two and a multiple of sizeof(void *).
+-----------+ +-----+
| | | |
| offset +----------------+ |
| | | |
| length +---- | |
| | \------- | |
+-----------+ \---+ |
| ptr | +-----+
+-----------+ | raw |
+-----+
The buffer::raw area can only be accessed through the buffer::ptr. It adresses the buffer::raw bytes in the range [offset,offset+length[. The buffer::ptr methods are very flexible and mostly designed to be used to implement buffer::lists. The constructors can allocate a buffer::raw area with ptr(unsigned l) or be assigned an existing buffer::raw with ptr(raw *r), as in buffer::list::rebuild().
Bytes can be copied in or copied out within the [offset,offset+length[ range. If the underlying buffer::raw extends beyond offset+length (as reported by unused_tail_length()), bytes can be appended.
bufferptr ptr(2);
ptr.set_length(0);
ptr.append('A');
EXPECT_EQ((unsigned)1, ptr.length());
EXPECT_EQ('A', ptr[0]);
ptr.append("B", (unsigned)1);
EXPECT_EQ((unsigned)2, ptr.length());
EXPECT_EQ('B', ptr[1]);
buffer::list and buffer::list::iterator
A buffer::list is a list of buffer::ptr, as shown below.
+---------+
| +-----+ |
list | | | |
+----------+ | | | |
| append_ | | | | |
| buffer | | | | |
+----------+ ptr | | | |
| _len | list +-----+ | | | |
+----------+ +------+ ,--->+ >-----> | |
| _buffers >----> >----- +-----+ | +-----+ |
+----------+ +----^-+ \ ptr | raw |
| last_p | `-->+-----+ | +-----+ |
+----------+ | >-----> | |
+-----+ | | | |
| | | |
| | | |
| | | |
| | | |
| +-----+ |
| raw |
| |
+---------+
The operator[] abstracts it.
bufferlist bl;
bl.append('A');
bufferlist other;
other.append('B');
bl.append(other);
EXPECT_EQ((unsigned)2, bl.buffers().size());
EXPECT_EQ('B', bl[1]);
The buffer::list::iterator class provides some of the usual iterator behavior and is used in the buffer::list::contents_equal(ceph::buffer::list& other) method.
+---------+
| +-----+ |
list | | | |
+----------+ | | | |
| append_ | | | | |
| buffer | | | | |
+----------+ ptr | | | |
| _len | list +-----+ | | | |
+----------+ +------+ ,--->+ >-----> | |
| _buffers >----> >----- +-----+ | +-----+ |
+----------+ +----^-+ \ ptr | raw |
| last_p | / `-->+-----+ | +-----+ |
+--------+-+ / + >-----> | |
| ,- ,--->+-----+ | | | |
| / ,--- | | | |
| / ,--- | | | |
+-v--+-^--+--^+-------+ | | | |
| bl | ls | p | p_off >--------------->| | |
+----+----+-----+-----+ | +-----+ |
| | off >------------->| raw |
+---------------+-----+ | |
iterator +---------+
The bl data member points to the buffer::list. The ls data member is used to avoid dereferencing a pointer and is equivalent to bl->_buffers. The p data member is a std::list<ptr>::iterator used to walk the list of buffer::ptr. The p_off data member is the offset at which the iterator currently is, within the buffer::raw pointed by p. The off data member is the offset of the iterator, as if there was only one buffer::raw.
Although buffer::list::iterator exposes copy in and out methods, they are designed to be used as supporting methods for the corresponding copy in and out methods from the buffer::list class.
{
bufferlist bl;
bufferlist dest;
const char *expected = "ABC";
bl.append(expected);
bl.copy(1, 2, dest);
EXPECT_EQ(0, ::memcmp(expected + 1, dest.c_str(), 2));
}
{
bufferlist bl;
bl.append("XXX");
bl.copy_in(1, 2, "AB");
EXPECT_EQ(0, ::memcmp("XAB", bl.c_str(), 3));
}
The internal representation of the buffer::list can be rebuilt to use a single buffer::raw.
{
bufferlist bl;
const std::string str(CEPH_PAGE_SIZE, 'X');
bl.append(str.c_str(), str.size());
bl.append(str.c_str(), str.size());
EXPECT_EQ((unsigned)2, bl.buffers().size());
bl.rebuild();
EXPECT_EQ((unsigned)1, bl.buffers().size());
}
It can also be rebuilt to only use buffer::raw that are aligned on CEPH_PAGE_SIZE.
{
bufferlist bl;
{
bufferptr ptr(CEPH_PAGE_SIZE + 1);
ptr.set_offset(1);
ptr.set_length(CEPH_PAGE_SIZE);
bl.append(ptr);
}
EXPECT_EQ((unsigned)1, bl.buffers().size());
EXPECT_FALSE(bl.is_page_aligned());
bl.rebuild_page_aligned();
EXPECT_TRUE(bl.is_page_aligned());
EXPECT_EQ((unsigned)1, bl.buffers().size());
}
The content of the buffer::list can be read from a file or written to a file, either with a file descriptor or a path.
{
bufferlist bl;
bl.append("ABC");
EXPECT_EQ(0, bl.write_file("testfile"));
}
{
std::string error;
bufferlist bl;
::system("echo ABC > testfile");
EXPECT_EQ(0, bl.read_file("testfile", &error));
EXPECT_EQ((unsigned)4, bl.length());
std::string actual(bl.c_str(), bl.length());
EXPECT_EQ("ABC\n", actual);
}

