The ceph buffers are used to process data in memory. For instance, when a FileStore handles an OP_WRITE transaction it writes a list of buffers to disk.
+---------+ | +-----+ | list ptr | | | | +----------+ +-----+ | | | | | append_ >-------> >--------------------> | | | buffer | +-----+ | | | | +----------+ ptr | | | | | _len | list +-----+ | | | | +----------+ +------+ ,--->+ >-----> | | | _buffers >----> >----- +-----+ | +-----+ | +----------+ +----^-+ \ ptr | raw | | last_p | / `-->+-----+ | +-----+ | +--------+-+ / + >-----> | | | ,- ,--->+-----+ | | | | | / ,--- | | | | | / ,--- | | | | +-v--+-^--+--^+-------+ | | | | | bl | ls | p | p_off >--------------->| | | +----+----+-----+-----+ | +-----+ | | | off >------------->| raw | +---------------+-----+ | | iterator +---------+
The actual data is stored in buffer::raw opaque objects. They are accessed through a buffer::ptr. A buffer::list is a sequential list of buffer::ptr which can be used as if it was a contiguous data area although it can be spread over many buffer::raw containers, as represented by the rectangle enclosing the two buffer::raw objects in the above drawing. The buffer::list::iterator can be used to walk each character of the buffer::list as follows:
bufferlist bl; bl.append("ABC", 3); { bufferlist::iterator i(&bl); ++i; EXPECT_EQ('B', *i); }
documentation and unit tests
The ultimate documentation for buffer.cc and buffer.h are the unit tests that demonstrate how it actually works. This document is a short guide designed to provide an overview and does not attempt to cover everything.
buffer::ptr and buffer::raw
The buffer::raw is where the data is actually stored. It is allocated with malloc, new or reusing a pointer provided by the caller. A variant of the malloc constructor provides an area that is aligned on CEPH_PAGE_SIZE. The address of the allocated memory will be a multiple of CEPH_PAGE_SIZE, which must be a power of two and a multiple of sizeof(void *).
+-----------+ +-----+ | | | | | offset +----------------+ | | | | | | length +---- | | | | \------- | | +-----------+ \---+ | | ptr | +-----+ +-----------+ | raw | +-----+
The buffer::raw area can only be accessed through the buffer::ptr. It adresses the buffer::raw bytes in the range [offset,offset+length[. The buffer::ptr methods are very flexible and mostly designed to be used to implement buffer::lists. The constructors can allocate a buffer::raw area with ptr(unsigned l) or be assigned an existing buffer::raw with ptr(raw *r), as in buffer::list::rebuild().
Bytes can be copied in or copied out within the [offset,offset+length[ range. If the underlying buffer::raw extends beyond offset+length (as reported by unused_tail_length()), bytes can be appended.
bufferptr ptr(2); ptr.set_length(0); ptr.append('A'); EXPECT_EQ((unsigned)1, ptr.length()); EXPECT_EQ('A', ptr[0]); ptr.append("B", (unsigned)1); EXPECT_EQ((unsigned)2, ptr.length()); EXPECT_EQ('B', ptr[1]);
buffer::list and buffer::list::iterator
A buffer::list is a list of buffer::ptr, as shown below.
+---------+ | +-----+ | list | | | | +----------+ | | | | | append_ | | | | | | buffer | | | | | +----------+ ptr | | | | | _len | list +-----+ | | | | +----------+ +------+ ,--->+ >-----> | | | _buffers >----> >----- +-----+ | +-----+ | +----------+ +----^-+ \ ptr | raw | | last_p | `-->+-----+ | +-----+ | +----------+ | >-----> | | +-----+ | | | | | | | | | | | | | | | | | | | | | +-----+ | | raw | | | +---------+
The operator[] abstracts it.
bufferlist bl; bl.append('A'); bufferlist other; other.append('B'); bl.append(other); EXPECT_EQ((unsigned)2, bl.buffers().size()); EXPECT_EQ('B', bl[1]);
The buffer::list::iterator class provides some of the usual iterator behavior and is used in the buffer::list::contents_equal(ceph::buffer::list& other) method.
+---------+ | +-----+ | list | | | | +----------+ | | | | | append_ | | | | | | buffer | | | | | +----------+ ptr | | | | | _len | list +-----+ | | | | +----------+ +------+ ,--->+ >-----> | | | _buffers >----> >----- +-----+ | +-----+ | +----------+ +----^-+ \ ptr | raw | | last_p | / `-->+-----+ | +-----+ | +--------+-+ / + >-----> | | | ,- ,--->+-----+ | | | | | / ,--- | | | | | / ,--- | | | | +-v--+-^--+--^+-------+ | | | | | bl | ls | p | p_off >--------------->| | | +----+----+-----+-----+ | +-----+ | | | off >------------->| raw | +---------------+-----+ | | iterator +---------+
The bl data member points to the buffer::list. The ls data member is used to avoid dereferencing a pointer and is equivalent to bl->_buffers. The p data member is a std::list<ptr>::iterator used to walk the list of buffer::ptr. The p_off data member is the offset at which the iterator currently is, within the buffer::raw pointed by p. The off data member is the offset of the iterator, as if there was only one buffer::raw.
Although buffer::list::iterator exposes copy in and out methods, they are designed to be used as supporting methods for the corresponding copy in and out methods from the buffer::list class.
{ bufferlist bl; bufferlist dest; const char *expected = "ABC"; bl.append(expected); bl.copy(1, 2, dest); EXPECT_EQ(0, ::memcmp(expected + 1, dest.c_str(), 2)); } { bufferlist bl; bl.append("XXX"); bl.copy_in(1, 2, "AB"); EXPECT_EQ(0, ::memcmp("XAB", bl.c_str(), 3)); }
The internal representation of the buffer::list can be rebuilt to use a single buffer::raw.
{ bufferlist bl; const std::string str(CEPH_PAGE_SIZE, 'X'); bl.append(str.c_str(), str.size()); bl.append(str.c_str(), str.size()); EXPECT_EQ((unsigned)2, bl.buffers().size()); bl.rebuild(); EXPECT_EQ((unsigned)1, bl.buffers().size()); }
It can also be rebuilt to only use buffer::raw that are aligned on CEPH_PAGE_SIZE.
{ bufferlist bl; { bufferptr ptr(CEPH_PAGE_SIZE + 1); ptr.set_offset(1); ptr.set_length(CEPH_PAGE_SIZE); bl.append(ptr); } EXPECT_EQ((unsigned)1, bl.buffers().size()); EXPECT_FALSE(bl.is_page_aligned()); bl.rebuild_page_aligned(); EXPECT_TRUE(bl.is_page_aligned()); EXPECT_EQ((unsigned)1, bl.buffers().size()); }
The content of the buffer::list can be read from a file or written to a file, either with a file descriptor or a path.
{ bufferlist bl; bl.append("ABC"); EXPECT_EQ(0, bl.write_file("testfile")); } { std::string error; bufferlist bl; ::system("echo ABC > testfile"); EXPECT_EQ(0, bl.read_file("testfile", &error)); EXPECT_EQ((unsigned)4, bl.length()); std::string actual(bl.c_str(), bl.length()); EXPECT_EQ("ABC\n", actual); }