Ceph dialog : PGBackend

This is a dialog where Samuel Just, Ceph core developer answers Loïc Dachary’s questions to explain the details of PGBackend (i.e. placement group backend). It is a key componenent of the implementation of erasure coded pools.

Loic: Could you explain what the first PGBackend patch series that was merged in master mid September is about ?
Sam: The breakdown prior to this patch is that there were two files : PG.h and ReplicatedPG.h. The original intention was that other PG types would inherit from PG and override the functionalities that need to be different. The reason we didn’t go that way is because most of the behavior of ReplicatedPG will generalize just fine to erasure coding. In the long run the thing that is currently called ReplicatedPG will turn into more like PG2 or something (not sure what the name would be) : the part of PG that deals with I/O I suppose. And it will delegate specific tasks that are different between replicated and erasure coded PG to a PGBackend interface. Specifically that looks like recovery and client I/O. Those are the big pieces because, for both, behavior is very different. Recovery code needs to know where the pieces are located, it needs to be able to recover many pieces at once ( because otherwise you multiply that read factor even worse with an erasure coded PG ). With client I/O and a replicated PG, reads can be serviced synchronously on the primary but with an erasure coded PG it may be necessary to go to replicas to service the read. So we have to be able to abstract out that difference and that’s the point of PGBackend.
Loic: I did not realize that ( reading the current patch series ) ReplicatedPG would no longer be related to replicated pools. Instead the logic is going to ReplicatedBackend.h ?
Sam: Yes. The reason why I did not rename the files is because git is not always smart enough to tell where the changes have happened and it would make rebase difficult. I’ll do it later once I can get away with it.
Loic: I also noticed that you changed the method names for the implementation in place, i.e. there are ReplicatedBackend methods in ReplicatedPG.cc. What about the PGBackend::Listener ?
Sam: In an ideal world, assuming we can get all this right, it would be possible to test a PGBackend implementation without an OSD, an actual PG, any actual objects or messengers. Anything at all, right ?
Loic: Yes.
Sam: We’re nowhere near that point yet because the PGBackend has a reference to an OSDService, which is no good. It should also probably not be sending its own messages. That should go through a separate interface so that we can send them a different way if we want. It does need its own FileStore because we have an ObjectStore interface and it gets used, the whole thing. The point of the Listener is that at least we don’t need to give PGBackend an actual PG. All we need is to give it something that fulfills the Listener interface and, over time, this will get smaller. We don’t necessarily need all that stuff as we improve the PGBackend interface.
Loic: I assumed the Listener was to be used by the back-end, from asynchronous operations, when it needs to dialog with the PG.
Sam: That’s correct. Which makes Listener not a great name 🙂 The main reason for the Listener is for on_local_recover_start, on_local_recover, on_global_recover, on_peer_recover, begin_peer_recover, failed_push and cancel_pull. Those are the real reasons. We need to notify the PGs of certain events.
Loic: It’s a way for the back-end to dialog with whatever is feeding it.
Sam: Yes.
Loic: Could you tell me about the RecoveryHandle ?
Sam: That’s actually really simple : we made a change ( before Dumpling ) that allows us to recover ten small ops ( a number of small objects ) in one message. If the primary needs to pull ten objects, and they are tiny, it will send one message containing ten pulls. The replica will then send one message containing ten pushes which will all happen at once. I wanted to maintain that behavior and a RecoveryHandle is an opaque thing that represents the recovery operation we’re about to start. It’s an empty interface because all you get to do is to pass it and get it back, right ?
Loic: OK.
Sam: But if you look at ReplicatedBackend, all it is is a map of the pushes we’re about to start.
Loic: There are two maps : a pull map and a push map.
Sam: Pushes and pulls actually, yes. Otherwise you’d have to push the logic for which objects needs to be recovered into PGBackend. Which I did not want to do. So the logic for which object needs to be recovered when lives in the parent and the logic about how to do the recovery lives in PGBackend.
Loic: I see that the PGBackend uses hobjects, will it eventually be ghobjects ?
Sam: The parent will speak to PGBackend in terms of hobjects. And the PGBackend will speak to the FileStore in terms of ghobjects.
Loic: So the hobjects will be cast in ghobjets where it makes sense ?
Sam: No. ghobjects are there to cut the back-end some freedom about how to store objects. So when it does a write it will creates a new version number and will use that as the generation number because it needs to not override the previous generations of that same object for erasure coded back-end. But when you’re recovering an object, you’re not recovering a particular shard or a particular generation. You’re recovering that object. The whole thing. That’s the back-end problem to translate that into a sequence of FileStore operations.
Loic: What did you have to reconsider compared to the approach you presented during the Emperor Design Summit in august 2013 ?
Sam: It has become more complicated but it’s roughly the same thing. There are more callbacks that I though it needed. I forgot about pull cancellation ( that’s cancel_pull now ) and failed push ( that’s failed_push now ). But the basics are the same. There were a lot of details that were not covered in that document. Like : the locking is handled by PGBackend and assumes that the parent handles the locking, which is simple. But it was not in the document. It’s that kind of things.
Loic: I also noted there is a new work queue, what is it for ?
Sam: There is some stuff we need to do ( I/O related stuff ). When we finish pulling a chunk of an object we need to send a new pull and we’re finished pulling that chunk when the transaction’s finished, right ? We send a Context to the FileStore that fires when the transaction commits and that’s how we know when it’s done. One option would be to send the new pull messages right then but we don’t want to block the FileStore queue with expensive work. Instead we queue in for work in the disk thread pool. The reason for this queue is that I’m tired of introducing single purpose work queues. This one just queues contexts that fire, that’s it.
Loic: Why was it necessary to first deal with the temp collection ? Is it even a relevant question ? 🙂
Sam: We need to keep track of the content of the temp collection. When we go through an interval change, we need to clear the temp collection. And we would prefer to not list the content of the temp collection since we can’t actually do that without flushing the content of the FileStore work queue. We need to track the set of objects we’ve created but not deleted off the temp collection. So there is just a set of hobjects that does that. But the main use of the temp collection is recovery. When we recover an object, we write the partial object into the temp collection and rename it into place at the end. For that reason, recovery is what uses the temp collection. Not the PG as a whole. So I moved all the temp collection handling stuff down to PGBackend. And there is a bunch of nonsense because the OSD needs to be able to delete these things so it needs to know what’s in it, which temp collections to delete.