How does a Ceph OSD handle a read message ? (in Firefly and up)

When an OSD handles an operation it is queued to a PG, it is added to the op_wq work queue ( or to the waiting_for_map list if the queue_op method of PG finds that it must wait for an OSDMap ) and will be dequeued asynchronously. The dequeued operation is processed by the ReplicatedPG::do_request method which calls the the do_op method because it is a CEPH_MSG_OSD_OP. An OpContext is allocated and is executed.

2014-02-24 09:28:34.571489 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] execute_ctx 0x7fc16c08a3b0

A transaction (which is either a RPGTransaction for a replicated backend or an ECTransaction for an erasure coded backend) is obtained from the PGBackend. The transaction is attached to a OpContext (which was allocated by do_op). Note that in the following log line although do_op shows, it comes from the execute_ctx method.

2014-02-24 09:28:34.571563 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] do_op 847441d7/SOMETHING/head//3 [read 0~4194304] ov 26’1

The execute_ctx method calls prepare_transaction which calls do_osd_ops which prepares the CEPH_OSD_OP_READ.

2014-02-24 09:28:34.571663 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] async_read noted for 847441d7/SOMETHING/head//3

The execute_ctx method continues when prepare_transaction returns and creates the MOSDOpReply object. Then it calls start_async_reads which calls objects_read_async on the backend (which is either ReplicatedBackend::objects_read_async or ECBackend::objects_read_async). When the read completes (this code path is not explored here), it calls the OnReadComplete::finish method (because the OnReadComplete object was given as an argument to objects_read_async) which calls ReplicatedPG::OpContext::finish_read each time a ready completes (i.e. if reading from an erasure coded pool, on each chunk) which calls ReplicatedPG::complete_read_ctx (if there are no pending reads) which sends the reply to the client .

Ceph paxos propose interval

When a command is sent to the Ceph monitor, such as ceph osd pool create, it will add a pool to the pending changes of the maps. The modification is stashed for paxos propose interval seconds before it is used to build new maps and becomes effective. This guarantees that the mons are not updated more than once a second ( the default value of paxos propose interval ).
When running make check changing the paxos propose interval value to 0.01 seconds for the cephtool tests roughly saves half the time (going from ~2.5mn to ~1.25mn real time).

--paxos-propose-interval=0.01

Exploring Ceph cache pool implementation

Sage Weil and Greg Farnum presentation during the Firefly Ceph Developer Summit in 2013 is used as an introduction to the cache pool that is being implemented for the upcoming Firefly release.
The CEPH_OSD_OP_COPY_FROM etc.. rados operations have been introduced in Emperor and tested by ceph_test_rados which is used by teuthology for integration tests by doing COPY_FROM and COPY_GET at random.
After a cache pool has been defined using the osd tier commands, objects can be promoted to the cache pool ( see the corresponding test case ).
The HitSets keep track of which object have been read or written ( using bloom filters ).

How does Ceph backfilling pushes objects to replicas ?

When a placement group start backfilling it will ask the OSD be queued for recovery. It will eventually be processed and the OSD will ask it to start the recovery operations. Since it is backfilling (this is the original reason why the recovery operation was queued), it will attempt to reserve a backfill channel ( step 1, step 2 ). When the reservation is successfull it goes back to the initial backfilling state which will re-queue the PG for recovery. When processed, the same function is run but this time the backfill channel is reserved and it starts the backfilling operations. It scans the other OSD to retrieve a list of objects and their associated versions and pushes missing objects to the replicas. Each object pushed is locked for read and ( after trying some snapshot based heuristics ) will register a push and send a CEPH_OSD_OP_PUSH operation to the peer OSD. The receiving replica will handle the message by submitting the payload to a transaction by which the OSD will write it to the file.

Anatomy of ObjectContext, the Ceph in core representation of an object

An ObjectContext is created when a ReplicatedPG applies operations on an object.

read/write mutual exclusion

The C_OSD_OndiskWriteUnlock callback is registered to be called after a transaction (read in this case) completes. It will signal the writes and reads waiting if all writes are done.
Before adding an entry that will write an object to a transaction ( for instance when ReplicatedPG::mark_object_lost sets the object_info_t::lost data member to true ) the ObjectContext::ondisk_write_lock method is called and will Cond::Wait until all reads complete. The caller of mark_object_lost adds the ObjectContext to a list that is used to build the C_OSD_OndiskWriteUnlockList callback that will be called when the transaction completes and call ondisk_write_unlock on each object.

The logic is similar when reverting an object to a prior version, applying a replica operation, or RepliatedPG::handle_pull_response.

ondisk_read_lock is called by ReplicatedPG::do_op and unlocked after calling prepare_transaction. ReplicatedPG::recover_object_replicas and ReplicatedPG::push_backfill_object do the same.

ondisk_write_lock will wait until there are no more read operations waiting ( readers_waiting ) or read being processed ( readers ). ondisk_read_lock will wait for ongoing writes to finish ( unstable_writes ) but will take the lock even if writers are waiting ( writers_waiting ) therefore taking precedence over write.There can be any number of simultaneous write ( unstable_writes > 1 ) as long as there are no ongoing reads ( readers < 1 ). There can be any number of simultaneous readers ( readers > 1 ) as long as there are no ongoing writes ( unstable_writes < 1 ). ondisk_read_unlock will signal waiting writers if there is no more readers ( !readers ). ondisk_write_unlock will signal waiting readers if there is no more writers ( !unstable_writers ).

blocking and blocked_by

ObjectContext has a blocking and a blocked_by data members. When an operation on an object is made of multiple operations, all of them must be about an object by the same name but can be about different versions. If a variation of the object is degraded, it is blocked by the degraded object and the is added to the list blocked by the degraded object.
Before peering the ReplicatedPG::on_change is called and for each object in the waiting_for_degraded_object list it will loop over the objects it is blocking, remove it from the list and unblock it. The same happens whenever an object has been pushed.

How does AccessMode controls read/write processing in Ceph ?

An operation ( read, write etc. ) may be added to the mode.waiting queue if the ReplicatedPG::AccessMode does not allow of it, yet. For instance, if an operation may_write but AccessMode::try_write finds the current state to be RMW_FLUSHING, it will return false and the operation will be added to mode.waiting. However, if it finds that it is IDLE, it will change to RMW and return true
When handling a write message eval_repop is called at the end to figure out if the operation must be sent to other OSDs in the acting set. If mode.is_rmw_mode(), it will call apply_repop(repop); which will create a transaction to write to the ObjectStore and give it a C_OSD_OpApplied callback which will call ReplicatedPG::op_applied when it completes. ReplicatedPG::op_applied will then call mode.write_applied() and if there is no pending write operation it will set wake = true.
The put_object_context is called immediately after mode.write_applied() to release the ObjectContext and it also checks for mode.wake and will requeue the operations that were previously added to the mode.waiting list because mode.state did not allow them to be processed. The put_object_context method is called in many places in ReplicatedPG, each of them is an opportunity to requeue the operations found in mode.waiting

How does a Ceph OSD handle a write message ? (up to Emperor)

When an OSD handles an operation is queued to a PG, it is added to the op_wq work queue ( or to the waiting_for_map list if the queue_op method of PG finds that it must wait for an OSDMap ) and will be dequeued asynchronously. The dequeued operation is processed by the PG::do_request method which calls the the do_osd_ops method because it is a CEPH_MSG_OSD_OP. The do_osd_ops method is called by prepare_transaction via the PG::do_op pure virtual method which is implemented in ReplicatedPG::do_op and called from the aforementionned PG::do_request method.

When done, ReplicatedPG::do_op calls ReplicatedPG::issue_repop which will send the operation to the replicates If all replicates ack’ed the operation, ReplicatedPG::eval_repop method will notify the client.

How do Ceph placement groups use Watch ?

ReplicatedPG::prepare_transaction ( which is called when handling a message ) will call ReplicatedPG::do_osd_op_effects if the message is handled successfully.
do_osd_op_effects iterates over ReplicatedPG::watch_connects which is set with a watcher including a cookie, a timeout ( hard coded to 30 seconds ) and an IP address built with the connection from which the CEPH_OSD_OP_WATCH message was receieved. The watcher is also is added to the object_info_t if not already present.
A Watch is added to the ObjectContext ( i.e. the in core representation of the object ). The Watch::connect method is called and retreives an OSD::Session from theConnection on which the message was received.
The notifications found in in_progress_notifies are handled by send_notify which creates a MWatchNotify message and asks the OSD to send it using the connection referenced by the Watch ( i.e. the conn data member ).
When an object context is created ReplicatedPG::populate_obc_watchers iterates over the watch_info_t that are found in the object_info_t that was loaded from disk and it rebuilds a Watch from them and disconnects it so that it gets a chance to re-establish the connection with the client. When all Watch have been rebuilt, ReplicatedPG::check_blacklisted_obc_watchers is called to loop over the watchers and simulate a timeout if they are associated with a blacklisted entity ( according to the OSD map ).
The watch from all ObjectContext is also checked for blacklisted entities when PG::handle_activate_map is activated.