WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Architecture - Scalable Pinger
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Summary
The Scalable Pinger provides peer health information to lustre clients and servers.
See bug 12471
Requirements
Pinger is currently used for several purposes:
- . clients identify dead servers, in order to reconnect to their failover partners
- . servers evict not-heard-from clients, in case the clients died.
- . servers provide common information to clients about committed transactions so that the clients can release saved requests from memory
- . servers provide other state information such as lock LRU size to clients that are not otherwise doing RPCs in order to manage global lock LRU size
Currently, every client pings every target (MDT and OST), 4 times every OBD_TIMEOUT (or 25s by default). A large system with 10,000 clients and 200 targets has 2,000,000 pings every 25s, or ~100,000 pings a second. Minor tweaks to the current system (ping once per OBD_TIMEOUT, ping each server only once) will only get us around a factor of 10 say (depending on targets per server), so this is not really sufficient.
Instead, a completely new system will be needed to drastically reduce the RPC traffic related to the pinger. This might look like:
- . Every client and server pings the MGS once every X seconds. (Targets should probably ping more frequently than clients, since clients should quickly access the failover partners, but there is no rush to evict clients.)
- . The MGS identifies dead clients or servers, and sends out broadcast messages to clients or servers as appropriate.
Setting CLIENT_PING_INTERVAL to 50s and SERVER_PING_INTERVAL to 10s gives us 220 pings a second.
Components involved:
- . Remove ping evictor
- . Disentangle pinger from recovery, and turn it into a simple rain-or-shine periodic RPC.
- . Add broadcast message to MGS, perhaps using a common-reader lock callback similar to the current log update system
- . Add notification mechanism from MGC to affected OBDs (maybe using name/uuid lookup and obd_notify)
- . Add /proc file on MGS showing current status of all clients and targets
- . Interoperability issues with current pinger
We could go further down the scalability road by having e.g. targets collect the identities of the clients that have recently pinged, and report these clients to the MGS inside the target->MGS ping. The MGS would then sort through and figure out who hasn't talked to anyone. Clients would only need to ping the MGS if they haven't talked to any target in awhile. This would require additional plumbing/development time, and I think it probably isn't worth it.