WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - Adaptive Timeouts - Use Cases

From Obsolete Lustre Wiki
Jump to navigationJump to search

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Terminology

Adaptive Timeout (AT) Network RPC timeouts based on server and network loading.
Early reply A response sent immediately by the server informing the client of a predicted slower-than-expected final response.
Service estimate The expected worst-case time a request on a given portal to a given service will take. This value changes depending on server loading.

Architecture

A. Report service times : Replies are annotated with the RPC's service time and the service estimate. The service estimate is updated upon every 'success' reply. Clients use the measured round-trip time and the reported service time to determine the round-trip network latency. Clients use the service estimate and network latency to set the timeout for future RPCs.
B. Early replies : Servers compare the timeout encoded in the RPC with the current service estimate and send early replies to all queued RPCs that they expect to time out before being serviced, reporting the new service estimate. Clients receiving early replies adjust the RPCs local timeout to reflect the new service estimate. This process is repeated whenever an RPC is near its deadline.

Use cases

Summary

id quality attribute summary
congested_server_new recovery server congestion causes client timeouts to adapt, not fail (new rpc)
congested_server_pending recovery server congestion causes client timeouts to adapt, not fail (pending or in-progress rpcs)
timeouts_recover performance timeouts decrease when load decreases
new_clients_learn performance new clients learn about server/network timing immediately
lock_timeouts recovery lock timeouts are based only on lock processing history, per target
busy_client_not_evicted recovery a responding client is not evicted for failing to return a lock quickly
server_down_client_start performance client starts while server is down.
liblustre_client_joins_late recovery a liblustre client computes for 20 min, then discovers the server has rebooted.
client_collection_timeout recovery Heavily loaded server fails over; clients have long AT already, and so don't try to reconnect for a long time.
replay_timeout recovery Client replaying lost reqs after server failover must wait for the server's recovery client collection phase to complete before they will see responses.
communications_failure availability, performance Lustre handling of communications failures
redundant_router_failure availability, performance Lustre handling of redundant router failure

server_down_client_start

Scenario: New client starts or restarts while a server is down or unresponsive
Business Goals: Maximize performance
Relevant QA's: Performance
details Stimulus: New client tries to connect to an unresponsive server
Stimulus source: client (re)boot
Environment: server or network failures
Artifact: Client connect time
Response: After the obd_connect RPC request timeout, a new connect is attempted, on either the same network or a different network.
Response measure: Time to successfully connect once the server/network becomes available.
Questions: None
Issues: We want obd_connect attempts on different connections (networks) to happen quickly, so that we try alternate routes in the case that one network fails. But we want attempts on the same network to happen at LND-timeout speed (e.g. 50s), so we don't just pile up PUTs that are stuck. But the problem is there is only a single RPC timeout number that we can specify, and it doesn't know what connection the next attempt will be on (decided in import_select_connection). Perhaps we need to do this: if we are at the end of the imp_conn_list, set the connect rpc timeout to slow (i.e. max(50s, adaptive timeout)), otherwise set the timeout fast (adaptive timeout).


liblustre_client_joins_late

Scenario: A liblustre client computes for 20 min, then discovers the server has rebooted.
Business Goals: Minimize evictions
Relevant QA's: Recovery
details Stimulus: Liblustre doesn't utilize pinger to verify server availability; attempts reconnection only when application tries to write.
Stimulus source: Actively computing client
Environment: server failures
Artifact: recoverable state in the cluster
Response: eviction of client determined by version recovery
Response measure: Availability
Questions: None
Issues: Version recovery makes it possible to rejoin after recovery period in some conditions; unrelated to AT.


client_collection_timeout

Scenario: Heavily loaded server fails over; clients have long AT already, and so don't try to reconnect for a long time.
Business Goals: Minimize evictions
Relevant QA's: Recovery
details Stimulus: Client with a long AT waits a long time to try to reconnect to a rebooting server. The server has no idea how long to wait for the (first) client during the recovery client collection phase.
Stimulus source: Client reconnect attempt
Environment: slow server / network conditons, then server failover
Artifact: recoverable state in the cluster
Response: eviction of client(s) if client collection timeout is too short
Response measure: Availability
Questions: None
Issues: Because a newly rebooted server has no idea of previous network/server load, a fixed timeout must be used when waiting for the first client to reconnect during the client collection phase. Once the first client has reconnected, the server can keep track of the maximum expected AT as reported by the client in the connect RPC. This information can then be used to adjust how much longer the server will wait for client collection to complete.


replay_timeout

Scenario: Client replaying lost reqs after server failover must wait for the server's recovery client collection phase to complete before they will see responses.
Business Goals: Minimize recovery time
Relevant QA's: Performance, Recovery
details Stimulus: Client replay request
Stimulus source: Server failover
Environment: server failure
Artifact: replay request timeout
Response: When a client tries to reconnect after failover, the replay req timeout is set to the AT expected processing time plus the fixed client collection timeout.
Response measure: Performance
Questions: Is it possible for version recovery to start recovery of certain files before the client collection period has finished?
Issues: Could have c1, recov period, c2, recov period, c3, in which case c1 would time out and resend the replay request until either c3 joins or recovery is aborted because it never showed up.

It would of course be desirable for c1, c2 to be able to start recovery before waiting a long time for c3 to arrive, to avoid wasting all that time.

Maybe version based recovery can help us in this case, so that c1, c2 know the files they are operating on have the same preop version and it is safe for them to begin recovery without waiting for c3. Then, either c1, c2 will have completed recovery and wait recov_period before normal recovery is closed, or c3 will join in time if there are dependencies.

communications_failure

Scenario: Lustre handling of communications failures
Business Goals: Fault isolation
Relevant QA's: Availability, performance
details Stimulus: Peer node crash / hang / reboot or network failure
Stimulus source: Hardware/software failure. Sysadmin
Environment: Operation under all loads
Artifact: ptlrpc (both client and server), OST/OSC, MDC/MDS
Response: Failed RPCs are completed cleanly. The number of uncompleted RPCs to any peer is bounded.
Response measure: No effect on communications other peers.
Questions: None
Issues:

An RPC, whether on the client or the server is a compound communication (e.g. request, bulk, reply). Each individual communication has an associated buffer which may be active (posted with LNetPut() or LNetGet()) or passive (posted with LNetMEAttach()). An RPC is only complete when the MDs for all its successfully posted communications have been unlinked. LNET may access these buffers at any time until the "unlinked" event has been received. RPCs must be completed in all circumstances, even if it is to be abandoned because of a timeout or a component communication failed.

Uncompleted RPCs consume resources in LNET, the relevant LND and its underlying network stack. Further RPCs may fail if the number of uncompleted RPCs is allowed to grow without limit - for example if all network resources are tied up waiting for RPCs to a failed peer, new RPCs to working, responsive peers may fail.

Calling LNetMDUnlink() on a posted MD ensures delivery of the "unlink" event at the earliest opportunity. The time to deliver this event is guaranteed finite, but may be determined by the underlying network stack. Note the fundamental race with normal completion - LNET handles this so that it is safe to call at any time or indeed any number of times, however it only returns success on the first call and only if the MD hasn't auto-unlinked already.

redundant_router_failure

Scenario: Lustre handling of redundant LNET router failure
Business Goals: Transparent handling of the failure
Relevant QA's: Availability, performance
details Stimulus: Redundant router crashes or hangs
Stimulus source: hardware/software failure or sysadmin reboot
Environment: operation under all loads with LNET router failure detection enabled
Artifact: ptlrpc (both client and server), OST/OSC, MDC/MDS
Response: Router fails transparently to applications using the file system. Minimum performance impact.
Response measure: No errors returned to the application. Performance impact no worse than server failover.
Questions: None
Issues:

When a router fails, many communications, either buffered in the router or committed to be routed via the router, will fail. This fails all relevant RPCs, however further intermittent RPC failure is possible until all nodes on all paths from client to server and back have detected and avoided the failed router.

References

The parent tracking bug for Adaptive Timeouts is 3055.