WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Architecture - Adaptive Timeouts - Use Cases
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Terminology
Adaptive Timeout (AT) | Network RPC timeouts based on server and network loading. |
Early reply | A response sent immediately by the server informing the client of a predicted slower-than-expected final response. |
Service estimate | The expected worst-case time a request on a given portal to a given service will take. This value changes depending on server loading. |
Architecture
A. Report service times : | Replies are annotated with the RPC's service time and the service estimate. The service estimate is updated upon every 'success' reply. Clients use the measured round-trip time and the reported service time to determine the round-trip network latency. Clients use the service estimate and network latency to set the timeout for future RPCs. |
B. Early replies : | Servers compare the timeout encoded in the RPC with the current service estimate and send early replies to all queued RPCs that they expect to time out before being serviced, reporting the new service estimate. Clients receiving early replies adjust the RPCs local timeout to reflect the new service estimate. This process is repeated whenever an RPC is near its deadline. |
Use cases
Summary
id | quality attribute | summary |
---|---|---|
congested_server_new | recovery | server congestion causes client timeouts to adapt, not fail (new rpc) |
congested_server_pending | recovery | server congestion causes client timeouts to adapt, not fail (pending or in-progress rpcs) |
timeouts_recover | performance | timeouts decrease when load decreases |
new_clients_learn | performance | new clients learn about server/network timing immediately |
lock_timeouts | recovery | lock timeouts are based only on lock processing history, per target |
busy_client_not_evicted | recovery | a responding client is not evicted for failing to return a lock quickly |
server_down_client_start | performance | client starts while server is down. |
liblustre_client_joins_late | recovery | a liblustre client computes for 20 min, then discovers the server has rebooted. |
client_collection_timeout | recovery | Heavily loaded server fails over; clients have long AT already, and so don't try to reconnect for a long time. |
replay_timeout | recovery | Client replaying lost reqs after server failover must wait for the server's recovery client collection phase to complete before they will see responses. |
communications_failure | availability, performance | Lustre handling of communications failures |
redundant_router_failure | availability, performance | Lustre handling of redundant router failure |
server_down_client_start
Scenario: | New client starts or restarts while a server is down or unresponsive | |
Business Goals: | Maximize performance | |
Relevant QA's: | Performance | |
details | Stimulus: | New client tries to connect to an unresponsive server |
Stimulus source: | client (re)boot | |
Environment: | server or network failures | |
Artifact: | Client connect time | |
Response: | After the obd_connect RPC request timeout, a new connect is attempted, on either the same network or a different network. | |
Response measure: | Time to successfully connect once the server/network becomes available. | |
Questions: | None | |
Issues: | We want obd_connect attempts on different connections (networks) to happen quickly, so that we try alternate routes in the case that one network fails. But we want attempts on the same network to happen at LND-timeout speed (e.g. 50s), so we don't just pile up PUTs that are stuck. But the problem is there is only a single RPC timeout number that we can specify, and it doesn't know what connection the next attempt will be on (decided in import_select_connection). Perhaps we need to do this: if we are at the end of the imp_conn_list, set the connect rpc timeout to slow (i.e. max(50s, adaptive timeout)), otherwise set the timeout fast (adaptive timeout). |
liblustre_client_joins_late
Scenario: | A liblustre client computes for 20 min, then discovers the server has rebooted. | |
Business Goals: | Minimize evictions | |
Relevant QA's: | Recovery | |
details | Stimulus: | Liblustre doesn't utilize pinger to verify server availability; attempts reconnection only when application tries to write. |
Stimulus source: | Actively computing client | |
Environment: | server failures | |
Artifact: | recoverable state in the cluster | |
Response: | eviction of client determined by version recovery | |
Response measure: | Availability | |
Questions: | None | |
Issues: | Version recovery makes it possible to rejoin after recovery period in some conditions; unrelated to AT. |
client_collection_timeout
Scenario: | Heavily loaded server fails over; clients have long AT already, and so don't try to reconnect for a long time. | |
Business Goals: | Minimize evictions | |
Relevant QA's: | Recovery | |
details | Stimulus: | Client with a long AT waits a long time to try to reconnect to a rebooting server. The server has no idea how long to wait for the (first) client during the recovery client collection phase. |
Stimulus source: | Client reconnect attempt | |
Environment: | slow server / network conditons, then server failover | |
Artifact: | recoverable state in the cluster | |
Response: | eviction of client(s) if client collection timeout is too short | |
Response measure: | Availability | |
Questions: | None | |
Issues: | Because a newly rebooted server has no idea of previous network/server load, a fixed timeout must be used when waiting for the first client to reconnect during the client collection phase. Once the first client has reconnected, the server can keep track of the maximum expected AT as reported by the client in the connect RPC. This information can then be used to adjust how much longer the server will wait for client collection to complete. |
replay_timeout
Scenario: | Client replaying lost reqs after server failover must wait for the server's recovery client collection phase to complete before they will see responses. | |
Business Goals: | Minimize recovery time | |
Relevant QA's: | Performance, Recovery | |
details | Stimulus: | Client replay request |
Stimulus source: | Server failover | |
Environment: | server failure | |
Artifact: | replay request timeout | |
Response: | When a client tries to reconnect after failover, the replay req timeout is set to the AT expected processing time plus the fixed client collection timeout. | |
Response measure: | Performance | |
Questions: | Is it possible for version recovery to start recovery of certain files before the client collection period has finished? | |
Issues: | Could have c1, recov period, c2, recov period, c3, in which case c1 would time out and resend the replay request until either c3 joins or recovery is aborted because it never showed up.
It would of course be desirable for c1, c2 to be able to start recovery before waiting a long time for c3 to arrive, to avoid wasting all that time. Maybe version based recovery can help us in this case, so that c1, c2 know the files they are operating on have the same preop version and it is safe for them to begin recovery without waiting for c3. Then, either c1, c2 will have completed recovery and wait recov_period before normal recovery is closed, or c3 will join in time if there are dependencies. |
communications_failure
Scenario: | Lustre handling of communications failures | |
Business Goals: | Fault isolation | |
Relevant QA's: | Availability, performance | |
details | Stimulus: | Peer node crash / hang / reboot or network failure |
Stimulus source: | Hardware/software failure. Sysadmin | |
Environment: | Operation under all loads | |
Artifact: | ptlrpc (both client and server), OST/OSC, MDC/MDS | |
Response: | Failed RPCs are completed cleanly. The number of uncompleted RPCs to any peer is bounded. | |
Response measure: | No effect on communications other peers. | |
Questions: | None | |
Issues: |
An RPC, whether on the client or the server is a compound communication (e.g. request, bulk, reply). Each individual communication has an associated buffer which may be active (posted with LNetPut() or LNetGet()) or passive (posted with LNetMEAttach()). An RPC is only complete when the MDs for all its successfully posted communications have been unlinked. LNET may access these buffers at any time until the "unlinked" event has been received. RPCs must be completed in all circumstances, even if it is to be abandoned because of a timeout or a component communication failed. Uncompleted RPCs consume resources in LNET, the relevant LND and its underlying network stack. Further RPCs may fail if the number of uncompleted RPCs is allowed to grow without limit - for example if all network resources are tied up waiting for RPCs to a failed peer, new RPCs to working, responsive peers may fail. Calling LNetMDUnlink() on a posted MD ensures delivery of the "unlink" event at the earliest opportunity. The time to deliver this event is guaranteed finite, but may be determined by the underlying network stack. Note the fundamental race with normal completion - LNET handles this so that it is safe to call at any time or indeed any number of times, however it only returns success on the first call and only if the MD hasn't auto-unlinked already. |
redundant_router_failure
Scenario: | Lustre handling of redundant LNET router failure | |
Business Goals: | Transparent handling of the failure | |
Relevant QA's: | Availability, performance | |
details | Stimulus: | Redundant router crashes or hangs |
Stimulus source: | hardware/software failure or sysadmin reboot | |
Environment: | operation under all loads with LNET router failure detection enabled | |
Artifact: | ptlrpc (both client and server), OST/OSC, MDC/MDS | |
Response: | Router fails transparently to applications using the file system. Minimum performance impact. | |
Response measure: | No errors returned to the application. Performance impact no worse than server failover. | |
Questions: | None | |
Issues: |
When a router fails, many communications, either buffered in the router or committed to be routed via the router, will fail. This fails all relevant RPCs, however further intermittent RPC failure is possible until all nodes on all paths from client to server and back have detected and avoided the failed router. |
References
The parent tracking bug for Adaptive Timeouts is 3055.