Minutes for FTB conference call - 2010 Aug 25th
From CIFTS
Attendees
- Argonne National Lab: Rinku
- Oak Ridge National Lab: Aniruddha, Hoony, Thomas, David
- Ohio State University: Raghu
- Indiana University: Abhishek
- University of Tennessee: Aurelien
- Lawrence Berkeley National Lab: Paul
Items Discussed
- ANL
- Renamed FTB-0.7 to FTB-1.0
- Rinku lead the reliability discussion. Significant points/changes:
- FTB_Disconnect/FTB_Unsubscribe message from subscribers in the middle of the handshake operation between event and ack should be treated as successes and not failure (David)
- Reliability semantic definition : The words "at that instant" was a little confusing. But the team did not narrow down on anything else.
- FTB API return codes: FTB_Test and other routines need special return codes to indicate that the routine itself was unsuccessful or whether the preceding "FTB_Publish (with reliability)" routine was unsuccessful in sending the events. We also need error codes for other cases (ex: FTB_Test called with event not published reliably)
- Need to rename "reliability" to "guarantee" maybe (David, Rinku)
- David mentioned maybe we should continue looking at other sources for reliability or adopt a more simplified/centralized approach for FTB
- Aurelien mentioned "group membership services" - this was with regards to agents summarizing information received from children before forwarding it to their own parents (need to look it up)
- FTB API : FTB_Test and FTB_Wait need to be refined in some way (Hoony). We need to discuss this more.
- ORNL
- David/Aniruddha mentioned that they are working on a paper for fault tolerance in fusion applications, which will include FTB
- Hoony to start work on the MD code
- Thomas to wrap up his test scripts.
- OSU
- No significant update
- IU
- Abhishek has sent the updated document to the MPI teams. Conference call to take place this week sometime.
- LBNL
- No significant update
- UTK
- FT HPL works, started adding FTB events to it, to report when it has been hit by a failure, how much time was lost recovering. Aurelien is considering reporting "soft benchmark errors", aka reporting when the benchmark is abnormally slow due to improper tuning.