Minutes for FTB conference call - 2010 Aug 25th

From CIFTS
Jump to: navigation, search

Attendees

  • Argonne National Lab: Rinku
  • Oak Ridge National Lab: Aniruddha, Hoony, Thomas, David
  • Ohio State University: Raghu
  • Indiana University: Abhishek
  • University of Tennessee: Aurelien
  • Lawrence Berkeley National Lab: Paul

Items Discussed

  • ANL
  1. Renamed FTB-0.7 to FTB-1.0
  2. Rinku lead the reliability discussion. Significant points/changes:
    1. FTB_Disconnect/FTB_Unsubscribe message from subscribers in the middle of the handshake operation between event and ack should be treated as successes and not failure (David)
    2. Reliability semantic definition : The words "at that instant" was a little confusing. But the team did not narrow down on anything else.
    3. FTB API return codes: FTB_Test and other routines need special return codes to indicate that the routine itself was unsuccessful or whether the preceding "FTB_Publish (with reliability)" routine was unsuccessful in sending the events. We also need error codes for other cases (ex: FTB_Test called with event not published reliably)
    4. Need to rename "reliability" to "guarantee" maybe (David, Rinku)
    5. David mentioned maybe we should continue looking at other sources for reliability or adopt a more simplified/centralized approach for FTB
    6. Aurelien mentioned "group membership services" - this was with regards to agents summarizing information received from children before forwarding it to their own parents (need to look it up)
    7. FTB API : FTB_Test and FTB_Wait need to be refined in some way (Hoony). We need to discuss this more.
  • ORNL
  1. David/Aniruddha mentioned that they are working on a paper for fault tolerance in fusion applications, which will include FTB
  2. Hoony to start work on the MD code
  3. Thomas to wrap up his test scripts.
  • OSU
  1. No significant update
  • IU
  1. Abhishek has sent the updated document to the MPI teams. Conference call to take place this week sometime.
  • LBNL
  1. No significant update
  • UTK
  1. FT HPL works, started adding FTB events to it, to report when it has been hit by a failure, how much time was lost recovering. Aurelien is considering reporting "soft benchmark errors", aka reporting when the benchmark is abnormally slow due to improper tuning.
Personal tools