Online performance modeling and analysis of message-passing parallel

  1. MORAJKO, OLEG
Dirigida por:
  1. Tomás Margalef Director/a
  2. Josep Jorba Esteve Director/a

Universidad de defensa: Universitat Autònoma de Barcelona

Fecha de defensa: 17 de julio de 2008

Tribunal:
  1. Emilio Luque Fadón Presidente/a
  2. Rosa M Badia Secretario/a
  3. Casiano Rodríguez León Vocal
  4. José C. Cunha Vocal
  5. Felix Wolf Vocal

Tipo: Tesis

Teseo: 176142 DIALNET

Resumen

Although the evolution of hardware is improving at an incredible rate, the advances in parallel software have been hampered for many reasons. Developing an efficient parallel application is still nor an easy task. Applications rarely achieve a good performance immediately and therefore, a careful performance analysis and optimization are crucial. These tasks are difficult to perform and require a thorough understanding of the program's behavior. However, there are several challenges that significantly complicate performance diagnosis of parallel applications. Our thesis is that many performance problems and their reasons can be quickly located and explained with automated techniques that work on approach, the application is automatically modeled and diagnosed during its execution. First, we introduce an online performance modeling technique that enables automated discovery of causal execution flows through communication and computational activities in message-passing parallel programs. By following the flow of control and intercepting communication between tasks at runtime, the corner stone of this technique is the ability to reflect the application behavior in a compact model. The model is composed of high-level application structures such as loops and communication operations and characterizes them with statistical execution profiles. If facilitates understanding of high-level program, behavior and enables an assortment of online diagnosis techniques. Our technique can be deployed on wide range of unmodified MPI applications with acceptable overhead and scales to thousands of processors. Second, we present a systematic approach to online performance analysis. The automated analysis uses online model to quickly identity the most important performance problems, and correlate them with application source code. Our technique is able to discover causal dependences between the problems, infer their root causes in some scenarios and explain them to developers. In the work, we focus on diagnosing scientific MPI parallel applications and their communication and computational problems although the approach can be extended to support other classes of activities and programming models. We have evaluated our approach on a variety of scientific parallel applications. In all scenarios, our online performance modeling techniques proved effective for low-overhead capturing of program's behavior and facilitated performance understanding. With our automated, model-based performance analysis approach, we were able to easily identify the most severe performance problems during application execution, and locate their root causes without previous knowledge of application internals.