Research

The four research areas of our work are:

Templates

Through engagement with DOE collaborations from a wide-array of science areas, we will identify,  develop, and design interfaces for templates that capture the common workflow patterns (e.g., sequence, data-parallel, parallel-split and synchronization) identified through previous work and our current collaborations.  The research challenge is in the definition of the templates and the interfaces for using them. The end goal is a template library that allows a scientist to script their analysis at a high level by describing the tasks and the sequencing of the tasks.  The end-user interface will be simple libraries or modules that users can use in the language (e.g., scripting, python, C) of their choice to manage data intensive applications. Our initial choice for this research will be to develop the analysis templates in Python, a language commonly used for scripting data analysis pipelines today

Execution

The templates will be designed to scale to large data volumes and large numbers of parallel executions. This makes the use of a centralized
staging engine and file-based communication between tasks of the analysis workflows difficult, if not impossible. A decentralized execution mechanism will be needed to minimize bottlenecks and increase fault-tolerance. The research challenge will be to develop a hybrid execution model that allows in-line hand-off of execution between steps and a template level execution engine to monitor progression and manage hand-offs requiring coordination. This mechanism is likely to present challenges for joins and enabling fault tolerance strategies for failed tasks. The decentralized execution mechanism is expected for a majority of the cases to enable significantly improved scalability and efficiency. Additionally, in-memory transfer of data between analyses is expected to enable realization of large performance gains (and energy reductions) since the latency gap between disk and memory is growing by orders of magnitude.

Provenance

Data provenance enables users to track the lineage of their data products and computations.  The templates have the potential to significantly ease provenance and computational state information collection. While current provenance tools provide methods to capture, store and query provenance, they often expect users to activate the mechanisms and to either know or adhere to an exact schema. The Tigres library will collect provenance information automatically as an artifact of template use and it will also allow users to define and capture user-defined provenance data. Definition of what provenance to collect is well defined in existing models. However, how to easily enable application handling of fault information is an open research question for workflow systems although there are many existing models from programming languages that can be tried.

Fault-Tolerance

Although faults in HPC systems are rare, they occur. Fault-detection and recovery capabilities will provide robustness of the templates and resilience to system and application faults. The state and provenance capture will provide the foundation for enabling detection of failures and errors that occur. The research challenges for the Tigres library will be automated application level handling of the errors. Interfaces for users to easily specify automated recovery actions will need to be developed. 

Comments