Software has become a key enabler and integrator for modern systems. Understanding the physical mechanics of software fault propagation is difficult for general class of systems. Without this knowledge, we often see that the software breaks all the time and the system breaks as a result. In this project, we studied technicals, patterns and architectural frameworks to make the software intensive system more resilient. In this work we accepted that software is going to fail and developed techniques that can be used to compare different designs for resiliency. We also studied the tradeoff between redundancy and runtime reconfiguration in this project. Finally, we designed tools for mapping distributed application configuration models to reliability block diagrams and using the redundancy information to compute resilience metrics used for comparing alternative deployments. More information and the tools are available here.
Acknowledgements: This project was sponsored by Air Force Research Laboratory.
In this project we designed and Implemented a Secure Information Architecture for the DARPA Systems F6 program.
The information architecture platform we developed is a layered stack containing a novel real-time operating
system, middleware and a component layer. This work further enabled Distributed Real-time
Embedded Managed Systems (DREMS), a special class of distributed embedded computing
systems that are remotely controlled and managed, but they operate in and are integrated into
a local physical environment. The complete software platform and a model-driven software
development toolchain that can be used to design, implement, and operate DREMS can be
White Paper Demonstration Scenario Demonstration Video
Acknowledgements: This work was supported by the DARPA System F6 Program under contract NNA11AC08C through NASA ARC.
Software complexity has grown to the point that it is necessary to apply the concepts and techniques from system health management. Today’s advanced software development technologies, especially model-based software development techniques are excellent candidates for bringing together the techniques of model-based fault diagnosis and software engineering. The project developed techniques for applying model-based techniques for software health management, with special focus on the use of models in run-time. This work is based previous efforts in the area of model-based software development tools, as well as model-based fault diagnostics. During this project we developed an emulator for ARINC-653, the state of the art standard for implementing
Integrated Modular Architecture in aerospace domain. This emulator was then
extended to build a component middleware and software health management framework for
ARINC-653 systems. This approach resulted in a novel two level
health management architecture that can be applied in the context of a model-based software
development process. The emulator and the design environment can be downloaded from
Acknowledgements: This project was sponsored by the NASA Aviation Safety Program.
Large computing clusters used for scientific processing suffer from systemic failures when operated over long continuous periods for executing workflows. Diagnosing job problems and faults leading to eventual failures in this complex environment is difficult, specifically when the success of an entire workflow might be affected by a single job failure. In this project, we developed a model-based hierarchical reflex and healing framework that was used to monitor and execute workflows reliably.
As participants belonging to a workflow were mapped onto machines and executed, periodic and on-demand monitoring of vital health parameters on allocated nodes was enabled according to pre-specified rules. These rules specified conditions that must be true pre-execution, during execution and post-execution. Monitoring information for each participant was propagated upwards through the reflex and healing architecture, which consists of a hierarchical network of decentralized fault management entities, called reflex engines. They were instantiated as state machines or timed automatons that change state and initiate reflexive mitigation action(s) upon occurrence of certain faults.
Acknowledgements: This project was funded by a grant from Department of Energy under the SCIDAC-II program.
Bayesian regression and queuing models based techniques for automatically learning the structure
of multi-tier enterprise applications online. These models allowed us to identify the
bottleneck resource inside an application server, predict the average response time and predict
the power consumption of each physical machine in the enterprise.
Acknowledgements: This project was funded in part by a grant from National Science foundation.