January 18 – 23 , 2015, Dagstuhl Seminar 15041
Model-driven Algorithms and Architectures for Self-Aware Computing Systems
1 / 3 >
For support, please contact
Self-aware computing systems are best understood as a subclass of autonomic computing systems. The term, autonomic computing, was first introduced by IBM in 2001. Expressing a concern that the ever-growing size and complexity of IT systems would soon become too difficult for human administrators to manage, IBM proposed a biologically-inspired solution. An analogy was drawn between the autonomic nervous system, which continually adjusts the heart and respiratory rates, pupil dilation, and other lower-level biological functions in response to conscious decisions made by individuals, and autonomic computing systems, which are envisioned as managing themselves in accordance with high-level objectives from humans.
In an effort to enlist the academic community in a worldwide effort to meet this grand challenge, Kephart and Chess laid out a vision of autonomic computing in an IEEE Computing article in 2003 . The article postulated a multi-agent architecture for autonomic computing systems consisting of interacting software agents (called autonomic elements) that consume computational resources and deliver services to humans and to other autonomic elements, and used that architecture as a structure against which a diverse set of research challenges were defined. One of the major challenges from a scientific perspective was the definition of appropriate abstractions and models for understanding, controlling, and designing emergent behavior in autonomic systems. Many different components of IT systems could be autonomic elements - database management systems, load balancers, provisioning systems, anomaly detection system, etc. In addition to managing their own behavior in accordance with policies established by humans or other autonomic elements, they also manage their relationships with other autonomic elements.
The self-managing properties of autonomic computing systems, including self-optimization, self-configuration, self-healing and self-protection, are expected to arise not just from the intrinsic self-managing capabilities of the individual elements, but even more so from the interactions among those elements, in a manner akin to the social intelligence of ant colonies. Understanding the mapping from local behavior to global behavior, as well as the inverse relationship, was identified as a key condition for controlling and designing autonomic systems. One proposed approach was the coupling of advanced search and optimization techniques with parameterized models of the local-to-global relationship and the likely set of environmental influences to which the system will be subjected.
In the ensuing decade, there has been much research activity in the field of autonomic computing. At least 8000 papers have been written on the topic, and explicit solicitations for papers on autonomic computing can be found in the call for papers of at least 200 conferences and workshops annually, including the International Conference on Autonomic Computing (ICAC), now in its tenth year. The European government has funded autonomic computing research projects for several million euros via the FP6 and FP7 programs, and the US government has funded research in this field as well.
In a retrospective keynote at ICAC 2011, Kephart assessed the state of the field, finding through bibliometric analysis that progress in the field has been good but uneven . While there has been a strong emphasis on self-optimization in its many forms, there have been considerably fewer works on other key autonomic properties such as self-configuration, self-protection and self-healing. An apparent reason for this imbalance is that benchmarks that quantify these properties and allow them to be compared across different systems and methods are still largely lacking. Another finding was that much work remains to be done at the system level. In particular, while there has been considerable success in using machine learning and feedback control techniques to create adaptive autonomic elements, few authors have successfully built autonomic computing systems containing a variety of interacting adaptive elements. Several authors have observed that interactions among multiple machine learners or feedback loops can produce interesting unanticipated and sometimes destructive emergent behaviors; such phenomena are well known in the multi-agent systems realm as well, but insufficiently understood from a theoretical and practical perspective.
It is worth noting that there is a substantial sub-community within the autonomic computing field that applies feedback control to computing systems. FeBID (Feedback Control Implementation and Design in Computing Systems and Networks), a key workshop in this space, began in 2006 as a forum for describing advances in the application of control theory to computing systems and networks. In 2012, FeBID acquired a new name (Feedback Computing) to reflect a much broader and colloquial interpretation of ``feedback'', in which the goals are no longer merely set points, and system models are not merely used to help transform or transduce signals, but may themselves be adapted through learning. The evolution of this sub-community of autonomic computing reflects a growing acceptance of the idea that, for an autonomic computing element or system to manage itself competently, it needs to exploit (and often learn) models of how actions it might take would affect its own state and the state of the part of the world with which it interacts.
Self-Aware Computing Systems
o understand how self-aware computing systems fit within the broader context of autonomic and feedback computing, we started with the following definition [4,5] in the beginning of the seminar:
Definition 1. A computing system is considered to be “self-aware” if it possesses, and/or is able to acquire at runtime, the following three properties, ideally to an increasing degree the longer the system is in operation:
- Self-reflective: Aware of its software architecture, execution environment, and hardware infrastructure on which it is running as well as of its operational goals (e.g., quality-of-service requirements, cost- and energy-efficiency targets),
- Self-predictive: Able to predict the effect of dynamic changes (e.g., changing service workloads) as well as predict the effect of possible adaptation actions (e.g., changing system configuration, adding/removing resources),
- Self-adaptive: Proactively adapting as the environment evolves in order to ensure that its operational goals are continuously met.
The three properties in the above definition are obviously not binary, and different systems may satisfy them to a different degree, however, in order to speak of "self-awareness", all three properties must apply to the considered system.
To realize the vision of "self-aware" computing systems, as defined above, we advocated a holistic model-based approach where systems are designed from the ground up with built-in self-reflective and self-predictive capabilities, encapsulated in the form of online system architecture models. The latter are assumed to capture the relevant influences (with respect to the system's operational goals) of the system's software architecture, its configuration, its usage profile, and its execution environment (e.g., physical hardware, virtualization, and middleware). The online models are also assumed to explicitly capture the system's operational goals and policies (e.g., quality-of-service requirements, service level agreements, efficiency targets) as well as the system's adaptation space, adaptation strategies and processes.
Figure 1 presents our vision of a self-aware system adaptation loop based on the MAPE-K control loop  in combination with the online system architecture models used to guide the system adaptation at runtime. In the following, we briefly describe the four phases of the adaptation loop.
Phase 1 (Observe/Reflect): In this phase, the managed system is observed and monitoring data is collected and used to extract, refine, calibrate, and continuously update the online system models, reflecting the relevant influences that need to be captured in order to realize the self-predictive property with respect to the system's operational goals. In the context of this phase, expertise from software engineering, systems modeling and analysis, as well as machine learning, is required for the automatic extraction, refinement and calibration of the online models based on observations of the system at runtime.
Phase 2 (Detect/Predict): In this phase, the monitoring data and online models are used to analyze the current state of the system in order to detect or predict problems such as SLA violations, inefficient resource usage, system failures, network attacks, and so on. Workload forecasting combined with performance prediction and anomaly detection techniques can be used to predict the impact of changes in the environment (e.g., varying system workloads) and anticipate problems before they have actually occurred. In the context of this phase, expertise from systems modeling, simulation, and analysis, as well as autonomic computing and artificial intelligence, is required to detect and predict problems at different time scales during operation.
Phase 3 (Plan/Decide): In this phase, the online system models are used to find an adequate solution to a detected or predicted problem by adapting the system at runtime. Two steps are executed iteratively in this phase: i) generation of an adaptation plan, and ii) prediction of the adaptation effects. In the first step, a candidate adaptation plan is generated based on the online models that capture the system adaptation strategies, taking into account the urgency of the problem that needs to be resolved. In the second step, the effects of the considered possible adaptation plan are predicted, again by means of the online system architecture models. The two steps are repeated until an adequate adaptation plan is found that would successfully resolve the detected or predicted problem. In the context of this phase, expertise from systems modeling, simulation, and analysis, as well as autonomic computing, artificial intelligence, and data center resource management, is required to implement predictable adaptation processes.
Phase 4 (Act/Adapt): In this phase, the selected adaptation plan is applied on the real system at runtime. The actuators provided by the system are used to execute the individual adaptation actions captured in the adaptation plan. In the context of this phase, expertise from data center resource management (virtualization, cluster, grid and cloud computing), distributed systems, and autonomic computing, is required to execute adaptation processes in an efficient and timely manner.
Broader Notion of Self-aware Computing
As a result of the working group "Defining Self-aware Computing Systems", a broader notion of self-aware computing was formulated:
Definition 2 Self-aware computing systems are computing systems that:
- learn models capturing knowledge about themselves and their environment (such as their structure, design, state, possible actions, and run-time behavior) on an ongoing basis and
- reason using the models (for example predict, analyze, consider, plan) enabling them to act based on their knowledge and reasoning (for example explore, explain, report, suggest, self-adapt, or impact their environment)
For a detailed discussion of the interpretation of this definition, we refer the reader to Section 4.1.
The envisioned novel algorithms and architectures for self-aware computing systems are of high relevance to the real-world problems faced by software developers and practitioners in the IT industry. Even though many of the specific problems have been researched upon within the aforementioned disciplines and communities, we believed the timing is right for adopting a broader integrated and interdisciplinary approach and exploiting synergies in the existing modeling and management approaches. The demand and the urgency for providing practical model-driven solutions to the described problems have never been higher, for the following reasons:
Large-scale, on-demand infrastructure: Although the cloud computing concept has been around for a long time, it wasn't until the last few years did we see a wide availability and adoption of cloud computing platforms. Such platforms provide infrastructure-on-demand to business critical applications and high performance computing workloads. Such highly dynamic, demand-driven environments make many existing automation schemes in computing systems inadequate, because they are mostly rule-based or heuristics-driven and cannot self-adapt to changes in both the infrastructure and the workloads.
Applications and workloads: The ever-increasing variety and complexity of modern applications and their workloads are placing more stress on computing systems and making many traditional management approaches obsolete. This is exacerbated by the extensive use of mobile devices and applications by an increasing population that produces new usage patterns and resource requirements.
Sensors and data: The numbers and types of sensors deployed in computing systems have never been greater, which lead to an explosion of runtime monitoring data that accurately capture the operating conditions of systems and software. Such data significantly enhance the chances for computing systems to Observe/Reflect (Phase 1) and to extract/refine/calibrate online system models that were difficult to learn otherwise, making a model-driven approach more feasible and reliable.
Need for automation: The IT industry is crying out ever so loud for automation technologies to help deal with the above challenges. Automation also helps reduce manual labor cost in management and administration and addresses the increasing gap between the number of skilled IT professionals and the industrial demand. There have been a growing number of startup companies that aim at developing automation solutions for capacity planning, provisioning and deployment, service level assurance, anomaly detection, failure/performance diagnosis, high availability, disaster recovery, and security enforcement. More research on modern-driven algorithms and architectures for self-aware computing can really feed into this new wave of innovations.
Organization of the Seminar
As inspired by the above described vision and approach towards its realization, we believed that the design of self-aware computing systems calls for an integrated interdisciplinary approach building on results from multiple areas of computer science and engineering including: i) software and systems engineering; ii) systems modeling, simulation and analysis; iii) autonomic and organic computing, machine learning and artificial intelligence; iv) data center resource management including virtualization, cluster, grid and cloud computing. This was the motivation of the research seminar. The list of invitees was carefully composed to provide a balance among these fields including both theoretical and applied research with participation from both academia and industry. We note that, in reality, each of the four mentioned communities is in fact comprised of multiple separate sub-communities although they have some overlap in their membership. While they can be seen as separate research communities, we consider them related in terms of their goals, with the difference being mostly in the specific focus of each sub-community and the employed scientific methods. The final participants of the seminar included representatives from each sub-community such that we cover the different relevant focus areas and scientific methodologies.
Achievements of the Seminar
This seminar has achieved its original goal of bringing together scientists, researchers, and practitioners from four different communities, including Software Engineering, Modeling and Analysis, Autonomic Computing, and Resource Management, in a balanced manner. The seminar program provided a basis for exchange of ideas and experiences from these different communities, offered a forum for deliberation and collaboration, and helped identify the technical challenges and open questions around self-aware computing systems. In summary, its achievements are mainly in the following two areas.
Identification of Synergies and Research Questions
By bringing together researchers from the above research fields and their respective communities, we avoid duplication of effort and exploit synergies between related research efforts.
During the seminar, we identified the following research questions and challenges that are of common interest to multiple communities:
- Design of abstractions for modeling quality-of-service (QoS) relevant aspects of systems and services deployed in dynamic virtualized environments. The abstractions should make it possible to capture information at different levels of detail and granularity allowing to explicitly model the individual layers of the system architecture and execution environment, context dependencies, and dynamic system parameters.
- Automatic model extraction, maintenance, refinement, and calibration during operation. Models should be tightly coupled with the system components they represent while at the same time they should abstract information in a platform-neutral manner.
- Efficient resolution of context dependencies including dependencies between the service deployment context and input parameters passed upon invocation, on the one hand, and resource demands, invoked third-party services, and control flow of underlying software components, on the other hand.
- Automatic generation of predictive models on-the-fly for online QoS prediction. The models should be tailored to answering specific online QoS queries. The model type, level of abstraction and granularity, as well as the model solution technique, should be determined based on: i) the type of the query (e.g., metrics that must be predicted, size of the relevant parts of the system), ii) the required accuracy of the results, iii) the time constraints, iv) the amount of information available about the system components and services involved.
- Efficient heuristics exploiting the online QoS prediction techniques for dynamic system adaptation and utility-based optimization. item Novel techniques for self-aware QoS management guaranteeing service-level agreements (SLAs) while maximizing resource efficiency or minimizing energy cost.
- Standard metrics and benchmarking methodologies for quantifying the QoS- and efficiency-related aspects (e.g., platform elasticity) of systems running on virtualized infrastructures.
The above research questions and challenges were considered in the context of our holistic model-based approach and the self-aware system adaptation loop presented in the previous section. Answering these questions can help determine what system aspects should be modeled, how they should be modeled, how model instances should be constructed and maintained at runtime, and how they should be leveraged for online QoS prediction and proactive self-adaptation.
The online system models play a central role in implementing the four phases of the described system adaptation loop. The term "model" in this context is understood in a broad sense since models can be used to capture a range of different system aspects and modeling techniques of different type and nature can be employed (e.g., an analytical queuing model for online performance prediction, a machine learning model for managing resource allocations, a statistical regression model capturing the relationship between two different system parameters, a descriptive model defining an adaptation policy applied under certain conditions). At the seminar, we advocate a model-based approach that does not prescribe specific types of models to be employed and instead we use the term "online system models" to refer to all information and knowledge about the system available for use at runtime as part of the system adaptation loop. This includes both descriptive and predictive models.
Descriptive models describe a certain aspect of the system such as the system's operational goals and policies (quality-of-service requirements and resource efficiency targets), the system's software architecture and hardware infrastructure, or the system's adaptation space and adaptation processes. Such models may, for example, be described using the Meta-Object-Facility (MOF) standard for model-driven engineering, heavily used in the software engineering community.
Predictive models are typically applied in three different contexts: i) to predict dynamic changes in the environment, e.g., varying and evolving system workloads, ii) to predict the impact of such changes on system metrics of interest, iii) to predict the impact of possible adaptation actions at runtime, e.g., application deployment and configuration changes. A range of different predictive modeling techniques have been developed in the systems modeling, simulation and analysis community, which can be used in the "detect/predict" phase of our adaptation loop, e.g., analytical or simulative stochastic performance models, workload forecasting models based on time-series analysis, reliability and availability models based on Markov chains, black-box models based on statistical regression techniques. Finally, models from the autonomic computing and machine learning communities can be used as a basis for implementing the "plan/decide" phase of our adaptation loop. Examples of such models are machine learning models based on reinforcement learning or analytical models based on control theory.
Two important goals of the seminar were to discuss the applicability of the various types of models mentioned above in the context of self-aware computing systems, and to evaluate the tradeoffs in the use of different modeling techniques and how these techniques can be effectively combined and tailored to the specific scenario. As discussed above, in each phase of the self-aware adaptation loop, multiple modeling techniques can be employed. Depending on the characteristics of the specific scenario, different techniques provide different tradeoffs between the modeling accuracy and overhead. Approaches to leverage these tradeoffs at runtime in order to provide increased flexibility will be discussed and analyzed.
Finally, the practical feasibility and associated costs of developing system architecture models was also extensively discussed. We also identified a major target of future research in the area of self-aware computing, which is to automate the construction of online system models and to defer as much as possible of the model building process to system runtime (e.g., the selection of a suitable model to use in a given online scenario, the derivation of adequate model structure by dynamically composing existing template models of the involved system components and layers, the parameterization of the model, and finally, the iterative validation and calibration of the model). Such an approach has the potential not only to help reduce the costs of building system architecture models, but also to bring models closer to the real systems and applications by composing and calibrating them at runtime based on monitoring of the real observed system behavior in the target production environment when executing real-life operational workloads.
Impact on the Research Community
By bringing together the aforementioned four communities, the research seminar allowed for cross-fertilization between research in the respective area. It has raised the awareness of the relevant research efforts in the respective research communities as well as existing synergies that can be exploited to advance the state-of-the-art of the field of self-aware computing systems. The seminar has left to this Dagstuhl Report that provides an up-to-date point of reference to the related work, currently active researchers, as well as open research challenges in this new field. Given that a significant proportion of the proposed participants are from industry, the seminar also fostered the transfer of knowledge and experiences in the respective areas between industry and academia.
In addition to producing this joint report summarizing, we also found enough support and interest among the seminar participants to continue the collaboration through the following venues: i) writing a joint book to publish at Springer with chapter contributions from the seminar participants, ii) establish a new annual workshop on self-aware computing to provide a forum for exchanging ideas and experiences in the areas targeted by the seminar.
Overall, the seminar opened up new and exciting research opportunities in each of the related research areas contributing to the emergence of a new research area at their intersection.
- Jeffrey O. Kephart, David M. Chess, “The Vision of Autonomic Computing,” in IEEE Computer, 36(1):41–50, 2003. DOI: 10.1109/MC.2003.1160055
- Jeffrey O. Kephart,“Autonomic Computing: The First Decade”, in Proc. of the 8th ACM Int’l Conf. on Autonomic Computing (ICAC’11), pp. 1-2, ACM, 2011. DOI: 10.1145/1998582.1998584
- IBM Corporation, “An Architectural Blueprint for Autonomic Computing”, IBM White Paper, 4th Edition, 2006.
- Samuel Kounev, “Engineering of Self-Aware IT Systems and Services: State-of-the-Art and Research Challenges”, in Proc. of the 8th European Performance Engineering Workshop (EPEW’11), LNCS, Vol. 6977, pp. 10–13, Springer, 2011. DOI: 10.1007/978-3-642-24749-1_2
- Samuel Kounev, Fabian Brosig, and Nikolaus Huber, “Self-Aware QoS Management in Virtualized Infrastructures”, in Proc. of the 8th ACM Int’l Conf. on Autonomic Computing (ICAC’11), pp. 175–176, ACM, 2011. DOI: 10.1145/1998582.1998615
Creative Commons BY 3.0 Unported license
Jeffrey O. Kephart, Samuel Kounev, Marta Kwiatkowska, and Xiaoyun Zhu
- Artificial Intelligence / Robotics
- Modelling / Simulation
- Software Engineering
- Autonomic systems
- Systems management
- Machine learning
- Feedback-based design