Monday, November 2, 2020

Reliability and Availability Modeling in Practice 

High reliability and availability is a requirement for most technical systems. Reliability and availability assurance methods based on probabilistic models is the topic addressed in this tutorial. Non-state-space solution methods are often used to solve models based on reliability block diagrams, fault trees and reliability graphs. Relatively efficient algorithms are known to handle systems with hundreds of components and have been implemented in many software packages. Nevertheless, many practical problems cannot be handled by such algorithms. Bounding algorithms are then used in such cases as was done for a major subsystem of Boeing 787. Non-state-space methods derive their efficiency from the independence assumption that is often violated in practice. State space methods based on Markov chains, stochastic Petri nets, semi-Markov and Markov regenerative processes can be used to model various kinds of dependencies among system components. However, the resulting state space explosion severely restricts the size of the problem that can be solved. Hierarchical and fixed-point iterative methods provide a scalable alternative that combines the strengths of state space and non-state-space methods and have been extensively used to solve real-life problems. We will take a journey through these model types via interesting real-world examples that the tutorial presenter has personally worked on. Examples include the availability model of IBM BladeCenter and High Availability implementation of SIP (Session Initiation Protocol) on IBM WebSphere. Cisco case study relates to an availability model of one of their routers while the SUN Microsystem case study is the availability model of their high availability platform. Boeing example shows the reliability analysis of the Current Return Network subsystem that was used for the FAA Certification of Boeing 787. All the techniques and case studies are drawn from a recent book by the tutorial presenter.


Prof. Kishor Trivedi holds the Fitzgerald Hudson Chair in ECE at Duke University. He has been on the Duke faculty since 1975. He is the author of a well-known text entitled, Probability and Statistics with Reliability, Queuing and Computer Science Applications, He has also published two other books entitled, Performance and Reliability Analysis of Computer Systems, published by Kluwer Academic Publishers and Queueing Networks and Markov Chains, John Wiley. His latest book, Reliability and Availability Engineering is published by Cambridge University Press in 2017. He is an IEEE Life Fellow. He has published over 600 articles and has supervised 48 Ph.D. dissertations. He is the recipient of IEEE Computer Society’s Technical Achievement Award for his research on Software Aging and Rejuvenation. His h-index is 102. He has worked closely with industry in carrying out reliability/availability analysis, providing short courses on reliability, availability, performability, survivability and in the development and dissemination of software packages: HARP, SHARPE, SPNP. He co-designer of IBM’s SAVE and Boeing’s IRAP modeling packages.
Prof. Andrea Bobbio is an honorary Professor of Computer Science at Università del Piemonte Orientale in Italy and Senior Member of IEEE. His academic and professional activity has been mainly in the area of reliability engineering and system reliability. He contributed to the study of heterogeneous modeling techniques for dependable systems, ranging from non-state-space techniques to Bayesian belief networks, to state-space based techniques, and fluid models. He has visited several important institutions and is the author of 200 papers in international journals, conferences and workshops. He is co-author with Kishor Trivedi of the book, Reliability and Availability Engineering, published by Cambridge University Press in 2017.

Load balancing, redundancy, and multi-type job and server systems 

Load balancing is a fundamental problem in many application domains such us clusters of web-server nodes, database systems, grid computing and inventory routing. An emerging approach that has received lot of attention is redundancy, which consists in sending several copies of the same job to multiple servers. We will cover known existing results, a present a token-based framework that subsumes existing frameworks, including redundancy systems and Order Independent queues. Under appropriate assumptions, the steady-state distribution is of product-form, and performance metrics can be analyzed. We will also discuss open questions and future research directions.


Prof. Urtzi Ayesta is currently a CNRS researcher working at IRIT, Toulouse, France and he also holds an adjunct lecturer position (part-time appointment funded by Ikerbasque) in the Computer Science Faculty at the University of the Basque Country, Spain 


AI4NETS – AI/ML for data communication Networks 

The popularity of Artificial Intelligence (AI) – and of Machine Learning (ML) as an approach to AI, has dramatically increased in the last few years, due to its outstanding performance in various domains, notably in image, audio, and natural language processing. In these domains, AI success-stories are boosting the applied field. When it comes to AI/ML for data communication Networks (AI4NETS), and despite the many attempts to turn networks into learning agents, the successful application of AI/ML in networking is limited. There is a strong resistance against AI/ML-based solutions, and a striking gap between the extensive academic research and the actual deployments of such AI/ML-based systems in operational environments. The truth is, there are still many unsolved complex challenges associated to the analysis of networking data through AI/ML, which hinders its acceptability and adoption in the practice.

In this tutorial, I elaborate on the most important show-stoppers in AI4NETS, and present a broad overview on the application of AI/ML to network data analysis problems, including network security, performance monitoring, anomaly detection, and Quality of Experience (QoE). The tutorial provides an introduction to the basics of ML, going from more traditional applications to newer paradigms, including Deep Learning, learning in adversarial scenarios, generative models, explainable AI, autoML, and more. The ultimate goal of the tutorial is to motivate – for the newcomers, and to strengthen – for those already in the field, the future research on the (re)-emergent field of AI4NETS.


Dr. Pedro Casas is a Senior Scientist at the AIT Austrian Institute of Technology, within the Data Science and Artificial Intelligence (DSAI) competence unit. He is an expert and lead researcher in AI4NETS (AI/ML for Networks), leading multiple national and international projects in network measurement and data analytics. Before joining AIT, he was Senior Researcher at the FTW Telecommunications Research Center Vienna, leading research activities in network traffic monitoring and analysis. He holds a PhD in computer science from Télécom Bretagne (France), and a PhD in electrical engineering from Universidad de la República (Uruguay). He has published more than 160 Networking research papers in major international conferences, journals, and workshops, received 14 best paper, best demo, best student work, and best workshop awards, and periodically participates as chair for different conferences and workshops in network measurement and analysis. His main research areas include machine-learning and data mining based approaches for Networking, big data analytics and platforms, Internet network measurements, network security and anomaly detection, as well as QoE modeling, assessment, and monitoring.