10,16,2021

 MLSys: The New Frontier of Machine Learning Systems2019-12-01   ${\displaystyle \cong }$ Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two. Strategies and Principles of Distributed Machine Learning on Big Data2015-12-31   ${\displaystyle \cong }$ The rise of Big Data has led to new demands for Machine Learning (ML) systems to learn complex models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. In order to run ML algorithms at such scales, on a distributed cluster with 10s to 1000s of machines, it is often the case that significant engineering efforts are required --- and one might fairly ask if such engineering truly falls within the domain of ML research or not. Taking the view that Big ML systems can benefit greatly from ML-rooted statistical and algorithmic insights --- and that ML researchers should therefore not shy away from such systems design --- we discuss a series of principles and strategies distilled from our recent efforts on industrial-scale ML solutions. These principles and strategies span a continuum from application, to engineering, and to theoretical research and development of Big ML systems and architectures, with the goal of understanding how to make them efficient, generally-applicable, and supported with convergence and scaling guarantees. They concern four key questions which traditionally receive little attention in ML research: How to distribute an ML program over a cluster? How to bridge ML computation with inter-machine communication? How to perform such communication? What should be communicated between machines? By exposing underlying statistical and algorithmic characteristics unique to ML programs but not typically seen in traditional computer programs, and by dissecting successful cases to reveal how we have harnessed these principles to design and develop both high-performance distributed ML software as well as general-purpose ML frameworks, we present opportunities for ML researchers and practitioners to further shape and grow the area that lies between ML and systems. Learning by Design: Structuring and Documenting the Human Choices in Machine Learning Development2021-05-03   ${\displaystyle \cong }$ The influence of machine learning (ML) is quickly spreading, and a number of recent technological innovations have applied ML as a central technology. However, ML development still requires a substantial amount of human expertise to be successful. The deliberation and expert judgment applied during ML development cannot be revisited or scrutinized if not properly documented, and this hinders the further adoption of ML technologies--especially in safety critical situations. In this paper, we present a method consisting of eight design questions, that outline the deliberation and normative choices going into creating a ML model. Our method affords several benefits, such as supporting critical assessment through methodological transparency, aiding in model debugging, and anchoring model explanations by committing to a pre hoc expectation of the model's behavior. We believe that our method can help ML practitioners structure and justify their choices and assumptions when developing ML models, and that it can help bridge a gap between those inside and outside the ML field in understanding how and why ML models are designed and developed the way they are. MLPerf Inference Benchmark2020-05-09   ${\displaystyle \cong }$ Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability. Visual Machine Learning: Insight through Eigenvectors, Chladni patterns and community detection in 2D particulate structures2020-01-02   ${\displaystyle \cong }$ Machine learning (ML) is quickly emerging as a powerful tool with diverse applications across an extremely broad spectrum of disciplines and commercial endeavors. Typically, ML is used as a black box that provides little illuminating rationalization of its output. In the current work, we aim to better understand the generic intuition underlying unsupervised ML with a focus on physical systems. The systems that are studied here as test cases comprise of six different 2-dimensional (2-D) particulate systems of different complexities. It is noted that the findings of this study are generic to any unsupervised ML problem and are not restricted to materials systems alone. Three rudimentary unsupervised ML techniques are employed on the adjacency (connectivity) matrix of the six studied systems: (i) using principal eigenvalue and eigenvectors of the adjacency matrix, (ii) spectral decomposition, and (iii) a Potts model based community detection technique in which a modularity function is maximized. We demonstrate that, while solving a completely classical problem, ML technique produces features that are distinctly connected to quantum mechanical solutions. Dissecting these features help us to understand the deep connection between the classical non-linear world and the quantum mechanical linear world through the kaleidoscope of ML technique, which might have far reaching consequences both in the arena of physical sciences and ML. Challenges and Pitfalls of Machine Learning Evaluation and Benchmarking2019-06-25   ${\displaystyle \cong }$ An increasingly complex and diverse collection of Machine Learning (ML) models as well as hardware/software stacks, collectively referred to as "ML artifacts", are being proposed - leading to a diverse landscape of ML. These ML innovations proposed have outpaced researchers' ability to analyze, study and adapt them. This is exacerbated by the complicated and sometimes non-reproducible procedures for ML evaluation. A common practice of sharing ML artifacts is through repositories where artifact authors post ad-hoc code and some documentation, but often fail to reveal critical information for others to reproduce their results. This results in users' inability to compare with artifact authors' claims or adapt the model to his/her own use. This paper discusses common challenges and pitfalls of ML evaluation and benchmarking, which can be used as a guideline for ML model authors when sharing ML artifacts, and for system developers when benchmarking or designing ML systems. Hidden Technical Debts for Fair Machine Learning in Financial Services2021-03-18   ${\displaystyle \cong }$ The recent advancements in machine learning (ML) have demonstrated the potential for providing a powerful solution to build complex prediction systems in a short time. However, in highly regulated industries, such as the financial technology (Fintech), people have raised concerns about the risk of ML systems discriminating against specific protected groups or individuals. To address these concerns, researchers have introduced various mathematical fairness metrics and bias mitigation algorithms. This paper discusses hidden technical debts and challenges of building fair ML systems in a production environment for Fintech. We explore various stages that require attention for fairness in the ML system development and deployment life cycle. To identify hidden technical debts that exist in building fair ML system for Fintech, we focus on key pipeline stages including data preparation, model development, system monitoring and integration in production. Our analysis shows that enforcing fairness for production-ready ML systems in Fintech requires specific engineering commitments at different stages of ML system life cycle. We also propose several initial starting points to mitigate these technical debts for deploying fair ML systems in production. Towards a Robust and Trustworthy Machine Learning System Development2021-01-08   ${\displaystyle \cong }$ Machine Learning (ML) technologies have been widely adopted in many mission critical fields, such as cyber security, autonomous vehicle control, healthcare, etc. to support intelligent decision-making. While ML has demonstrated impressive performance over conventional methods in these applications, concerns arose with respect to system resilience against ML-specific security attacks and privacy breaches as well as the trust that users have in these systems. In this article, firstly we present our recent systematic and comprehensive survey on the state-of-the-art ML robustness and trustworthiness technologies from a security engineering perspective, which covers all aspects of secure ML system development including threat modeling, common offensive and defensive technologies, privacy-preserving machine learning, user trust in the context of machine learning, and empirical evaluation for ML model robustness. Secondly, we then push our studies forward above and beyond a survey by describing a metamodel we created that represents the body of knowledge in a standard and visualized way for ML practitioners. We further illustrate how to leverage the metamodel to guide a systematic threat analysis and security design process in a context of generic ML system development, which extends and scales up the classic process. Thirdly, we propose future research directions motivated by our findings to advance the development of robust and trustworthy ML systems. Our work differs from existing surveys in this area in that, to the best of our knowledge, it is the first of its kind of engineering effort to (i) explore the fundamental principles and best practices to support robust and trustworthy ML system development; and (ii) study the interplay of robustness and user trust in the context of ML systems. Declarative Machine Learning - A Classification of Basic Properties and Types2016-05-19   ${\displaystyle \cong }$ Declarative machine learning (ML) aims at the high-level specification of ML tasks or algorithms, and automatic generation of optimized execution plans from these specifications. The fundamental goal is to simplify the usage and/or development of ML algorithms, which is especially important in the context of large-scale computations. However, ML systems at different abstraction levels have emerged over time and accordingly there has been a controversy about the meaning of this general definition of declarative ML. Specification alternatives range from ML algorithms expressed in domain-specific languages (DSLs) with optimization for performance, to ML task (learning problem) specifications with optimization for performance and accuracy. We argue that these different types of declarative ML complement each other as they address different users (data scientists and end users). This paper makes an attempt to create a taxonomy for declarative ML, including a definition of essential basic properties and types of declarative ML. Along the way, we provide insights into implications of these properties. We also use this taxonomy to classify existing systems. Finally, we draw conclusions on defining appropriate benchmarks and specification languages for declarative ML. Machine Learning for Intelligent Optical Networks: A Comprehensive Survey2020-03-11   ${\displaystyle \cong }$ With the rapid development of Internet and communication systems, both in services and technologies, communication networks have been suffering increasing complexity. It is imperative to improve intelligence in communication network, and several aspects have been incorporating with Artificial Intelligence (AI) and Machine Learning (ML). Optical network, which plays an important role both in core and access network in communication networks, also faces great challenges of system complexity and the requirement of manual operations. To overcome the current limitations and address the issues of future optical networks, it is essential to deploy more intelligence capability to enable autonomous and exible network operations. ML techniques are proved to have superiority on solving complex problems; and thus recently, ML techniques have been used for many optical network applications. In this paper, a detailed survey of existing applications of ML for intelligent optical networks is presented. The applications of ML are classified in terms of their use cases, which are categorized into optical network control and resource management, and optical networks monitoring and survivability. The use cases are analyzed and compared according to the used ML techniques. Besides, a tutorial for ML applications is provided from the aspects of the introduction of common ML algorithms, paradigms of ML, and motivations of applying ML. Lastly, challenges and possible solutions of ML application in optical networks are also discussed, which intends to inspire future innovations in leveraging ML to build intelligent optical networks. Petuum: A New Platform for Distributed Machine Learning on Big Data2015-05-14   ${\displaystyle \cong }$ What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system designs versus well-known implementations of modern ML algorithms, allowing ML programs to run in much less time and at considerably larger model sizes, even on modestly-sized compute clusters. User-centric Composable Services: A New Generation of Personal Data Analytics2017-11-26   ${\displaystyle \cong }$ Machine Learning (ML) techniques, such as Neural Network, are widely used in today's applications. However, there is still a big gap between the current ML systems and users' requirements. ML systems focus on improving the performance of models in training, while individual users cares more about response time and expressiveness of the tool. Many existing research and product begin to move computation towards edge devices. Based on the numerical computing system Owl, we propose to build the Zoo system to support construction, compose, and deployment of ML models on edge and local devices. Insights into Performance Fitness and Error Metrics for Machine Learning2020-05-17   ${\displaystyle \cong }$ Machine learning (ML) is the field of training machines to achieve high level of cognition and perform human-like analysis. Since ML is a data-driven approach, it seemingly fits into our daily lives and operations as well as complex and interdisciplinary fields. With the rise of commercial, open-source and user-catered ML tools, a key question often arises whenever ML is applied to explore a phenomenon or a scenario: what constitutes a good ML model? Keeping in mind that a proper answer to this question depends on a variety of factors, this work presumes that a good ML model is one that optimally performs and best describes the phenomenon on hand. From this perspective, identifying proper assessment metrics to evaluate performance of ML models is not only necessary but is also warranted. As such, this paper examines a number of the most commonly-used performance fitness and error metrics for regression and classification algorithms, with emphasis on engineering applications. Towards the Science of Security and Privacy in Machine Learning2016-11-11   ${\displaystyle \cong }$ Advances in machine learning (ML) in recent years have enabled a dizzying array of applications such as data analytics, autonomous systems, and security diagnostics. ML is now pervasive---new systems and models are being deployed in every domain imaginable, leading to rapid and widespread deployment of software based inference and decision making. There is growing recognition that ML exposes new vulnerabilities in software systems, yet the technical community's understanding of the nature and extent of these vulnerabilities remains limited. We systematize recent findings on ML security and privacy, focusing on attacks identified on these systems and defenses crafted to date. We articulate a comprehensive threat model for ML, and categorize attacks and defenses within an adversarial framework. Key insights resulting from works both in the ML and security communities are identified and the effectiveness of approaches are related to structural elements of ML algorithms and the data used to train them. We conclude by formally exploring the opposing relationship between model accuracy and resilience to adversarial manipulation. Through these explorations, we show that there are (possibly unavoidable) tensions between model complexity, accuracy, and resilience that must be calibrated for the environments in which they will be used. The Adversarial Machine Learning Conundrum: Can The Insecurity of ML Become The Achilles' Heel of Cognitive Networks?2019-06-03   ${\displaystyle \cong }$ The holy grail of networking is to create \textit{cognitive networks} that organize, manage, and drive themselves. Such a vision now seems attainable thanks in large part to the progress in the field of machine learning (ML), which has now already disrupted a number of industries and revolutionized practically all fields of research. But are the ML models foolproof and robust to security attacks to be in charge of managing the network? Unfortunately, many modern ML models are easily misled by simple and easily-crafted adversarial perturbations, which does not bode well for the future of ML-based cognitive networks unless ML vulnerabilities for the cognitive networking environment are identified, addressed, and fixed. The purpose of this article is to highlight the problem of insecure ML and to sensitize the readers to the danger of adversarial ML by showing how an easily-crafted adversarial ML example can compromise the operations of the cognitive self-driving network. In this paper, we demonstrate adversarial attacks on two simple yet representative cognitive networking applications (namely, intrusion detection and network traffic classification). We also provide some guidelines to design secure ML models for cognitive networks that are robust to adversarial attacks on the ML pipeline of cognitive networks. Application of Machine Learning in Wireless Networks: Key Techniques and Open Issues2019-02-28   ${\displaystyle \cong }$ As a key technique for enabling artificial intelligence, machine learning (ML) is capable of solving complex problems without explicit programming. Motivated by its successful applications to many practical tasks like image recognition, both industry and the research community have advocated the applications of ML in wireless communication. This paper comprehensively surveys the recent advances of the applications of ML in wireless communication, which are classified as: resource management in the MAC layer, networking and mobility management in the network layer, and localization in the application layer. The applications in resource management further include power control, spectrum management, backhaul management, cache management, beamformer design and computation resource management, while ML based networking focuses on the applications in clustering, base station switching control, user association and routing. Moreover, literatures in each aspect is organized according to the adopted ML techniques. In addition, several conditions for applying ML to wireless communication are identified to help readers decide whether to use ML and which kind of ML techniques to use, and traditional approaches are also summarized together with their performance comparison with ML based approaches, based on which the motivations of surveyed literatures to adopt ML are clarified. Given the extensiveness of the research area, challenges and unresolved issues are presented to facilitate future studies, where ML based network slicing, infrastructure update to support ML based paradigms, open data sets and platforms for researchers, theoretical guidance for ML implementation and so on are discussed. Counterfactual Explanations for Machine Learning on Multivariate Time Series Data2020-08-24   ${\displaystyle \cong }$ Applying machine learning (ML) on multivariate time series data has growing popularity in many application domains, including in computer system management. For example, recent high performance computing (HPC) research proposes a variety of ML frameworks that use system telemetry data in the form of multivariate time series so as to detect performance variations, perform intelligent scheduling or node allocation, and improve system security. Common barriers for adoption for these ML frameworks include the lack of user trust and the difficulty of debugging. These barriers need to be overcome to enable the widespread adoption of ML frameworks in production systems. To address this challenge, this paper proposes a novel explainability technique for providing counterfactual explanations for supervised ML frameworks that use multivariate time series data. The proposed method outperforms state-of-the-art explainability methods on several different ML frameworks and data sets in metrics such as faithfulness and robustness. The paper also demonstrates how the proposed method can be used to debug ML frameworks and gain a better understanding of HPC system telemetry data. A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary Classification: Application in Pancreatic Cancer Nested Case-control Studies with Implications for Bias Assessments2020-09-08   ${\displaystyle \cong }$ Machine learning (ML) offers a collection of powerful approaches for detecting and modeling associations, often applied to data having a large number of features and/or complex associations. Currently, there are many tools to facilitate implementing custom ML analyses (e.g. scikit-learn). Interest is also increasing in automated ML packages, which can make it easier for non-experts to apply ML and have the potential to improve model performance. ML permeates most subfields of biomedical research with varying levels of rigor and correct usage. Tremendous opportunities offered by ML are frequently offset by the challenge of assembling comprehensive analysis pipelines, and the ease of ML misuse. In this work we have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification (i.e. case/control prediction), and applied this pipeline to both simulated and real world data. At a high level, this 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms, each with hyperparameter optimization, and e) thorough evaluation, including appropriate metrics, statistical analyses, and novel visualizations. This pipeline organizes the many subtle complexities of ML pipeline assembly to illustrate best practices to avoid bias and ensure reproducibility. Additionally, this pipeline is the first to compare established ML algorithms to 'ExSTraCS', a rule-based ML algorithm with the unique capability of interpretably modeling heterogeneous patterns of association. While designed to be widely applicable we apply this pipeline to an epidemiological investigation of established and newly identified risk factors for pancreatic cancer to evaluate how different sources of bias might be handled by ML algorithms. Towards Game Design via Creative Machine Learning (GDCML)2020-07-25   ${\displaystyle \cong }$ In recent years, machine learning (ML) systems have been increasingly applied for performing creative tasks. Such creative ML approaches have seen wide use in the domains of visual art and music for applications such as image and music generation and style transfer. However, similar creative ML techniques have not been as widely adopted in the domain of game design despite the emergence of ML-based methods for generating game content. In this paper, we argue for leveraging and repurposing such creative techniques for designing content for games, referring to these as approaches for Game Design via Creative ML (GDCML). We highlight existing systems that enable GDCML and illustrate how creative ML can inform new systems via example applications and a proposed system. Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)2020-09-28   ${\displaystyle \cong }$ Software Engineering, as a discipline, has matured over the past 5+ decades. The modern world heavily depends on it, so the increased maturity of Software Engineering was an eventuality. Practices like testing and reliable technologies help make Software Engineering reliable enough to build industries upon. Meanwhile, Machine Learning (ML) has also grown over the past 2+ decades. ML is used more and more for research, experimentation and production workloads. ML now commonly powers widely-used products integral to our lives. But ML Engineering, as a discipline, has not widely matured as much as its Software Engineering ancestor. Can we take what we have learned and help the nascent field of applied ML evolve into ML Engineering the way Programming evolved into Software Engineering [1]? In this article we will give a whirlwind tour of Sibyl [2] and TensorFlow Extended (TFX) [3], two successive end-to-end (E2E) ML platforms at Alphabet. We will share the lessons learned from over a decade of applied ML built on these platforms, explain both their similarities and their differences, and expand on the shifts (both mental and technical) that helped us on our journey. In addition, we will highlight some of the capabilities of TFX that help realize several aspects of ML Engineering. We argue that in order to unlock the gains ML can bring, organizations should advance the maturity of their ML teams by investing in robust ML infrastructure and promoting ML Engineering education. We also recommend that before focusing on cutting-edge ML modeling techniques, product leaders should invest more time in adopting interoperable ML platforms for their organizations. In closing, we will also share a glimpse into the future of TFX.