Designing Machine Learning Systems - Notes
- Designing Machine Learning Systems" by Chip Huyen is reviewed positively by various experts as a comprehensive guide for building, deploying, and scaling ML models.
- The book emphasizes a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive, considering various components and stakeholders.
- Chip Huyen is the co-founder of Claypot AI and has experience at NVIDIA, Netflix, and Snorkel AI, teaching at Stanford University.
- The book covers engineering data, automating model development, detecting issues, and building responsible ML systems.
- Machine learning systems are complex, consisting of different components and stakeholders, and are unique because of data dependence.
- The @oreillymedia Twitter handle, O'Reilly Media LinkedIn page, and O'Reilly Media YouTube channel are mentioned.
- The book presents an iterative framework with case studies and ample references, focusing on practical aspects rather than ML theories.
- Praised as essential for ML engineers, data leaders, and practitioners for its focus on first principles and principled view on end-to-end ML.
- Aims to give nuanced answers to questions about ML systems, appealing to engineers, data scientists, and technical leaders.
- Recommends visiting oreilly.com for purchasing books and accessing online editions, and contacting corporate@oreilly.com for sales inquiries.
- The book isn't an introduction to ML; it assumes basic understanding of ML models/techniques, metrics, statistical concepts, and common ML tasks.
- Focuses on providing a framework to develop solutions regardless of the specific algorithm and includes discussions around trade-offs, pros and cons, and concrete examples.
- The book takes into account different components of the system and the objectives of different stakeholders involved.
- Mentions a GitHub repository that accompanies the book and a Discord server on MLOps for discussions about the book.
- Expresses appreciation to course staff, reviewers, readers, and the O’Reilly team for their contributions to the book.
- Chapter 1 provides an overview of Machine Learning Systems, emphasizing the importance of asking when, and when not, to use ML before starting an ML project.
- ML is defined as an approach to (1) learn (2) complex patterns from (3) existing data to (4) make predictions on (5) unseen data.
- To learn to improve customer satisfaction and increase profits, either directly, such as increasing sales (conversion rates) and cutting costs, businesses need to ask whether ML is necessary or cost-effective.
- The chapter highlights problems that can be very well, solved by the ML algorithms and those for which they shouldn't be used.
- Zero-shot learning enables ML systems to make predictions without task-specific training data, provided they were previously trained on other related tasks.
- The absence of both data and continual learning leads companies to adopt the 'fake-it-til-you make it' approach.
- Explains how machine learning is useful in object detection and speech recognition tasks with complex patterns.
- It is discussed how ML solutions shine when a problem has these additional characteristics:
-Repetitive patterns
-Cost of incorrect predictions is low
-Patterns are constantly changing, etc. - Lists use cases for ML in enterprise, serving internal needs (reducing costs, automation) and external needs (improving customer experience).
- Discusses fraud detection, price optimization, and demand forecasting as enterprise ML applications.
- User acquisition and churn prediction are ML applications to reduce customer costs.
- Automated support ticket classification and brand monitoring are enterprise ML applications to improve customer experience.
- Highlights ML use cases in health care, specifically those that focus on accuracy and privacy requirements, where ML assists doctors in providing diagnosis.
- ML in production differs significantly from ML in research, involving different stakeholders, requirements, computational priorities, and data properties.
- Notes that production prioritizes fast inference and low latency, whereas research prioritizes fast training and high throughput.
- Defines latency (time from query to result) and throughput (queries processed per time unit), and explains the trade-offs.
- Stresses the importance of considering latency distribution using percentiles (p50, p90, p95, p99), especially for valuable users.
- Discusses how data in production is messier and constantly shifting compared to clean, static research datasets, requiring consideration of fairness and interpretability. Mentions data systems, data formats, data movements, and data-processing engines.
- Also mentions the need for testing and versioning data.
- Highlights that ML algorithms encode the past which can perpetuate biases in the data and more.
- Notes that while many traditional SWE tools can be used to develop and deploy ML applications, many challenges are unique to ML.
- Underscores that indiscriminately accepting all available data might hurt performance and make it susceptible to data poisoning.
- Mentions increased accessibility of ML research and off-the-shelf models and the demand for ML in production.
- Acknowledges the engineering challenges of getting large models into production, especially on edge devices, and ensuring they run fast enough to be useful..
- Chapter 2 introduces Machine Learning Systems Design, emphasizing a holistic approach where business objectives are translated into ML objectives.
- To design systems that meet a specified goal, need to consider:
-Adaptability
-Scalability
-Maintainability
-Reliability - The iterative process: lays out goals/constraints, engineers data, develops/evaluates models, deploys models, and reviews the outcome then analyzes business goals to generate insights
- Discusses the need to frame ML problems appropriately by defining the inputs, outputs, and objective functions.
- Describes common ML tasks, including classification (binary, multiclass, multilabel) and regression, highlighting the importance of framing problems effectively.
- Discusses that the choice of objective functions and decoupled objectives are important, where utility is in the eye of the beholder
- Emphasizes the fundamental role of data in ML systems and touches on the debate about the importance of data versus intelligent algorithms.
- In Chapter 3, data engineering fundamentals, data sources are discussed, emphasizing the importance of understanding data origins for efficient use.
-Explains several types of data sources such as user input data, system-generated data, data from internal databases, and third-party data - Introduces various data serialization formats like JSON, CSV, Parquet, Avro, Protobuf, and Pickle, along with their characteristics (human-readability, binary/text format) and example use cases.
- Discusses row-major (e.g., CSV) versus column-major (e.g., Parquet) data formats and their implications for read/write operations and library efficiency (e.g., pandas).
- Explains Text vs. Binary Format with the fact that binary files are more compact, and how the Parquet format is up to 2x faster to unload and consumes up to 6x less storage in comparison to test formats.
- Data models describe how the data in a given data format should be structured.
- Discusses two major types of models, relational and NonSQL, which seem opposite to each other but are actually converging
-Data in Relational models is organized into relations, with each set of tuples, with tables representing the relation
-Database should be normalized to reduce datal redundancy, but also be optimized depending on the data
-SQL = declarative
-Talks about different types of NonSQLs: document models & graph models
-Document Model focuses on document and each document has Unique key
-Talks about structured vs unstructured data, and that structured data makes your data easier to analyze.
-Introduces data lake as unstructured data that doesn't adhere to a predefined schema
-The majority of the chapter focuses of data storage engines and processing (TP or AP)
-The chapter touches ETL (extract, transform and load), while giving us modes of DataFlow. The first source of data flow being through databases
-The chapter ends by noting the importance of knowing how to collect, process, store, retrieve, and process the data.