AJ

Introduction to Data Science Flashcards

Introduction to Data Science

Module Director

  • Prof. Dr. Thomas Zöller teaches data science at IU International University of Applied Sciences.
  • His focus is on advanced analytics, artificial intelligence, and their role in digital transformation.
  • He studied computer science with a minor in mathematics at the University of Bonn and holds a doctorate in machine learning in image processing.
  • He has application-oriented research experience, including work at the Fraunhofer Society.
  • His professional career includes roles in business intelligence, advanced analytics, analytics strategy, and artificial intelligence, with experience in defense technology, logistics, trade, finance, and automotive.

Introduction

  • The course book contains core content; additional materials are on the learning platform.
  • The book is divided into units and sections, each focusing on one key concept for efficient learning.
  • Self-check questions are at the end of each section to help check understanding.
  • For modules with a final exam, complete the knowledge tests on the learning platform.
  • Pass the knowledge test for each unit by answering at least 80% of the questions correctly.
  • Complete course evaluation before registering for the assessment.

Basic Reading

  • Akerkar, R., & Sajja, P. S. (2016). Intelligent techniques for data science. New York, NY: Springer International Publishing.
  • Hodeghatta, U. R., & Nayak, U. (2017). Business analytics using R—A practical approach. New York, NY: Apress Publishing.
  • Runkler, T. A. (2012). Data analytics: Models and algorithms for intelligent data analysis. New York, NY: Springer.
  • Skiena, S. S. (2017). The data science design manual. New York, NY: Springer International Publishing.

Further Reading

  • Unit 1: Davenport, T. H., & Patil, D. J. (2012). Data scientist: The sexiest job of the 21st century. Harvard Business Review, 90, 70—76. Horvitz, E., & Mitchell, T. (2010). From data to knowledge to action: A global enabler for the 21st century. Washington, WA: Computing Community Consortium.
  • Unit 2: Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly, 36(4), 1165—1188. Cleveland, W. (2001). Data science: An action plan for expanding the technical areas of the field of statistics. International Statistical Review, 69(1), 21—26.
  • Unit 3: Dorard, L. (2017). The machine learning canvas [website]. Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19(1), 25—42.
  • Unit 4: Mailund, T. (2017). Beginning data science in R, 125—204. New York, NY: Apress Publishing. Efron, B., & Hastie, T. (2016). Computer age statistical inference: Algorithms, evidence, and data science. Cambridge: Cambridge University Press.
  • Unit 5: Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT Press. Shalev-Shwartz, S. (2015). Understanding machine learning: From theory to algorithms. Cambridge: Cambridge University Press.

Learning Objectives

  • Learn how and why data scientists extract important information from data.
  • Understand the definition of data science and its benefits when applied to various situations.
  • Learn ways of labeling different sources and outline the main activities of data science.
  • Understand the concepts of descriptive analytics and probability theory.
  • Learn how to identify a data science use case in diverse organizations and obtain the value proposition for every use case.
  • Learn how to analyze the prediction model of the developed value through evaluation metrics.
  • Study the necessary key performance indicators (KPIs) to determine whether its implementation in the business has been successful.
  • Learn about the quality issues that routinely degrade the data and the traditional methods used when dealing with missing values, irrelevant features, and data duplications.
  • Become aware of the different paradigms of machine learning and how a prediction model is developed.
  • Understand how the model’s outputs can be effectively shown to a related business manager as a complete framework of the underlying data.
  • Understand how each of its parameters influence the current and future performance, so better decisions can be made and improved actions can be taken.

Unit 1 Introduction to Data Science

Study Goals:

  • Understand data science definition
  • Understand why data science is needed
  • Understand the main terms and definitions related to data science
  • Understand the role of a data scientist
  • Understand the typical activities within data science

1. 1 “Data Science” Definition

  • Data Science definition:
    • Combines business, analytical, and programming skills to extract meaningful insights from raw data.
    • Focuses on ways to understand and use data for benefit.
    • Unlocks real values and insights by identifying complex behaviors, trends, and inferences.
    • Enables smarter business decisions by analyzing data from customers, sensors, or social media.
    • Can provide solutions to business problems and allows innovative analysis of longstanding issues.
  • Google Example:
    • Scanning and uploading physical copies of books published in the last 200 years.
    • Data improves search results and allows observation of language change through Google Ngrams.
    • Enables answering questions like:
      • How has language changed over time?
      • How and why do new words become popular?
  • Key Terms:
    • Deep Learning: Application of computational networks to learning tasks.
    • Artificial Intelligence: Approaches to enable computers to emulate cognitive processes through learning from data.

Why Data Science?

  • Top Five Traded Companies (2001–2016):
    • Shift from companies like General Electric and Exxon to technology and online trade companies like Apple, Alphabet, Microsoft, Amazon, and Facebook.
    • Data is the key resource and product for these top companies.
  • Implementation of data science:
    • Not only implemented in tech companies but in any organization with data to be analyzed.
    • A company can manage and analyze user data, gain insights, and extract useful information.
  • Examples:
    • Predicting a human’s age, blood pressure, smoking status by analyzing images of their retina using deep learning.
    • Canadian government research program to predict suicide rates using data science on anonymized social media accounts.
    • Developing a tool to recognize artistic features and apply them to other images, producing images in the styles of famous painters.
    • Identifying that skin cancer is the most common human malignancy by applying machine learning techniques to clinical images.
  • Benefits of Data Science:
    • Improves decision-making, amplifies profitability, and enhances operational efficiency.
    • Recognizes and informs companies of their target audiences.
    • Assists the automated aspect of HR recruitment and improves its accuracy.
    • Optimizes transportation routes and delivery times for shipment companies.
    • Optimizes fraud detection processes for banking institutions.

1. 2 Data Science's Related Fields

  • Data science is viewed as the intersection of statistics, computer science, and business management.
    • Computer science is the platform for data generation and sharing.
    • Statistics uses data and applies numerical techniques to organize and model it.
    • Both fields are employed together to maximize insights from business management data.
  • Overlapping subjects:
    • Machine learning
    • Database storage and data processing
    • Statistics
    • Neuro-computing
    • Knowledge discovery (KDD)
    • Data mining
    • Pattern recognition
    • Data visualization
  • Key Terms:
    • Data Mining: Process of discovering patterns in large datasets.
    • Business Intelligence (BI): Routines used to analyze and deliver business performance metrics; focuses on descriptive analysis of historical performance.

Data Science Terms

  • Table 2: Data Handling Terms
    • Training Set: Dataset used by machine learning model to learn a task.
    • Testing Set: Data used to measure performance of the developed machine learning model.
    • Outlier: Exceptional data record outside normal input data distribution.
    • Data Cleansing: Process of removing redundant data, handling missing entries, and addressing data quality issues.
  • Table 3: Data Features Terms
    • Feature: Observable measure of the data (e.g., height, length, width).
    • Dimensionality Reduction: Process of reducing the dataset into fewer dimensions while retaining similar information.
    • Feature Selection: Process of selecting relevant features of the provided dataset.
  • Table 4: Artificial Intelligence Terms
    • Machine Learning: Algorithms or mathematical models that use information extracted from data to achieve a desired task or function.
    • Supervised Learning: Subset of Machine Learning based on labeled data.
    • Unsupervised Learning: Subset of Machine Learning based on un-labeled data (e.g., clustering and dimensionality reduction).
    • Deep Learning: Application of networks of computational units with cascading layers of information processing used to learn through tasks.
  • Table 5: Model Development Terms
    • Decision Model: A model assesses the relationships between the elements of provided data to recommend a possible decision for a given situation.
    • Regression: Forecasting technique to estimate the functional dependence between input and output variables.
    • Cluster Analysis: A type of unsupervised learning used to partition a set of data records into clusters.
    • Classification: A machine learning approach to categorize entities into predefined classes.
  • Table 6: Model Performance Terms
    • Probability: Quantification of how likely it is that a certain event occurs, or the degree of belief in a given proposition.
    • Standard Deviation: A measure of how spread out the data values are.
    • Type I Error: False positive output, meaning that it was actually negative but has been predicted as positive.
    • Type II Error: False negative output, meaning that it was actually positive but has been predicted as negative.

1. 3 Data Science's Activities

  • Data scientist role:
    • starts with data exploration and becomes a detective when faced with data-related questions.
    • Analyzes data to recognize patterns, applying quantitative techniques like machine learning.
    • Provides strategic support to guide business managers in decision-making.
    • manages data science projects, stores and cleans data, explores datasets, builds predictive models, and presents findings to decision-makers.
  • Major Activities:
    • exist simultaneously in three dimensions: data flow, data curation, and data analytics.
    • Each dimension represents challenges, solution methodologies, and numerical techniques.
  • Actions of a data scientist:
    1. Understand the problem.
    2. Collect enough data.
    3. Process the raw data.
    4. Explore the data.
    5. Analyze the data.
    6. Communicate the results.
  • Data science is a multidisciplinary field that derives information from data and applies it to diverse purposes, such as making predictions relevant to decision making in an organization.

Unit 2 Data

Study Goals

  • Understand data and information.
  • Understand data types and shapes.
  • Understand typical sources of data.
  • Understand the 5Vs of big data.
  • Understand the issues concerning data quality.
  • Understand the challenges associated with the data engineering process.

2. 1 Data Types & Sources

  • Human DNA Example:
    • 3 x 10^9 base pairs (A, T, C, G).
    • Complete sequence can be stored in 750MB.
    • Storage and computational capabilities allow researchers to analyze DNA sequences for chronic diseases and adapt medications.
  • Data Scientists' Time Allocation (CrowdFlower survey, 2016):
    • 60% spent cleaning and organizing data.
    • 19% spent collecting data sets.
  • Key Term: The facts, observations, assumptions, or incidences of any business practice are defined as the associated “data” of the underlying process.
  • Data Collection Methods: statistical populations, research experiments, sample surveys, and byproduct operations.
  • Data Types: quantitative and qualitative.
  • Table 7: Qualitative Vs Quantitative Data
    • Qualitative Data:
      • Describes qualities or characteristics.
      • Cannot be counted.
      • Data type: words, objects, pictures, observations, and symbols.
      • Answers: What characteristic or property is present?
      • Purpose: Identify important themes and conceptual framework.
      • Examples: happiness rating, gender, categories of plants, descriptive temperature of coffee (e.g., warm).
    • Quantitative Data:
      • Expressed as a number or can be quantified.
      • Can be counted.
      • Data type: numbers and statistics.
      • Answers: “How much?” and “How often?”
      • Purpose: Test hypotheses, develop predictions, check cause and effect.
      • Examples: height of a student, duration of green light, distance to planets, temperature of coffee (e.g., 30°C).
  • Shapes of Data: structured, unstructured, and streaming.
  • Table 8: Structured Vs Unstructured Data
    • Structured Data:
      • Characteristics: predefined data models, usually text or numerical, easy to search.
      • Applications: inventory control, airline reservation systems.
      • Examples: phone numbers, customer names, transaction information.
    • Unstructured Data:
      • Characteristics: no predefined data models, may be text, images, or other formats, difficult to search.
      • Applications: word processing, tools for editing media.
      • Examples: reports, surveillance imagery, email messages.
  • Semi-structured data involves both structured and unstructured shapes.
  • Streaming data is continuously generated, processed incrementally, and allows immediate access to content.

Sources of Data

  • Organizational and trademarked data sources: Google and Facebook provide bulk downloads of public data. Internal systems also record business activities.
  • Government data sources: Federal governments release demographic and economic data for risk estimation.
  • Academic data sources: Academic research creates large datasets that are made available to other researchers.
  • Webpage data sources: Webpages provide valuable numerical and text data (e.g., Twitter for sentiment analysis).
  • Media data sources: Media includes video, audio, and podcasts that provide quantitative and qualitative visions concerning user interaction.

2. 2 The 5Vs of Big Data

  • Big data: high-volume, -velocity, and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
  • Challenges: analysis, storage, visualization, and information retrieval.
  • 5Vs of Big Data: volume, variety, veracity, validity, and velocity.
    • Volume: Amount and scale of the data (e.g., airplane sensors generate 10GB of data per second).
      Approximate yearly data creation is at around 50 zetabytes and growing.
    • Variety: Considerable variety of data, from structured to unstructured.
    • Velocity: Speed at which data is created, stored, analyzed, and visualized.
    • Veracity: Quality of data; data may contain noise and cannot be guaranteed to be correct or precise.
    • Validity: Data may be correct but outdated or unsuitable for the question at hand.

2. 3 Data Quality

  • Collected data commonly suffers from quality issues due to imperfect data sources or issues in the data collection process.
  • Problematic data includes noisy, inaccurate, incomplete, inconsistent, missing, duplicate, or outlier values.
  • Key Terms: Outlier An outlier is a data record which is seen as exceptional and incompatible with the overall pattern of the data.
  • Methods for Handling Missing Values and Outliers:
    1. Removal of data records containing missing values and/or outliers:
      • Recommended for large datasets.
    2. Replacement of the missing value or outlier with an interpolated value from neighboring records:
      • Example: Linear interpolation for temperature data.
        x = arc{22.5 + 20}{2} = 21.25 arc{}{} °C
    3. Replacement of the missing value or outlier with the average value of its variable across all data records.
    4. Replacement of the missing value or outlier with the most-often observed value for its variable across all data records.
  • Duplicate Records: Are removed to reduce computing time and prevent distortion of analytics outcome.
  • Redundancy: Identified and resolved by applying correlation analysis to each pair of variables.
  • Formula: Correlation coefficient (ρ) between two data variables x and y:
    • \rho{x, y} = \frac{\sum{i=1}^{n} (xi - \bar{x})(yi - \bar{y})}{\sqrt{\sum{i=1}^{n} (xi - \bar{x})^2 \sum{i=1}^{n} (yi - \bar{y})^2}}
  • Correlation Coefficient Interpretation:
    • If \rho = 1 The two variables are fully correlated.
    • If \rho = 0 Indicates no correlation or that the variables are independent.
      Dimensionality reduction aims to simplify the data by removing data properties that are non-informative in relation to the analytical question at hand.
      Mention Dimensionality reduction approaches.

2. 4 Data Engineering

  • Data Engineering Focuses on practical applications of data collection and analysis.
  • Data TransformationMain transformation methods are variable scaling, decomposition, and aggregation.
  • Table 9: Data Transformation Methods
    • Variable Scaling: Variables may have mixed scales, models work on scaled values.
    • Variable Decomposition: a time variable may be decomposed into hour and minute variables.
    • Variable Aggregation: “gross income” and “paid tax” variables may be aggregated into one variable, “net income.”
  • Benefits of data processing:
    • improved analysis and demonstration of the organization’s data,
    • reduction of data so that only the most meaningful information is present,
    • easier storage and distribution of data,
    • simplified report creation,
    • enhanced productivity and increased profits, and
    • further accurate decision-making.
  • Real Life Examples
    • Case study: Online merchants can gain a 360 degree view of how customers are utilizing their web services.
      The Internet of Things (IoT) which includes connected devices and sensors on a platform.
  • Industrial processes data applications
    • Main goal is to automate and optimize them, and to improve the competitive situation of the company.
  • Business data applications
    • Main goal is to better understand, motivate, and drive the business processes.
  • Text data applications
    • Main goal when applying data science to text data is to filter, search, extract, and structure information.
  • Image data applications
    • Main goal of applying data science to image data is to find and recognize objects, ana- lyze and classify scenes, and relate image data to other information sources.
  • Medical data applications
    • Main goal of applying data science to medical data is to analyze, understand, and annotate the influences and side effects of medication in order to detect and predict different levels of certain diseases.

Unit 3 Data Science in Business

Study Goals

  • Understand a data science use case.
  • Understand the machine learning canvas.
  • Understand the model-centric performance evaluation.
  • Understand the role of KPIs in operational decisions.
  • Understand the influence of the cognitive biases.

Identification of Use Cases

  • Businesses become more valuable when they take an in-depth look their data and identify the suitable data science use cases (DSUC) for their business objectives.
  • Prediction techniques are applied through DSUCs to extract valuable information from collected data.
  • The DSUC in any business can be identified through three main points: effort, risk, and achieved value.
  • An organization should focus their analysis on reducing effort and increasing gain.
  • Important questions have to be answered:
    • What is the value of the knowledge gained by applying data science tools to that dataset?
    • What will be discovered about the input dataset and its hypothesis?
    • What value will be added to the organization through applying data science techniques?
    • What will the organization’s decision be if the data science produces disappointing results?
Data Handling and Analysis
  • Data could be sourced from internal or external databases, web scrapping, or sensor data.
  • Data collection is often a tedious and costly task as it may require human intervention.
  • Humans are involved in the data collection phase study the data, label it, add valuable comments, correct errors, and even observe some data anomalies.
  • Once these tasks have been completed, the preprocessing techniques are applied to the data to correct any kind of error or noise, then redundant or missing values/records are scanned.
  • Employees who carry out the data scrubbing should have a significant knowledge of the domain of this data in order to make efficient decisions concerning the way to deal with the monitored data errors.

Machine Learning Canvas

  • Developed by Dorard (2017), the machine learning canvas is a tool that is used to identify data science use cases and provide a visual user procedure.
  • For example, the canvas can be used in the domain of real estate.
  • It is useful when investigating risky investments and comparing the real estate’s price predictions with the actual prices to determine the best deals.

3. 2 Performance Evaluation

  • Two approaches:
    • Evaluate the model by comparing its output through a list of well-established numerical metrics.
    • Evaluate the ways that the model influenced the business by helping it to improve and achieve its goals.

Model-Centric Evaluation: Performance Metrics

  • Evaluation metrics for a classification model
    • For a DSUC designed with only two possible outputs {“yes”, “no”}, the decision of the output is dependent on a threshold assigned to the model.
    • When the model is applied to a data record, there are only four possible outcomes.
      These are true positive, true negative, false positive and false negative.
      These four possible results are usually presented in a matrix form called the confusion matrix.
Table 10: The Confusion Matrix
*   Precision Formula:  \frac{number \ of \ TP}{number \ of \ TP + number \ of \ FP} 
*   Accuracy Formula:  \frac{number \ of \ TP + number \ of \ TN}{number \ of \ TP + number \ of \ TN + number \ of \ FP + number \ of \ FN} 
*   Recall Formula:  \frac{number \ of \ TP}{number \ of \ TP + number \ of \ FN} 
  • The receiver operator characteristic (ROC) curve shows how altering the cutoff value could change the true positive and false positive rates.

Evaluation metrics for a regression model

  • The objective is to measure how close a regression model’s output (y) is to the desired output (d).
  • There are standard metrics that evaluate the accuracy and performance of the model which are root mean square error, mean absolute error, absolute error, mean absolute error, relative error, and square error, as given in the following equations.
  • Absolute error Formula: \varepsilon = d - y
  • Relative error Formula: \varepsilon^* = \frac{d-y}{d} 100 \%
  • Mean absolute percentage error Formula: MAPE = \frac{1}{n} \sum{i=1}^{n} \frac{|di - yi|}{di} 100 \%
  • Square error Formula: \varepsilon^2 = (d - y)^2
  • Mean square error Formula: MSE = \frac{1}{n} \sum{i=1}^{n} (di - y_i)^2
  • Mean absolute error Formula: MAE = \frac{1}{n} \sum{i=1}^{n} |di - y_i|
  • Root mean square error Formula: RMSE = \sqrt{\frac{1}{n} \sum{i=1}^{n} (di - y_i)^2}

3. 3 Data-Driven Operational Decisions

  • Analytics results are communicated and made available in such a way that they are useful for the relevant decision makers inside an organization.
  • Quantification of the model’s merit is achieved by defining so-called Key Performance Indicators (KPIs).
  • These are measurements that express to what extent the business goals have been met or not.
  • Most KPIs focus on increased efficiency, reduced costs, improved revenue, and enhanced customer satisfaction.

Characteristics of effective KPIs

  1. easy to comprehend and simple to measure,
  2. assists the splitting of the overall objective into the daily operations of the staff responsible for it,
  3. visible across the entire organization,
  4. able to indicate positive/negative variations from the business objective,
  5. has a defined length of time including start and end dates of its measuring, and
  6. achievable through the available resources (e.g. machines, staff, etc.).

3. 4 Cognitive Biases

  • Montibeller & Winterfeldt (2015, p. 1230) reported that “behavioral decision research has demonstrated that judgments and decisions of ordinary people and experts are subject to numerous biases.”
  • From an evolutionary perspective, DSUC are subject to cognitive biases which can highly influence the judgment of the business’ performance and/or settings.
Table 11: The Common Cognitive Biases and Their De-biasing Techniques
  • Anchoring:
    • Occurs when the estimation of a numer- ical value is based on an initial value (anchor).
    • De-biasing Technique: Remove anchors, have numerous and counter anchors, use various experts using specific anchors.
  • Confirmation:
    • Occurs when there is a desire to confirm one's belief, leading to unconscious selectivity in the acquisition and use of evidence.
    • De-biasing Technique: Use multiple experts for assumptions, counterfactual challenging probability assessments, use sample evidence for alternative assumptions.
  • Desirability:
    • Favoring alternative options due to a bias that leads to underestimating or overestimating consequences.
    • De-biasing Technique: Use multi-stakeholder studies of differ- ent perspectives, use multiple experts with different views, use appropriate transparency rates.
  • Insensitivity:
    • Sample sizes are ignored and extremes are considered equally in small and large samples.
    • De-biasing Technique: Use statistics to determine the likeli- hood of extreme results in different samples, use the sample data to prove the logical reason behind extreme sta- tistics.

Unit 4 Statistics

Study Goals

  • Understand the importance of statistics in data science.
  • Understand probability and its relation to the prediction model’s outputs.
  • Understand conditional probability and the probability density function.
  • Understand the different probability distributions.
  • Understand the Bayesian statistics.

4. 1 Importance of Statistics in Data Science

  • Two Separate Fields of statistic.
    • Descriptive Statistics
    • Probability theory
  • Key Term: Standard Deviation is a measure of how spread out the data values are, which is typically applied to normal distributed data.
  • Before modern information, processing, and storage technology, data analysts sought to summarize sample data quantitatively in the form of a compact set of measures.
  • These statistical parameters include: mean, maximum, minimum, median, and standard deviation.
    Median =\frac{2 + 3 + 4 + 5 + 6 + 7 + 9}{7} = 5.14
  • Key formula: Median, in a sorted element is the value in the middle.
  • If sorted values for elements: (2,3,4,5,6,7,9) = 5.14.

Probability Theory

  • Probability theory is the core theory for many data science techniques.
  • Key Term: Written as (P), probability is simply defined as the chance of an event happening
  • If occurrence of an event is impossible, its probability is P = 0.
  • If an event is certain, then its probability is P = 1
  • The probability of any event is a number between 0 and 1.
    Two contradicting events cannot happen to the same object at the same time.
  • Mutually exclusive even, Opposite events such as this are defined as mutually exclusive events.
  • Two mutually independent events can happen simultaneously without affecting one another.
  • Example: A company can make profit and have legal issues at the same time because these two events do not impact each other.
    P(A \ And \ B) = P(A \cap B) = P(A) P(B)
    P(A \ Or \ B) = P(A \cup B) = P(A) + P(B) - P(A \cap B)

Conditional probability

  • When two events are correlated, the conditional probability p(A|B) is defined as the probability of an event A, given that event B has already occurred.
    • Conditional probability formula:
      • p(A|B) = \frac{p(A \cap B)}{p(B)}

Probability distribution function

  • Consider a random variable that can take on a given set of values. The occurrence of each of these values has a certain probability.
  • The function that maps outcomes with their respective probability is called a probability distribution function.

4. 2 Important Statistical Concepts

Probability Distributions
Examples that regularly occur are discussed.

Normal distribution
  • Arguably one of the most common distributions, the normal distribution has a bell-shaped curve.
  • Since, in many naturally occurring scenarios, attributes distribute symmetrically around their mean value.
  • The normal distribution has about 68 percent of the possible values within one standard deviation from the mean, while two standard deviations cover 95 percent of the values. Finally, the interval of ±3 standard deviations contains 99.7 percent of the values.
Binomial distribution
  • The binomial distribution is the probability distribution of the number of successes in a sequence of independent trials that each can be described by a binary random variable.
  • If a coin is tossed twice, what is the probability of “heads” occurring once? What is the probability of “heads” occurring twice? The table below represents the possible outcomes when a coin is tossed twice.
Table 12: Possible Outcomes of Tossing a Coin

P(\text{two heads}) = \frac{1}{4} = 0.25\ .

  • However, the probability of “heads” occurring only once in two throws is recorded twice in the table out of four possible outcomes. This is half out of the total.
    P(\text{one head}) = \frac{2}{4} = 0.5 .
  • The probability of the results being just “tails” for both throws is recorded only once out of four possible outcomes:
    P(\text{no heads}) = \frac{1}{4} = 0.25 .
Poisson distribution
  • The Poisson distribution quantifies the probability of a given number of independent events occurring in a fixed time interval. If an average of 10 calls per day are sent to a call center, what is the probability that the call center will receive exactly seven calls on a given day. p(x) = \frac{e^{-\mu} \mu^x}{x!}
    • P(7)= e^{-10} \frac{10^7}{7!} \approx 0.09 .
Bayesian Statistics
  • Bayesian statistics is a unique branch of statistics that does not interpret probabilities as frequencies of occurrences, but rather as an expectation of belief.
  • Formula:
    • p(A|B) = \frac{p(B|A)p(A)}{p(B)}

Unit 5 Machine Learning

Study Goals

  • Understand what is meant by machine learning.
  • Understand the different applications of machine learning.
  • Understand the concepts of classification and regression.
  • Understand the difference between each machine learning paradigm.
  • Understand the basic machine learning approaches.

5. 1 Role of Machine Learning in Data Science

  • According to Samuel (1959), machine learning is a “field of study that gives computers the ability to learn without being explicitly programmed.”
  • Machine learning employs descriptive statistics to summarize salient properties of the data and predictive analytical techniques to derive insights from training data that are useful in subsequent applications.

Important Terms

  • developed model is called a machine learning model developed models are applied in a variety of different settings such as vision/language processing, forecasting, pattern recognition, games, data mining, expert systems, and robotics.
  • For continuous outputs, machine learning builds a prediction model called a regression model.
  • For discrete outputs, the prediction model is called a classification model.
  • Important Paradigms in Machine Learning:
    • Supervised learning: both the data inputs and the desired outputs are provided, and includes classification and regression approaches.
    • Unsupervised learning:
      Discovers patterns in the data inputs which includes the clustering analysis.
    • Semi-supervised learning: covers tasks that involve partially labelled data sets.

5. 2 Overview of ML Approaches: Supervised Learning

  • Objective - develop a mathematical model (f) that relates the output to the inputs and can predict the output for future inputs, as clarified in the following equation: y = f(x_i) \quad i = 1 \dots n
    • Where n: total number of variables
  • In classification, the output (y) belongs to the predicted class or classes.
  • In regression, the output (y) belongs to a range of infinite continuous values that define the numerical outcome(s).
  • Updating process is governed by a specific loss function, and the objective is to adjust the parameters so that this loss function is minimized.
  • Regression problems: the loss function can be the mean squared error (MSE)
  • Classification problems: the loss function can be the number of wrongly classified instances.
Supervised learning techniques:
  • Decision tree based methods
  • K-nearest neighbors method
  • Naïve Bayes method
  • Support Vector Machines (SVM) method
  • Linear regression method:
    y = w0 + w1x1 + w2x2 + \dots wmx_m
  • Logistic regression method
  • Artificial neural networks (ANN) method - deep learning

Overview of ML Approaches: Unsupervised Learning

  • Consists of inputs (independent variables, x_i