FDA

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/54

Earn XP

Description and Tags

Ch 2-6; Last exam 5/5!!

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

55 Terms

New cards

2: Data

A collection of facts such as numbers, words, measurements, observations or description of things.
Technically, values of qualitative or quantitative variables belonging to a set of items.

New cards

Quantitative

Values representing counts or measurements based on some quantitative trait. Typically summarized using averages / means / statistical metrics.

New cards

Quantitative Types:

Continuous - any value within a given interval, can be whole number
Discrete - only particular values, whole and half numbers

New cards

Quantitative Levels of measurement:

Interval - differences (intervals) are meaningful but ratios are not, no true zero point.
Ratio - both intervals and ratios are meaningful, has a true zero point

New cards

Qualitative

Values that can be placed in non-numerical categories, grouped into categories based on some qualitative trait. Commonly summarized using “percentages / proportions”.

New cards

Qualitative levels of measurement:

Nominal - Data that cannot be ranked/ordered, names, labels or categories. e.g eye colour
Ordinal - Data that can be arranged in order/ranking. e.g ratings

New cards

Data Characterization

Structured data - High degree of organization, strict format, easily processed. e.g relational database
Semi-structured data - Data collected in ad-hoc (when necessary/for purpose) manner, certain structure not all, mixed with data values.
Unstructured data - limited indication of type of data/format. e.g raw audio

New cards

Data Science (DS)

An inter-disciplinary field that uses computer science, statistics and machine learning to collect, clean, integrate, analyze, visualize, and interact with big data to create data products.

New cards

Difference between BI and DS

New cards

Types of Data Science Questions

Causal Analysis
Descriptive Analysis
Exploratory Analysis
Inferential Analysis
Mechanistic Analysis
Predictive Analysis

New cards

Causal Analysis

The gold standard for data analysis, it’s to find out the what happens to one variable when another variable changes.
Known as cause-and-effect relationship between variables.

New cards

Descriptive Analysis

Summarizes a characteristic of a dataset and is commonly applied to census data.
Example is a census report that summarizes the population distribution.

New cards

Exploratory Analysis

This can find relationships/trends in the data which is also known as hypothesis-generating analyses.
It is useful for defining future studies but it is not used for predicting nor the final say.
Example is a company analyzes customers’ purchases to see what they’re most likely to purchase.

New cards

Inferential Analysis

Uses a relatively small sample of data for a bigger population and can make a statement about something outside the data.

New cards

Mechanistic Analysis

Helps to understand the exact changes in variables that lead to changes in other variables for individual objects but incredibly hard to infer.

New cards

Predictive Analysis

Uses the data on some objects to accurately predict values for another object, the goal is to predict but not explain the reasons.

New cards

A/B Testing

Heavily used to compare if certain actions/variants lead to better response/outputs. Example: A/B testing on a website.

New cards

3: Descriptive Analytics

To report summary descriptions on population and sample, summarize and get a general sense of data at hand.

New cards

Histogram

Shows frequency of quantitative data (ordinal, interval, ratio)

To show numerical record of the frequency items in the dataset.

New cards

Central Tendency:

Mean
Median
Mode

New cards

Mean

The average of the dataset or the balance point but can be badly affected by extreme values (outliers) in the dataset.
It is used to find the central value of data.

New cards

Median

The middle value that divides a distribution into 2 equal halves.
The middle value of a ranked dataset.
It is not affected by outliers and is used to find the central point if data is skewed or has extreme outliers.

New cards

Mode

The most common data point.
The most frequent data in a dataset.
It is used to identify the most common values in the dataset.

New cards

Measures of dispersion:

Range
Interquartile Range
Variance
Deviation
Sum of Squared Deviations
Standard Deviation

New cards

Range

The spread or the distance between the lowest and highest values of a variable. Not robust, sensitive to extreme values.

New cards

Deviation

The distance away from the mean of a case’s score.

New cards

Standard Deviation

Measures the spread around the mean; how much data deviates from the mean. It is the square root of variance.

New cards

Variance

A measure of the spread of the recorded values on a variable
Measures the dispersion of the dataset, squared deviations from the mean.
It doesn’t use the original data since it is squared.

New cards

Interquartile Range

The distance or range between the 25^th percentile and the 75^th percentile (Q3-Q1).
To understand the spread and variability of the data and it focuses on the middle 50% of the data
It can also be used to detect outliers.

New cards

Index of Qualitative Variation (IQV)

A measure of variability for nominal variables.
It ranges from 0 to 1 (only 1 group i.e least homogeneous to maximum variability i.e most heterogenous)

New cards

Population vs Sample

Population is a large data set while sample is a small data set taken from population.
Population is the entire group of interest that you want to study and sample of a population is a subset of the population.

New cards

4: Data Visualization

It is an effective tool to perform exploratory data analysis.
The use of visual representations to explore the data, make sense of it, and communicate insights about it.

New cards

Histogram

Shows frequency of quantitative data, shows 4 main aspects: Shape, Center, Spread, Outliers.
Determining the “bins” matters. Ex: 5-10.
Small data sets – can be misleading.
Large data sets – can be quite effective.
Effectively only work with 1 variable at a time.

New cards

Boxplots

Portrays most descriptive statistics info, including IQR, Median, Range, Outliers, Variability, Skewness/Symmetry.
However, the negatives:
Over-plotting (multiple boxplots, close together)
Hides some details about distribution shape (unimodal/bimodal?)
No standard implementation in software (different software, different way)

New cards

Calculate Boxplot’s Outliers:

Below (Q1 – 1.5 X IQR)
Above (Q3 + 1.5 X IQR)

New cards

Bar chart

Comparisons across categories of data

New cards

Pie charts

Shows percentages or proportions of a total

New cards

Scatter plot

Displays relationship between 2 quantitative variables

New cards

Line chart

Displays trends over 2 quantitative axes, one of them represents continuity

New cards

Maps

Display spatial data on map

New cards

5: Data Preparation

Taking the data from its raw format, extracting relevant data and into a tidy format.

New cards

Web Scraping

Programmatically scraping information, automating the extraction and navigation of data from multiple web pages.
Used when data is not available via more direct methods such as direct download and to automatically keep track of regularly changing data.

New cards

2 Main Steps for Web Scraping

Retrieve page
Extract desired information

New cards

6: Data Mining Applications

Spam Filtering, Handwriting Character Recognition, Customer Attrition
Predicting sales amount of a product/wind velocities
Credit card fraud detection, Network Intrusion Detection
Market segmentation, Products recommendations, Shelf management

New cards

Machine Learning Methods:

Supervised - machine learns a mapping from known data of both X’s and Y’s that allows to predict an unknown value of a target variable.

Unsupervised - no particular target variable and tries to find useful structures or relationships in the data

New cards

Supervised:

Regression
Classification

New cards

Unsupervised:

Clustering
Association Rule Discovery
Sequential Pattern Discovery
Anomaly Detection

New cards

Types of Regression Models

Single (Simple) → Linear & Non-linear

Multiple (Multivariate) → Linear & Non-linear

New cards

Univariate Regression

Used to predict the numeric label value using one variable value. Can be linear / non-linear, for example: logarithmic, exponential.

New cards

Multivariate Regression

Used to predict the numeric label value using 2 or more variable values. Can also be linear / non-linear.

New cards

Linear Regression

Basic regression analysis model that allows you to identify the relationship between a dependent variable and a single independent variable.

New cards

Non-linear Regression

Non-linear relationship between a dependent and independent variable. Can capture complex relationships between the dependent and independent variables.

Types of non-linear regression: Quadratic regression, Cubic regression, Exponential regression, Logarithmic regression, Polynomial regression

New cards

Root Mean Squared Error (RMSE)

The lower the RMSE, the better the model.

Penalizes larger errors more than small errors.

New cards

Mean Absolute Error (MAE)

Less sensitive to outliers compared to RMSE.

New cards

R squared

Statistical measure that represents the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model.

The value is between 0 and 1. Closer to 1, the better the model.