FDA

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/54

flashcard set

Earn XP

Description and Tags

Ch 2-6; Last exam 5/5!!

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

55 Terms

1
New cards

2: Data

  • A collection of facts such as numbers, words, measurements, observations or description of things.

  • Technically, values of qualitative or quantitative variables belonging to a set of items.

2
New cards

Quantitative

Values representing counts or measurements based on some quantitative trait. Typically summarized using averages / means / statistical metrics.

3
New cards

Quantitative Types:

  • Continuous - any value within a given interval, can be whole number

  • Discrete - only particular values, whole and half numbers

4
New cards

Quantitative Levels of measurement:

  • Interval - differences (intervals) are meaningful but ratios are not, no true zero point.

  • Ratio - both intervals and ratios are meaningful, has a true zero point

5
New cards

Qualitative

Values that can be placed in non-numerical categories, grouped into categories based on some qualitative trait. Commonly summarized using “percentages / proportions”.

6
New cards

Qualitative levels of measurement:

  • Nominal - Data that cannot be ranked/ordered, names, labels or categories. e.g eye colour

  • Ordinal - Data that can be arranged in order/ranking. e.g ratings

7
New cards

Data Characterization

  • Structured data - High degree of organization, strict format, easily processed. e.g relational database

  • Semi-structured data - Data collected in ad-hoc (when necessary/for purpose) manner, certain structure not all, mixed with data values.

  • Unstructured data - limited indication of type of data/format. e.g raw audio

8
New cards

Data Science (DS)

An inter-disciplinary field that uses computer science, statistics and machine learning to collect, clean, integrate, analyze, visualize, and interact with big data to create data products.

9
New cards

Difference between BI and DS

10
New cards

Types of Data Science Questions

  • Causal Analysis

  • Descriptive Analysis

  • Exploratory Analysis

  • Inferential Analysis

  • Mechanistic Analysis

  • Predictive Analysis

11
New cards

Causal Analysis

  • The gold standard for data analysis, it’s to find out the what happens to one variable when another variable changes.

  • Known as cause-and-effect relationship between variables.

12
New cards

Descriptive Analysis

  • Summarizes a characteristic of a dataset and is commonly applied to census data.

  • Example is a census report that summarizes the population distribution.

13
New cards

Exploratory Analysis

  • This can find relationships/trends in the data which is also known as hypothesis-generating analyses.

  • It is useful for defining future studies but it is not used for predicting nor the final say.

  • Example is a company analyzes customers’ purchases to see what they’re most likely to purchase.

14
New cards

Inferential Analysis

Uses a relatively small sample of data for a bigger population and can make a statement about something outside the data.

15
New cards

Mechanistic Analysis

Helps to understand the exact changes in variables that lead to changes in other variables for individual objects but incredibly hard to infer.

16
New cards

Predictive Analysis

Uses the data on some objects to accurately predict values for another object, the goal is to predict but not explain the reasons.

17
New cards

A/B Testing

Heavily used to compare if certain actions/variants lead to better response/outputs. Example: A/B testing on a website.

18
New cards

3: Descriptive Analytics

To report summary descriptions on population and sample, summarize and get a general sense of data at hand.

19
New cards

Histogram

Shows frequency of quantitative data (ordinal, interval, ratio)

To show numerical record of the frequency items in the dataset.

20
New cards

Central Tendency:

  • Mean

  • Median

  • Mode

21
New cards

Mean

  • The average of the dataset or the balance point but can be badly affected by extreme values (outliers) in the dataset.

  • It is used to find the central value of data.

22
New cards

Median

  • The middle value that divides a distribution into 2 equal halves.

  • The middle value of a ranked dataset.

  • It is not affected by outliers and is used to find the central point if data is skewed or has extreme outliers.

23
New cards

Mode

  • The most common data point.

  • The most frequent data in a dataset.

  • It is used to identify the most common values in the dataset.

24
New cards

Measures of dispersion:

  • Range

  • Interquartile Range

  • Variance

  • Deviation

  • Sum of Squared Deviations

  • Standard Deviation

25
New cards

Range

The spread or the distance between the lowest and highest values of a variable. Not robust, sensitive to extreme values.

26
New cards

Deviation

The distance away from the mean of a case’s score.

27
New cards

Standard Deviation

Measures the spread around the mean; how much data deviates from the mean. It is the square root of variance.

28
New cards

Variance

  • A measure of the spread of the recorded values on a variable

  • Measures the dispersion of the dataset, squared deviations from the mean.

  • It doesn’t use the original data since it is squared.

29
New cards

Interquartile Range

  • The distance or range between the 25th percentile and the 75th percentile (Q3-Q1).

  • To understand the spread and variability of the data and it focuses on the middle 50% of the data

  • It can also be used to detect outliers.

30
New cards

Index of Qualitative Variation (IQV)

  • A measure of variability for nominal variables.

  • It ranges from 0 to 1 (only 1 group i.e least homogeneous to maximum variability i.e most heterogenous)

    Formula

31
New cards

Population vs Sample

  • Population is a large data set while sample is a small data set taken from population.

  • Population is the entire group of interest that you want to study and sample of a population is a subset of the population.

32
New cards

4: Data Visualization

  • It is an effective tool to perform exploratory data analysis.

  • The use of visual representations to explore the data, make sense of it, and communicate insights about it.

33
New cards

Histogram

  • Shows frequency of quantitative data, shows 4 main aspects: Shape, Center, Spread, Outliers.

  • Determining the “bins” matters. Ex: 5-10.

  • Small data sets – can be misleading.

  • Large data sets – can be quite effective.

  • Effectively only work with 1 variable at a time.

34
New cards

Boxplots

  • Portrays most descriptive statistics info, including IQR, Median, Range, Outliers, Variability, Skewness/Symmetry.

  • However, the negatives:

  • Over-plotting (multiple boxplots, close together)

  • Hides some details about distribution shape (unimodal/bimodal?)

  • No standard implementation in software (different software, different way)

35
New cards

Calculate Boxplot’s Outliers:

  • Below (Q1 – 1.5 X IQR)

  • Above (Q3 + 1.5 X IQR)

36
New cards

Bar chart

Comparisons across categories of data

37
New cards

Pie charts

Shows percentages or proportions of a total

38
New cards

Scatter plot

Displays relationship between 2 quantitative variables

39
New cards

Line chart

Displays trends over 2 quantitative axes, one of them represents continuity

40
New cards

Maps

Display spatial data on map

41
New cards

5: Data Preparation

Taking the data from its raw format, extracting relevant data and into a tidy format.

42
New cards

Web Scraping

  • Programmatically scraping information, automating the extraction and navigation of data from multiple web pages.

  • Used when data is not available via more direct methods such as direct download and to automatically keep track of regularly changing data.

43
New cards

2 Main Steps for Web Scraping

  1. Retrieve page

  2. Extract desired information

44
New cards

6: Data Mining Applications

  • Spam Filtering, Handwriting Character Recognition, Customer Attrition

  • Predicting sales amount of a product/wind velocities

  • Credit card fraud detection, Network Intrusion Detection

  • Market segmentation, Products recommendations, Shelf management

45
New cards

Machine Learning Methods:

Supervised - machine learns a mapping from known data of both X’s and Y’s that allows to predict an unknown value of a target variable.

Unsupervised - no particular target variable and tries to find useful structures or relationships in the data

46
New cards

Supervised:

  • Regression

  • Classification

47
New cards

Unsupervised:

  • Clustering

  • Association Rule Discovery

  • Sequential Pattern Discovery

  • Anomaly Detection

48
New cards

Types of Regression Models

Single (Simple) → Linear & Non-linear

Multiple (Multivariate) → Linear & Non-linear

49
New cards

Univariate Regression

Used to predict the numeric label value using one variable value. Can be linear / non-linear, for example: logarithmic, exponential.

50
New cards

Multivariate Regression

Used to predict the numeric label value using 2 or more variable values. Can also be linear / non-linear.

51
New cards

Linear Regression

Basic regression analysis model that allows you to identify the relationship between a dependent variable and a single independent variable.

52
New cards

Non-linear Regression

Non-linear relationship between a dependent and independent variable. Can capture complex relationships between the dependent and independent variables.

Types of non-linear regression: Quadratic regression, Cubic regression, Exponential regression, Logarithmic regression, Polynomial regression

53
New cards

Root Mean Squared Error (RMSE)

The lower the RMSE, the better the model.

Penalizes larger errors more than small errors.

54
New cards

Mean Absolute Error (MAE)

Less sensitive to outliers compared to RMSE.

55
New cards

R squared

Statistical measure that represents the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model.

The value is between 0 and 1. Closer to 1, the better the model.