Statistical analysis

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/43

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

44 Terms

1
New cards

Sample?

  • used to draw conclusion about one or more characteristics of a population

  • representativeness is important

2
New cards

What is the null hypothesis?

  • its about the mean in the population

  • h0

3
New cards

What is the T statistic?

  • used to determine whether the null hypothesis is rejected or not

4
New cards

Standard error?

  • uncertainty around estimate

5
New cards

Scatter plot patterns

knowt flashcard image
6
New cards

Correlation patterns

knowt flashcard image
7
New cards

Correlation coefficient

  • closer to -1: more strongly the variables move in the opposite direction

  • closer to 1: more strongly the variables move in the same direction

<ul><li><p>closer to -1: more strongly the variables move in the opposite direction </p></li><li><p>closer to 1: more strongly the variables move in the same direction</p></li></ul><p></p>
8
New cards

Regression analysis

  • What is the effect of the IV on the DV

  • What is the relationship between the sales of my product and the properties of my locations?

  • DV/Y is predicted by IV/X

    • DV to be explained

    • IV the explanator

9
New cards

Ordinary Least Squares/OLS

  • the line of best fit

  • a smaller amount of predicting errors

  • how close are the residuals to the regression line → shows you the line with the least amount of prediction errors possible for a straight line

10
New cards

Linear regression model

  • make predictions within data or outside of it

  • can summarise how predictions or average values of an outcome vary across observations defined by a set of predictors

  • y = value based on fitted line + distance from fitted line

    • y = β0 + β1x + ε

    • We say that y is related to an intercept (constant), a variable x and an error term

β0 and β1 are the parameters

  • These need to be estimated

  • β1 provides information about how y and x are related to each other → what regression models do

11
New cards

Interpreting linear regressions

  • Small p-value: evidence for a significant relationship between x & y → interpret the coefficient

  • Large p-value: no evidence for a significant relationship between x & y → don’t interpret the coefficient

12
New cards

Binary variables

  • variable that takes either of two values

Including in regression analysis

  • Process: Transform into a “dummy” or “indicator” variable where one category = 0 and the other = 1 (if not already 0/1 scaled) and include that variable in your model

    • The 2 categories have numerical values of 0 or 1

13
New cards

R2/ R squared

  • Percentage of the variation in y that is explained by x

  • Between 0% and 100% (e.g. 28%)/ number between 0 and 1 (0.2)

  • a goodness of fit measure

  • tells us how much the fit of model is improved

  • indicates the % of variance in the DV that the IV can explain collectively

  • Higher R2 = better the model fits the obs

    • More variance = data points are closer to the line

    • line is steeper, reduce variaton

    • x helps to predict y

<ul><li><p>Percentage of the variation in y that is explained by x</p></li><li><p>Between 0% and 100% (e.g. 28%)/ number between 0 and 1 (0.2)</p></li><li><p>a goodness of fit measure</p></li><li><p>tells us how much the fit of model is improved </p></li><li><p>indicates the % of variance in the DV that the IV can explain collectively</p></li><li><p>Higher R2 = better the model fits the obs</p><ul><li><p>More variance = data points are closer to the line</p></li><li><p>line is steeper, reduce variaton </p></li><li><p>x helps to predict y </p></li></ul></li></ul><p></p>
14
New cards

Causality

  • how does the change in X affect Y

  • if one goes up the other goes down

  • Correlation is not causation

15
New cards

why is the diff between correlation and causality so important

  • useful to establish causality to know what the causal effect of the variable is

    • medical world

    • policy: impact of policy change, can mean

16
New cards

ways to establish causality

  • experiments, treatment and control group, measure a certain outcome and compare it

17
New cards

Causal claim

  • a direction

  • X is doing something to Y

    • are associated/correlated/related

18
New cards

Non causal claims

  • no direction or specification which of X & Y goes first

  • talking how the variables work together

19
New cards

Reverse causality

  • Y to X

<ul><li><p>Y to X </p></li></ul><p></p>
20
New cards

Omitted variable

  • Z is related to X & Y and plays a big role

<ul><li><p>Z is related to X &amp; Y and plays a big role</p></li></ul><p></p>
21
New cards

Logistic regression

  • to predict a categorical outcome 1 VS 0

  • This is better than linear because probabilities are assumed instead of 0 and 1 only

  • used to predict categorical variables

  • s line makes sure that we have predictions between 0 and 1

<ul><li><p>to predict a categorical outcome 1 VS 0</p></li><li><p>This is better than linear because probabilities are assumed instead of 0 and 1 only</p></li><li><p>used to predict categorical variables</p></li><li><p>s line makes sure that we have predictions between 0 and 1</p></li></ul><p></p>
22
New cards

Interpreting logistic regression

  • 1 DV and multiple IVS

  • Positive: we expect Y=1 to become more likely as X increases

  • Negative: we expect Y=1 to become less likely as X increases

23
New cards

K nearest neighbours

  • focus on indi and look in database and look for the similar indi, and look at the neighbours, copy the outcome to the indi

  • x and y axis could be anything

  • based on characteristics and compare the indi to everybody else in the database, and place them accordingly → find the b persons similar to A, and find the most similar 3 → is it always 3? closest to A

  • make prediction on A, depending on the probability of neighbours at least 50%

  • based on a circle

  • K is the basis of the comparison

  • depends on the domain tho

  • For k we determine the percentage of points that belong to category 1 (e.g., eviction)

  • At least 50%? → Classify as category 1

  • The new example is ALWAYS placed after

24
New cards

how large K should be?

  • depends on the prediction performance of your model

  • does it change a lot

    • messing with diff k values what outcome are there

  • what is the predictive power of a smaller K → can miss patterns: overfitting

  • too big: can miss trend

    • trading off process, need to find the balance

25
New cards

SVM?

  • support vector machines

  • searching for hyperplane line that separates the 2 groups as well as possible by looking for a border house closest to the line on both sides

  • goal: make the distance as large as possible and can separate the groups properly → if as far as possible, good job

  • domain specific

  • Depends on the characteristics of the data set, needs to be able to separated

26
New cards

Decision tree

oal of decision tree is go from very impure to minimum impurity

to arrive at the final tree, 1/0

<p>oal of decision tree is go from very impure to minimum impurity</p><p>to arrive at the final tree, 1/0</p>
27
New cards

Impurity

  • measure of heterogeneity, how mix is the data at each step

  • Impurity has maximum value (= 0.50) when the observations are evenly distributed among the categories (50% in category 1; 50% in category 0)

  • Impurity is 0 when all observations belong to 1 category (100% in either category 1 or category 0

  • It decreases from Impure to Pure

  • pluses and circles - mixture of both

  • categories

very impure

  • 50 of each

  • max impurity

  • max value of .5

<ul><li><p>measure of heterogeneity, how mix is the data at each step</p></li><li><p>Impurity has maximum value (= 0.50) when the observations are evenly distributed among the categories (50% in category 1; 50% in category 0)</p></li><li><p>Impurity is 0 when all observations belong to 1 category (100% in either category 1 or category 0</p></li><li><p>It decreases from Impure to Pure</p></li><li><p>pluses and circles - mixture of both</p></li><li><p>categories</p></li></ul><p></p><p>very impure</p><ul><li><p>50 of each</p></li><li><p>max impurity</p></li><li><p>max value of .5</p></li></ul><p></p>
28
New cards

Training and testing

  • Prediction is probs the most interesting phase

  • testing set is used for prediction purpose

<ul><li><p>Prediction is probs the most interesting phase</p></li></ul><ul><li><p>testing set is used for prediction purpose</p></li></ul><p></p>
29
New cards

K fold cross validation

  • way to evaluate the model’s performance

30
New cards

K fold cross validation process

  • take all available data for training model & split set into k parts

    • divide dataset in 5 parts

    • the first 4 is 80, test is used to predict → based on iterations

    • take one part out and train the model using the remaining k-1 parts

    • trained model is compared to labels & actual labels with withheld parts

    • repeated till each k withheld part is done

  • full use of data for prediction

  • cross validation is used

    • determine the optimal parameters of the system

<ul><li><p>take all available data for training model &amp; split set into k parts</p><ul><li><p>divide dataset in 5 parts</p></li><li><p>the first 4 is 80, test is used to predict → based on iterations</p></li><li><p>take one part out and train the model using the remaining k-1 parts</p></li><li><p>trained model is compared to labels &amp; actual labels with withheld parts</p></li><li><p>repeated till each k withheld part is done</p></li></ul></li><li><p>full use of data for prediction</p></li><li><p>cross validation is used</p><ul><li><p>determine the optimal parameters of the system </p></li></ul></li></ul><p></p>
31
New cards

Accuracy

  • percentage of true positives and true negatives

32
New cards

Sensitivity

  • percentage of true positives within category 1

33
New cards

Specificity

  • percentage of true negatives within category 0

34
New cards

Model evaluation: confusion matrix

  • it shows the correct and incorrect classifications

  • looks at the no. of observations in each cell

<ul><li><p>it shows the correct and incorrect classifications</p></li><li><p>looks at the no. of observations in each cell</p></li></ul><p></p>
35
New cards

In practice for confusion matrix

  • 1/0 variable, compare to predicted values

  • false negative → predicted 0 and when its supposed to be 1

  • false pos → does not match reality

  • true neg and pos are the ones we care about good prediction, add them together and divide by total no of obs → accuracy

  • always interested in high accuracy? if so , why or why not

  • always use true pos and neg together and compare it to the total

<ul><li><p>1/0 variable, compare to predicted values</p></li><li><p>false negative → predicted 0 and when its supposed to be 1</p></li><li><p>false pos → does not match reality</p></li><li><p>true neg and pos are the ones we care about good prediction, add them together and divide by total no of obs → accuracy</p></li><li><p>always interested in high accuracy? if so , why or why not</p></li><li><p>always use true pos and neg together and compare it to the total</p></li></ul><p></p>
36
New cards

recall/precision/f1score

  • checks whether if 1s are predicted well

37
New cards

recall formula

  • true pos / true pos + false pos

38
New cards

Precision formula

  • true pos/ true pos + false neg

39
New cards

F1 score

  • used to evaluate performance

  • a harmonic mean between precision and recall

<ul><li><p>used to evaluate performance</p></li><li><p>a harmonic mean between precision and recall</p></li></ul><p></p>
40
New cards

F1 formula

knowt flashcard image
41
New cards

Precision and F1

  • Precision: percentage of true positives within the category of predicted 1’

  • the percentage of cases with a specific outcome classified correctly by the system

<p></p><ul><li><p>Precision: percentage of true positives within the category of predicted 1’</p></li><li><p>the percentage of cases with a specific outcome classified correctly by the system </p></li></ul><p></p>
42
New cards

Process of precision

  • predicted and actual case outcomes compared

  • True positive: the known truth of the actual value → need it: MATCHES

    True negative: actual value is saying I don’t need it

    False neg: actual value is telling is me I need, but model is saying I dont

    False pos: what the model predict is not real

43
New cards

Accuracy

  • general measure of prediction quality

  • counts the no of correctly predicted observations divided by total number of observations

Formula

  • True pos + True neg / all obs in the table

44
New cards

Overfitting

  • the model explains very well but does not predict we

<ul><li><p>the model explains very well but does not predict we</p></li></ul><p></p>