data science

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/66

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

67 Terms

1
New cards

population

set of all possible observations of interest to problem at hand

2
New cards

sample

subset of population containing objects or outcomes that have actually been observed

3
New cards

parameter

describes a population (mean, standard deviation)

4
New cards

statistic

describes a sample

5
New cards

probability sampling

random selection

6
New cards

non probability

based on convenience

7
New cards

sampling with replacement

each data unit in the population can reappear in the sample

8
New cards

sampling without replacement

each data unit in population appears once in the sample

9
New cards

simple random sampling

equal chance

10
New cards

stratified 

divide into homogenous groups, sample each

11
New cards

cluster

divide into heterogenous clusters then sample clusters

12
New cards

why sampling is important

generalise population

prevent bias

reduce computation cost

13
New cards

cross validation

a sampling technique used during assessment phase to assess how well the model generalises to unseen data

14
New cards

how a cross validation works

split into k fold

train on k-1 folds

validate on remaining folds

repeat k times

average performance

15
New cards

aspects of data quality

accuracy, completeness, consistency, timeliness, validity, uniqueness

16
New cards

variance

{(n-mean)² + (n1-mean)² …}/ n-1

17
New cards

standard deviation

square root of variance

18
New cards

histogram

displays the frequency with which values occur in data

19
New cards

Lq

n+3/4 if odd

n+2/4 if even

20
New cards

Q1

X(Lq) if lq is integer

{X(Lq-0.5)+X(Lq+0.5)} / 2 if Lq is not integer

21
New cards

Q3

If Lq is integer: X(n+1-Lq)

If Lq is not integer:

X{(n+1 - Lq - 0.5)+X(n+1-Lq+0.5)}/2

22
New cards

fences

step = 1.5(Q3-Q1)

UIF = Q3+step

LIF = Q1- step

23
New cards

Outlier formula

(>1.5XIQR from median)

24
New cards

what statistics to use on symmetrical data with no outliers

standard deviation, mean

25
New cards

what to use on skewed data

median, quartiles

26
New cards

sample space

set of all possible outcomes of an experiment

27
New cards

event

sub set of the sample space

28
New cards

product rule

P(A or C) = P(A+C)/P(C)

29
New cards

addition rule

P(A or C)=P(A)+P(C)-P(AandC)

30
New cards

system failure (mutually exclusive)

P(System fail)= P(b1)+P(b2)…

31
New cards

parallel systems(all components must fail)

P(system failure)= P(b1)*P(b2)…

32
New cards

probability density function(PDF)

shows where the variable is most or less likely to fal

33
New cards

cumulative distribution function

probability that the random variable is less than or equal to a certain value

34
New cards

binomial distribution

fixed number of trials
each trial has two possible outcomes

probability of success is equal for each trial

you are counting number of successes

35
New cards

uniform distribution

all outcomes are equally likely

can be discrete or continuous

e.g rolling a die

36
New cards

poisson distribution

measuring number of events that occur within time space or area

events occur independantly

average rate of occurrence is constant

37
New cards

exponential distribution

measuring the time or distance between events

events occur continuously and independently at a constant average rate

always positive values

38
New cards

binomial distribution mean and variance

mean= n*p

variance=n*p(1-p)

me=number of trials

p=probability of success in each trial

39
New cards

uniform distribution mean and variance

mean=(a+b)/2

variance = (b-a+1)²-1/12

if x is uniformly distributed between a and b variance=(b-a)²/12p

40
New cards

Poisson mean and variance

mean and variance = lambda

lambda = average number of events per intervalex

41
New cards

potential mean and variance

mean=1/lambda

variance=1/lambda²

lambda = events per unit time

42
New cards

normal distribution

continuous data

bell shaped curve

values cluster around meann

43
New cards

bayes theorem

P(A∣B)=P(BIA)P(A)/P(B)

44
New cards

probability plots

graphical tool to determine if a set of empirical observations comes from a population

compare CDF calculated for sampled values with the theoretical CDF

scatter plot is made to compare values

45
New cards

estimate the CDF from the data with n observations

F(xi)=P(X<=xi)=number of observations<=xi/n

46
New cards

sign test

non parametric

test for median of a random sample or median of the difference of two random samples

nu

47
New cards

null hypothesis

assumption is correct

48
New cards

p<=0.05

reject null hypothesisp

49
New cards

p>0.05

fail to reject

50
New cards

issues with a small sample

central limit theorem does not apply

standard deviation of the sample is not a good approximation of population standard deviation

51
New cards

solution to small sample size (parametric analysis)

if population is approximately normal, calculation of confidence intervals can be performed using student t distribution

52
New cards

student t distribution

tn-1= √n(X-u)/s

53
New cards

for a 2 sided confidence interval the confidence interval length L=

L=2(tn-1,a/2)s/√n

54
New cards

simple regression model

yi=Bo+B1Xi+Ei

yi=observations of dependent variable

xi= observations of independent

Ei= term error

Bo=y-intercept B1 is slope of true model

55
New cards

how to find one of best fit

using calculus solution is

b1=SSxy/SSxx

b0=ȳ-b1x̄

x̄ and ȳ are sample means

<p>using calculus solution is </p><p>b1=SSxy/SSxx</p><p>b0=ȳ-b1x̄</p><p>x̄ and&nbsp;ȳ are sample means</p>
56
New cards

SSxx and SSxy

knowt flashcard image
57
New cards

how to measure error

residual error

<p>residual error</p>
58
New cards

mean squared error MSres

SSres/n-k

n=error degrees of freedom

k=pairs of observations

59
New cards

SStot (total sum of squares)

knowt flashcard image
60
New cards

SSreg (Regression sum of squares)

knowt flashcard image
61
New cards

SSres (sum of squared residuals)

<p></p>
62
New cards

R²(Proportion of total variation in y accounted for by regression line)

knowt flashcard image
63
New cards

Goodness of fit statistics-adjusted coefficient of determination

knowt flashcard image
64
New cards

Distributions of parameters

provided the residuals are normally distributed N(0,sigma²) the standard deviations are

<p>provided the residuals are normally distributed N(0,sigma²) the standard deviations are</p>
65
New cards

confidence intervals of parameters Bo B1

knowt flashcard image
66
New cards

distributions of parameters B0 B1

knowt flashcard image
67
New cards

confidence intervals of regression line

knowt flashcard image