Sociology 113: Midterm Review

studied byStudied by 7 people
5.0(1)
Get a hint
Hint

Nature of Statistics

1 / 89

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

90 Terms

1

Nature of Statistics

Stats is the science concerned with studying methods etc. to interpret empirical data

Study of variarion

New cards
2

R studio

R is the engine and R-Studio is the interface

New cards
3

Loading and Processing Data

object-oriented language

functions are verbs and objects are nouns

New cards
4

Language and conventions

Case-sensitive

New cards
5

Functions, Objects, and operators

← is to assign values

* multiplication

== is it true

! is x not equal to y

New cards
6

Using packages

apps on smartphones

New cards
7

Data Structures

Scalar, Vector, Data Frame, Matrix, List

New cards
8

Scalar

numeric, integer, character, or logical cannot hold multiple values of the same or different types

a scalar variable can represent a single number, a single text string, a single logical value (TRUE or FALSE), or a single integer value

New cards
9

Vector

C is combine so it combines a vector aka c(2,9, 9,3)

Can combine twice

V[2] gives second element

V[1:5]

Create logical vectors (one-dimensional array-like structure that contains logical values, which are either TRUE or FALSE)

New cards
10

Data Frame

Excel tables

Takes a lot of vectors and makes a data frame out of it

Access the different type of info using $ sign

You can change the name

New cards
11

Matrix

Table

Matrices has the same info, different from data frame

organize and work with structured data

New cards
12

list

a list is a versatile data structure that can hold elements of different data types such as numeric, character, logical, vectors, matrices, data frames, and even other lists. Lists are similar to vectors, but unlike vectors, the elements of a list can be of different types

New cards
13

Descriptive statistics

statistic is the study of variation and analyzing patterns, that's why is called a variable because it varies

New cards
14

Numerical

Continuous, discrete, ordinal

New cards
15

Continuous

how much would you pay for a slice of pizza (continue for ever)

New cards
16

Discrete

whole number

New cards
17

Ordinal

ordinal data has a clear sequence or hierarchy

education level, economic status, agree strongly agree etc, pain scales

New cards
18

Center

mean (average): modeling that entire variable

Median: the middle

Salaries should use median, not mean because the mean gets skewed with greater values

New cards
19

Spread

range, standard deviation, deviation, variance

New cards
20

Range

max-min (not the best because of outliers)

different from IQR

simple measure of variability and is affected by extreme values in the dataset. However, it does not provide information about the distribution of values within the dataset or the central tendency.

New cards
21

Standard deviation

Divide variance

measure of the dispersion or spread of a dataset. It quantifies the amount of variation or dispersion of a set of values

New cards
22

Variance

how much they deviate from the mean (mean or add up all the values and divide by number of numbers in the dataset)

another measure of the spread or dispersion of a dataset. It is closely related to the standard deviation and provides information about how much the values in a dataset deviate from the mean.

New cards
23

Five number summary

Lowest value, lower quartile, median, upper quartile, highest value

New cards
24

Categorical

binomial/dichotomous, nominal, ordinal

New cards
25

Binomial/Dichotomous

yes or no

New cards
26

Nominal

option best describes race or ethnicity

New cards
27

Ordinal

what the highest level of education

New cards
28

Relative frequencies and proportions

Relative frequency is the proportion of times a particular value occurs in a dataset relative to the total number of observations in the dataset

Proportions are ratios that compare a part to the whole, expressing how much of a dataset belongs to a specific category relative to the total dataset

relative frequencies: relative to the total number

proportions compare a part to the whole

New cards
29

Two-way tables

Two-way tables are useful for analyzing relationships between categorical variables and identifying patterns or associations in the data.

creating multi-way tables

New cards
30

Different types of distributions

Normal, Skewed, Exponential, Uniform

New cards
31

Normal

A normal distribution is like a symmetrical bell, with most data clustered in the middle and fewer data points as you move away from the center. It's smooth, balanced, and described by its mean and standard deviation.

New cards
32

Skewed

Majority of data points cluster towards one side, causing the curve to be asymmetrical

Right Skewed: mode, median, mean

Left Skewed: mean, median, mode

<p>Majority of data points cluster towards one side, causing the curve to be asymmetrical</p><p>Right Skewed: mode, median, mean </p><p>Left Skewed: mean, median, mode </p>
New cards
33

Exponential

Occurrences of events that happen randomly over time, often with a rapid decline in probability as time progresses

New cards
34

Uniform

uniform distribution" refers to a probability distribution where all values within a given range are equally likely to occur

New cards
35

Principles of effective visualizations

need it in .csv, studata.frame(read.csv(“studata.cvs”))

New cards
36

Histogram

a histogram is a graphical representation of the distribution of numerical data. It divides the data into intervals called "bins" and counts the number of observations that fall into each bin. The height of each bar in the histogram represents the frequency or relative frequency of observations in that bin.

New cards
37

Histogram (hist())

Code: hist(yrsincoll) variable inside

Picks axis that seems best but you can change them

Tells you how many times something pops up in the data

Breaking down the code: Hist (yrsincoll, main = “Number of years my closest friends have spent in college”, xlab = "Number of years in college", ylab = "Number of friends", xlim = c(0,5), col = "pink")

Main: Main title xlab:x label ylab: y label xlim: limits of x-axis col: color—must be in ““

New cards
38

Histogram ggplots

Create a data frame from the variable

ggplot(data = testdata, aes(x = yrsincoll))+geom_histogram()

Data frame and type of variable you want to view then the type of graph at the end “+” adds layers and

Add layers to object saved

New cards
39

Box plots

a box plot (also known as a box-and-whisker plot) is a graphical summary of the distribution of numerical data through five key summary statistics: the minimum, lower quartile (Q1), median (Q2), upper quartile (Q3), and maximum.

Median and tells us how data is spread out

5 number summary

New cards
40

Box Plots code

Code: myboxplot <- ggplot(data = studata, aes(x = sleephrs)) + geom_boxplot()

Add the layer of boxplot, you can add a layer of boxplot ontop of a histogram

New cards
41

Pie Charts

Pie charts are useful for visualizing the relative sizes of different categories or proportions within a data set.

Not always as useful

Basic r == pie ()

Ploty and dplyr packages must be uploaded

New cards
42

Bar charts

In R Studio, a bar chart is a graphical representation of categorical data that uses rectangular bars to represent the frequencies or proportions of different categories

ggplot2: Geom_bar

New cards
43

Scatter plots

Scatter plots are useful for visualizing the relationship between two numerical variables, identifying patterns, trends, and outliers in the data, and assessing the strength and direction of the relationship between the variables

New cards
44

Normal distributions and the central limit theorem (Gaussian)

Bell curve shaped, pattern happens all the time

Influenced by a tiny little factors

with a large sample size, the distribution of sample means will be approximately normal, regardless of the original distribution's shape

the mean (μ), which represents the central tendency, and the standard deviation (σ), which represents the spread or variability of the distribution

New cards
45

Identifying outliers

Outside of main distribution

Outliers are data points that significantly differ from the majority of the other data points in a dataset.

warrant special attention in statistical analysis and interpretation.

New cards
46

Interquartile range (IQR)

different from range

Q3-Q1 (half of the values)

Outliers: 1.5*IQR below and 1.5*IQR above

Code: lower_bound <- Q1 - 1.5 IQR & upper_bound <- Q3 + 1.5 IQR

outliers <- happyworld[(happyworld$Cantrilscore > upper_bound | happyworld$Cantrilscore < lower_bound), ]

Q1 <- quantile(happyday$happiness, 0.25) and Q3 <- quantile(happyday$happiness, 0.75)

New cards
47

Z-scores

how many standard deviations a data point is away from the mean of a dataset. It indicates how far a data point is from the mean, in terms of standard deviation units

To figure out how many standard deviations something is from the value, we subtract it by the mean and divide by standard deviation (then it must be below 3 and -3 if not it is an outlier )(e.g., ±3)

z= x-mean/ sd

identifying outliers, comparing data points from different datasets, and standardizing data for statistical analysis

New cards
48

68-95-99.7 rule

empirical rule; statistical guideline that describes the approximate percentage of data within certain standard deviations from the mean in a normal distribution

68.27% of data points fall within 1 SD above or below the mean

95.45% of data points fall within 2 SDs above or below the mean

99.73% of data points fall within 3 SDs above or below the mean

  • Only 0.3% of all data are 3D way from the mean

New cards
49

Standardizing variables

technique used to rescale variables to have a mean of 0 and a standard deviation of 1

Converting a variable to a z score

scale() converts everything into a z-score

Will calculate the IQR but with z scores to see the outliers

sum(happyday$stdhappiness > 3 | happyday$stdhappiness < -3)

New cards
50

Range of acceptable values

refers to the acceptable boundaries or limits within which a variable or measurement is considered valid

(Z*SD) + Mean = X

Multiply by 3 and -3 to get upper and lower limits

New cards
51

Percentiles

Percentiles are often used to understand the distribution of data and identify specific values that are typical or extreme within a dataset

Pnorm (z) tell you the percentage of numbers higher or lower

it returns the probability that a standard normal random variable is less than or equal to z

New cards
52

Conceptual significance of outliers (and how to handle them)

Get rid of errors while preserving natural variation

New cards
53

Skewness

Skewness is if it is of to the side and it is calculated usign skewness() 

Negatively skewed (left) : mean is higher 

Normal = 0 

Positive skewed (right)= mode peaks first

New cards
54

Kurtosis

the tails needs to be between 2 and -2, this is calculated using the library moments ans code kurtotis ()

indicates whether the distribution is more peaked and has heavier tails than a normal distribution (positive kurtosis), less peaked and has lighter tails than a normal distribution (negative kurtosis), or has similar peakedness and tail behavior as a normal distribution (kurtosis close to zero).

New cards
55

Skewness and kurtosis

skewness describes the symmetry of the distribution, while kurtosis describes the shape of the distribution's tails.

New cards
56

Sampling and the central limit theorem

No matter the kind of variable or sample you always get a normal distribution (therefore you can make predictions from sampling)

Flipping a coin

Seed is random 

New cards
57

Population vs Sample

CLT helps us make assumptions from a population just from one sample

Population: N μ(mean) σ(standard deviation)

Sample: n x(mean) and d(standard deviation)

New cards
58

Degrees of freedom

For sample, the formula is n-1 

Lose a degree of freedom for every parameter that you estimate 

If you do not subtract, then you underestimate the values

number of values in the final calculation of a statistic that are free to vary. It's a concept that's used in various statistical tests and calculations

New cards
59

Z-score

How far a data point is from the mean 

New cards
60

Z-test  (comparing sample to population)

Where sample fits relative to the population

“Teaching demos”

Apply to sample to check is sample is different from population

z.test(sample_mean, mu = 4, SD = 1.5)

Divided by square root of n

~By comparing the calculated z-value to the critical values from the standard normal distribution, you can determine whether to reject the null hypothesis

New cards
61

Standard error

√n

How spread out, but this is about the precision of sampling to the entire population

Average of sample should be about average of population

How confidence you can be about assumptions

Error means uncertain and number are different from expected, variability that isn't being captures

measure of the variability or uncertainty in an estimate, particularly in the context of statistical inference. It quantifies the precision of an estimate by indicating how much it might vary from the true population parameter on average

New cards
62

T-test

see if values from one sample vary from a different sample (equation similar to Z test, but it doesn't rely on normal distribution but a t distribution)

New cards
63

One sample T-test (comparing sample to population)

Tails are bigger, more conservative approach to hypothesis testing

t.test(studata$approx_drinks, mu = 6.5)

How much different, t=11 t value must be less than 0.5 to be confident that it is different

commonly used when you have collected a sample and want to assess whether it is representative of the population from which it was drawn

New cards
64

Two sample T-test

Comparing, two groups of people and could be different sizes (means ) 

Have to be independent, small sample size (<30 ppl) normally distributed or 

More people the more confident we can be 

compare the means of two groups to assess whether there is evidence of a difference between them

New cards
65

T-statistic 

Subtract means and add together standard errors, variance we could account for, gives sense of confidence that two samples are statistically significant 

hypothesis testing to assess whether the observed difference between the sample mean and the population mean is statistically significan

New cards
66

Unpaired T-test

statistical test used to compare the means of two independent groups to determine if they are significantly different from each other,

The null hypothesis (𝐻0) for an unpaired t-test typically states that there is no difference between the means of the two groups. The alternative hypothesis (𝐻𝑎) suggests that there is a significant difference between the two means.

<p><span>statistical test used to compare the means of two independent groups to determine if they are significantly different from each other, </span></p><p>The null hypothesis (𝐻0) for an unpaired t-test typically states that there is no difference between the means of the two groups. The alternative hypothesis (𝐻𝑎) suggests that there is a significant difference between the two means.</p>
New cards
67

Paired

Every person in the first sample is PAIRED with someone in the next

a statistical test used to compare the means of two related groups to determine if they are significantly different from each other. It's commonly used when you have paired or matched observations and want to assess whether there is evidence of a difference in their means.

New cards
68

T-distribution

Small sample sizes

a probability distribution that arises in hypothesis testing when the population standard deviation is unknown and must be estimated from the sample data.

New cards
69

Effect sizes~difference between means

Larger sample sizes the smaller differences, shape of distribution –more confidence with low variability 

The study must have enough power to detect effect, if it does not vary as much, how far the mean are and how spread out 

More power detects smaller differences between sample means and be more confident in our results.

  • Small effect size: d = 0.2

  • Medium effect size: d = 0.5

  • Large effect size: d = 0.8

How it is relevant to how spread out

New cards
70

Conceptual foundation of test statistics

Framework of hypothesis testing, a fundamental concept in statistics used to make inferences about population parameters based on sample data.

Null Hypothesis, Alternative Hypothesis, t-statisitc, Sampling distribution under the null hypothesis,

New cards
71

Null Hypothesis

Default assumption about the population parameter(s)

no effect, no difference, or no association between variables.

New cards
72

Alternative hypothesis

It asserts what you hypothesize to be true about the population parameter(s) being tested. It can be one-sided (e.g., greater than, less than) or two-sided (e.g., not equal to)

New cards
73

Test statistic

numerical summary of sample data that measures the degree of compatibility between the observed data and the null hypothesis. It quantifies how far the observed data deviates from what would be expected under the null hypothesis

New cards
74

Sampling distribution under the null hypothesis

Represents the distribution of test statistic values that would be obtained if the null hypothesis were true and helps assess the probability of observing the data given the null hypothesis

New cards
75

p-value

threshold used to determine the strength of evidence against the null hypothesis

The decision to reject or fail to reject the null hypothesis is based on whether the observed test statistic falls beyond the critical value or whether the p-value is smaller than a predefined significance level (e.g., 0.05).

New cards
76

Experimental Considerations

careful consideration of these experimental factors is essential for producing reliable, valid, and ethical research findings that contribute to the advancement of knowledge in the field.

New cards
77

ANOVAs Vs the F-test

ANOVA is a technique used to compare means across multiple groups, while the F-test is a statistical test used to assess the overall significance of the ANOVA model by comparing variances,

the F-test is an integral part of ANOVA and helps determine whether the observed differences between group means are statistically significant

New cards
78

Anovas

ANOVA (Analysis of Variance)

compare the means of three or more groups to determine if there are statistically significant differences between them

Any number of groups, and it will tell you if it has a different mean from the others, not helpful because it doesn't tell you which mean is greater

New cards
79

F-test

F test checks if the variance within the groups/ distribution of each group is smaller across groups

If smth is happening then check with post op test

If the F-test is statistically significant, it suggests that there are significant differences between the group means, and further investigation (e.g., post hoc tests) may be warranted.

New cards
80

Descriptive stats = mean,sd

Descriptive statistics are numerical summaries or measures that provide insights into the central tendency, variability, and distribution of a dataset.

New cards
81

Inference: conclusion of broader population based on sample

Hypothesis testing

Null and alternative hypotheses

H0 or Null= stays the same 

HA or alternative = something happens 

  • Keep null hypothesis if you do not have enough evidence to reject it (p >0.5)

New cards
82
New cards
83
New cards
84
New cards
85
New cards
86
New cards
87
New cards
88
New cards
89
New cards
90
New cards

Explore top notes

note Note
studied byStudied by 14 people
... ago
4.0(1)
note Note
studied byStudied by 234 people
... ago
4.6(5)
note Note
studied byStudied by 31 people
... ago
5.0(1)
note Note
studied byStudied by 2903 people
... ago
5.0(21)
note Note
studied byStudied by 24 people
... ago
5.0(3)
note Note
studied byStudied by 26 people
... ago
5.0(1)
note Note
studied byStudied by 58 people
... ago
5.0(1)
note Note
studied byStudied by 12 people
... ago
5.0(1)

Explore top flashcards

flashcards Flashcard (22)
studied byStudied by 1 person
... ago
5.0(1)
flashcards Flashcard (48)
studied byStudied by 4 people
... ago
5.0(1)
flashcards Flashcard (97)
studied byStudied by 10 people
... ago
5.0(1)
flashcards Flashcard (59)
studied byStudied by 10 people
... ago
5.0(1)
flashcards Flashcard (30)
studied byStudied by 5 people
... ago
5.0(1)
flashcards Flashcard (99)
studied byStudied by 3 people
... ago
5.0(1)
flashcards Flashcard (25)
studied byStudied by 2 people
... ago
5.0(1)
flashcards Flashcard (60)
studied byStudied by 5 people
... ago
5.0(1)
robot