Unit 1 (STATS - 1000)

0.0(0)

Studied by 1 person

0.0(0)

Call with Kai

Knowt Play

New

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/139

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

140 Terms

New cards

Statistics

Set of methods for obtaining, organizing, summarizing, presenting and analyzing data

New cards

Data

Comes from characteristics measured on individuals, or units

New cards

Individuals/ Units

Nearly anything: people, animals, places, things, etc

New cards

Observations

collected data values

New cards

Population

Totality of individuals about which we want information

New cards

Sample

Subset of the individuals in a population that we actually examine in order to gather information

New cards

Good sample

Representative of the populations

New cards

Identifying the population that a sample represents

replace the sample size with “all”

New cards

Variable

characteristic or property of an individual.

New cards

Examples of possible variables

Lifespan of a light bulb, The number of heads in five tosses of a quarter, Hair colour

New cards

Classifications of data

categorical and quantitative

New cards

Categorical data

values of categorical/qualitative variables.
These are variables that place individuals into one of several groups categories.

New cards

Categorical variables (examples)

Eye colour
Favourite singer
Reason for taking STAT 1000

New cards

Categorical and ordinal

meaningful, logical ordering to the values of a categorical variable.

New cards

Categorical and nominal

not a meaningful, logical ordering to the values of a categorical variable

New cards

Quantitative data

Represents quantitative variables

New cards

Quantitative variables are

Take numerical values for which arithmetic operations (such as subtracting, averaging, etc.) make sense (i.e. their results are meaningful).

New cards

Quantitative variables (examples)

Height
Volume of air in a balloon
Exam score
Time

New cards

Data distribution tells us:

What values a variable takes, and How often it takes these values

New cards

Bar Charts

Display variable values on one axis, and frequencies on the other.

Bars don’t touch (not continuous)

New cards

Pie charts

visual representation of the relative frequency/proportion of the observed values for a categorical variable

New cards

Frequency distribution

count of how many of our data values fall into various predetermined classes or intervals

New cards

<p>Frequency Distribution <strong>Example</strong></p>

Frequency Distribution Example

New cards

Relative frequency distributions

Dividing the number of data values in each class by the total number of data values, we get the relative frequency, or proportion of individuals in each class

New cards

Proportions (relative frequency distributions)

Values between 0 and 1 that are decimal representations of fractions. You can convert proportions to percentages by multiplying by 100.

New cards

<p>Relative frequency distribution <strong>Example</strong></p>

Relative frequency distribution Example

New cards

Frequency distribution (intervals)

choose them ourselves

New cards

Frequency distribution (interval rules)

Our first interval must include the lowest data value (called the minimum)
Our last interval must contain the highest data value (called the maximum)
All intervals should be of equal length
Each interval includes the left endpoint, but not the right

New cards

Choosing the intervals (frequency distribution)

“nice choices”, that summarize our data well. We’d typically use around 5 - 10 intervals total

New cards

Why cant we just use non-overlapping intervals?

because of decimals (continuous variables)

70-79 how about 79.5?

New cards

Continuous variables

These are quantitative variables that can take any value within a given range.

New cards

Continuous variables (examples)

Test scores, age, height, distance

New cards

Discrete variables

These are quantitative variables that can only take a “countable” number of values: i.e. they can only take a specific, distinct values.

New cards

discrete variables (examples)

The number of children in a family
The number of days of rain in a month
The number of books a person has read in their life

New cards

Histograms

More useful and commonly used display of continuous data

Graphical displays of the frequency (or relative frequency) of data values falling into each of several intervals.
Histograms are especially useful when we’re dealing with large data sets.

New cards

What type of data is used for a histogram

continuous, quantitative data

New cards

Why is there no spaces between the bars in a histogram

because they are continuous data

New cards

What does the base of a histogram represent

length of the interval (equal length)

New cards

What does the height of a histogram represent

the frequency of the data in each interval

New cards

Distribution shape (histogram)

A histogram can be used to characterize the shape of the data distribution

Symmetric
Skewed to the right
Skewed to the left

New cards

Symmetric shape (histogram)

center divides it into two approximate mirror images

New cards

Skewed to the right (Histogram)

longer tail on the right side

most of the data values are concentrated on the left

New cards

Skewed to the left (Histogram)

longer tail on the left side

most of the data values are concentrated on the right.

New cards

Distribution shape (!!WARNING!!)

Be careful interpreting the shape of a histogram if it’s displayed vertically!!

x-axis has to start at 0 (when flipped horizontal)

New cards

Time series data

which are values for a variable measured over time

New cards

How can you visually display time series data

time plots

New cards

What constitutes a Time Plot

Time is plotted on the x - axis, and variable values are plotted on the y - axis

New cards

How is data represented on a Time Plot

Data values are represented by points. We connect these points to better visualize the pattern/trend.

New cards

Seasonal variation (time plot) {example}

fluctuations in data values that occur at regular intervals due to seasonal factors, showing predictable changes at specific times of the year.

<p>fluctuations in data values that occur at regular intervals due to seasonal factors, showing predictable changes at specific times of the year.</p>

New cards

Numerical Summaries of Data

Two important features of a data set that we describe using numbers are its location and variability

New cards

Measures of Location

our data is determined by where the center of our data falls.

mean
median
mode

New cards

Mode

Most frequently observed data value

New cards

Can you have more than one mode

it is possible

New cards

Median value

“middle value” in an ordered data set.

Half of the data values are less than or equal to the median, and the other
half of the data values are greater than or equal to the median.

New cards

What is the first step to make sure the median is accurate

Ensure the data set is ordered.

New cards

What must you do if the n is odd (median)

You locate the middle value directly.

New cards

What must you do if the n is even (median)

take the average of the two middle values.

New cards

Mean

The average of a data set, calculated by adding all values together and dividing by the number of values.

New cards

What is a extreme value (or outlier)

An extreme value, or outlier, is a data point that significantly differs from other observations in a data set, potentially skewing the results.

New cards

Which is resistant to outliers

The median

New cards

Which is not resistant to outliers

The mean

New cards

Is resistance to outliers a good thing?

Yes

a more accurate representation of the central tendency of the data, making analyses less sensitive to extreme values.

New cards

what is the advantage of the mean

It takes all data points into account, providing a measure of central tendency that reflects the overall dataset.

New cards

Median as a Measure of Center

It is simple to visualize how the median measures the center of the data: it divides the data set in half

New cards

Mean as a Measure of Center

center of mass” or “balance point” of the data.

New cards

How do the mean and the median for a given data set compare?

symmetric distribution
In a skewed distribution

New cards

Symmetric distribution (given data set)

the mean and median are equal

New cards

Skewed distribution (given data set)

The mean follows the tail

right-skewed
Left skewed

New cards

Right skewed (skewed distribution)

The mean is greater than the median

New cards

Left skewed (skewed distribution)

The mean is less than the median

New cards

Weighted mean

Sometimes when we’re calculating the mean, some data values are given more weight than others

Some values are observed more frequently, or because some values are more “important” than others

New cards

Variability

How going to discuss how to numerically describe the variability of quantitative data

New cards

Difference

Both are approximately symmetric
The center of the distributions are approximately equal
The variability/spread is different:
- The distribution on top has higher variability than the distribution on the bottom (the data is more “spread out”)

New cards

Measures of variability

Range
Interquartile Range

New cards

Range

This is the difference between the greatest observation and the least observation

New cards

Range formula

R = maximum - minimum

New cards

is range affected by extreme values?

Yes, range is sensitive to extreme values.

New cards

outliers how can they occur

in measurement
legitimate observations, BUT we might not be interested in including these extreme values in our numerical summary of the data

New cards

Interquartile range

measures the length of the interval that covers the middle 50% of the ordered observations.

New cards

Does the interquartile range exclude outliers?

Yes, it excludes outliers because it focuses only on the central 50% of data.

New cards

What is the first and third quartile

The endpoints of this interval

New cards

first quartile

Value where at least 25% of our observations are less than or equal to Q1
75% of our observations are greater than or equal to Q1

New cards

Third quartile

Value where at least 75% of our observations are less than or equal to Q3
25% of our observations are greater than or equal to Q3

New cards

How to find Q1 (first quartile)

Take the median of all the data values lower than the (data’s) median

don’t include counting the median

New cards

How to find Q3 (third quartile)

Take the median of all the data values higher than the (data’s) median

count from the maximum of the data set

New cards

how to solve for the interquartile range

Subtract Q1 from Q3 to find the IQR

New cards

Percentiles

Percentiles are values that divide a dataset into 100 equal parts, indicating the percentage of scores that fall below a particular data point.

New cards

Percentile (class)

P-th percentile of a data set is a value such that p% of observations are less than or equal to the p-th percentile, and at least (100-p)% of observations are greater than or equal to the p-th percentile

New cards

What is the five-number summary

The five-number summary consists of five descriptive statistics that provide a quick overview of a dataset:

The minimum
first quartile (Q1)
median
third quartile (Q3)
maximum

New cards

What does the five number summary divide the data into

The five-number summary divides the data into four intervals,

25% each

New cards

What does the Five number summary describe?

The center/location of our data
The spread/variability of our data
The shape of our data

New cards

Quantile boxplot

five - number summary to get a “picture” of our data,

New cards

What does a quantile boxplot consist of

A number line at the bottom, drawn horizontally
A vertical line at the median
A box around the median that covers the IQR
Lines (called “whiskers”) that extend from the box out to the minimum and maximum

<ul><li><p><span>A number line at the bottom, drawn horizontally</span></p></li><li><p><span>A vertical line at the median</span></p></li><li><p><span>A box around the median that covers the IQR</span></p></li><li><p><span>Lines (called “whiskers”) that extend from the box out to the minimum and maximum</span></p></li></ul><p></p>

New cards

How do you know if the boxplot is skewed to the left

left is longer than on the right, it indicates left skew.

New cards

How do you know if the boxplot is skewed to the right

If the right whisker is longer than the left, it indicates right skew.

New cards

Vertical boxplot

New cards

How do you know if the boxplot is skewed to the left (vertical)

The lower line is longer

New cards

How do you know if the boxplot is skewed to the right (vertical)

The upper line is longer.

New cards

side-by-side boxplots

Comparative visual display of two or more boxplots to analyze differences in distributions.

100

New cards

<p> <span>The side-by-side boxplots below compare the height distributions for Toronto Blue Jays pitchers and players in other fielding positions: (</span>example)</p>

The side-by-side boxplots below compare the height distributions for Toronto Blue Jays pitchers and players in other fielding positions: (example)

The median heights for pitchers and fielders are equal
The distribution for pitchers is skewed to the right and the distribution for fielders is skewed to the left
The IQR for pitchers and fielders are equal, but the range for fielders is greater