STATS 101/108 - Module 1

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/84

Earn XP

Description and Tags

Chapters 1-4

Statistics

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

85 Terms

New cards

What are the three main reasons for generating and analysing data?

Description, prediction, explanation.

New cards

Description

Describing the features of a dataset or population of interest, This involves calculating estimates based on groups and identifying clusters.

New cards

Prediction

Predict what will happen in a new instance or a future time. This involves making forecasts at the individual level.

New cards

Explanation

Explain why things have happened, often so that we can change how the happen in the future. This involves discovering and investigating cases.

New cards

Entity/case

Each entry in a dataset.

New cards

Variable/attribute

Features or properties of an entity.

New cards

What are the two kinds of variable?

Numerical and categorical.

New cards

Numerical variable

Variables describing a measurable characteristic.

New cards

Categorical variable

Variables describing the different groups an entity belongs to.

New cards

Rectangular data set

A form of storing data where each row corresponds to an entity and each column corresponds to a variable.

New cards

Classification

Identifying and grouping entities into predetermined levels.

New cards

Cross-classification

Creating groups based on combinations of levels from two categorical variables

New cards

Two-way table of counts

A type of summary table that allows us to calculate the proportions in the data.

New cards

Classification model

A model that predicts the level/group for a target categorical variable.

New cards

Training data

The data used to create a prediction model.

New cards

Binary classification model

A classification model that predicts which of two groups an entity is in.

New cards

Proportion

The fraction of the total that possesses a certain attribute.

New cards

How can proportions be expressed?

As a fraction, decimal, or percentage.

New cards

Baseline model

A no-information model that always predicts the level with the highest proportion in the training data.

New cards

Confusion matrix

A type of two-way table where the two categorical variables relate to the success of the classification model.

New cards

What are the two categorical variables in a confusion matrix?

Actual value and predicted value.

New cards

Percentage correctly classified (PCC)

The proportion of predictions where the actual and predicted values match.

New cards

How is the PCC caluclated?

By adding the values in the main diagonal and dividing by the total.

New cards

Conditional proportions

Used to calculate the PCC for different levels of the data.

New cards

Why are conditional proportions used?

To find out if a model is better or worse at identifying a particular variable.

New cards

What common displays are used for numeric data?

Dot plot and box plot.

New cards

How are numeric displays ordered?

Low-to-high along the x-axis.

New cards

Dot plot

Display that shows each value as a dot and stacks similar values on top of each other.

New cards

Box plot

Display that only shows the minimum, lower quartile, median, upper quartile, and maximum of the values.

New cards

What are the two measures of centre?

Median and mean.

New cards

Median

The middle value of the distribution.

New cards

Mean

The average value of the distribution.

New cards

Distribution

Different shapes made by the dot plot.

New cards

Positively skewed

Most data is low, but some extreme values create a long tail and pull the mean up.

New cards

Negatively skewed

Most data is high, but some extreme values create a long tail and pull the mean down.

New cards

In a positively skewed distribution, the mean is ____ than the median.

higher

New cards

In a negatively skewed distribution, the mean is ____ than the median.

lower

New cards

Symmetric

The data is evenly distributed.

New cards

Unimodal

There is one peak in the distribution.

New cards

Bimodal

There are two peaks in the distribution.

New cards

Variation

How close the values are together.

New cards

Interquartile range (IQR)

The difference between the upper and lower quartiles. The middle 50% of the distribution is here.

New cards

Standard deviation

How close, on average, values are to the mean.

New cards

A larger standard deviation indicates that the values are ____ from the mean on average, so there is ____ variation.

further away, more

New cards

A smaller standard deviation indicates that the values are ____ to the mean on average, so there is ____ variation.

closer, less

New cards

Testing data

Data used to test a classification model.

New cards

Algorithm

A set of instructions for using input data to predict which level a group belongs to.

New cards

Decision rule

A type of algorithm used in classification models.

New cards

Algorithmic bias

The ways that the data and assumptions in the development of an algorithm can result in unfair or inaccurate outcomes.

New cards

Dynamic data

Data that is updated periodically as new data becomes available.

New cards

Target/response variable

The variable we are trying to predict.

New cards

Prediction error

The difference between the actual value and the predicted value.

New cards

Positive prediction error occurs when the actual value is ____ than the predicted value.

higher

New cards

Negative prediction error occurs when the actual value is ____ than the predicted value.

lower

New cards

No-information model for a numerical variable

Always predicts the mean.

New cards

Prediction interval

Gives a range of values for the prediction, between an upper and lower limit.

New cards

How much of the training data is the prediction interval based on?

The middle 95% of the data.

New cards

What two features need to be balanced in a prediction interval?

Accuracy and precision.

New cards

Accuracy

How often a model gets the prediction correct

New cards

Precision

How close the predictions are to the actual values.

New cards

Scatter plot

A plot with two numerical variables used to visualise the association between the variables.

New cards

What features should be checked on a scatter plot?

Association, pattern/trend, scatter/variation.

New cards

Explanatory variable

The indepedent variable.

New cards

What axis is the explanatory variable plotted on?

The x-axis.

New cards

What axis is the response variable plotted on?

The y-axis.

New cards

Response variable

The dependent variable - what we are trying to predict.

New cards

Correlation

A measure of association strength between two variables.

New cards

Rank correlation

Measure of how well the variables match up if ordered from smallest to largest.

New cards

Linear model

A straight line that goes through the centre of the data - a line of best fit - that is used to make a point prediction.

New cards

Residual

The prediction error used in a linear model.

New cards

What is used as the residual in a linear model?

A prediction error that contains 95% of the data.

New cards

Why is the middle 95% of the distribution used?

It shows us where the likely/usual values for the distribution are.

New cards

Tail proportion

The proportion of values that are above or below a value of interest.

New cards

If the tail proportion is less than 2.5%, it can be considered ____ for the distribution.

unusual

New cards

If the tail proportion is more than 2.5%, it can be considered ____ for the distribution.

usual

New cards

Null model

A baseline model used to account for any uncertainty in the process that may have led to the observed result.

New cards

Null hypothesis

The “just chance” explanation for a particular situation.

New cards

Chance variation

Explanation for why even if the underlying proportion is a certain value, we may not see this proportion when generating data from the model.

New cards

Experiment

A study in which the researcher will control, manipulate, or change the conditions the experimental units experience.

New cards

Random variation

Differences in group summaries and nothing else, e.g. due to random allocation to groups.

New cards

Reference distribution

A distribution showing the random variation.

New cards

Random allocation

Experimental unit are allocated to treatments such that it is equally likely that each treatment is applied to each unit/

New cards

Randomisation test

A test producing a reference distribution of what could be explained by ‘just chance’.

New cards

If the tail proportion of the observed result is less than 2.5% in the reference distribution, it is ____ with the null model and ____ a result of chance.

not compatible, not likely

New cards

If the tail proportion of the observed result is more than 2.5% in the reference distribution, it is ____ with the null model and ____ a result of chance.

compatible, could be