1/84
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What are the three main reasons for generating and analysing data?
Description, prediction, explanation.
Description
Describing the features of a dataset or population of interest, This involves calculating estimates based on groups and identifying clusters.
Prediction
Predict what will happen in a new instance or a future time. This involves making forecasts at the individual level.
Explanation
Explain why things have happened, often so that we can change how the happen in the future. This involves discovering and investigating cases.
Entity/case
Each entry in a dataset.
Variable/attribute
Features or properties of an entity.
What are the two kinds of variable?
Numerical and categorical.
Numerical variable
Variables describing a measurable characteristic.
Categorical variable
Variables describing the different groups an entity belongs to.
Rectangular data set
A form of storing data where each row corresponds to an entity and each column corresponds to a variable.
Classification
Identifying and grouping entities into predetermined levels.
Cross-classification
Creating groups based on combinations of levels from two categorical variables
Two-way table of counts
A type of summary table that allows us to calculate the proportions in the data.
Classification model
A model that predicts the level/group for a target categorical variable.
Training data
The data used to create a prediction model.
Binary classification model
A classification model that predicts which of two groups an entity is in.
Proportion
The fraction of the total that possesses a certain attribute.
How can proportions be expressed?
As a fraction, decimal, or percentage.
Baseline model
A no-information model that always predicts the level with the highest proportion in the training data.
Confusion matrix
A type of two-way table where the two categorical variables relate to the success of the classification model.
What are the two categorical variables in a confusion matrix?
Actual value and predicted value.
Percentage correctly classified (PCC)
The proportion of predictions where the actual and predicted values match.
How is the PCC caluclated?
By adding the values in the main diagonal and dividing by the total.
Conditional proportions
Used to calculate the PCC for different levels of the data.
Why are conditional proportions used?
To find out if a model is better or worse at identifying a particular variable.
What common displays are used for numeric data?
Dot plot and box plot.
How are numeric displays ordered?
Low-to-high along the x-axis.
Dot plot
Display that shows each value as a dot and stacks similar values on top of each other.
Box plot
Display that only shows the minimum, lower quartile, median, upper quartile, and maximum of the values.
What are the two measures of centre?
Median and mean.
Median
The middle value of the distribution.
Mean
The average value of the distribution.
Distribution
Different shapes made by the dot plot.
Positively skewed
Most data is low, but some extreme values create a long tail and pull the mean up.
Negatively skewed
Most data is high, but some extreme values create a long tail and pull the mean down.
In a positively skewed distribution, the mean is ____ than the median.
higher
In a negatively skewed distribution, the mean is ____ than the median.
lower
Symmetric
The data is evenly distributed.
Unimodal
There is one peak in the distribution.
Bimodal
There are two peaks in the distribution.
Variation
How close the values are together.
Interquartile range (IQR)
The difference between the upper and lower quartiles. The middle 50% of the distribution is here.
Standard deviation
How close, on average, values are to the mean.
A larger standard deviation indicates that the values are ____ from the mean on average, so there is ____ variation.
further away, more
A smaller standard deviation indicates that the values are ____ to the mean on average, so there is ____ variation.
closer, less
Testing data
Data used to test a classification model.
Algorithm
A set of instructions for using input data to predict which level a group belongs to.
Decision rule
A type of algorithm used in classification models.
Algorithmic bias
The ways that the data and assumptions in the development of an algorithm can result in unfair or inaccurate outcomes.
Dynamic data
Data that is updated periodically as new data becomes available.
Target/response variable
The variable we are trying to predict.
Prediction error
The difference between the actual value and the predicted value.
Positive prediction error occurs when the actual value is ____ than the predicted value.
higher
Negative prediction error occurs when the actual value is ____ than the predicted value.
lower
No-information model for a numerical variable
Always predicts the mean.
Prediction interval
Gives a range of values for the prediction, between an upper and lower limit.
How much of the training data is the prediction interval based on?
The middle 95% of the data.
What two features need to be balanced in a prediction interval?
Accuracy and precision.
Accuracy
How often a model gets the prediction correct
Precision
How close the predictions are to the actual values.
Scatter plot
A plot with two numerical variables used to visualise the association between the variables.
What features should be checked on a scatter plot?
Association, pattern/trend, scatter/variation.
Explanatory variable
The indepedent variable.
What axis is the explanatory variable plotted on?
The x-axis.
What axis is the response variable plotted on?
The y-axis.
Response variable
The dependent variable - what we are trying to predict.
Correlation
A measure of association strength between two variables.
Rank correlation
Measure of how well the variables match up if ordered from smallest to largest.
Linear model
A straight line that goes through the centre of the data - a line of best fit - that is used to make a point prediction.
Residual
The prediction error used in a linear model.
What is used as the residual in a linear model?
A prediction error that contains 95% of the data.
Why is the middle 95% of the distribution used?
It shows us where the likely/usual values for the distribution are.
Tail proportion
The proportion of values that are above or below a value of interest.
If the tail proportion is less than 2.5%, it can be considered ____ for the distribution.
unusual
If the tail proportion is more than 2.5%, it can be considered ____ for the distribution.
usual
Null model
A baseline model used to account for any uncertainty in the process that may have led to the observed result.
Null hypothesis
The “just chance” explanation for a particular situation.
Chance variation
Explanation for why even if the underlying proportion is a certain value, we may not see this proportion when generating data from the model.
Experiment
A study in which the researcher will control, manipulate, or change the conditions the experimental units experience.
Random variation
Differences in group summaries and nothing else, e.g. due to random allocation to groups.
Reference distribution
A distribution showing the random variation.
Random allocation
Experimental unit are allocated to treatments such that it is equally likely that each treatment is applied to each unit/
Randomisation test
A test producing a reference distribution of what could be explained by ‘just chance’.
If the tail proportion of the observed result is less than 2.5% in the reference distribution, it is ____ with the null model and ____ a result of chance.
not compatible, not likely
If the tail proportion of the observed result is more than 2.5% in the reference distribution, it is ____ with the null model and ____ a result of chance.
compatible, could be