Design, Data and Decisions Lecture 1 Flashcards
33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 1 | Page Module 1 – Descriptive Statistics Lecture One In this lecture we will think about: • Graphical summaries of a single variable o Categorical and Continuous • Numerical summaries of a single variable oWhere is the location of the data? oWhat is the spread of the data? 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 2 | Page Structure of our data Data comes in many forms nowadays • Transactional – records interactions between and within systems • Images – you now get your MRI as a file to keep… • Ongoing – we are generating data constantly… But to start analysing and describing our data we need structure Obs Attribute 1 Attribute 2 Attribute 3 … Attribute p 1 22.3 Aus 4 … Low 2 41.7 Overseas 7 … High … … … … … n 7.23 Aus 3 … Middle 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 3 | Page Types of Variables – describing the attributes of our data Categorical (Qualitative) (divides the units into a distinct set of groups/categories) Nominal (no order) nationality, gender Ordinal (natural ordering) age group, level of education, BMI group, month of birth, season Quantitative (measures a numerical quantity for unit) Discrete (takes whole number values) age (in years), number of birds in a tree, number of damaged cells, number of damaged bytes Continuous (takes any real number) time since birth, height, temperature, data transfer time, BMI 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 4 | Page Types of Variables – describing the attributes of our data Categorical (Qualitative) (divides the units into a distinct set of groups/categories) Nominal (no order) nationality, gender Ordinal (natural ordering) age group, level of education, BMI group, month of birth, season Quantitative (measures a numerical quantity for unit) Discrete (takes whole number values) age (in years), number of birds in a tree, number of damaged cells, number of damaged bytes Continuous (takes any real number) time since birth, height, temperature, data transfer time, BMI 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 5 | Page Types of Variables – describing the attributes of our data Categorical (Qualitative) (divides the units into a distinct set of groups/categories) Nominal (no order) nationality, gender Ordinal (natural ordering) age group, level of education, BMI group, month of birth, season Quantitative (measures a numerical quantity for unit) Discrete (takes whole number values) age (in years), number of birds in a tree, number of damaged cells, number of damaged bytes Continuous (takes any real number) time since birth, height, temperature, data transfer time, BMI Recognising the type of variable a particular attribute is, will help us identify the correct analysis and appropriate Statistics 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 6 | Page Categorical Variables All we can do is count how many observations are in a category and report the frequencies or counts • Might report the relative frequencies as percentages o 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑓𝑓𝑓𝑓𝑓𝑓 𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × 100 Self-reported activity level for 92 individuals in an experiment Activity Level Frequency % Slight 10 10.9% Moderate 61 66.3% High 21 22.8% Total 92 100.0% 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 7 | Page Categorical Variables – Bar Chart / Pie Chart As the names suggest, a Bar Chart uses a bar for each category • Frequency or Relative Frequency (%s) in each category A Pie Chart shares out the pie with slices for each category • The angle of the slice is proportional to the relative frequency o10% is 36o, 25% is 90o, 50% is 180o… 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 8 | Page Bar Chart or Pie Chart? Bar Charts are suitable for: • Ordinal Variables – shows order of categories • Nominal Variables – order by the category frequencies • Comparing distributions of two categorical variables – must use relative frequencies Pie is best left on the plate BUT if you want to use a Pie Chart then it is just suitable for a single Nominal Variable without too many categories • Don’t use 3D – it changes the relative sizes of the slices 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 9 | Page Bar Chart or Pie Chart? Bar Charts are suitable for: • Ordinal Variables – shows order of categories • Nominal Variables – order by the category frequencies • Comparing distributions of two categorical variables – must use relative frequencies Pie is best left on the plate BUT if you want to use a Pie Chart then it is just suitable for a single Nominal Variable without too many categories • Don’t use 3D – it changes the relative sizes of the slices A B C 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 10 | Page Bar Chart or Pie Chart? Bar Charts are suitable for: • Ordinal Variables – shows order of categories • Nominal Variables – order by the category frequencies • Comparing distributions of two categorical variables – must use relative frequencies Pie is best left on the plate BUT if you want to use a Pie Chart then it is just suitable for a single Nominal Variable without too many categories • Don’t use 3D – it changes the relative sizes of the slices A B C A B C 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 11 | Page Quantitative Variables – Graphical Summaries Let us look at the heights for the same experimental data height in cm Stem-and-Leaf Plot Frequency Stem & Leaf 1.00 15 . 4 6.00 15 . 677779 6.00 16 . 000022 13.00 16 . 5555677777777 17.00 17 . 00000002222222222 17.00 17 . 55555555556777777 15.00 18 . 000000122222222 14.00 18 . 55555556677777 3.00 19 . 000 Stem width: 10 Each leaf: 1 case(s) 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 12 | Page Quantitative Variables – what is the Shape? We like ‘distributions’ that are symmetric or bell-shaped Age of COVID deaths (adults) Age of COVID cases (adults) 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 13 | Page Quantitative Variables – where is the centre Located? With Quantitative Variables we consider three ‘Sample Statistics’ to measure the Location of our distribution • MODE – most common value oOnly makes sense for Discrete (also useful for Ordinal – and the only suitable measure) • MEDIAN – the middle value when the observations are ordered • (Arithmetic) MEAN – the value we get if the total of the observations is shared evenly oYou would also know this as the ‘average’ 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 14 | Page Quantitative Variables – Median Let’s consider a sample of 11 of us with the variable age. • 17, 19, 18, 18, 18, 20, 19, 18, 21, 23, 52 (me) Order that data from lowest to highest • 17, 18, 18, 18, 18, 19, 19, 20, 21, 23, 52 Find the value of the middle observation • If it is 11 observations then that is the 6th (in order) • 17, 18, 18, 18, 18, 19 years, 19, 20, 21, 23, 52 • What about 10 observations (excluding me)? o17, 18, 18, 18, 18, 19, 19, 20, 21, 23 so 18.5 years OUTLIER 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 15 | Page Quantitative Variables – Median General approach to calculating the median in our sample of n observations 1. Order the data from lowest to highest 2. The median is the 𝑛𝑛+1 2 th observation in the ordered list a. When n is an odd number that is a specific observation b. When n is an even number it sits between the middle two values 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 16 | Page Quantitative Variables – Mean Let’s consider a sample of 11 of us with the variable age. • 17, 19, 18, 18, 18, 20, 19, 18, 21, 23, 52 (me) Add up all the values • 17+19+18+18+18+20+19+18+21+23+52=243 Divide the total evenly amongst the observations • 243 11 = 22.1 years • If it’s the 10 observations excluding me then o17+19+18+18+18+20+19+18+21+23=191 o 191 10 = 19.1 years o Median 19 or 18.5 – outliers impact the mean… OUTLIER 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 17 | Page Quantitative Variables – Mean For variable 𝑥𝑥 with observations 𝑥𝑥𝑖𝑖 from a sample with 𝑖𝑖 = 1, … , 𝑛𝑛 observations then the sample mean 𝑥𝑥̅is 𝑥𝑥̅= 1 𝑛𝑛� 𝑥𝑥𝑖𝑖 𝑛𝑛 𝑖𝑖=1 Mean pulled down relative to median • Mean age at death lower than Median Mean pulled up relative to median • Mean income higher than Median 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 18 | Page Quantitative Variables – how Spread is the distribution? With Quantitative Variables we consider three ‘Sample Statistics’ to measure the Spread of our distribution • RANGE – just the difference between the smallest and largest values oSo 52-17 = 35 years (my outlier age has a BIG impact) • INTER-QUARTILE RANGE – the middle 50% of the data when the observations are ordered oSpread around the MEDIAN • STANDARD DEVIATION – a measure of how spread-out the observations are around the MEAN o How far I might expect an observation to be from the MEAN 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 19 | Page Quantitative Variables – Inter-Quartile Range Let’s consider a sample of 11 of us with the variable age. • 17, 19, 18, 18, 18, 20, 19, 18, 21, 23, 52 (me) Order that data from lowest to highest • 17, 18, 18, 18, 18, 19, 19, 20, 21, 23, 52 Split the data into four equal groups (quartiles) • If it is 11 observations then that is the 3rd, 6th, 9th observations • 17, 18, 18, 18, 18, 19, 19, 20, 21, 23, 52 o Range of ‘middle’ 50% is 21-18 = 3 years • What about 10 observations (excluding me)? o17, 18, 18, 18, 18, 19, 19, 20, 21, 23; 20.25-18 = 2.25 OUTLIER 33116 Design, Data & Decisions Module 1, Session 1, Lecture 1 20 | Page Quantitative Variables – Inter-Quartile Range General approach to calculating the inter-quartile range in our sample of n observations 1. Order the data from lowest to highes