Introduction to Probability and Statistics: The Nature of Data and Basic Terms

Course Overview and Learning Outcomes

The course MF004, titled Introduction to Probability and Statistics, begins by establishing the foundational nature of the field. Upon completing the first chapter, students are expected to demonstrate proficiency in five specific areas. First, students must demonstrate a clear understanding of six basic terms fundamental to statistical analysis. Second, they must be able to distinguish between the two primary branches of statistics. Third, students should be able to identify and categorize various types of data. Fourth, they must determine the appropriate measurement levels for different variables. Finally, they should be able to identify and apply the four basic methods of sampling.

Introduction to Statistics

Statistics as a field of knowledge is formally defined as the science of collecting, organizing, and analyzing data. This definition is sourced from Upton, G. J. G., and Cook, I. (2014) in the book "A Dictionary of Statistics" (3rd ed.) published by Oxford University Press. It serves as the intellectual framework for interpreting information through a systematic mathematical lens.

Six Basic Terms of Statistics

There are six essential terms used to describe statistical concepts and their applications. The first term is a Variable, which is defined as a number or a description that can take more than more than one value. Examples of numeric values include sets such as 0,1,2,3{0, 1, 2, 3} or 1,2,3,4,5,6{1, 2, 3, 4, 5, 6}. Variables can also represent continuous data, such as a set of temperature readings like 25.0,26.1,25.4,27.2{25.0, 26.1, 25.4, 27.2}. Non-numeric variables might include descriptions like the days of the week, ranging from Sun(0)\text{Sun(0)} to Sat(6)\text{Sat(6)}, or dates such as 1st Jan\text{1st Jan} through 31st Dec\text{31st Dec}.

The second term is a Random Variable. This is a specific type of variable whose values cannot be determined or predicted exactly. Examples include the outcome of rolling a die or the result of a lottery draw. The third term is Population, which refers to all the subjects of interest for a random variable. In many practical scenarios, a population may be too large to handle or analyze in its entirety.

The fourth term is a Sample, which is defined as a group of subjects selected from the population for the purpose of investigation. Samples are used specifically in cases where the population is too large to manage. For instance, in a study of university students, a sample might be taken from the larger student body. The fifth term is a Parameter. A parameter is a number that summarizes data for a population. Because populations are often too large to handle, the exact value of a parameter may never be known with certainty. The final term is a Statistic, which is a number that summarizes data for a sample. These numbers are used to estimate the parameters of the corresponding population when obtaining the actual parameter is impossible.

Branches of Statistics: Descriptive vs. Inferential

Statistics is divided into two main branches: Descriptive Statistics and Inferential Statistics. Descriptive Statistics consists of collecting data, organizing it, presenting it through tools such as graphs, charts, and tables, and summarizing it using measures like the mean, median, and variance. Examples of descriptive statistics include stating that last semester 417{417} students registered for MF004, or finding that the mean height of a specific class is 179.3cm{179.3\,cm} and the median height is 176.1cm{176.1\,cm}. Another example is noting that among a group of 200{200} students at UCSI, 74{74} of them wear glasses.

Inferential Statistics focuses on broader applications, such as making conclusions about a population based on a sample, estimating unknown values, finding the probability of an event, uncovering relationships among variables, and making predictions. For instance, based on a single class, one might estimate that the mean height of all university students in Malaysia is 179cm{179\,cm}. Other inferential examples include predicting that 500{500} students will take MF004 next semester, or predicting that in 10{10} years, the mean student height in MF004 will reach 185cm{185\,cm}. Calculating the probability of a student wearing glasses as 0.37{0.37}, or concluding that eating burgers causes obesity based on a sample of 100{100} students, are also inferential processes. Similarly, estimating a population of 2.1{2.1} million citizens with 740,000{740,000} licensed drivers based on 40{40} interviews is inferential, whereas the government record of 2,134,864{2,134,864} citizens and 743,198{743,198} licenses is a descriptive population data set.

Types of Variables and Classification

Variables can be classified through several systems. One method is by the Q.Q.D.C. framework, which stands for Qualitative or Quantitative, and Discrete or Continuous. Qualitative variables are non-numeric and involve categories. Examples include gender (male, female), marital status (single, married, divorced), blood types (O, A, B, AB), and letter grades (A, A-, B+, etc.).

Quantitative variables are numeric. These are sub-divided into Discrete and Continuous variables. Discrete variables consist of countable values, such as the numbers 0,1,2,3,{0, 1, 2, 3, \dots} or marks in a range like 0,1,2,,100{0, 1, 2, \dots, 100}. Continuous variables consist of uncountable values often derived from measurements. Examples include height within a range such as 50cm<x<250cm{50\,cm < x < 250\,cm}, weight such as 30kg<x<200kg{30\,kg < x < 200\,kg}, volume V>0ml{V > 0\,ml}, and time t>9s{t > 9\,s}.

Levels of Measurement

Another way to classify types of variables is by their Levels of Measurement, which progress through four levels: Nominal, Ordinal, Interval, and Ratio. The Nominal level involves classification only, where no inherent order exists. Examples include types of diseases (diabetes, cancer, kidney failure, hypertension), gender (male, female), and colors (red, green, blue). The Ordinal level involves classification and a meaningful order, but the differences between levels are not quantifiable. Examples include shirt sizes (S, M, L, XL) and academic grades (A, B, C, D).

The Interval level includes classification and order, and the differences between values are meaningful; however, there is no true or absolute zero point. An example is coordinate measurements like longitude/latitude ranging from 180<x<180{-180^{\circ} < x < 180^{\circ}} or IQ scores. The highest level is the Ratio level, which includes classification, order, meaningful differences, and a true zero point, meaning ratios between values are significant. Examples include height, weight, age, and money. In the ratio level, values are countable (discrete) or measurable (continuous).