JM

CH 1

1.1

  • Data is information, especially facts or numbers, usually collected or computes for purposes of analysis.

  • Zettabyte is one sectillion or 1021

Common sources of data

  • social networks → human generated data

    • facebook, twitter, blogs, youtube

  • Traditional business systems→ data produces by public agencies or businesses

    • medical records

    • commercial transactions, credit cards, etc (businesses)

  • internet of things → data from sensors

    • fixed sensors

      • home automation

      • weather/pollution sensors

    • mobile sensors (tracking)

      • gps

      • cars

    • data from computer sensors

      • logs, web logs

Three types of data analytics exist:

  • Descriptive data analytics seeks to describe data, providing insight and knowledge. Ex: Based on collected data, the world population in 2015 is about 7 billion.

  • Predictive data analytics seeks to make predictions from data. Ex: Using models based on birth rates, death rates, medical care improvements, and other data, the United Nations predicts the world population will reach 11.2 billion in 2100.

  • Prescriptive data analytics seeks to make decisions (prescriptions) based on data. Ex: Population predictions for specific countries help the United Nations decide where to focus agricultural development efforts.

Types of data

Variables

Data is typically represented using variables. A variable is an item that can have different ("varying") values. Ex: A person's age is a variable and can have the value 10, 33, 99, or other values. Variables are often considered as being of two possible types:

  • A quantitative variable can take on a numeric value (quantitative data) that can be measured and ordered. Ex: A person's age, the outside temperature, and a meal's price are quantitative variables. Example numeric values are an age of 33 or 99 years, a temperature of 40 or 45 degrees, and a price of 12 or 15 dollars.

  • A categorical variable can take on the value (usually a label) of one of several categories. Ex: A person's blood type, seasons, and U.S. companies are categorical variables. Blood type can be A, B, AB, or O, seasons can be fall, winter, spring, or summer, and U.S. companies can be Wal-Mart, McDonalds, UPS, etc. A categorical variable is often called a qualitative variable (known by qualities, rather than quantities).

1.2

Descriptive statistics

Descriptive statistics are methods to summarize and describe a variable's important characteristics. Ex: The median price for homes in an area is a meaningful way to numerically summarize the price variable from data about homes.

An effective way to begin understanding a variable is to examine the variable's distribution. The distribution of a variable is the possible values the variable can take on and a measure of how often each value occurs. Visualizing a variable's distribution with a graph gives insights into the distribution's shape. For quantitative variables:

  • A cluster is a distinct group of neighboring values in a distribution that occur noticeably more often than the values on either side of the group.

  • The tails of a distribution are the end values of the distribution. The left tail refers to the lowest values of the distribution, and the right tail refers to the highest values of the distribution.

Inferential statistics

The questions and objectives of statistics projects often need conclusions to be made beyond the scope of the collected sample data to a broader population. Inferential statistics are methods that result in conclusions and estimates about the population based on a sample. Testing claims about the population and estimating quantities of the population are the two primary inference methods. A numerical quantity of the population, such as the population mean or population proportion, is called a parameter. Population parameters are usually unknown, and inferential statistics allow for generalizations to be made about the population based on the observed sample.

1.3

Data can be collected in two major ways:

  • In an observational study, data is collected by measuring/recording variables of interest without any direct intervention.

  • In an experiment, some of the variables are controlled when the data is collected due to direct intervention by the researchers conducting the experiment.

A lurking variable is a variable that was not recorded or accounted for in the data collection process that impacts a variable of interest.

Causation exists when a cause-and-effect (causal) relationship exists between two variables. The saying correlation does not imply causation means that two variables being associated, correlated, or otherwise dependent upon each other is not enough to conclude the two variables have a causal relationship. A spurious relationship (or spurious correlation) occurs when two unrelated variables falsely appear to have a cause-and-effect relationship.

A spurious relationship can occur by random chance, but is often the result of lurking variables influencing both the supposed "cause" and "effect" variables. A confounding variable is a variable that influences two other variables into a relationship, obscuring what is actually occurring. Ex: When considering alcohol and heart disease, age is a confounding variable since older people are more at risk of heart disease and are also less likely to drink regularly.

1.4

Surveys are conducted to allow statisticians to make generalizations about a population.

A population is any collection of objects, people, or things about which statistical inferences are made.

A sampling unit is an individual in the population on which a measurement can be taken.

The sampling frame is the subset of the population from which a sample is drawn.

The sample is composed of the sampling units that provide data to be collected.

A parameter of a population is a numerical characteristic of a population, such as mean, median, or standard deviation. A statistic is a numerical characteristic of a sample, rather than the population.