Week 1: Exploring and Collecting Data (Ch 1-3)

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/42

flashcard set

Earn XP

Description and Tags

A comprehensive set of Q&A flashcards covering data, variables, distributions, and descriptive statistics from Week 1 notes (Ch 1-3).

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

43 Terms

1
New cards

What is data?

Recorded values (numbers or labels) together with their context, captured and stored.

2
New cards

What is a data warehouse?

A vast digital repository where data are stored.

3
New cards

What is Big Data?

The challenges of collecting, managing, storing, and curating large-scale data.

4
New cards

What must you know before turning data into information?

The decision you want to make or the question data can answer, and how to communicate the answer.

5
New cards

What is data mining?

The process of obtaining actionable information from data, often for future performance.

6
New cards

What is predictive analytics?

Analysis focusing on predicting future performance.

7
New cards

What is business analytics?

The use of data and statistical analysis to inform business decisions.

8
New cards

What does it mean that all data have a context?

Data values are information about a subject and are interpreted within context; data are organized into a data table.

9
New cards

What are the rows of a data table called?

Cases (individuals or units) about whom we record characteristics.

10
New cards

What are the columns of a data table called?

Variables (the characteristics recorded).

11
New cards

What is a categorical (qualitative) variable?

A variable that names categories and answers questions about how cases fall into those categories.

12
New cards

What is a quantitative variable?

A numerical variable with units that indicates how much is measured.

13
New cards

What subtypes exist for categorical variables?

Ordinal, nominal, and binary.

14
New cards

What is an ordinal categorical variable?

Values with intrinsic order (e.g., Dissatisfied, Neutral, Satisfied).

15
New cards

What is a nominal categorical variable?

Values without intrinsic order (e.g., locations like South Australia, Victoria, etc.).

16
New cards

What is a binary categorical variable?

A categorical variable with only two possible values (e.g., gender).

17
New cards

Why do quantitative variables have units?

To indicate how values are measured, their scale, and magnitude.

18
New cards

What are the two types of quantitative variables?

Continuous and discrete.

19
New cards

Can a variable be both categorical and quantitative?

Yes; depending on purpose. For example Age can be quantitative (in years) or used as categories (child, teen, adult).

20
New cards

What is a histogram?

A graph for a quantitative variable showing frequency of values by dividing into bins.

21
New cards

What is a relative frequency histogram?

A histogram showing the percentage of cases in each bin instead of counts.

22
New cards

What are the three things to describe when looking at a distribution?

Shape, center, and spread.

23
New cards

What is a mode?

Peaks in a distribution; unimodal, bimodal, multimodal.

24
New cards

What does symmetry mean in a distribution?

Halves on either side of the center look like mirror images.

25
New cards

What are tails in a distribution?

The thinner ends of the distribution.

26
New cards

What is skewness?

If one tail stretches farther than the other; distribution skewed to the side of the longer tail.

27
New cards

What is an outlier?

A value that stands away from the body of the distribution; can affect methods and may indicate errors; should be discussed in conclusions.

28
New cards

How do you calculate the mean?

Sum of all values divided by the number of data values (n).

29
New cards

When should you use the median?

When a distribution is skewed, has gaps, or contains outliers; the median is resistant to outliers.

30
New cards

What is meant by the mean and median in symmetric distributions?

If roughly symmetric, the mean and median are close.

31
New cards

What is the range?

Max minus min; a simple measure of spread; not resistant to outliers.

32
New cards

What are quartiles and the IQR?

Q1 and Q3 frame the middle 50% of data; IQR = Q3 − Q1 (a robust spread measure).

33
New cards

What is the standard deviation and variance?

Variance is the average of squared deviations from the mean (s^2); the standard deviation is the square root of the variance.

34
New cards

When is standard deviation appropriate?

For symmetric distributions and when used with the mean; it can be influenced by outliers.

35
New cards

What is a five-number summary?

Median, Q1, Q3, minimum, and maximum.

36
New cards

What is a boxplot?

A plot showing the five-number summary; the central box shows the middle 50% (IQR); whiskers indicate potential skewness; outliers are plotted separately.

37
New cards

What is a z-score?

A standardized value: (value − mean) / standard deviation; tells how many standard deviations a value is from the mean.

38
New cards

How do you determine which data point is more unusual using z-scores?

Compare the absolute values of the z-scores; the larger absolute value is more unusual.

39
New cards

In the real estate example, which is more unusual: a $340,000 house or a 5000 sq ft house?

The 5000 sq ft house (z ≈ 4.46) is more unusual than the $340,000 house (z = 3.0).

40
New cards

What does a z-score of 2 indicate?

Two standard deviations above the mean.

41
New cards

What is the purpose of the boxplot’s box and whiskers?

Box shows the middle 50% (IQR); whiskers indicate spread and potential skewness; outliers shown separately.

42
New cards

What is the five-number summary used for in boxplots?

To describe a distribution and provide input for a boxplot visualization.

43
New cards

Why should outliers be noted in conclusions?

They can be the most informative part of the data and may indicate data quality issues or true extremes.