Econ Data Analytics Midterm Spring 2025

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/80

flashcard set

Earn XP

Description and Tags

Economics

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

81 Terms

1
New cards

What does each row of a rectangular dataset represent?

An Observation

2
New cards

What does each column of a rectangular dataset represent?

A Variable (for observations)

3
New cards

What is Personally Identifying info? (PII)

Any information from a data set that could be used to individually identify a person

4
New cards

Give an example of PII:

address, first and last name, SSN, birthday etc

5
New cards

What is the difference between POPULATION and SAMPLE?

Population refers to the entire group of individuals or items being studied, while a sample is a subset of that population selected for analysis

6
New cards

Some studies collect and utilize qualitative data. Name one type of qualitative data.

An interview transcript

7
New cards

Name two steps you may need to take to prepare your data for Analysis?

  1. Transpose Records – e.g. from horizontal to vertical or vice verse — to get the right units. 

  2. Collapse Records so that smaller, more specific information is combined or summed together and the data is easier to make observations from.

8
New cards

What are imputations? Give methods of imputation.

  • methods of filling in the gaps in data

  • Methods:

    • use a value of related information —> someone in the same household

    • to the mean —> overall or subgroup

    • Regressions, multiple regressions

9
New cards

Name the three measures of central tendency

  • mean

  • median

  • mode

10
New cards

Name the three measures of variability (spread)

  • range

  • standard deviation

  • variance

11
New cards

Which measure(s) of variability is expressed in the same unit as

the variable? Why is that important?

  • range because it is just the difference

  • standard deviation —> makes it easier to interpret and use

12
New cards

Formula for Standard Deviation: (STDEV.S or STDEV.P)

<p></p><p></p>
13
New cards

Formula for Variance: (VAR.S or VAR.P)

  • equal to std Dev squared

<ul><li><p>equal to std Dev squared</p></li></ul><p></p>
14
New cards

Formula for Range: (=max-min)

  • can also use “h-L+1” to make it inclusive

<ul><li><p>can also use “h-L+1” to make it inclusive</p></li></ul><p></p>
15
New cards

What does the correlation coefficient measure? What can’t you

say based on correlation coefficient?

  • how related two variables are from -1 (strongly unrelated) to +1 (strongly related)

  • no relation is 0

  • canNOT say that one causes the other

16
New cards

Is 0.5 or -0.9 a stronger correlation coefficient?

-0.9 is a stronger correlation. Although it is negative, implying that the variables are strongly UNrelated, that is still more correlation that a 0.5 which is only half related.

17
New cards

Formula for Correlation Coefficient:

knowt flashcard image
18
New cards

What is the difference between validity and reliability?

  • Validity: does it measure what it’s supposed to? measure of accuracy

  • Reliabiltiy: does it work every time? measure of consistency

19
New cards

What are the types of numeric variables?

  • categorical

  • ordinal

  • continuous

  • discrete

  • binary

20
New cards

What is a categorical (nominal) variable?

two or more categories, but the numbers themselves have no value.
example: hair color: 1 = brunette, 2 = blonde, 3 = red, 4 = grey

21
New cards

What is an ordinal variable?

two or more categories but with levels.

example: level of edu: 1 = elementary, 2 = 2ndary

22
New cards

What is a continuous variable?

any number between two points (line of a graph)

23
New cards

what is a discrete variable?

number of children in household, cars in garage, trees in yard etc

24
New cards

What is a binary variable?

  • value of 1 or 0

  • example: female (0=no, 1= yes)

  • example: did you attend? (0=no, 1=yes)

25
New cards

What is time series data?

  • collected over time —> think Dad’s time-lapse of pond puddle

  • regular equal intervals

  • usually collected for same interval

  • example: ocean tides, quarterly revenue

26
New cards

What is cross-sectional data?

  • collected on different individuals

  • collected at one time or same period of time

  • Example: opinion polls, census

27
New cards

What is pooled data?

  • mixture of time series and cross-sectional

  • same piece of info for multiple people

  • example: annual GDP for multiple countries

28
New cards

What is panel data/logitudinal data?

  • info for same cross-sectional same is repeated

  • some variables collected @ once are constant

    • Gender

    • DOB

    • Race

  • others over time

    • Edu level

    • Earnings

    • Marital Status

29
New cards

What is extant data?

  • already available from organizations

  • was not collected FOR analysis but could be useful

    • HW and projects and art from schools

30
New cards

What is client data?

  • data that firms collect about themselves

  • sales, revenue, etc

  • usually proprietary so only in-house

31
New cards

What are Public Use Data Files (PUF)

  • end of some studies —> data made public

  • stripped of all Personally Identifying Info (PII)

  • other data-masking techniques

32
New cards

What is Personally Identifying Data (PII)?

  • anything that could attach data to a person

    • DOB

    • Social Security Number

    • First and Last Names

    • Address

33
New cards

What are data-masking techniques?

  • dropping sensitive variables entirely

  • collapsing categorical variables with small cell sizes

34
New cards

What is a Restricted-Use Data File? (RUF)

  • most PII is stripped but other data not masked

  • higher risk, may need

    • Data Use Agreement (DUA)

    • Memorandum of Understanding (MOU)

    • may need to work on computer in locked room etc

35
New cards

What may you find in Data Codebooks and Documentation?

  • list of variables

  • lots of time-saving info

    • sample definiitions

    • description of data collection —> annotated survey

36
New cards

What are four methods are analysis?

  • Experimental

  • Quasi-experimental

  • Correlational

  • Descriptive

37
New cards

What are examples of quantitative data?

  • mean, median, mode

  • distributions, frequencies

38
New cards

Ways to collect qualitative Data?

  • interviews

  • observations

  • focus groups

  • survey write-in responses

39
New cards

How can you combine data from multiple files if they have observations under the same variables?

append/stack together the files one after the other

40
New cards

What are the three ways to merge files when the variables are split up between them?

  • one-to-one

  • one-to-many

  • many-to-many

41
New cards

What is a one-to-one file merger?

  • Take two files each with half the needed variables

  • combine them into one new file with all variables

<ul><li><p>Take two files each with half the needed variables</p></li><li><p>combine them into one new file with all variables</p></li></ul><p></p>
42
New cards

What is a one-to-many file merger?

  • not sure?

<ul><li><p>not sure?</p></li></ul><p></p>
43
New cards

What is a many-to-many file merger?

  • very tricky!

  • to be avoided

44
New cards

What to do with extra data?

  1. Leave extra variables/observations and filter with “if/when” statements

  2. Create a new file and delete extras — keep raw file just in case

45
New cards

How to spot poor data quality?

  • will need a Data Dictionary

  • does variable take on expected values?

    • look for outliers

  • How much data is missing?

    • could use different notations: “missing” “.” “9999”

46
New cards

How to fix poor data?

  • Conditional formatting on Excel

    • create a rule to find out of range

    • “top” and “bottom” rules to see outliers

  • filters —> view only certain values

  • visualization methods

    • histograms and box plots

47
New cards

What is a business rules document and what would be found on it?

  • a file that lists all the analytical decisions you made

    • explain to other what you did

    • show WHY you did it

    • lets someone else replicate your process

  • any dropped or constructed variables

  • any other imputations

48
New cards

What are the three quartiles based on the median?

  • Lower (QL) or First (Q1) —> 25% of data below

  • Median or Second (Q2) —> 50% of data below

  • Upper (QU) or Third (Q3) —> 75% of data below

49
New cards

Why would you use median instead of mean?

  • insensitive to extreme values

  • if data has outliers, median better reflects central tendency

  • depends on distribution

    • normal distribution - mean

    • skewed data - median

50
New cards

Where is the “mode” a useful measure of central tendency?

  • measure of non-numeric variables

    • most common hair color

    • party affiliation

    • college majors

51
New cards

Why use “N-1” for variation measures?

  • observation values typically closer to sample than population mean

  • N-1 does more when N is small —> less correction needed for large sample

  • variance and std dev are calculated from sample mean

    • N would underestimate, N-1 doesn’t

52
New cards

How to deal with outliers in data?

  • adjust up TOP CODE or down BOTTOM CODE

  • set outliers to missing

53
New cards

What are the four ways a distribution can vary?

  • average value (shift left or right)

  • variability (change shape of curve)

  • skewness

  • kurtosis

54
New cards

What is skewness? What are the two directions?

  • measure of lack of symmetry

  • positive skewness —> mean is greater than median

  • negative skewness —> median is greater than mean

<ul><li><p>measure of lack of symmetry</p></li><li><p>positive skewness —&gt; mean is greater than median </p></li><li><p>negative skewness —&gt; median is greater than mean </p></li></ul><p></p>
55
New cards

What is formula for skewness?

  • xbar is mean

  • s is std dev

  • M is median

<ul><li><p>xbar is mean</p></li><li><p>s is std dev</p></li><li><p>M is median</p></li></ul><p></p>
56
New cards

What is Kurtosis?

a measure of how flat or peaked the distribution is

57
New cards

What are the three forms of Kurtosis?

  1. Mesokurtosis - bellshaped (red)

  2. Platykurtic - flatish with thin tails (green)

  3. Leptokurtic - peaked with fat tails (purple)

<ol><li><p>Mesokurtosis - bellshaped (red)</p></li><li><p>Platykurtic - flatish with thin tails (green)</p></li><li><p>Leptokurtic - peaked with fat tails (purple)</p></li></ol><p></p>
58
New cards

What are two visual ways to represent interval grouping of data?

histograms —> each bar is one interval

  • covers the whole set of data

Cumulative Frequency Distribution

  • shows intervals and their frequency + total frequency

59
New cards

What are dashboards and what are they used for?

  • visual presentations

  • used to track

    • historic and real-time data

    • Key Performance Indicators (KPI)

60
New cards

What is the correlation coefficient (r value) and how does it work?

  • measure of how two variables relate to each other

  • ranges from -1 to 1, with the magnitude being the strength

  • 0 means no correlation

61
New cards

Rate the strength of several intervals of correlation coefficient

  • 0.8 to 1.0 —> very strong

  • 0.6 to 0.8 —> strong

  • 0.4 to 0.6 —> moderate

  • 0.2 to 0.4 —> weakish

  • 0.0 to 0.2 —> weak

62
New cards

What is the formula for correlation coefficient?

knowt flashcard image
63
New cards

What is a correlation matrix used for?

Comparing several variables all to each other

64
New cards

What is measurement?

assignment of values to outcomes following a set of rules

65
New cards

What are the four scales of measurement?

  • Nominal —> least precise

  • Ordinal

  • Interval

  • Ratio → includes absolute zero

<ul><li><p>Nominal —&gt; least precise</p></li><li><p>Ordinal</p></li><li><p>Interval</p></li><li><p>Ratio → includes absolute zero</p></li></ul><p></p>
66
New cards

What is the nominal level of measurement?

  • named categories - least precise

  • outcome only fits in one category

  • we know categories are different

  • DONT know how they relate

    • blonde/brunette/red/grey

67
New cards

What is the Ordinal level of measurement?

  • “ord” means order

  • categories are ordered

  • we know theyre different

  • we know how they rank

  • we DONT know how different the rankings are

    • job applications

68
New cards

What is the interval level of measurement?

  • intervals are ordered along a scale of equal positions

  • we know theyre different, how they rank, difference between categories

    • tests - 10 questions right is twice 5 right

69
New cards

What is the ratio level of measurement?

  • most precise, includes absolute zero

  • only works in some disciplines:

    • physics —>no light, no molecular movement

    • BAD for knowledge tests —> zero on spelling test does NOT mean no spelling ability

70
New cards

What is the difference between observed and true score?

  • observed —> score they were given “i got 55!”

  • true —> what they actually know

    • can never really be tested perfectly

71
New cards

What is the error score and where can error come from?

  • difference between observed and true score

    • True = Observed + Error

  • goal is to minimize error score

  • outside factors that cause error

    • room too hot, too loud, i was sick, etc

    • measurement problems

72
New cards

What are the four forms of reliabiliy?

  • test-retest

  • Parallel forms

  • Internal consistency —> within one test

  • Interrater

73
New cards

What is test retest reliability?

  • is it good over time?

  • same test, same ppl, two diff times

    • good test gives similar/same answer

    • calculate correlation between two sets of scores

74
New cards

What is parallel forms Reliability?

  • make sure two diff forms of a test are the same

    • “version A” (Blu) and “Version B” (Gre)

  • ensure that same ideas are tested

  • calculate correlation between two sets

75
New cards

What is internal consistency Reliability?

  • used to check consistency within a test

  • how well do diff measures for same concept yield the same result?

    • would a certain concept do better with multiple-choice or true-false?

  • Calculate Cronbach’s Alpha

76
New cards

What is Interrater Reliability?

  • see if diff judges scores same way

    • judges at Olympics expected to give same score

  • whenever humans are used there is error

    • #of agree/ #of possible agreements

77
New cards

What are the main goals for reliability coefficients?

  • need to be positive/direct

  • should be as large as possible

    • -0.7 is really bad, 0.3 still isn’t great

78
New cards

What are the three types of validity?

  • content

  • criterion

  • construct

79
New cards

What is content validity?

  • does the sampled content really represent the population

  • use on achievement tests

    • ask experts to make judgement that the items represent the universe of possible items on the same topic

80
New cards

What is criterion validity?

  • are scores systematically linked to other variiables to show that the testee understands material

  • Concurrent validity —> is the new measure simular to tried-and-true ones?

    • correlate new scores with proven ones

  • Predictive validitiy —> ability of test to predict future outcomes

81
New cards

What is Construct validity?

  • the test measures a psychological construct

    • correlate test scores with theorized outcome that reflect the construct you’re testing

    • example of measuring aggression from correlation with fights and suspensions