AC547 Midterm

0.0(0)
studied byStudied by 5 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/105

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

106 Terms

1
New cards

selection bias

takes place when data is chosen in a way that is not reflective of real world data distribution

2
New cards

survivorship bias

the tendency to draw conclusions based on things that have survived, some selection process, and to ignore things that did not survive. it is a cognitive bias and is a form of selection bias.

3
New cards

mean

average; also sometimes called the "expected value"

4
New cards

median

the "middle" or the "50th percentile' of the data

5
New cards

mode

the most common occurring value

6
New cards

percentile

a measure of relative position where a percentage of observations fall

7
New cards

What is an example of percentile?

the median is the 50th percentile

8
New cards

interquartile range

difference between the 75th percentile and the 25th percentile (the middle 50% of the data)

9
New cards

variance

the average of the squared difference of the mean (denoted as σ^2)

10
New cards

standard deviation

the square root of the variance (denoted as σ)

11
New cards

purpose of sampling

1. dont have the resources to collect data from the entire population

2. dont have access to the entire population

3. when evaluated properly, the sample can be used to make inferences about the entire population

12
New cards

central limit theorem

tells us the means from samples have a normal distribution

13
New cards

law of large numbers

tells us that if we repeat something many times (or take a large sample) then we will get closer to the true average outcome

14
New cards

central limit theorem and the law of large numbers together

allows us to infer that our sample tells us something about the population

15
New cards

structured data

generic term for data that is organized for some purpose. relatively easy to work with.

16
New cards

What are some examples of structure data?

1. calendar

2. grade book

3. diamonds dataset

17
New cards

"tidy" data

structured data that is ready to analyze

18
New cards

columns

variable

19
New cards

what are examples of columns?

net income, assets, revenues

20
New cards

rows

observations or units

21
New cards

what are some examples of rows?

companies

22
New cards

data "tidying" or "wrangling" often takes up ____________ of the time spent on a data analysis project

most

23
New cards

unstructured data

data that is not organized or well curated for analysis

24
New cards

cross-sectional data

data for many subjects at a certain point

25
New cards

what is an example of cross-sectional data?

financial data from all public companies in 2022

26
New cards

time-series data

data for one subject over time

27
New cards

what are some examples of time-series data?

Tesla's financial data from 2010-2020

28
New cards

panel data

data for many subjects over time

29
New cards

what are some examples of panel data?

financial data from all public companies for 2010-2020

30
New cards

factor or categorical variable

take a limited number of values

31
New cards

what are the two types of categorical variables?

1. ordinal

2.nominal

32
New cards

what are some examples of ordinal variables?

education level, grades, bond ratings

33
New cards

what are some examples of nominal variables?

geographic location, industry classification, colors

34
New cards

discrete number variables

take specific values like whole numbers or integers

35
New cards

what are some examples of discrete nunber variables?

inventory counts and population

36
New cards

indicator, Boolean, and logical variables

take binary values

37
New cards

what are some examples of indicator, Boolean and logical variables?

True/False, Win/Lose, Heads/Tails

38
New cards

continuous numeric variables

take (in theory) an infinites number of values (often stored as numeric)

39
New cards

what are som examples of continuous numeric variables?

earnings per share, distances, time

40
New cards

character variables

capture text or "strings"

41
New cards

what are some examples of character variables?

names, phrases, sentences

42
New cards

date variable

capture dates and times

43
New cards

snake case

all characters in lower case and underscores represent spaces

44
New cards

camel case

character are in lower case except the first letter of new word

45
New cards

filter()

extract a "subset" of observations or rows in a dataset

46
New cards

select()

extract variables or columns in a dataset

47
New cards

arrange()

sort data

48
New cards

mutate()

create or replace variables

49
New cards

group_by()

work with data in "groups"

50
New cards

ungroup()

to undo group_by()

51
New cards

summarize()

aggregate and summarize data

52
New cards

distinct()

identify distinct observations and remove deli observations

53
New cards

duplicated()

identify duplicated observations

54
New cards

lubridate

working with dates and times

55
New cards

in lubridate y means

year

56
New cards

in lubridate m means

month

57
New cards

in lubridate d means

date

58
New cards

in lubridate h means

hour

59
New cards

in lubridate m means

minute

60
New cards

in lubridate s means

second

61
New cards

TIDYR

organizing and reshaping data

62
New cards

unite()

combine columns

63
New cards

separate()

separate columns

64
New cards

pivot_longer

reshape datasets to convert rows into columns

65
New cards

pivot_wider

reshape datasets to convert columns into rows

66
New cards

str_remove()

remove certain character in a string

67
New cards

str_replace()

replace certain characters from strings

68
New cards

to_lower()

convert all character to lower case case

69
New cards

to_upper

convert all characters to upper case

70
New cards

left_join()

joins the left (L) and right (R) datasets keeping all rows in the L dataset

71
New cards

right_join()

joins the right (R) and left (L) datasets keeping all rows in the R dataset

72
New cards

inner_join()

joins the left (L) and right (R) datasets keeping only rows that match

73
New cards

full_join()

join the left (L) and right (R) datasets keeping all rows from both datasets

74
New cards

semi_join()

retain rows in the left (L) dataset that match to the right (R) dataset (similar to an inner join but without merging the variable).

75
New cards

anti_join()

retain rows in the left (L) dataset that do not match to the right (R) dataset

76
New cards

what are the 7 comments of the grammar of graphics?

1. Data

2. Aesthetics

3. Geometrics

4. Facets

5. Statistics

6. Coordinates

7. Themes

77
New cards

aesthetics

x- axis, y - axis, color, fill, size, labels, alpha, shape, line width, line type

78
New cards

geometrics

point, line, histogram, bar, box plot

79
New cards

facets

columns, rows

80
New cards

statistics

binning, smoothing, descriptive, inferential

81
New cards

coordinates

cartesian, fixed, polar, limits

82
New cards

themes

non-data ink

83
New cards

what are the 5 common visualizations?

1. Bar/Pie Charts

2. Histograms

3. Line Charts

4. Scatterplots

5. Boxplots

84
New cards

pie chart

allows us to visualize relative values (means, counts, sum) by categories or groups

85
New cards

bar charts

allows us to visualize relative value (means, counts, sums) by categories or groups- generally more useful than pie charts

86
New cards

histograms

allows us to visualize the distribution of a variable using the frequency of observations inside bins

87
New cards

box plots

allow us to visualize the distribution of the data by displaying the median, interquartile, range, and extreme values

88
New cards

line charts

depicts trends in variables. great for time series analysis

89
New cards

scatterplots

visualize relation between variable using data points. observations are plotted Y and X axes, expand beyond two dimensions with shapes, colors, sized and other features

90
New cards

overplotting

use 'jitter', 'transparency' and sampling to alleviate overplaying

91
New cards

rm(list=ls())

clear the environment

92
New cards

install.package()

install packages

93
New cards

library()

open libraries

94
New cards

setwd()

set your working directory

95
New cards

getwd()

check your working directory

96
New cards

str()

check the structure of an object

97
New cards

summary()

summarize data

98
New cards

unique() OR distinct()

list unique variable

99
New cards

table()

count of observations for each unique variable

100
New cards

what are the 4 components for statistics for location?

1. mean

2. median

3. mode

4. percentile