AC547 Midterm

studied byStudied by 5 people
0.0(0)
learn
LearnA personalized and smart learning plan
exam
Practice TestTake a test on your terms and definitions
spaced repetition
Spaced RepetitionScientifically backed study method
heart puzzle
Matching GameHow quick can you match all your cards?
flashcards
FlashcardsStudy terms and definitions

1 / 105

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

106 Terms

1

selection bias

takes place when data is chosen in a way that is not reflective of real world data distribution

New cards
2

survivorship bias

the tendency to draw conclusions based on things that have survived, some selection process, and to ignore things that did not survive. it is a cognitive bias and is a form of selection bias.

New cards
3

mean

average; also sometimes called the "expected value"

New cards
4

median

the "middle" or the "50th percentile' of the data

New cards
5

mode

the most common occurring value

New cards
6

percentile

a measure of relative position where a percentage of observations fall

New cards
7

What is an example of percentile?

the median is the 50th percentile

New cards
8

interquartile range

difference between the 75th percentile and the 25th percentile (the middle 50% of the data)

New cards
9

variance

the average of the squared difference of the mean (denoted as σ^2)

New cards
10

standard deviation

the square root of the variance (denoted as σ)

New cards
11

purpose of sampling

1. dont have the resources to collect data from the entire population

2. dont have access to the entire population

3. when evaluated properly, the sample can be used to make inferences about the entire population

New cards
12

central limit theorem

tells us the means from samples have a normal distribution

New cards
13

law of large numbers

tells us that if we repeat something many times (or take a large sample) then we will get closer to the true average outcome

New cards
14

central limit theorem and the law of large numbers together

allows us to infer that our sample tells us something about the population

New cards
15

structured data

generic term for data that is organized for some purpose. relatively easy to work with.

New cards
16

What are some examples of structure data?

1. calendar

2. grade book

3. diamonds dataset

New cards
17

"tidy" data

structured data that is ready to analyze

New cards
18

columns

variable

New cards
19

what are examples of columns?

net income, assets, revenues

New cards
20

rows

observations or units

New cards
21

what are some examples of rows?

companies

New cards
22

data "tidying" or "wrangling" often takes up ____________ of the time spent on a data analysis project

most

New cards
23

unstructured data

data that is not organized or well curated for analysis

New cards
24

cross-sectional data

data for many subjects at a certain point

New cards
25

what is an example of cross-sectional data?

financial data from all public companies in 2022

New cards
26

time-series data

data for one subject over time

New cards
27

what are some examples of time-series data?

Tesla's financial data from 2010-2020

New cards
28

panel data

data for many subjects over time

New cards
29

what are some examples of panel data?

financial data from all public companies for 2010-2020

New cards
30

factor or categorical variable

take a limited number of values

New cards
31

what are the two types of categorical variables?

1. ordinal

2.nominal

New cards
32

what are some examples of ordinal variables?

education level, grades, bond ratings

New cards
33

what are some examples of nominal variables?

geographic location, industry classification, colors

New cards
34

discrete number variables

take specific values like whole numbers or integers

New cards
35

what are some examples of discrete nunber variables?

inventory counts and population

New cards
36

indicator, Boolean, and logical variables

take binary values

New cards
37

what are some examples of indicator, Boolean and logical variables?

True/False, Win/Lose, Heads/Tails

New cards
38

continuous numeric variables

take (in theory) an infinites number of values (often stored as numeric)

New cards
39

what are som examples of continuous numeric variables?

earnings per share, distances, time

New cards
40

character variables

capture text or "strings"

New cards
41

what are some examples of character variables?

names, phrases, sentences

New cards
42

date variable

capture dates and times

New cards
43

snake case

all characters in lower case and underscores represent spaces

New cards
44

camel case

character are in lower case except the first letter of new word

New cards
45

filter()

extract a "subset" of observations or rows in a dataset

New cards
46

select()

extract variables or columns in a dataset

New cards
47

arrange()

sort data

New cards
48

mutate()

create or replace variables

New cards
49

group_by()

work with data in "groups"

New cards
50

ungroup()

to undo group_by()

New cards
51

summarize()

aggregate and summarize data

New cards
52

distinct()

identify distinct observations and remove deli observations

New cards
53

duplicated()

identify duplicated observations

New cards
54

lubridate

working with dates and times

New cards
55

in lubridate y means

year

New cards
56

in lubridate m means

month

New cards
57

in lubridate d means

date

New cards
58

in lubridate h means

hour

New cards
59

in lubridate m means

minute

New cards
60

in lubridate s means

second

New cards
61

TIDYR

organizing and reshaping data

New cards
62

unite()

combine columns

New cards
63

separate()

separate columns

New cards
64

pivot_longer

reshape datasets to convert rows into columns

New cards
65

pivot_wider

reshape datasets to convert columns into rows

New cards
66

str_remove()

remove certain character in a string

New cards
67

str_replace()

replace certain characters from strings

New cards
68

to_lower()

convert all character to lower case case

New cards
69

to_upper

convert all characters to upper case

New cards
70

left_join()

joins the left (L) and right (R) datasets keeping all rows in the L dataset

New cards
71

right_join()

joins the right (R) and left (L) datasets keeping all rows in the R dataset

New cards
72

inner_join()

joins the left (L) and right (R) datasets keeping only rows that match

New cards
73

full_join()

join the left (L) and right (R) datasets keeping all rows from both datasets

New cards
74

semi_join()

retain rows in the left (L) dataset that match to the right (R) dataset (similar to an inner join but without merging the variable).

New cards
75

anti_join()

retain rows in the left (L) dataset that do not match to the right (R) dataset

New cards
76

what are the 7 comments of the grammar of graphics?

1. Data

2. Aesthetics

3. Geometrics

4. Facets

5. Statistics

6. Coordinates

7. Themes

New cards
77

aesthetics

x- axis, y - axis, color, fill, size, labels, alpha, shape, line width, line type

New cards
78

geometrics

point, line, histogram, bar, box plot

New cards
79

facets

columns, rows

New cards
80

statistics

binning, smoothing, descriptive, inferential

New cards
81

coordinates

cartesian, fixed, polar, limits

New cards
82

themes

non-data ink

New cards
83

what are the 5 common visualizations?

1. Bar/Pie Charts

2. Histograms

3. Line Charts

4. Scatterplots

5. Boxplots

New cards
84

pie chart

allows us to visualize relative values (means, counts, sum) by categories or groups

New cards
85

bar charts

allows us to visualize relative value (means, counts, sums) by categories or groups- generally more useful than pie charts

New cards
86

histograms

allows us to visualize the distribution of a variable using the frequency of observations inside bins

New cards
87

box plots

allow us to visualize the distribution of the data by displaying the median, interquartile, range, and extreme values

New cards
88

line charts

depicts trends in variables. great for time series analysis

New cards
89

scatterplots

visualize relation between variable using data points. observations are plotted Y and X axes, expand beyond two dimensions with shapes, colors, sized and other features

New cards
90

overplotting

use 'jitter', 'transparency' and sampling to alleviate overplaying

New cards
91

rm(list=ls())

clear the environment

New cards
92

install.package()

install packages

New cards
93

library()

open libraries

New cards
94

setwd()

set your working directory

New cards
95

getwd()

check your working directory

New cards
96

str()

check the structure of an object

New cards
97

summary()

summarize data

New cards
98

unique() OR distinct()

list unique variable

New cards
99

table()

count of observations for each unique variable

New cards
100

what are the 4 components for statistics for location?

1. mean

2. median

3. mode

4. percentile

New cards

Explore top notes

note Note
studied byStudied by 77 people
673 days ago
4.5(2)
note Note
studied byStudied by 1 person
968 days ago
5.0(1)
note Note
studied byStudied by 144 people
706 days ago
4.5(259)
note Note
studied byStudied by 116840 people
704 days ago
4.9(708)
note Note
studied byStudied by 143 people
41 days ago
5.0(4)
note Note
studied byStudied by 16 people
881 days ago
5.0(2)
note Note
studied byStudied by 11 people
902 days ago
5.0(1)
note Note
studied byStudied by 34 people
467 days ago
4.6(5)

Explore top flashcards

flashcards Flashcard (30)
studied byStudied by 10 people
305 days ago
5.0(1)
flashcards Flashcard (29)
studied byStudied by 1 person
655 days ago
5.0(1)
flashcards Flashcard (209)
studied byStudied by 83 people
477 days ago
5.0(1)
flashcards Flashcard (74)
studied byStudied by 123 people
329 days ago
5.0(2)
flashcards Flashcard (24)
studied byStudied by 31 people
578 days ago
4.7(3)
flashcards Flashcard (126)
studied byStudied by 1 person
104 days ago
5.0(1)
flashcards Flashcard (45)
studied byStudied by 8 people
673 days ago
5.0(1)
flashcards Flashcard (23)
studied byStudied by 3 people
267 days ago
5.0(1)
robot