AC547 Midterm

0.0(0)

Studied by 5 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/105

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

106 Terms

New cards

selection bias

takes place when data is chosen in a way that is not reflective of real world data distribution

New cards

survivorship bias

the tendency to draw conclusions based on things that have survived, some selection process, and to ignore things that did not survive. it is a cognitive bias and is a form of selection bias.

New cards

mean

average; also sometimes called the "expected value"

New cards

median

the "middle" or the "50th percentile' of the data

New cards

mode

the most common occurring value

New cards

percentile

a measure of relative position where a percentage of observations fall

New cards

What is an example of percentile?

the median is the 50th percentile

New cards

interquartile range

difference between the 75th percentile and the 25th percentile (the middle 50% of the data)

New cards

variance

the average of the squared difference of the mean (denoted as σ^2)

New cards

standard deviation

the square root of the variance (denoted as σ)

New cards

purpose of sampling

1. dont have the resources to collect data from the entire population

2. dont have access to the entire population

3. when evaluated properly, the sample can be used to make inferences about the entire population

New cards

central limit theorem

tells us the means from samples have a normal distribution

New cards

law of large numbers

tells us that if we repeat something many times (or take a large sample) then we will get closer to the true average outcome

New cards

central limit theorem and the law of large numbers together

allows us to infer that our sample tells us something about the population

New cards

structured data

generic term for data that is organized for some purpose. relatively easy to work with.

New cards

What are some examples of structure data?

1. calendar

2. grade book

3. diamonds dataset

New cards

"tidy" data

structured data that is ready to analyze

New cards

columns

variable

New cards

what are examples of columns?

net income, assets, revenues

New cards

rows

observations or units

New cards

what are some examples of rows?

companies

New cards

data "tidying" or "wrangling" often takes up ____________ of the time spent on a data analysis project

most

New cards

unstructured data

data that is not organized or well curated for analysis

New cards

cross-sectional data

data for many subjects at a certain point

New cards

what is an example of cross-sectional data?

financial data from all public companies in 2022

New cards

time-series data

data for one subject over time

New cards

what are some examples of time-series data?

Tesla's financial data from 2010-2020

New cards

panel data

data for many subjects over time

New cards

what are some examples of panel data?

financial data from all public companies for 2010-2020

New cards

factor or categorical variable

take a limited number of values

New cards

what are the two types of categorical variables?

1. ordinal

2.nominal

New cards

what are some examples of ordinal variables?

education level, grades, bond ratings

New cards

what are some examples of nominal variables?

geographic location, industry classification, colors

New cards

discrete number variables

take specific values like whole numbers or integers

New cards

what are some examples of discrete nunber variables?

inventory counts and population

New cards

indicator, Boolean, and logical variables

take binary values

New cards

what are some examples of indicator, Boolean and logical variables?

True/False, Win/Lose, Heads/Tails

New cards

continuous numeric variables

take (in theory) an infinites number of values (often stored as numeric)

New cards

what are som examples of continuous numeric variables?

earnings per share, distances, time

New cards

character variables

capture text or "strings"

New cards

what are some examples of character variables?

names, phrases, sentences

New cards

date variable

capture dates and times

New cards

snake case

all characters in lower case and underscores represent spaces

New cards

camel case

character are in lower case except the first letter of new word

New cards

filter()

extract a "subset" of observations or rows in a dataset

New cards

select()

extract variables or columns in a dataset

New cards

arrange()

sort data

New cards

mutate()

create or replace variables

New cards

group_by()

work with data in "groups"

New cards

ungroup()

to undo group_by()

New cards

summarize()

aggregate and summarize data

New cards

distinct()

identify distinct observations and remove deli observations

New cards

duplicated()

identify duplicated observations

New cards

lubridate

working with dates and times

New cards

in lubridate y means

year

New cards

in lubridate m means

month

New cards

in lubridate d means

date

New cards

in lubridate h means

hour

New cards

in lubridate m means

minute

New cards

in lubridate s means

second

New cards

TIDYR

organizing and reshaping data

New cards

unite()

combine columns

New cards

separate()

separate columns

New cards

pivot_longer

reshape datasets to convert rows into columns

New cards

pivot_wider

reshape datasets to convert columns into rows

New cards

str_remove()

remove certain character in a string

New cards

str_replace()

replace certain characters from strings

New cards

to_lower()

convert all character to lower case case

New cards

to_upper

convert all characters to upper case

New cards

left_join()

joins the left (L) and right (R) datasets keeping all rows in the L dataset

New cards

right_join()

joins the right (R) and left (L) datasets keeping all rows in the R dataset

New cards

inner_join()

joins the left (L) and right (R) datasets keeping only rows that match

New cards

full_join()

join the left (L) and right (R) datasets keeping all rows from both datasets

New cards

semi_join()

retain rows in the left (L) dataset that match to the right (R) dataset (similar to an inner join but without merging the variable).

New cards

anti_join()

retain rows in the left (L) dataset that do not match to the right (R) dataset

New cards

what are the 7 comments of the grammar of graphics?

1. Data

2. Aesthetics

3. Geometrics

4. Facets

5. Statistics

6. Coordinates

7. Themes

New cards

aesthetics

x- axis, y - axis, color, fill, size, labels, alpha, shape, line width, line type

New cards

geometrics

point, line, histogram, bar, box plot

New cards

facets

columns, rows

New cards

statistics

binning, smoothing, descriptive, inferential

New cards

coordinates

cartesian, fixed, polar, limits

New cards

themes

non-data ink

New cards

what are the 5 common visualizations?

1. Bar/Pie Charts

2. Histograms

3. Line Charts

4. Scatterplots

5. Boxplots

New cards

pie chart

allows us to visualize relative values (means, counts, sum) by categories or groups

New cards

bar charts

allows us to visualize relative value (means, counts, sums) by categories or groups- generally more useful than pie charts

New cards

histograms

allows us to visualize the distribution of a variable using the frequency of observations inside bins

New cards

box plots

allow us to visualize the distribution of the data by displaying the median, interquartile, range, and extreme values

New cards

line charts

depicts trends in variables. great for time series analysis

New cards

scatterplots

visualize relation between variable using data points. observations are plotted Y and X axes, expand beyond two dimensions with shapes, colors, sized and other features

New cards

overplotting

use 'jitter', 'transparency' and sampling to alleviate overplaying

New cards

rm(list=ls())

clear the environment

New cards

install.package()

install packages

New cards

library()

open libraries