1/104
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
putting in a dataset
data<-c(81,85,93,93,99,76,75,84,78,84,81,82,89,81,96,82,74,70,84,86,80,70,131,75,88,102,115,89,82,79,106)
length(data)
tells you how may entries are in your vector
sort(data)
puts data smallest to largest
summary(data)
returns the five number summary and the sample mean
mean(data)
sd(data)
sum(data)
sample mean
sample standard deviation
adds all the elements of the vector
seq(1,100,1)
seq(2,100,2)
tells r to create the sequence of numbers from 1 to 100 by 1 (1,2,3...)
1:100 does the same thing
(2,4,6...100)
data<-read.table(file.choose(),header=TRUE)
read.cvs
calls this dataset into r and you can name it whatever you want
getwd()
creates the current working directory and calls in your data too
dim(data)
checks dimensions of data
data
type the name of the data to see it
data[1:5]
calls in the first five rows of the dataset if you are working with a large dataset
head(data)
shows the first few rows
data[,1:2]
all rows for columns 1 and 2
data[1:5,1:2]
first five rows and first two columns
str(data)
structure of the object
View(data)
(capital V) puts your data in a viewable popup window
data$shoes
calls in the column called shoes
attach(data)
attaches data so you can work with it and don't have to keep calling it in
(if you attach several datasets that have the same column names r will be confused so you have to detach before attaching again)
data<-subset(data,Type=="WT")
you can pick out certain rows/columns and you can enter more == and "" in order to be more specific
hist(data)
give you a histogram
hist(data,breaks=c(60,70,80,90))
creates a sequence and breaks them
xlab="percents of ..."
x axis name
ylab="Frequency"
y axis name
main="..."
title of the graph
boxplot(data)
gives a box plot
boxplot(...~...)
gives side by side boxplots
boxplot(Oil~Type,names=c("WIld Type","Mutated"))
creates your own label for the boxplot
plot(X,Y)
create a scatterplot
X1 <- c(8,5,14,13,29)
X2 <- c(13,8,6,18,4)
X1 is first line with the numbers you want
X2...
dbinom(j,n,p)
gives P(Y = j) discrete binomial probability
pbinom(J,n,p)
gives P(<=J) = P(Y = 0) + P(Y = 1) + ... + P(Y = J) exp binomial prob
dpois(j,lambda)
gives P(X=j) poisson discrete
ppois(J,lambda)
gives P(X<=J) poisson exponential
pexp(x,lambda)
gives P(Y<=j) exponential
qexp(p, lambda)
gives the pth percentile exponential
pnorm(x,mu,sigma)
Pr{X < x} for X~N(mu,sigma) .....so, 1-pnorm(x,mu,sigma) gives Pr(X > x)
qnorm(p,mu,sigma)
gives the value in the normal distribution (with mean mu and sd sigma) that has p to the left of it
norm prob
pt(t,df)
Pr{T < t} for T~t(df)
t distribution
qt(p,df)
gives the value of the t distribution with df=df that has p to the left of it
qqnorm(data)
normal QQplot of the data.
t.test(data,conf.level=0.95)
Using it for a 95% confidence interval for the population mean
descriptive statistics
collecting, organizing, and presenting the data.
inferential statistics
drawing conclusions about a population based on sample data from that population.
statistic
is a number calculated from a sample and is used to estimate the parameter.
Parameter
is a number used to describe a population.
time series data
A variable that is measured at regular intervals over time
cross-sectional data
When a characteristic is measured on many subjects at the same time point (or same time frame)
data warehouse
These data are recorded and stored electronically, in vast digital repositories
big data
describe data sets so large that traditional methods of storage and analysis are inadequate.
types of variables
quantitative, qualitative
categorical
arise from descriptive responses to questions like "What kind of advertising do you use?".may only have two possible values (like "yes" or "no").may be a number like a zip code
Qualitative
don't have a meaningful numerical value
aka categorical variables
nominal
Categorical variables used only to name categories that don't have order (grocery, clothing, hardware)
ordinal
When data values can be ordered (freshman, sophomore, junior, senior)
quantitative
have a numerical value that works like a number
discrete
there are jumps between the possible values
continuous
there is another possible value between any two values
identifier variables
a unique identifier assigned to each individual or item in a group (social security number, student ID number)
pattern of a distribution
skewed left is when the little tail is to the left and most of the data is on the right (
qualities of a good graphical display(and things to avoid)
Avoid 3-D
GooD:
good title
sample size
units
Frequency
Relative Frequency
number of times an allele occurs in a gene pool compared with the number of times other alleles occur
Pie Chart
Categorical (qualitative) data
Displays parts of a whole
Not good when there are too many categories
Don't ever make it 3-D or "tilted"!
Bar Graph (frequency and relative frequency)
Categorical (qualitative) data
Can be horizontal or vertical
Can display parts of a whole or separate values
For nominal data, put bars in ascending or descending order
For ordinal data, put bars in order of categories
Pictograms
A pictorial symbol or sign representing an object or concept. Used by many non-alphabetic written scripts.(can be misleading with pictures like the people pictures if one is bigger and one is smaller)
Line Graph
Displays quantitative data changing over timeTime should go on the horizontal axisVariable should go on the vertical axisUse different lines to denote separate categories or groups
Boxplot
A graph of the five-number summary.
good at comparing two datasets next to each other
Histogram (frequency and relative frequency)
medium to large quantitative datasets•
Bins touch•
Choice of number of bins can distort features of the shape of the distribution
(Notice, a boxplot displays the same data as the histogram, but the histogram shows more details about the shape. )
Scatterplot
is used to depict two potentially related quantitative variables.-Each point is a pairing: (x1,y1), (x2,y2), etc.-Linear, curvilinear, or no relationships-Positive vs. negative relationships
Five Number Summary
Boxplot: An efficient way to communicate the measure of center and variation all at once
IQR
Q3-Q1
Range
Max-Min
Sample Mean (x bar)
average of the sample
balancing point
Sample Standard Deviation (s)(just its properties not how to compute)
sqrt s squared
Mode
Most frequently occurring value
According to the shape of the distribution know when to use:
Mean vs. Median
Sample Standard Deviation vs. Quartiles (IQR and Range)
Coefficient of variation and What it is and why it's used
Standard deviation expressed as a percent of the mean
Compare variation in datasets with different units or means
Estimating percent of observations within certain standard deviations
Chebychev's Inequality
Empirical Rule
K = 2
1 -1/22
= 1 -¼
= 0.75 so at least 75% of observations lie within 2 standard deviations
Chebychev's Inequality
For any population with a mean, μ, and standard deviation, σ, the percent of observations that lie within kstandard deviations of the mean is at least (1−1/ksquared)×100
Empirical Rule
For unimodal distributions that are roughly normal, approximately
68% of observations are within 1 sdof the mean
95% of observations are within 2 sdof the mean
99.7% of observations are within 3 sdof the mean
Z-scores - calculation, interpretation, properties and when to use
A Z-score represents the number of standard deviations an observation is above or below the mean
Beware of using Z-scores from skewed distributions (the Z-scores have the same shape distributions as the original observations)
Cannot compare a Z-score from a skewed distribution to a Z-score from a symmetric distribution
2 basic rules (probability is on scale from 0 to 1; sum of probability of all (disjoint) events in sample space = 1)
P(A) = 0 → Event A will not occurP(A) = 1 → Event A will surely occurP(A) = ½ → Event A will happen 50% of the time
Compute probabilities for complements, unions, intersections, conditional events
Complement:
The probability of the complement of an event, P(Ac), is equal to one minus the probability of the event
Union:
The probability that event A or B occurs (at least one of the events happens) is:P(A U B) = P(A) + P(B) -P(A ∩ B)
Intersection:
General intersection rulefor two events both occurring (always works): P(A ∩ B) = P(A)P(B|A)= P(B) P(A|B)
Disjoint events
Two events, A and B, are said to be disjoint(or mutually exclusive) if they share no outcomes in common. Disjoint events have no intersection.P(A ∩ B) = 0
independent events
Two events are independentif the occurrence of one event does not affect the probability of the occurrence of the other event.Examples: Two flips of a coin, two rolls of a die, two spins of a roulette wheel
tree diagram( how to use and determine independence)
Bayes' Theorem
go over some examples in the notes
a theorem describing how the conditional probability of each of a set of possible causes for a given observed outcome can be computed from knowledge of the probability of each cause and the conditional probability of the outcome of each cause.
mean for a discrete random variable
The mean or expected value of a discrete random variable X is muX= E(X)= Σxi P(xi)
variance(sd) for a discrete random variable
The variance of a discrete random variable Y is sigmaX2= Var(X) = Σ[(Xi-mux)squaredP (xi)]
Binomial (when to use, mean and variance(sd))
A fixed number, n,trials take place, where:
1.Each trial has only two possible outcomes ("Success" or "Failure")
2.Probability of "Success" is a constant pfor every trial
3.Trials are identical and independentThese are called Bernoulli trials.
Poisson (When to use
Lambda (λ) and its relation to the exponential distribution
1)The event cannot occur twice at exactly the same time.
2)No occurrence of the event being analyzed affects the probability of the event re-occurring (events occur independently).
3)The expected number of occurrences of the event during any such interval is a constant
Note: The Poisson Distribution is only designed to be applied to events that occur relatively rarely.
Normal distribution (Properties, Standard normal distribution, z-scores and their interpretation)
Properties:
Lengths and weights of newborn babies
Scores on SAT
Cumulative debt in college students
Advertising expenditure of firms
Standard Normal Distribution
has a mean of 0 and a standard deviation of 1
(normal can be transferred to standard by z=x-u/sd)
z-scores
is the number of standard deviations from the mean a data point is. But more technically it's a measure of how many standard deviations below or above the population mean a raw score is
u is the population mean and sigma is the population standard deviation
Parameter (ch. 7)
number used to describe a population
(usually do not know the value of a parameter; it is a fixed number)
Statistic (ch. 7)
is a number calculated from a sample and is used to estimate the parameter
(we know the value; it will change from sample to sample)
sample survey
when the respondents in a survey provide their own data
census
when a survey attempts to use the entire population as the sample
Central Limit Theorem
the sampling distribution of a sum or percentage will become approximately normal as the sample size gets larger