1/44
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Uses of data science
Exploration
Inference
Prediction
Association
Any relation or link. doesn’t mean that there is causality.
How to identify causality
Usually need to conduct an experiment that contains both a treatment (or experimental) group and a control group (no treatment or placebo). If the groups are similar apart from the treatment, differences between group outcomes can be due to the treatment.
Confounding
Leads researchers astray. Creates a false sense that there is a causal relationship. Usually due to systematic differences outside of the treatment.
Benefits of random assortment
the groups are likely to be similar outside of the treatment. Lets you account for assignment variability.
Observational Study
Researchers observe subjects. No direct impact on them, just observing.
Assignment Statement
Changes the meaning of the name to the left of the “=” symbol. Bound to the expression value to the right.
ex: A_New_Name = 5×10
Functions
Code that performs a task. Organized & reusable. Breaks complex stuff into smaller parts
Can have Functions that are built into python and that you define yourself.
What happens to a past calculation if you change the variable
It does not change, you must calculate again in order for it to change.
How a function works
Call function
Input values
Interpreter
Function Output
Data Structures
Vector - 1 dimensional
Table - 2 dimensional
Table
Sequence of labeled columns.
Each row — individual and all their data
Each column — observations for each variable
Table Operations
t.drop(label) - makes table where chosen columns are omitted
t.sort(label, descending = True) - makes table where the rows are sorted by a specific column. originally in increasing order.
t.where(label, condition) - makes table with just rows matching condition
t.select(label) - makes table that only has selected columns
Condition arguments
are.equal_to(value)
are.below()
are.above()
are.above_or_equal_to()
are.below_or_equal_to()
Data Types
All data values have a type.
int
float
str
Table
random one: builtin_function_or_method
type function
gives the type of a value. Based on value, not appearance. ex: type(5+5) returns int and type(5.5) returns float.
Column/Array Index
First value is always zero.
Float
number w/ fractional/decimal part. Always has a decimal point.
int
integer of any size, never has a decimal
Float Limitations
Final decimal places can be wrong after arithmetic
Limited precision of 15-16 decimals
Limited size, but still big limit
Operators
Addition: +
Subtraction: -
Multiplication: *
Division: /
Remainder: %
Exponent: **
String
Text of any length. Can be created with “__“ or ‘__'.
Converting from strings
int(“15”)
float(“1.9”)
Need to make sure the string values can actually convert or you will get an error.
Converting to strings
Any value can be converted to a string with str().
Converting between types
floats and integers can be converted to each other. But converting a float to an int will lose information if there is a decimal.
Array
A sequence of values. All have same type. Table column is an array. You can add elements between arrays if they have the same length. Each element has math applied individually to it.
Array functions
t.column(label) - makes an array out of a column. can also use an index
a.item(index) - value at a particular index
to aggregate array values, either:
np.mean(), np.sum(), np.max(), np.min()
a.mean(), a.sum(), a.max(), a.min()
make_array()
Repeating a string
if you multiply by an integer you can get a longer string. ex: “hi” * 2 = “hihi”
np.arange
makes an array.
np.arange(end) - increasing integers from 0 to end
np.arange(start, end) - increasing integers from start up to end
np.arange(start, end, step) - range with a defined amount (step) between each value.
Does not include end value in range
Creating tables
Table.read_table(filename)
Table() - empty
Table().with_column - one array to a table
Table().with_columns - multiple arrays to a table
num_rows and num_columns
Find the size of a table.
Row-related functions
t.take(row_numbers) - keeps the rows that are numbered
t.where - keeps rows when column value matches condition
Lists
sequence of values where there can be different types.
Numerical attributes
each value is from a numerical scale
categorical attributes
Values are from a preset inventory
Ways to plot 2 numerical variables
line graph - t.plot(x, y). Better for stuff with order and trends over a period.
Scatter graph - t.scatter(x, y). Better for associations
Bins
defined by their lower and upper bounds. Upper bound is the lower bound of the next bin.
Histogram
Shows the distribution of a numerical variable
One bar for each bin.
Area of a bar is the percent individuals in the bin.
Height of Bar = (% in bin) / (bin width)
height measures the percent of data in the bin relative to space in the bin. Height measures density.
Histogram area
Area measures percent.
Area of bar = % in bin = Height x width of bin
Understanding histograms
Looking for the number of individuals in a bin → use area
Looking for how crowded or dense a bin is → use height
Bar Chart vs histogram
Bar chart focuses on distribution of a categorical variable while a histogram focuses on distribution of a numerical variable.
Bars have random (yet equal) widths and spacing with any order in bar charts, while histograms have a numerical horizontal axis and drawn to scale.
Bar chart - height and area of bars is proportional to percent of individuals. Histogram has area of bars proportional to percent of individuals and height measuring density.
Building functions
def name(argument names):
. return expression
Made up of def, name, argument parameters, and a return expression.
Apply Function
t.apply(function, argument1, argument2, etc)
first - function to apply
arguments - input columns
table.with_column(array)
table.with_row(list)
column stack and row stack. Adds these things to a table
len()
find length of something