1/23
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Scatter plot
numerical vs numerical
line plot
sequential numerical (time) vs numerical
bar chart
cateogrical vs numerical
histogram
distribution of numerical
What is the distribution of a variable?
“How often does a variable take on a certain value?”
Both categorical and numerical variables have this.
Categorical variables
Bar charts show you distribution of categorical variable.
and you can do .plot(kind=”barh”, y=”Distance”) ← y declares numerical value
can include .plot(kind=,y=,legend=False,xlabel=”Count”, title=”Distribution of Exoplanet Types”)
When you don’t put an x, you use the index
legend=False puts a legend on the top right that may or may not be accurate
xlabel is the label of the x-axis (the categorical part)
title= the title
figsize=(3, 10) takes in sequence of two values, first one is how wide you want it to be, second one is the how high you want it
Keep in mind, this is a bar in horizontal.. so if you have ascending=False, it will actually output an ascending order!
What does .describe() do? Series method.
Output series, and will give you count(), mean(), etc.
can you represent radius of exoplanets in bar chart?
NO, horizontal axis should be numerical not categorical. There should be more space between certain bars than others.
ex. you might think that one planet which is 80% larger than another be the same height
Instead, use density histograms
What is a density histogram? for a radius?
Looks like a bar chart, but x-axis is like a numberline (rather than a category!)
also, y-axis says frequency. This does not mean there are 2.8 planets within that range if y-axis is at 2.8; It’s not telling us how many COUNT() is in there
What is binning?
Groups nearby values into one bin. like [a, b) will include a, b is not
this is the convention of binning, greater than or equal to the left endpoint and less than the right endpoint
Doesn’t distinguish between each number, just puts together
Plotting density histograms
df.plot(
kind=”hist”
y=column_name
density=True
)
ec=”w” puts a little white EDGE COLORS to the bars
Requires ONLY ONE value
default chooses 10 bins of equal space, some which are empty
can also specify bins to be different by included argument “bins = #”
What does bins=20 argument do?
creates 20 bins of equal width for your histogram
you can specify specific starting and ending points. How?
set bin= to a sequence such as a list of all the endpoints you want to use.
bin=[]
what is the y-axis values? for histograms?
Proportion of the values of that bar’s WIDTH
Normally histogram bins include [a, b) but what about the last section?
[a, b]
Do bins cut off value?
Yes, if you don’t include all values in your range then it will get cut off
What does bins=np.arange(4)
This works! But it creates bins [0,1), [1, 2), [2,3]
CUTS OFF 4!
Also, histograms total area is what
Total area of all the redness is 1, explains the weird y-axis. This is the DENSITY histogram, so it makes sense!
Proportion vs percentage
proportion = 0-1, percentage 0%-100%
How do you find area of bar?
Calculate the width, then multiply it by the height;
not an exact match b/c you’re estimating visually
y-axis always says “FREQUENCY” but it’s wrong. how to fix?
Can use ylabel to fix it, but you usually don’t and just see that it’s density histogram, and know how to interpret it.
How to make multiple plots on the same axis
you can .get([]) multiple columns, and display them at the same time!
Alternatively, if you omit the y, you get ALL the columns displayed at the same time (if they can be. i.e. they are numerical!)