What is EDA about?
Understanding the data by
Understanding the data dictionary
How to deal with missing data dataframe.info()
Finding outliers
dataframe.describe()
Which features to keep and discard
Positive vs. Negative correlation
1.00 is a perfect correlation (age vs. age) is 100% correlated to itself. 0.00 means there is no correlation.
Comparing cp to target has a +0.43. So, it has a positive potential correlation to the target
As cp goes up. The target value also increases. As cp incraeses, the target (has heart disease) goes up
exang to target has a -0.44. So, it has a negative potential correlation to the target
As exange goes down. The target value will go up. As exang goes up the target (has heart disease) goes down
Explain ways to use EDA
dataframe.describe() …count, mean, stdDev, min, max
dataframe['target'].value_counts() will show you if, the data is balanced for dep var
Finding questions to ask the SME's
Using a correlation matrix dataframe.corr() to see how each variable is correlated to every other variable
Visualizing correlation matrix using a heat map to visually see how variables relate to each other
Explain what a crosstab is used for
To compare different a feature variable against the target variable
A crosstab compares two variables and puts them in a matrix
pd.crosstab(dataframe['target'], dataframe.sex)