1/55
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What are the six main steps of data wrangling?
Discovering
Structuring (uniform formats/units)
Cleaning (remove/replace missing/outlier values)
Enriching (adding or expanding)
Validating (verify data is correct)
Publishing (share dataset with others)
How can you return a range of characters from a string?
string[start:end]
indexes from start inclusive to end - 1
How can you return a dataframe transformed to a certain datatype?
df.astype()
How should the mean in a mean imputation be calculated?
Missing and outlier data are excluded from the computation of
the mean.
How is Sqlite3 setup in Python?
import sqlite3
db = sqlite3.connect(<database>)
cursor = db.cursor()
How can you do an update query with Sqlite3?
query = "update playlists set Name ='MyMovie' where playlistID = 100;"
cursor.execute(query)
db.commit()
How can you read SQL into a pandas dataframe (after getting a db connection)?
df = pd.read_sql(“SELECT * FROM playlists”, con = db)
How can you do an insert query with Sqlite3?
cursor.execute('insert into playlists values (100 , "myMovie")')
db.commit()
When is standardization useful?
When outliers are present. It is good because values are positioned relative to the mean and standard deviation.
Explain the difference between ETL and ELT processes.
Both are data integration methods. Extract Transform Load means data is changed before saving to a database. Extract Load Transform means raw data is saved first.
Describe the purpose of Feature Scaling and differentiate between Standardization and Normalization.
Algorithms can be sensitive to the range of data. Standardization converts features to a range centered at 0 (Z score). Normalization converts features to the range [0,1].
What is an overloaded feature and how should it be handled during data wrangling?
Overloaded means multiple different values in one column. This original column should be kept, but individual values should be placed in new features.
What are the types of dirty data?
Outlier
Duplicate
Missing
What is the difference between Hot-deck and Cold-deck imputation?
Hot deck is replacement from the same dataset. Cold deck is imputation from a different dataset.
What type of data is best represented by a bar chart, and how does a relative frequency bar chart differ from a standard bar chart?
Categorical data. Frequency charts use percentages relative to said categories.
What is a high leverage point?
A point that is far from the others in the x y plane, and may or may not be an influential point.
What is an influential point?
A point that significantly affects the data (or its predictions).
What is a primary key?
A unique identifier in a SQL table.
What is a foreign key?
An identifier in a SQL table that doesn’t have to be unique. It can be used to link rows in different tables.
What is a heatmap and when is it used?
A data visualization technique that uses color variations to represent the magnitude of values in a dataset.
What is a violin plot, when is it used, and how is it created in python?
A violin plot is a smoother version of a boxplot. It shows the distribution of the data.
sns.violin(x="Column", data=df)
What is MCAR?
Missing Completely at Random. The reason a value is missing is completely random and not influenced by any specific characteristic of the data.
What is MAR?
Missing at Random. The data is missing because of something relating to that data, but not specifically the same feature. The reason for the data being missing can be explained by other available information, making it possible to predict the missing values based on what's observed. For example, a survey where individuals who are not programmers are more likely to skip questions about programming languages.
What is MNAR?
Missing Not At Random. The reason for the missingness is related to the unobserved data. For example, a survey where people with more severe depression are less likely to respond to questions about their depression severity.
What is dodging?
Separating visualization elements based on a categorical value within a larger group. Usually done by changing a color hue.
What is EDA?
Exploratory Data Analysis means using visualizations to find patterns and relationships in the data.
What is faceting?
Dividing data into disjoint subsets, most often by different levels of a categorical variable
Why is using query parameters with the execute() function helpful?
execute() can filter input to prevent SQL injections.
How is seaborn usually installed?
import seaborn as sns
What is a Kernel Density Plot and how is it created?
Visually represents the distribution of data using a smooth curve, like a continuous histogram.
sns.kde(x="Column", data=df, hue=<column>)
What is the formula for normalizing data?
(value - min) / (max - min)
What is the formula for standardizing data?
Z = (value - mean) / sd
What are Tukey’s Fences?
values 1.5*IQR above Q3 or below Q1 are outliers
What code standardizes dataset “mpg”
preprocessing.scale(mpg)
What code normalizes dataset “mpg”
preprocessing.MinMaxScaler.fit_transform(mpg)
What code to removes duplicate rows from the “cars” dataset?
cars.drop_duplicates()
What code removes the “MPG” feature from the “cars dataset?
cars.drop(axis=1, labels='MPG')
What code replaces null values with 20 in the “cars” dataset?
cars.fillna(value=20)
What does cleaning data entail?
Discarding and imputing/replacing missing values
What does enriching data entail?
Deriving new features from existing features and appending data from external datasets.
What does structuring data entail?
Converting features to a uniform format. Also, scaling and unpacking overloaded features into new, simple features.
What does an inner join do?
Creates a new table with all rows that are share a key value in the 2 base tables.
How can dataframe “cars1” be inner joined with “cars2” without using SQL?
cars1.merge(cars2, how='inner')
What is the difference between a bar chart and histogram?
A histogram splits continuous data into chunks, and shows the number of instances in each. A bar chart shows the number of instances of different categorical groups.
What code creates a bar chart?
sns.countplot(data=<df>, x=<column>)
What code creates a histogram?
sns.histplot(data=<df>, x=<column>)
What is a violin plot?
The density plot for the numerical feature is plotted for each of the categorical feature's categories. Each density plot is mirrored and plotted offset from the others. Useful for large datasets but does not display differences in the number of instances in each category.
Which pandas method should be used to detect unusual values of numerical features in a dataframe?
df.boxplot()
Which pandas method should be used to explore the shape of numerical features in a dataframe?
df.hist()
Which pandas method can be used to detect missing values in a dataframe?
df.info()
An accident in processing blood samples at the lab leads to missing data for those patients. What type of missing data is this?
MCAR. Since the accident does not select which patients lack those blood samples, this data is MCAR.
Customer satisfaction survey responses mostly fall in extremely positive or extremely negative categories. What type of missing data is this?
MNAR. Any analysis of the data will lack an understanding of customers who did not have an extreme opinion. Most customers likely belong in the silent middle ground category.
A subject missed an appointment due to illness for a study that includes information about the subject's general health. What type of missing data is this?
MAR. The data that would have been collected during the appointment about the subject's general health is related to the reason the data is missing. The illness is unlikely to have removed that single subject in a systematic fashion.
What are the main queries of SQL?
Insert, Update, Select, Delete
How many clauses are in the statement?
INSERT INTO Student
VALUES (888, 'Smith', 'Dana', 3.0);
2