2410 Final

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/55

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

56 Terms

New cards

What are the six main steps of data wrangling?

Discovering

Structuring (uniform formats/units)

Cleaning (remove/replace missing/outlier values)

Enriching (adding or expanding)

Validating (verify data is correct)

Publishing (share dataset with others)

New cards

How can you return a range of characters from a string?

string[start:end]

indexes from start inclusive to end - 1

New cards

How can you return a dataframe transformed to a certain datatype?

df.astype()

New cards

How should the mean in a mean imputation be calculated?

Missing and outlier data are excluded from the computation of

the mean.

New cards

How is Sqlite3 setup in Python?

import sqlite3

db = sqlite3.connect(<database>)

cursor = db.cursor()

New cards

How can you do an update query with Sqlite3?

query = "update playlists set Name ='MyMovie' where playlistID = 100;"

cursor.execute(query)

db.commit()

New cards

How can you read SQL into a pandas dataframe (after getting a db connection)?

df = pd.read_sql(“SELECT * FROM playlists”, con = db)

New cards

How can you do an insert query with Sqlite3?

cursor.execute('insert into playlists values (100 , "myMovie")')

db.commit()

New cards

When is standardization useful?

When outliers are present. It is good because values are positioned relative to the mean and standard deviation.

New cards

Explain the difference between ETL and ELT processes.

Both are data integration methods. Extract Transform Load means data is changed before saving to a database. Extract Load Transform means raw data is saved first.

New cards

Describe the purpose of Feature Scaling and differentiate between Standardization and Normalization.

Algorithms can be sensitive to the range of data. Standardization converts features to a range centered at 0 (Z score). Normalization converts features to the range [0,1].

New cards

What is an overloaded feature and how should it be handled during data wrangling?

Overloaded means multiple different values in one column. This original column should be kept, but individual values should be placed in new features.

New cards

What are the types of dirty data?

Outlier

Duplicate

Missing

New cards

What is the difference between Hot-deck and Cold-deck imputation?

Hot deck is replacement from the same dataset. Cold deck is imputation from a different dataset.

New cards

What type of data is best represented by a bar chart, and how does a relative frequency bar chart differ from a standard bar chart?

Categorical data. Frequency charts use percentages relative to said categories.

New cards

What is a high leverage point?

A point that is far from the others in the x y plane, and may or may not be an influential point.

New cards

What is an influential point?

A point that significantly affects the data (or its predictions).

New cards

What is a primary key?

A unique identifier in a SQL table.

New cards

What is a foreign key?

An identifier in a SQL table that doesn’t have to be unique. It can be used to link rows in different tables.

New cards

What is a heatmap and when is it used?

A data visualization technique that uses color variations to represent the magnitude of values in a dataset.

New cards

What is a violin plot, when is it used, and how is it created in python?

A violin plot is a smoother version of a boxplot. It shows the distribution of the data.

sns.violin(x="Column", data=df)

New cards

What is MCAR?

Missing Completely at Random. The reason a value is missing is completely random and not influenced by any specific characteristic of the data.

New cards

What is MAR?

Missing at Random. The data is missing because of something relating to that data, but not specifically the same feature. The reason for the data being missing can be explained by other available information, making it possible to predict the missing values based on what's observed. For example, a survey where individuals who are not programmers are more likely to skip questions about programming languages.

New cards

What is MNAR?

Missing Not At Random. The reason for the missingness is related to the unobserved data. For example, a survey where people with more severe depression are less likely to respond to questions about their depression severity.

New cards

What is dodging?

Separating visualization elements based on a categorical value within a larger group. Usually done by changing a color hue.

New cards

What is EDA?

Exploratory Data Analysis means using visualizations to find patterns and relationships in the data.

New cards

What is faceting?

Dividing data into disjoint subsets, most often by different levels of a categorical variable

New cards

Why is using query parameters with the execute() function helpful?

execute() can filter input to prevent SQL injections.

New cards

How is seaborn usually installed?

import seaborn as sns

New cards

What is a Kernel Density Plot and how is it created?

Visually represents the distribution of data using a smooth curve, like a continuous histogram.

sns.kde(x="Column", data=df, hue=<column>)

New cards

What is the formula for normalizing data?

(value - min) / (max - min)

New cards

What is the formula for standardizing data?

Z = (value - mean) / sd

New cards

What are Tukey’s Fences?

values 1.5*IQR above Q3 or below Q1 are outliers

New cards

What code standardizes dataset “mpg”

preprocessing.scale(mpg)

New cards

What code normalizes dataset “mpg”

preprocessing.MinMaxScaler.fit_transform(mpg)

New cards

What code to removes duplicate rows from the “cars” dataset?

cars.drop_duplicates()

New cards

What code removes the “MPG” feature from the “cars dataset?

cars.drop(axis=1, labels='MPG')

New cards

What code replaces null values with 20 in the “cars” dataset?

cars.fillna(value=20)

New cards

What does cleaning data entail?

Discarding and imputing/replacing missing values

New cards

What does enriching data entail?

Deriving new features from existing features and appending data from external datasets.

New cards

What does structuring data entail?

Converting features to a uniform format. Also, scaling and unpacking overloaded features into new, simple features.

New cards

What does an inner join do?

Creates a new table with all rows that are share a key value in the 2 base tables.

New cards

How can dataframe “cars1” be inner joined with “cars2” without using SQL?

cars1.merge(cars2, how='inner')

New cards

What is the difference between a bar chart and histogram?

A histogram splits continuous data into chunks, and shows the number of instances in each. A bar chart shows the number of instances of different categorical groups.

New cards

What code creates a bar chart?

sns.countplot(data=<df>, x=<column>)

New cards

What code creates a histogram?

sns.histplot(data=<df>, x=<column>)

New cards

What is a violin plot?

The density plot for the numerical feature is plotted for each of the categorical feature's categories. Each density plot is mirrored and plotted offset from the others. Useful for large datasets but does not display differences in the number of instances in each category.

New cards

Which pandas method should be used to detect unusual values of numerical features in a dataframe?

df.boxplot()

New cards

Which pandas method should be used to explore the shape of numerical features in a dataframe?

df.hist()

New cards

Which pandas method can be used to detect missing values in a dataframe?

df.info()

New cards

An accident in processing blood samples at the lab leads to missing data for those patients. What type of missing data is this?

MCAR. Since the accident does not select which patients lack those blood samples, this data is MCAR.

New cards

Customer satisfaction survey responses mostly fall in extremely positive or extremely negative categories. What type of missing data is this?

MNAR. Any analysis of the data will lack an understanding of customers who did not have an extreme opinion. Most customers likely belong in the silent middle ground category.

New cards

A subject missed an appointment due to illness for a study that includes information about the subject's general health. What type of missing data is this?

MAR. The data that would have been collected during the appointment about the subject's general health is related to the reason the data is missing. The illness is unlikely to have removed that single subject in a systematic fashion.

New cards

What are the main queries of SQL?

Insert, Update, Select, Delete

New cards

How many clauses are in the statement?

INSERT INTO Student

VALUES (888, 'Smith', 'Dana', 3.0);

New cards