1/74
Flashcards covering key concepts from data analysis lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What are the criteria to access knowledge in data analysis?
Correctness, generality, usefulness, comprehensibility, novelty
What is the goal of descriptive analytics?
Increased understanding of the data we have
What is the goal of predictive analytics?
Making predictions for unknown objects or future data
What does machine learning mean?
Teaching a computer to learn from data without being given exact instructions
What does modeling mean in the context of a modeling framework?
Build the best model possible from prior knowledge and data
What is supervised learning?
Learn from labeled data to predict an outcome based on past examples
What are the two main types of supervised learning?
Classification and regression
What does classification do?
Predict a category (discrete result)
What does regression do?
Predict a number (continuous result)
What is unsupervised learning?
Learn from unlabeled data to find patterns or structures
What are some common techniques in unsupervised learning?
Clustering analysis, association analysis, dimensionality reduction
What does dimensionality reduction do?
Simplify data (make fewer features) while keeping important information
What is reinforcement learning?
Learn by trial and error, obtaining rewards or penalties for actions
What is overfitting?
Model learns random noise, not real patterns
What is a key warning about causality?
Correlation does not equal causation
What does CRISP-DM stand for?
Cross Industry Standard Process for Data Mining
What are the main phases of the CRISP-DM model?
Project understanding, data understanding, data preparation, modeling, evaluation, deployment
What is the goal of the project understanding phase?
To understand the real-world problem and business/research needs
What is the goal of the data understanding phase?
To get to know the data better and check if it’s good enough to solve the problem
What is the goal of the data preparation phase?
Transform the raw data into the final, clean dataset for analysis
What is the goal of the modeling phase?
Choose and apply the right modeling techniques to solve the problem
What is the goal of the evaluation phase?
Check if the models meet the objectives and are ready for real-life situations
What is the goal of the deployment phase?
Put the model to use in the real-world
What is the purpose of determining the project objective?
To define the aim of the project and criteria to measure success
What are the components of cognitive maps?
Nodes (variables) and arrows (direction and type of influence)
What elements are involved in assessing the situation of a data analytics project?
Requirements & constraints, assumptions (representativeness, data quality, external factors)
What is the main purpose when determining the analysis goal?
Take the overall project goal and turn it into specific measurable tasks
What model requirements should be considered when defining success during a data analysis project?
Accuracy, flexibility, interpretability, runtime, expert knowledge
Analysis goal
Objective, technical tasks, model requirements
What are the goals of data understanding?
Gain general insight on the data and check assumptions
What preprocessing steps are needed before building a data matrix?
Converting non-numerical data into numbers and handling missing values
What are the types of attributes in a data matrix?
Categorical (nominal), ordinal, numerical (discrete, continuous)
What are the scales for numerical attributes?
Interval scale, ratio scale, absolute scale
How should ordinal attributes be encoded?
Using ranking numbers
Why does data quality matter?
It will lead to bad analysis results
What are the two types of data quality checks?
Syntactic and semantic accuracy
What is syntactic accuracy?
About making sure the format of the data is correct
What is semantic accuracy?
About whether the data makes sense in its context
When is completeness violated in data quality?
An entry is missing
What are types of missing values?
MCAR, MAR, Nonignorable
What does MCAR stand for?
Missing completely at random
What does MAR stand for?
Missing at random
What is the purpose of data visualization?
To understand the data and spot quality issues early
What is an outlier?
A value or data object that is very different from the rest
Data Preparation
Select attributes, reduce dimensions, select records, handle missing values, handle outliers, integrate and transform, improve data quality
Feature extraction
Creating new, smarter features from raw data to make learning easier
What is automatic feature extraction?
PCA and other dimensionality reduction methods
What are the reasons to remove features in feature selection?
Reasons to remove features include irrelevance, too many missing/bad values, same value for all rows, redundancy
What are reasons to select only useful rows of data using record selection?
Time-lines, representativeness, rare events
What is data cleansing
Fix or remove bad data, also called data scrubbing
What are the threee ways to handle missing values?
Ignore/delete, imputation, explicit value
What are types of numeric discretization?
Equi-width, equi-frequency, V-optimal, minimal entropy
What are common date intergration issues?
Unifying structure, missing values in one source but not in others, duplicates, joins, map overlay
What is min-max normalization?
Scale values to the range [0,1]
What is z-score standardization?
Center around 0 with standard deviation 1
What is involved in centering the data matrix?
Subtract the mean of each attribute from all its values
What is another method to visualize high dimensional data (other than 2D/3D)
Parallel coordinates
What is a specific type of linear mapping dimensionality reduction method?
Principal component analysis
Why is Normalization needed during PCA?
It is sensitive to units
Checklist for Data Understanding
Define goals, assess data quality, understand distributions, inspect relationships, check temporal or group differences, check representativeness
data mining
Extracting new information from medical records
What patterns exist in data that allows the data to be learned within input -> output?
Pattern is too complex to write as a formula so we use data to learn it
unsupervised learning
Find patterns or sructure from unlabeled data
What do validation tests do?
The model might work great on training data, but fail on new data
What does overfitting mean?
model is too complex, it learns the training data too well, even memorizing random noise and it fails on new data
What's one of the first steps in finding the best model in machine learning?
Figure out what kind of problem you are solving
K-nearest neighbor
one of the simplest machine learning methods that makes predictions by looking at the mos similar data points the training set
Methods to determine best 'k' value
Cross validation to determine/tune how many neighbours to evaluate
Ridge regression
solve the overfitting problem in linear regression especially useful, when you have too many features (high-dimensional data), your features are correlated or you have more features than data points
support vector machine - SVM
Maximizing margin between classes in binary classification tasks
Cluster
Group similar data points based on features
What does hierarchical clustering do?
Builds a tree of clusters
DNA Microarrays
grid or chip with thousands of DNA spots, ecah spot represent a gene to see how active each gene is under differnet conditions
What is anomaly detection?
Spotting thins that are weird or differnet in a dataset
Association rule analysis
find interesting relationships or patterns between items in large dataset, focused on associations between items or features that frequently occur together