Data Analysis

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/74

flashcard set

Earn XP

Description and Tags

Flashcards covering key concepts from data analysis lecture notes.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

75 Terms

1
New cards

What are the criteria to access knowledge in data analysis?

Correctness, generality, usefulness, comprehensibility, novelty

2
New cards

What is the goal of descriptive analytics?

Increased understanding of the data we have

3
New cards

What is the goal of predictive analytics?

Making predictions for unknown objects or future data

4
New cards

What does machine learning mean?

Teaching a computer to learn from data without being given exact instructions

5
New cards

What does modeling mean in the context of a modeling framework?

Build the best model possible from prior knowledge and data

6
New cards

What is supervised learning?

Learn from labeled data to predict an outcome based on past examples

7
New cards

What are the two main types of supervised learning?

Classification and regression

8
New cards

What does classification do?

Predict a category (discrete result)

9
New cards

What does regression do?

Predict a number (continuous result)

10
New cards

What is unsupervised learning?

Learn from unlabeled data to find patterns or structures

11
New cards

What are some common techniques in unsupervised learning?

Clustering analysis, association analysis, dimensionality reduction

12
New cards

What does dimensionality reduction do?

Simplify data (make fewer features) while keeping important information

13
New cards

What is reinforcement learning?

Learn by trial and error, obtaining rewards or penalties for actions

14
New cards

What is overfitting?

Model learns random noise, not real patterns

15
New cards

What is a key warning about causality?

Correlation does not equal causation

16
New cards

What does CRISP-DM stand for?

Cross Industry Standard Process for Data Mining

17
New cards

What are the main phases of the CRISP-DM model?

Project understanding, data understanding, data preparation, modeling, evaluation, deployment

18
New cards

What is the goal of the project understanding phase?

To understand the real-world problem and business/research needs

19
New cards

What is the goal of the data understanding phase?

To get to know the data better and check if it’s good enough to solve the problem

20
New cards

What is the goal of the data preparation phase?

Transform the raw data into the final, clean dataset for analysis

21
New cards

What is the goal of the modeling phase?

Choose and apply the right modeling techniques to solve the problem

22
New cards

What is the goal of the evaluation phase?

Check if the models meet the objectives and are ready for real-life situations

23
New cards

What is the goal of the deployment phase?

Put the model to use in the real-world

24
New cards

What is the purpose of determining the project objective?

To define the aim of the project and criteria to measure success

25
New cards

What are the components of cognitive maps?

Nodes (variables) and arrows (direction and type of influence)

26
New cards

What elements are involved in assessing the situation of a data analytics project?

Requirements & constraints, assumptions (representativeness, data quality, external factors)

27
New cards

What is the main purpose when determining the analysis goal?

Take the overall project goal and turn it into specific measurable tasks

28
New cards

What model requirements should be considered when defining success during a data analysis project?

Accuracy, flexibility, interpretability, runtime, expert knowledge

29
New cards

Analysis goal

Objective, technical tasks, model requirements

30
New cards

What are the goals of data understanding?

Gain general insight on the data and check assumptions

31
New cards

What preprocessing steps are needed before building a data matrix?

Converting non-numerical data into numbers and handling missing values

32
New cards

What are the types of attributes in a data matrix?

Categorical (nominal), ordinal, numerical (discrete, continuous)

33
New cards

What are the scales for numerical attributes?

Interval scale, ratio scale, absolute scale

34
New cards

How should ordinal attributes be encoded?

Using ranking numbers

35
New cards

Why does data quality matter?

It will lead to bad analysis results

36
New cards

What are the two types of data quality checks?

Syntactic and semantic accuracy

37
New cards

What is syntactic accuracy?

About making sure the format of the data is correct

38
New cards

What is semantic accuracy?

About whether the data makes sense in its context

39
New cards

When is completeness violated in data quality?

An entry is missing

40
New cards

What are types of missing values?

MCAR, MAR, Nonignorable

41
New cards

What does MCAR stand for?

Missing completely at random

42
New cards

What does MAR stand for?

Missing at random

43
New cards

What is the purpose of data visualization?

To understand the data and spot quality issues early

44
New cards

What is an outlier?

A value or data object that is very different from the rest

45
New cards

Data Preparation

Select attributes, reduce dimensions, select records, handle missing values, handle outliers, integrate and transform, improve data quality

46
New cards

Feature extraction

Creating new, smarter features from raw data to make learning easier

47
New cards

What is automatic feature extraction?

PCA and other dimensionality reduction methods

48
New cards

What are the reasons to remove features in feature selection?

Reasons to remove features include irrelevance, too many missing/bad values, same value for all rows, redundancy

49
New cards

What are reasons to select only useful rows of data using record selection?

Time-lines, representativeness, rare events

50
New cards

What is data cleansing

Fix or remove bad data, also called data scrubbing

51
New cards

What are the threee ways to handle missing values?

Ignore/delete, imputation, explicit value

52
New cards

What are types of numeric discretization?

Equi-width, equi-frequency, V-optimal, minimal entropy

53
New cards

What are common date intergration issues?

Unifying structure, missing values in one source but not in others, duplicates, joins, map overlay

54
New cards

What is min-max normalization?

Scale values to the range [0,1]

55
New cards

What is z-score standardization?

Center around 0 with standard deviation 1

56
New cards

What is involved in centering the data matrix?

Subtract the mean of each attribute from all its values

57
New cards

What is another method to visualize high dimensional data (other than 2D/3D)

Parallel coordinates

58
New cards

What is a specific type of linear mapping dimensionality reduction method?

Principal component analysis

59
New cards

Why is Normalization needed during PCA?

It is sensitive to units

60
New cards

Checklist for Data Understanding

Define goals, assess data quality, understand distributions, inspect relationships, check temporal or group differences, check representativeness

61
New cards

data mining

Extracting new information from medical records

62
New cards

What patterns exist in data that allows the data to be learned within input -> output?

Pattern is too complex to write as a formula so we use data to learn it

63
New cards

unsupervised learning

Find patterns or sructure from unlabeled data

64
New cards

What do validation tests do?

The model might work great on training data, but fail on new data

65
New cards

What does overfitting mean?

model is too complex, it learns the training data too well, even memorizing random noise and it fails on new data

66
New cards

What's one of the first steps in finding the best model in machine learning?

Figure out what kind of problem you are solving

67
New cards

K-nearest neighbor

one of the simplest machine learning methods that makes predictions by looking at the mos similar data points the training set

68
New cards

Methods to determine best 'k' value

Cross validation to determine/tune how many neighbours to evaluate

69
New cards

Ridge regression

solve the overfitting problem in linear regression especially useful, when you have too many features (high-dimensional data), your features are correlated or you have more features than data points

70
New cards

support vector machine - SVM

Maximizing margin between classes in binary classification tasks

71
New cards

Cluster

Group similar data points based on features

72
New cards

What does hierarchical clustering do?

Builds a tree of clusters

73
New cards

DNA Microarrays

grid or chip with thousands of DNA spots, ecah spot represent a gene to see how active each gene is under differnet conditions

74
New cards

What is anomaly detection?

Spotting thins that are weird or differnet in a dataset

75
New cards

Association rule analysis

find interesting relationships or patterns between items in large dataset, focused on associations between items or features that frequently occur together