engineering data analysis

0.0(0)
studied byStudied by 1 person
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/33

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

34 Terms

1
New cards

Identify and create the appropriate dataset

Perform computation to learn

Required rules, pattern and relations

Output the decision

What is machine learning

2
New cards

Supervised learning

In _____, we need some thing called a Labelled Training Dataset

3
New cards

Supervised learning

Given a labelled dataset, the task is to devise a function which takes the dataset, and a new sample, andproduces an output value.

4
New cards

Classification

If the possible output values of the function are predefined and discrete/categorical, it is called _________

5
New cards

Predefined classes

_______ _______ means, it will produce output only from the labels defined in the dataset. For example,

even if we input a bus, it will produce either CAR or BIKE

6
New cards

Regression

If the possible output values of the function are continuous real values, then it is called _______

7
New cards

Supervised

The classification and Regression problems are both _______

8
New cards

Experience

The characteristics of the ground truth labels or values present in the dataset, which we define as experience

9
New cards

unsupervised learning

[ In the , we do not need to know the labels or Ground truth values ]

10
New cards

Reinforcement Learning

Learning from trials and errors

11
New cards

Classification

is supervised learning from examples.

12
New cards

Unsupervised learning

Class labels of the data are not given or unknown

13
New cards

Unsupervised learning

Given a set of data, the task is to establish the existence of classes or clusters in the data

14
New cards

Decision tree learning

is one of the most

widely used techniques for classification

15
New cards

Ross Quinlan

C4.5 by ______ ______ is perhaps the best

known system. It can be downloaded from

the Web.

16
New cards

information theory

provides a mathematical

basis for measuring the information content

17
New cards
term image

The entropy formula

18
New cards

entropy

We use ________ as a measure of impurity or disorder or uncertainty of data set D (or, a measure of information in a tree)

19
New cards

Overfitting

A tree may ______the training data

Good accuracy on training data but poor on test data

Symptoms: tree too deep and too many branches, some may reflect anomalies due to noise or outliers

20
New cards

Pre-pruning

Halt tree construction early

Difficult to decide because we do not know what may happen subsequently if we keep growing the tree.

21
New cards

Post-pruning

Remove branches or sub-trees from a “fully grown” tree.

This method is commonly used. C4.5 uses a statistical method to estimates the errors at each node for pruning.

A validation set may be used for pruning as well.

22
New cards

Predictive accuracy

Accuracy = Number of correct classifications / total number of test cases

23
New cards

Holdout set

The available data set D is divided into

two disjoint subsets

24
New cards

Holdout set

This method is used when the data set D is large.

25
New cards

n-fold cross-validation

The available data is

partitioned into n equal-size disjoint subsets.

26
New cards

Validation set

the many cases, the available data

is divided into three subsets

27
New cards

Validation Set

is used frequently for estimating

parameters in learning algorithms.

The parameter values that give the best accuracy on the validation set are used as the final parameter values

28
New cards

Positive Class

The class of interest is commonly called the

_______ _______

29
New cards

Precision and recall measures

Used in information retrieval and text classification.

30
New cards

Precision p

is the number of correctly classified

positive examples divided by the total number of

examples that are classified as positive

31
New cards

Recall r

is the number of correctly classified positive

examples divided by the total number of actual

positive examples in the test set

32
New cards

ROC Curve

It is a plot of the true positive rate (TPR) against the false positive rate (FPR).

33
New cards

Sensitivity

Same as TPR (or recall)

34
New cards

Specifity

Also called True Negative Rate (TNR) (negative recall)