1/17
Flashcards covering data exploration concepts (summary statistics, visualization) and discretization methods from Lecture 3.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Summary statistics
A set of measures that summarize data, e.g., frequency and mean.
Frequency
The percentage of times a value occurs in the data set.
Mode
The most frequent attribute value.
Mean
Arithmetic average; a location measure that is sensitive to outliers.
Median
The middle value; a measure of central tendency often used as an alternative to the mean.
Range
Difference between the maximum and minimum values.
Variance
A measure of the spread of a data set; a common dispersion metric.
Visualization
Conversion of data into visual representations to reveal patterns, relationships, and outliers.
Scatter plot
A two-dimensional plot showing relationships between two numeric attributes; can use size, shape, and color to encode extra attributes.
Histogram
A chart showing the distribution of a single variable by binning values into intervals.
Discretization
Turning a numeric (continuous) attribute into a categorical attribute by dividing its range into sub-ranges (bins).
Bin (bucket)
A sub-range of values used in discretization.
Equal-width discretization
Divides the value range into N equal-sized subranges; bin width = (max – min) / N.
Equal-frequency discretization
Divides the range into N bins so each bin holds roughly the same number of instances.
Unsupervised discretization
Discretization methods that do not use class values when creating bins (e.g., equal-width, equal-frequency).
Supervised discretization
Discretization methods that consider class values to choose bin boundaries.
Entropy-based discretization
A supervised method using information entropy to select bin boundaries for better class separation.
Iris dataset
A classic data set with three flower classes (Setosa, Virginica, Versicolor) and four attributes (sepal/petal length/width).