A2 Notes
In Assignment 1, only numerical features were utilized as sklearn does not support nominal features directly, resulting in a limited representation of the diverse types of data commonly found in real-world problems.
For Assignment 2, a dataset containing nominal features will be used, allowing for a broader analysis and potentially better predictive performance.
To incorporate nominal features in sklearn, they need to be converted to numerical values using One-Hot Encoding, which transforms categorical variables into a format that can be provided to machine learning algorithms.
Understanding One-Hot Encoding
OneHotEncoder class: Available in sklearn's preprocessing library, this class efficiently handles nominal features by transforming them into a binary array, where each unique value is represented as a separate feature.
from sklearn.preprocessing import OneHotEncoder
Creating the Encoder Object:
ohe = OneHotEncoder(sparse_output=False)
Setting
sparse_output=Falseensures output is dense rather than sparse, which is useful for many applications, especially when the resulting dataset will be used in further analysis or modeling steps.
Implementing One-Hot Encoding
Data Example:
a = [["red","medium","circle"],["blue","large","square"],["green","small","triangle"]]
Fitting the Encoder: - The
fitmethod creates new features for each unique nominal feature value in the data, effectively learning the categories that will be encoded:
ohe.fit(a)
Check categories of original features:
ohe.categories_
# This returns arrays of unique values for each feature.
Getting Feature Names: This method allows users to see the transformed names of the features after encoding:
ohe.get_feature_names_out()
Transforming Data with One-Hot Encoding
Transforming New Data: When applying the encoder on a new set of data, it changes categorical entries to their corresponding binary representations:
b = [["green","small","circle"],["blue","medium","square"]]
new_b = ohe.transform(b)
The transformed data will be a numerical representation, demonstrating how categorical input translates to machine-readable formats:
array([[0., 1., 0., 0., 0., 1., 1., 0., 0.],
[1., 0., 0., 0., 1., 0., 0., 1., 0.]])
Inverse Transformation: Should one need to revert back to the original categorical values after processing,
inverse_transformprovides this functionality:
old_b = ohe.inverse_transform(new_b)
Combining Fit and Transform
Single Step Execution: The
fit_transformmethod achieves fitting and transforming in one, streamlined step, which is ideal for efficiency in the first stages of working with new datasets:
new_a = ohe.fit_transform(a)
Output will again be a numerical array, showcasing its prompt processing ability:
array([[0., 0., 1., 0., 1., 0., 1., 0., 0.], ...])
Handling Categorical Features in Datasets
Loading Data: To work with a dataset that includes categorical features, one can source data from OpenML. For example, using the prnn_viruses dataset:
from sklearn import datasets
vir = datasets.fetch_openml(data_id=480)
Inspecting the DataFrame: Understand the structure of the dataset by checking its information with
info(), which reveals details about the columns, data types, and null values:
vir.data.info()
Example of Nominal Features in Dataset
Unique Values for Nominal Features: Identifying unique entries in categorical columns can provide insights into the variety of data present:
vir.data["col_8"].unique() # Results: ['0', '1', '2', '3']
vir.data["col_10"].unique() # Results: ['0', '2', '3', '1', '4', '7']
Transforming Nominal Features with ColumnTransformer
Using ColumnTransformer: This class allows combinations of different preprocessing techniques, applying OneHotEncoding only to specified columns while keeping the rest unchanged. It enhances the flexibility in manipulating datasets with mixed features:
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("encoder", OneHotEncoder(sparse_output=False), [7,9,10,11,12,13,14,17])], remainder="passthrough")
Finalizing the Transformation
Executing the Transformation: After defining which features to process, the actual transformation can be performed, leading to a new dataset ready for use:
new_data = ct.fit_transform(vir.data)
Retrieve new feature names to ensure clarity on the structure of the transformed dataset:
ct.get_feature_names_out()
Converted Data Type: Ensure the newly formatted data is in the correct type for processing:
type(new_data) # Should return <class 'numpy.ndarray'>
Converting numpy Array to DataFrame: To leverage the functionalities of pandas, convert the numpy array back to a DataFrame:
import pandas as pd
vir_new_data = pd.DataFrame(new_data, columns=ct.get_feature_names_out(), index=vir.data.index)
This allows further operations in DataFrame format, maintaining the context of the original data and adding more usability.
Model Evaluation Techniques
Cross-Validation with Naïve Bayes: Evaluate model performance by employing cross-validation, a technique to assess how the results of a statistical analysis will generalize to an independent dataset:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
from sklearn.model_selection import cross_validate
scores = cross_validate(nb, vir_new_data, vir.target, cv=10, scoring="accuracy")
Average accuracy score output helps in gauging model reliability:
scores["test_score"].mean() # Example output: 0.919
Using Ensemble Methods
Bagging Classifier: To enhance predictive performance through ensemble methods, implement a bagged decision tree classifier:
from sklearn.ensemble import BaggingClassifier
bagged_dt = BaggingClassifier(estimator=DecisionTreeClassifier())
scores = cross_validate(bagged_dt, vir_new_data, vir.target, cv=10, scoring="accuracy")
Boosted Classifier: Using AdaBoost with a decision tree allows for better management of overfitting and enhances model robustness:
from sklearn.ensemble import AdaBoostClassifier
boosted_dt = AdaBoostClassifier(estimator=DecisionTreeClassifier())
scores = cross_validate(boosted_dt, vir_new_data, vir.target, cv=10, scoring="accuracy")
Voting Classifier: Create a heterogeneous ensemble of various classifiers to improve overall model performance:
from sklearn.ensemble import VotingClassifier
vc = VotingClassifier([("lr", LogisticRegression()), ("nb", MultinomialNB())])