A2 Notes

  • In Assignment 1, only numerical features were utilized as sklearn does not support nominal features directly, resulting in a limited representation of the diverse types of data commonly found in real-world problems.

  • For Assignment 2, a dataset containing nominal features will be used, allowing for a broader analysis and potentially better predictive performance.

  • To incorporate nominal features in sklearn, they need to be converted to numerical values using One-Hot Encoding, which transforms categorical variables into a format that can be provided to machine learning algorithms.


Understanding One-Hot Encoding
  • OneHotEncoder class: Available in sklearn's preprocessing library, this class efficiently handles nominal features by transforming them into a binary array, where each unique value is represented as a separate feature.

from sklearn.preprocessing import OneHotEncoder
  • Creating the Encoder Object:

ohe = OneHotEncoder(sparse_output=False)
  • Setting sparse_output=False ensures output is dense rather than sparse, which is useful for many applications, especially when the resulting dataset will be used in further analysis or modeling steps.


Implementing One-Hot Encoding
  • Data Example:

a = [["red","medium","circle"],["blue","large","square"],["green","small","triangle"]]
  • Fitting the Encoder: - The fit method creates new features for each unique nominal feature value in the data, effectively learning the categories that will be encoded:

ohe.fit(a)
  • Check categories of original features:

ohe.categories_
  # This returns arrays of unique values for each feature.
  • Getting Feature Names: This method allows users to see the transformed names of the features after encoding:

ohe.get_feature_names_out()

Transforming Data with One-Hot Encoding
  • Transforming New Data: When applying the encoder on a new set of data, it changes categorical entries to their corresponding binary representations:

b = [["green","small","circle"],["blue","medium","square"]]
  new_b = ohe.transform(b)
  • The transformed data will be a numerical representation, demonstrating how categorical input translates to machine-readable formats:

array([[0., 1., 0., 0., 0., 1., 1., 0., 0.],
         [1., 0., 0., 0., 1., 0., 0., 1., 0.]])
  • Inverse Transformation: Should one need to revert back to the original categorical values after processing, inverse_transform provides this functionality:

old_b = ohe.inverse_transform(new_b)

Combining Fit and Transform
  • Single Step Execution: The fit_transform method achieves fitting and transforming in one, streamlined step, which is ideal for efficiency in the first stages of working with new datasets:

new_a = ohe.fit_transform(a)
  • Output will again be a numerical array, showcasing its prompt processing ability:

array([[0., 0., 1., 0., 1., 0., 1., 0., 0.], ...])

Handling Categorical Features in Datasets
  • Loading Data: To work with a dataset that includes categorical features, one can source data from OpenML. For example, using the prnn_viruses dataset:

from sklearn import datasets
  vir = datasets.fetch_openml(data_id=480)
  • Inspecting the DataFrame: Understand the structure of the dataset by checking its information with info(), which reveals details about the columns, data types, and null values:

vir.data.info()

Example of Nominal Features in Dataset
  • Unique Values for Nominal Features: Identifying unique entries in categorical columns can provide insights into the variety of data present:

vir.data["col_8"].unique()  # Results: ['0', '1', '2', '3']
  vir.data["col_10"].unique()  # Results: ['0', '2', '3', '1', '4', '7']

Transforming Nominal Features with ColumnTransformer
  • Using ColumnTransformer: This class allows combinations of different preprocessing techniques, applying OneHotEncoding only to specified columns while keeping the rest unchanged. It enhances the flexibility in manipulating datasets with mixed features:

from sklearn.compose import ColumnTransformer
  ct = ColumnTransformer([("encoder", OneHotEncoder(sparse_output=False), [7,9,10,11,12,13,14,17])], remainder="passthrough")

Finalizing the Transformation
  • Executing the Transformation: After defining which features to process, the actual transformation can be performed, leading to a new dataset ready for use:

new_data = ct.fit_transform(vir.data)
  • Retrieve new feature names to ensure clarity on the structure of the transformed dataset:

ct.get_feature_names_out()
  • Converted Data Type: Ensure the newly formatted data is in the correct type for processing:

type(new_data)  # Should return <class 'numpy.ndarray'>
  • Converting numpy Array to DataFrame: To leverage the functionalities of pandas, convert the numpy array back to a DataFrame:

import pandas as pd
  vir_new_data = pd.DataFrame(new_data, columns=ct.get_feature_names_out(), index=vir.data.index)
  • This allows further operations in DataFrame format, maintaining the context of the original data and adding more usability.


Model Evaluation Techniques
  • Cross-Validation with Naïve Bayes: Evaluate model performance by employing cross-validation, a technique to assess how the results of a statistical analysis will generalize to an independent dataset:

from sklearn.naive_bayes import MultinomialNB
  nb = MultinomialNB()
  from sklearn.model_selection import cross_validate
  scores = cross_validate(nb, vir_new_data, vir.target, cv=10, scoring="accuracy")
  • Average accuracy score output helps in gauging model reliability:

scores["test_score"].mean()  # Example output: 0.919

Using Ensemble Methods
  • Bagging Classifier: To enhance predictive performance through ensemble methods, implement a bagged decision tree classifier:

from sklearn.ensemble import BaggingClassifier
  bagged_dt = BaggingClassifier(estimator=DecisionTreeClassifier())
  scores = cross_validate(bagged_dt, vir_new_data, vir.target, cv=10, scoring="accuracy")
  • Boosted Classifier: Using AdaBoost with a decision tree allows for better management of overfitting and enhances model robustness:

from sklearn.ensemble import AdaBoostClassifier
  boosted_dt = AdaBoostClassifier(estimator=DecisionTreeClassifier())
  scores = cross_validate(boosted_dt, vir_new_data, vir.target, cv=10, scoring="accuracy")
  • Voting Classifier: Create a heterogeneous ensemble of various classifiers to improve overall model performance:

from sklearn.ensemble import VotingClassifier
  vc = VotingClassifier([("lr", LogisticRegression()), ("nb", MultinomialNB())])