Dimensionality Reduction

In Machine Learning, Dimensionality is defined as the number of input variables for a dataset.
More the input variables, more challenging the model becomes to do the predictive analysis. This phenomenon is called as curse of dimensionality.

Hence, the number of features should be minimized by using Dimensionality Reduction to overcome this back fall. So in this article, we are going to get through in detail about the functionalities of Dimensionality Reduction.

Curse of dimensionality:

  • The chances of overfitting are high for a model with many degrees of freedom.
  • It gets increasingly dependent on the training data due to overfitting, which degrades the performance of real data.
The more the fan wings, the lower the wind-generated is

Ways to reduce the dimensionality:

  • Some of the variables would be correlated and thus redundant.
  • Feature selection and Feature engineering methods should be applied.

Feature Extraction:

In a dataset, each feature is not equally important. Using Feature Extraction, the number of variables can be reduced while retaining the variance.

In the following scenario, new features have been generated in two dimensions by reducing the given three-dimensional feature dataset.

Consider below table as dataset,

XYZ
123
343
514
743
953
1063
1374
1573
1763
1984
Dataset

We have used commonly used Feature Extraction techniques called Principal Components to reduce the number of features.

import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
from sklearn.decomposition import PCA
pca=PCA(n_components=2)
pca_x=pca.fit_transform(x)
print("Number of features before Feature Extraction:", x.shape[1])
print("Number of features after Feature Extraction:", pca_x.shape[1])
Number of features before Feature Extraction: 3 
Number of features after Feature Extraction: 2
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.scatter3D(x[:,0], x[:,1], x[:,2],color='Green')
ax.plot3D(x[:,0], x[:,1], x[:,2],color='Green')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
plt.show()
Before Feature Extraction
fig = plt.figure()
ax = plt.axes()
ax.scatter(pca_x[:,0],pca_x[:,1],color='Green')
ax.plot(pca_x[:,0],pca_x[:,1],color='Green')
ax.set_xlabel('x1')
ax.set_ylabel('x2')
plt.show()
After Feature Extraction

Feature Selection:

The techniques in Feature Selection are used to select highly informative features by dropping less useful features.
In the below example, we have used one of the Feature Selection method called Thresholding by Variance to remove those variables with low variance.

from sklearn.feature_selection import VarianceThreshold
VarThreshhold = VarianceThreshold(threshold=.5)
features_high_variance = VarThreshhold.fit_transform(x)
print("Number of features before feature selection:",x.shape[1])
print("Number of features after feature selection:", features_high_variance.shape[1])
Number of features before feature selection: 3
Number of features after feature selection: 2

Advantages:

  • Reduces the computation time.
  • Improves the model accuracy.
  • Data compression is enhanced.

Disadvantages:

  • Data loss may occur.

To deal with a 14-dimensional space, visualize a 3-D space and say ‘fourteen’ to yourself very loudly. Everyone does it.

Geoffrey Hinton

Leave a Reply

Your email address will not be published. Required fields are marked *