In Machine Learning, Dimensionality is defined as the number of input variables for a dataset.
More the input variables, more challenging the model becomes to do the predictive analysis. This phenomenon is called as curse of dimensionality.
Hence, the number of features should be minimized by using Dimensionality Reduction to overcome this back fall. So in this article, we are going to get through in detail about the functionalities of Dimensionality Reduction.
Curse of dimensionality:
- The chances of overfitting are high for a model with many degrees of freedom.
- It gets increasingly dependent on the training data due to overfitting, which degrades the performance of real data.
Ways to reduce the dimensionality:
- Some of the variables would be correlated and thus redundant.
- Feature selection and Feature engineering methods should be applied.
Feature Extraction:
In a dataset, each feature is not equally important. Using Feature Extraction, the number of variables can be reduced while retaining the variance.
In the following scenario, new features have been generated in two dimensions by reducing the given three-dimensional feature dataset.
Consider below table as dataset,
X | Y | Z |
1 | 2 | 3 |
3 | 4 | 3 |
5 | 1 | 4 |
7 | 4 | 3 |
9 | 5 | 3 |
10 | 6 | 3 |
13 | 7 | 4 |
15 | 7 | 3 |
17 | 6 | 3 |
19 | 8 | 4 |
We have used commonly used Feature Extraction techniques called Principal Components to reduce the number of features.
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
from sklearn.decomposition import PCA
pca=PCA(n_components=2)
pca_x=pca.fit_transform(x)
print("Number of features before Feature Extraction:", x.shape[1])
print("Number of features after Feature Extraction:", pca_x.shape[1])
Number of features before Feature Extraction: 3 Number of features after Feature Extraction: 2
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.scatter3D(x[:,0], x[:,1], x[:,2],color='Green')
ax.plot3D(x[:,0], x[:,1], x[:,2],color='Green')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
plt.show()
fig = plt.figure()
ax = plt.axes()
ax.scatter(pca_x[:,0],pca_x[:,1],color='Green')
ax.plot(pca_x[:,0],pca_x[:,1],color='Green')
ax.set_xlabel('x1')
ax.set_ylabel('x2')
plt.show()
Feature Selection:
The techniques in Feature Selection are used to select highly informative features by dropping less useful features.
In the below example, we have used one of the Feature Selection method called Thresholding by Variance to remove those variables with low variance.
from sklearn.feature_selection import VarianceThreshold
VarThreshhold = VarianceThreshold(threshold=.5)
features_high_variance = VarThreshhold.fit_transform(x)
print("Number of features before feature selection:",x.shape[1])
print("Number of features after feature selection:", features_high_variance.shape[1])
Number of features before feature selection: 3 Number of features after feature selection: 2
Advantages:
- Reduces the computation time.
- Improves the model accuracy.
- Data compression is enhanced.
Disadvantages:
- Data loss may occur.