Data cleaning can be a monotonous process in the Machine Learning project. The steps and techniques for data cleaning will not be similar for every dataset. Without proper data, it will be time-consuming to see the actually important parts in exploration data analysis. We are going to see some basic data cleaning operations that should be performed on every ML project.
Consider the below student dataset and let us clean the data by following techniques.
data=pd.DataFrame([['abc',10,20,10,5],['efg',20,18,20,5],['hij',30,19,30,5],['klm',40,20,40,5],
['klm',40,19,40,5],['opq',50,20,50,5],['rst',60,np.nan,60,5]])
data.columns=['name','score','age','mark','subject #']
Duplicate data:
The first step of data cleaning is to remove unwanted features and observations from the dataset. It is common to find Duplicate data in rows and columns during data collection.
data.drop_duplicates(inplace=True) #remove row duplicates
data = data.loc[:,~data.T.duplicated()] #remove column duplicates
Features with a single value:
In some features, there will be a repetition of the same information in all observations. Columns that have a single value for all rows do not contain any useful information for analysis.
no_of_unique = data.nunique()
single_value=no_of_unique[no_of_unique==1]
data.drop(single_value.index, axis=1, inplace=True)
Null Handling:
Missing data is a tricky problem. So missing value can be ignored or imputed with some meaningful data.
a) Removing missing value
data.isnull()
data.dropna(axis=0, how='any')
b) Imputing with related value
from sklearn.preprocessing import Imputer
data.iloc[:,1:4] = Imputer(strategy='median').fit_transform(data.iloc[:,1:4])