Basic Data Cleaning techniques

Data cleaning can be a monotonous process in the Machine Learning project. The steps and techniques for data cleaning will not be similar for every dataset. Without proper data, it will be time-consuming to see the actually important parts in exploration data analysis. We are going to see some basic data cleaning operations that should be performed on every ML project.

Consider the below student dataset and let us clean the data by following techniques.

data=pd.DataFrame([['abc',10,20,10,5],['efg',20,18,20,5],['hij',30,19,30,5],['klm',40,20,40,5],
                   ['klm',40,19,40,5],['opq',50,20,50,5],['rst',60,np.nan,60,5]])
data.columns=['name','score','age','mark','subject #']

Duplicate data:

The first step of data cleaning is to remove unwanted features and observations from the dataset. It is common to find Duplicate data in rows and columns during data collection.

data.drop_duplicates(inplace=True) #remove row duplicates
data = data.loc[:,~data.T.duplicated()] #remove column duplicates

Features with a single value:

In some features, there will be a repetition of the same information in all observations. Columns that have a single value for all rows do not contain any useful information for analysis.

no_of_unique = data.nunique()
single_value=no_of_unique[no_of_unique==1]
data.drop(single_value.index, axis=1, inplace=True)

Null Handling:

Missing data is a tricky problem. So missing value can be ignored or imputed with some meaningful data.

a) Removing missing value

data.isnull()
data.dropna(axis=0, how='any')

b) Imputing with related value

from sklearn.preprocessing import Imputer
data.iloc[:,1:4] = Imputer(strategy='median').fit_transform(data.iloc[:,1:4])

I kind of have to be a master of cleaning, extracting and trusting my data before I do anything with it.

Scott Nicholson

Leave a Reply

Your email address will not be published. Required fields are marked *