Data is rarely clean. Most of the times, the data is missing or incomplete.There is a need to pre-process or clean training data, since Machine Learning models don’t work when the data is dirty. So, we can either ignore the rows with missing data columns or impute the values with some calculated output.
Let’s look at sample data. The data contains some missing values which are marked as NaN. We need to look for different ways imputing these missing data values.
We can check missing values in a dataframe using two built-in functions,
pandas.DataFrame.isnull() as follows:
The features Age, Position, Experience and Salary are having missing value.
The missing data can be handled in the following ways:
- Dropping rows with missing data.
- Replacing NaN with data.
Dropping Rows With Missing Data¶
As stated earlier, ignoring the rows with the missing data can lead to inconsistent results as the data that is removed can be crucial for further calculations and might contain some important observations.
Although it is not a recommended way, we can still remove rows if the dataset is large. We can consider that, in a large dataset, the rows with the missing values may have a very small impact.
We can consider following scenarios to drop rows with NANs:
- drop all rows that have any NaN (missing) values
- drop only if entire row has NaN (missing) values
- drop only if a row has more than 2 NaN (missing) values
- drop NaN (missing) in a specific column
Case 1: Drop all rows that have NaN¶
The above code will remove all the rows of a column that has NaN or NA values, hence excluding these rows from further analysis.
Case 2: Drop only if entire row has NaN values
Case 3: Drop only if a row has more than 2 NaN values
Case 4: Drop NaN in a specific column
Despite it being an easier option, we should not use this method. Let’s look for an alternative to make the dataset more consistent.
Replacing NaN With Data¶
Imputation is another approach to resolve the problem of missing data. The missing column values are substituted by another computed value. There might be scenarios where the dataset is small or where each row of the dataset represents a critical value. In those cases, we cannot remove the row from the dataset. The missing values can be imputed.
There are different strategies to define the substitute for the missing value. The value can be substituted by these values:
mean valueof the other column values available in the training dataset.
median valueof the other values available in the training dataset.
- Substitute with the
mode value(most frequent) in the training dataset.
- Substitute with the
constant valuein the training dataset.
To achieve the required substitution, we can use Scikit’s Imputer class. This class constructor takes the following parameters as input:
missing_values: This is the actual value that needs to be replaced. The developer can specify the value that needs to be considered for replacement. Let’s say we need to replace all NaN occurrences in the dataset. We can even replace integer or string values in the dataset
strategy: We have a different strategy to calculate the missing values. The strategies that can be used are mean, median, constant and most_frequent.
axis: This parameter takes either 0 or 1 as input value. It decides if the strategy needs to be applied to a row or a column. In this case, 0 represents a column and 1 represents the column strategy. For the value 0, it looks for all the values in the column and for 1, it looks for all the row data to calculate the value according to the strategy.
verbose: This defines the verbosity of the imputer. Value default is 0.
copy: This decides if a copy of the original object needs to be made or whether the original data need to be transformed. By default, it is set to true, which signifies that a copy of the original object is created.
We will look at how to replace the missing values with the various strategies.
1. Imputing with mean
2. Imputing with median
3. Imputing with mode
4. Imputing with constant
Missing data arise in almost all serious data analyses. The approach to deal with missing values is heavily dependent on the nature of such data. Thanks for reading.