Lesson

Data Cleaning in Pandas

Learn Data Cleaning in Pandas in SQLPad's Python Pandas Mastery course with practical examples and guided lessons.

Welcome to the Data Cleaning in Pandas lesson, part of the Python Pandas Mastery: An Interactive and Practical Guide to Data Analysis course. In this lesson, we'll focus on various techniques that allow you to clean and preprocess your data using the Pandas library. Data cleaning is a crucial step in any data analysis project, as it ensures that your data is accurate, complete, and well-structured before you start the analysis process. We'll cover topics such as handling missing values, correcting data inconsistencies, and transforming data to a suitable format. This lesson will provide you with practical examples that will help you build a solid foundation in data cleaning with Pandas.

Welcome to the Data Cleaning in Pandas lesson, part of the Python Pandas Mastery: An Interactive and Practical Guide to Data Analysis course. In this lesson, we'll focus on various techniques that allow you to clean and preprocess your data using the Pandas library. Data cleaning is a crucial step in any data analysis project, as it ensures that your data is accurate, complete, and well-structured before you start the analysis process. We'll cover topics such as handling missing values, correcting data inconsistencies, and transforming data to a suitable format. This lesson will provide you with practical examples that will help you build a solid foundation in data cleaning with Pandas.

Removing Duplicates

To demonstrate removing duplicates, we will create a sample dataset using pandas.

import pandas as pd

# Create a sample dataset
data = {'Name': ['John', 'Alice', 'Bob', 'John', 'Alice', 'Charlie'],
        'Age': [28, 24, 22, 28, 24, 30],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'New York', 'San Francisco', 'Los Angeles']}

# Create DataFrame
df = pd.DataFrame(data)

# Print the original DataFrame
print(df)

Now, let's remove duplicate rows from the DataFrame.

# Remove duplicate rows
df_unique = df.drop_duplicates()

# Print the updated DataFrame
print(df_unique)

If you want to remove duplicates based on a specific column, you can use the subset parameter.

# Remove duplicates based on the 'Name' column
df_unique_name = df.drop_duplicates(subset='Name')

# Print the updated DataFrame
print(df_unique_name)

Finally, you can also use the keep parameter to determine which duplicate to keep. By default, it is set to 'first', which means the first occurrence of a duplicate is kept, and others are removed. You can change it to 'last' to keep the last occurrence or use False to remove all duplicates.

# Remove all occurrences of duplicates
df_no_duplicates = df.drop_duplicates(subset='Name', keep=False)

# Print the updated DataFrame
print(df_no_duplicates)

Renaming Columns

Import Required Libraries

import pandas as pd

Create Sample DataFrame

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 28, 22],
        'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
print(df)

Renaming Columns Using rename Method

df = df.rename(columns={'Name': 'Full Name', 'Age': 'Age in Years'})
print(df)

Renaming Columns Using columns Attribute

df.columns = ['Full Name', 'Age in Years', 'City']
print(df)

Replacing Values

First, let's create a sample DataFrame that we will use for demonstrating the process of replacing values in Pandas.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago'],
    'Gender': ['F', 'M', 'M', 'M']
}

df = pd.DataFrame(data)
print(df)

Now let's replace the 'Gender' column values with full descriptions, i.e., 'F' to 'Female' and 'M' to 'Male'.

df['Gender'] = df['Gender'].replace({'F': 'Female', 'M': 'Male'})
print(df)

We can also replace specific values in the entire DataFrame, for example, let's replace 'New York' with 'NYC'.

df = df.replace('New York', 'NYC')
print(df)

Lastly, let's replace the 'Age' column values greater than 30 with the mean age.

mean_age = df['Age'].mean()
df['Age'] = df['Age'].replace([a for a in df['Age'] if a > 30], mean_age)
print(df)

Converting Data Types

First, let's import the necessary libraries and create a sample dataframe.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Cathy'],
        'Age': ['25', '30', '35'],
        'Score': [80.5, 90.0, 78.5]}

df = pd.DataFrame(data)
print(df)

Now, let's check the data types of each column using dtypes.

print(df.dtypes)

We can see that the 'Age' column has a data type of object (string), while we want it to be of type int. Let's convert the data type of the 'Age' column to int using the astype() method.

df['Age'] = df['Age'].astype(int)
print(df.dtypes)

The data type of the 'Age' column has been successfully converted to int. Now, let's convert the 'Score' column to int as well, using the same method.

df['Score'] = df['Score'].astype(int)
print(df)
print(df.dtypes)

The 'Score' column has been successfully converted to int, and our final DataFrame looks like this:

print(df)

In this example, we demonstrated how to convert data types in a pandas DataFrame using the astype() method.

Handling Outliers

In this example, we will handle outliers in a DataFrame using Pandas. We will create a sample dataset with some outliers and demonstrate how to identify and remove them using the Interquartile Range (IQR) method.

First, let's create a sample DataFrame.

import pandas as pd

data = {'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100], 'B': [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 200]}
df = pd.DataFrame(data)
print(df)

Now, let's calculate the IQR for each column.

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Next, we will identify and remove the outliers from the DataFrame.

outlier_condition = (df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))
df_no_outliers = df[~outlier_condition.any(axis=1)]
print(df_no_outliers)

The final output will be a DataFrame with the outliers removed.

Regular Expressions for Data Cleaning

Import Libraries and Load Data

import pandas as pd
import numpy as np

# Create a sample dataframe
data = {'ID': ['A001', 'A002', 'A003', 'A004'],
        'Name': ['John Doe', 'Jane Smith', 'Alex Johnson', 'Layla Brown'],
        'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
        'Salary': ['$60,000', '$70,000', '$80,000', '$90,000']}

df = pd.DataFrame(data)
print(df)

Replace Dollar Sign and Comma in Salary Column

df["Salary"] = df["Salary"].replace('[\$,]', '', regex=True).astype(float)
print(df)

Extract Domain from Email Column

df['Email_Domain'] = df['Email'].str.extract('(@[a-zA-Z0-9]+\.[a-zA-Z0-9]+)', expand=True)
print(df)

Split Name Column into First and Last Name

df[['First_Name', 'Last_Name']] = df['Name'].str.split(' ', expand=True)
print(df)

Remove Numbers from ID Column

df['ID'] = df['ID'].replace('[0-9]', '', regex=True)
print(df)

Exercises

1. Data Cleaning Exercise: Handling Missing Values

Instruction

In this exercise, you will practice handling missing values in a dataset using Pandas. We will use a sample dataset containing information about students and their test scores. Your task is to clean the dataset by filling in missing values using appropriate techniques.

Follow these steps:

  1. Create a sample dataset containing the following data:
Name Age Test Score
Alice 20 85
Bob 21 NaN
Carol 19 90
David NaN 78
Eve 22 88
Frank NaN NaN
  1. Import the necessary libraries and create a DataFrame from the dataset.

  2. Calculate the mean of the Age and Test Score columns.

  3. Fill in the missing values in the Age column with the mean age, and in the Test Score column with the mean test score. Be sure to round the values to the nearest whole number.

  4. Print the cleaned dataset.

My Solution

# Your solution goes here

Hint

Use the fillna() method to fill in missing values in the DataFrame. Calculate the mean using the mean() method on the respective columns, and pass the rounded mean values to the fillna() method.

Solution

import pandas as pd

# Create the dataset
data = {'Name': ['Alice', 'Bob', 'Carol', 'David', 'Eve', 'Frank'],
        'Age': [20, 21, 19, None, 22, None],
        'Test Score': [85, None, 90, 78, 88, None]}

# Create the DataFrame
df = pd.DataFrame(data)

# Calculate the mean of Age and Test Score columns
mean_age = df['Age'].mean()
mean_test_score = df['Test Score'].mean()

# Fill in the missing values
df['Age'].fillna(round(mean_age), inplace=True)
df['Test Score'].fillna(round(mean_test_score), inplace=True)

# Display the cleaned dataset
print(df)