Lesson
Data Structures in Pandas
Learn Data Structures in Pandas in SQLPad's Python Pandas Mastery course with practical examples and guided lessons.
Welcome to the lesson on Data Structures in Pandas, which is a part of the Python Pandas Mastery: An Interactive and Practical Guide to Data Analysis course. In this lesson, we will explore the two main data structures provided by Pandas, namely Series and DataFrame. You will learn the key features and capabilities of these data structures, and how to create and manipulate them. We will also discuss the importance of these data structures in the context of data analysis and how they can help you achieve your data analysis goals more efficiently and effectively.
Welcome to the lesson on Data Structures in Pandas, which is a part of the Python Pandas Mastery: An Interactive and Practical Guide to Data Analysis course. In this lesson, we will explore the two main data structures provided by Pandas, namely Series and DataFrame. You will learn the key features and capabilities of these data structures, and how to create and manipulate them. We will also discuss the importance of these data structures in the context of data analysis and how they can help you achieve your data analysis goals more efficiently and effectively.
Creating and Modifying Series
In this code example, we will demonstrate how to create a Pandas Series and perform basic modifications.
Import Pandas library
import pandas as pd
Create a Pandas Series
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)
Add an index to the Series
index = ['one', 'two', 'three', 'four', 'five']
series = pd.Series(data, index=index)
print(series)
Access an element using the index
print(series['three'])
Modify an element in the Series
series['three'] = 33
print(series)
Add a new element to the Series
series['six'] = 6
print(series)
Remove an element from the Series
series = series.drop('six')
print(series)
Perform element-wise operations on the Series
series_squared = series ** 2
print(series_squared)
Now you have learned how to create and modify a Pandas Series. Practice these concepts to become more familiar with the basics of Pandas data structures.
Creating and Modifying DataFrames
Creating and Modifying DataFrames
In this code example, we will learn how to create and modify DataFrames using Pandas library.
First, let's import the Pandas library.
import pandas as pd
Now, let's create a simple DataFrame manually.
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
We can also load a built-in dataset. Let's load the famous "Iris" dataset.
from sklearn import datasets
iris = datasets.load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
print(iris_df.head())
Now, let's modify the DataFrame by adding a new column.
# Adding a new column
df['Country'] = ['USA', 'USA', 'USA', 'USA']
print(df)
We can also modify the DataFrame by updating the values of an existing column.
# Updating the values of an existing column
df['Age'] = df['Age'] + 1
print(df)
Finally, let's remove a column from the DataFrame.
# Removing a column
df = df.drop(columns=['City'])
print(df)
In this code example, we have learned how to create and modify DataFrames using Pandas library.
Accessing and Selecting Data in Series and DataFrames
In this example, we will learn how to access and select data in Pandas Series and DataFrames.
First, let's import the necessary libraries and create a Series and a DataFrame.
import pandas as pd
# Creating a Series
data_series = pd.Series([10, 20, 30, 40, 50], index=['A', 'B', 'C', 'D', 'E'])
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']}
data_df = pd.DataFrame(data)
Now, let's access and select data in the Series.
# Accessing data using index
print(data_series['A'])
# Accessing data using index position
print(data_series[0])
# Slicing data
print(data_series[1:4])
Next, let's access and select data in the DataFrame.
# Accessing a column
print(data_df['Name'])
# Accessing a row using iloc
print(data_df.iloc[0])
# Accessing a row using loc
print(data_df.loc[0])
# Accessing a specific element using iloc
print(data_df.iloc[0, 0])
# Accessing a specific element using loc
print(data_df.loc[0, 'Name'])
# Slicing data using iloc
print(data_df.iloc[1:3, 1:3])
# Slicing data using loc
print(data_df.loc[1:2, ['Age', 'City']])
In this example, we learned how to access and select data in Pandas Series and DataFrames using index, position, iloc, loc, and slicing.
Performing Arithmetic Operations with Series and DataFrames
Performing Arithmetic Operations with Series and DataFrames
In this code example, we will demonstrate how to perform arithmetic operations with Series and DataFrames in Pandas. We will create two DataFrames and a Series, and then perform addition, subtraction, multiplication, and division operations.
import pandas as pd
# Creating two DataFrames
data1 = {'A': [1, 2, 3], 'B': [4, 5, 6]}
data2 = {'A': [4, 5, 6], 'B': [7, 8, 9]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Creating a Series
s = pd.Series([10, 20, 30], index=['A', 'B', 'C'])
# Performing arithmetic operations
addition = df1 + df2
subtraction = df1 - df2
multiplication = df1 * df2
division = df1 / df2
# Performing arithmetic operations with DataFrame and Series
addition_with_series = df1 + s
subtraction_with_series = df1 - s
multiplication_with_series = df1 * s
division_with_series = df1 / s
print("Addition:\n", addition)
print("\nSubtraction:\n", subtraction)
print("\nMultiplication:\n", multiplication)
print("\nDivision:\n", division)
print("\nAddition with Series:\n", addition_with_series)
print("\nSubtraction with Series:\n", subtraction_with_series)
print("\nMultiplication with Series:\n", multiplication_with_series)
print("\nDivision with Series:\n", division_with_series)
In this example, we first created two DataFrames df1 and df2 and a Series s. Then, we performed the arithmetic operations (addition, subtraction, multiplication, and division) on the DataFrames and also between the DataFrame and Series. The output will show the results of these operations.
Handling Missing Data in Series and DataFrames
Handling Missing Data in Series and DataFrames
In this code example, we will demonstrate how to handle missing data in Pandas Series and DataFrames. We will use the built-in dataset iris for this example.
First, let's import Pandas and load the iris dataset.
import pandas as pd
import plotly.express as px
# Load the iris dataset
iris = px.data.iris()
# Display the first 5 rows of the dataset
iris.head()
Now, let's create a DataFrame with some missing values.
# Introduce missing values to the iris dataset
import numpy as np
iris_with_missing_values = iris.copy()
iris_with_missing_values.loc[[2, 6, 25, 36, 55], "sepal_length"] = np.nan
iris_with_missing_values.loc[[10, 16, 45, 75, 90], "sepal_width"] = np.nan
# Display the first 10 rows of the dataset with missing values
iris_with_missing_values.head(10)
To handle missing data, we can use the following methods:
- Drop missing values using
dropna() - Fill missing values using
fillna() - Interpolate missing values using
interpolate()
# 1. Drop missing values
dropped_missing_values = iris_with_missing_values.dropna()
print("Dropped missing values:")
print(dropped_missing_values.head())
# 2. Fill missing values with a constant value (e.g., 0)
filled_missing_values = iris_with_missing_values.fillna(0)
print("\nFilled missing values with 0:")
print(filled_missing_values.head(10))
# 3. Fill missing values with a method (e.g., forward fill)
forward_filled_missing_values = iris_with_missing_values.fillna(method="ffill")
print("\nForward filled missing values:")
print(forward_filled_missing_values.head(10))
# 4. Interpolate missing values (e.g., linear interpolation)
interpolated_missing_values = iris_with_missing_values.interpolate()
print("\nInterpolated missing values:")
print(interpolated_missing_values.head(10))
These are some of the ways to handle missing data in Pandas Series and DataFrames.
Exercises
1. Creating a Pandas DataFrame from a Built-in Plotly Dataset Gapminder
Instruction
In this exercise, you will create a Pandas DataFrame from a built-in dataset provided by the plotly library and display the first few rows of the dataset. Follow the steps below:
- Import the necessary libraries: Pandas and plotly.express.
- Load the built-in plotly.express dataset 'gapminder'.
- Display the first few rows of the DataFrame using the
head()function. - Write the code in the provided code block below.
- (Bonus) can you also display the last 10 rows? Can you guess which function you will use?
My Solution
# Your solution goes here
Hint
- To import Pandas, use
import pandas as pdand to import plotly.express, useimport plotly.express as px. - To load the built-in gapminder dataset 'gapminder', use
px.data.gapminder(). - To create a Pandas DataFrame from the loaded dataset, use
pd.DataFrame(data). - To display the first few rows of the DataFrame, use the
head()function on the DataFrame object. - To display the last few rows, you can use
tailfunction
Solution
# Import libraries
import pandas as pd
import plotly.express as px
# Load the data
df = px.data.gapminder()
# Display the first few rows of the DataFrame
df.head()
2. Creating a DataFrame from a Dictionary
Instruction
In this exercise, you will create a DataFrame from a dictionary and display its contents. Follow these steps:
- Import the necessary libraries.
- Create a dictionary with keys as column names and values as lists containing the corresponding data.
- Convert the dictionary into a DataFrame using the
pd.DataFrame()function. - Display the contents of the DataFrame using the
print()function.
Sample Dictionary Data
| Name | Age | City |
|---|---|---|
| Alice | 30 | Paris |
| Bob | 25 | London |
| Carol | 35 | Rome |
Expected Result
Name Age City
0 Alice 30 Paris
1 Bob 25 London
2 Carol 35 Rome
My Solution
# Your solution goes here
Hint
Remember to import the pandas library and use the pd.DataFrame() function to convert the dictionary into a DataFrame. Make sure your dictionary keys match the column names, and the lists inside the dictionary contain the corresponding data.
Solution
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Carol'],
'Age': [30, 25, 35],
'City': ['Paris', 'London', 'Rome']
}
df = pd.DataFrame(data)
print(df)