Lesson

Data Structures in Pandas

Learn Data Structures in Pandas in SQLPad's Python Pandas Mastery course with practical examples and guided lessons.

Welcome to the lesson on Data Structures in Pandas, which is a part of the Python Pandas Mastery: An Interactive and Practical Guide to Data Analysis course. In this lesson, we will explore the two main data structures provided by Pandas, namely Series and DataFrame. You will learn the key features and capabilities of these data structures, and how to create and manipulate them. We will also discuss the importance of these data structures in the context of data analysis and how they can help you achieve your data analysis goals more efficiently and effectively.

Welcome to the lesson on Data Structures in Pandas, which is a part of the Python Pandas Mastery: An Interactive and Practical Guide to Data Analysis course. In this lesson, we will explore the two main data structures provided by Pandas, namely Series and DataFrame. You will learn the key features and capabilities of these data structures, and how to create and manipulate them. We will also discuss the importance of these data structures in the context of data analysis and how they can help you achieve your data analysis goals more efficiently and effectively.

Creating and Modifying Series

In this code example, we will demonstrate how to create a Pandas Series and perform basic modifications.

Import Pandas library

import pandas as pd

Create a Pandas Series

data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)

Add an index to the Series

index = ['one', 'two', 'three', 'four', 'five']
series = pd.Series(data, index=index)
print(series)

Access an element using the index

print(series['three'])

Modify an element in the Series

series['three'] = 33
print(series)

Add a new element to the Series

series['six'] = 6
print(series)

Remove an element from the Series

series = series.drop('six')
print(series)

Perform element-wise operations on the Series

series_squared = series ** 2
print(series_squared)

Now you have learned how to create and modify a Pandas Series. Practice these concepts to become more familiar with the basics of Pandas data structures.

Creating and Modifying DataFrames

Creating and Modifying DataFrames

In this code example, we will learn how to create and modify DataFrames using Pandas library.

First, let's import the Pandas library.

import pandas as pd

Now, let's create a simple DataFrame manually.

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

We can also load a built-in dataset. Let's load the famous "Iris" dataset.

from sklearn import datasets

iris = datasets.load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

print(iris_df.head())

Now, let's modify the DataFrame by adding a new column.

# Adding a new column
df['Country'] = ['USA', 'USA', 'USA', 'USA']
print(df)

We can also modify the DataFrame by updating the values of an existing column.

# Updating the values of an existing column
df['Age'] = df['Age'] + 1
print(df)

Finally, let's remove a column from the DataFrame.

# Removing a column
df = df.drop(columns=['City'])
print(df)

In this code example, we have learned how to create and modify DataFrames using Pandas library.

Accessing and Selecting Data in Series and DataFrames

In this example, we will learn how to access and select data in Pandas Series and DataFrames.

First, let's import the necessary libraries and create a Series and a DataFrame.

import pandas as pd

# Creating a Series
data_series = pd.Series([10, 20, 30, 40, 50], index=['A', 'B', 'C', 'D', 'E'])

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']}
data_df = pd.DataFrame(data)

Now, let's access and select data in the Series.

# Accessing data using index
print(data_series['A'])
# Accessing data using index position
print(data_series[0])
# Slicing data
print(data_series[1:4])

Next, let's access and select data in the DataFrame.

# Accessing a column
print(data_df['Name'])
# Accessing a row using iloc
print(data_df.iloc[0])
# Accessing a row using loc
print(data_df.loc[0])
# Accessing a specific element using iloc
print(data_df.iloc[0, 0])
# Accessing a specific element using loc
print(data_df.loc[0, 'Name'])
# Slicing data using iloc
print(data_df.iloc[1:3, 1:3])
# Slicing data using loc
print(data_df.loc[1:2, ['Age', 'City']])

In this example, we learned how to access and select data in Pandas Series and DataFrames using index, position, iloc, loc, and slicing.

Performing Arithmetic Operations with Series and DataFrames

Performing Arithmetic Operations with Series and DataFrames

In this code example, we will demonstrate how to perform arithmetic operations with Series and DataFrames in Pandas. We will create two DataFrames and a Series, and then perform addition, subtraction, multiplication, and division operations.

import pandas as pd

# Creating two DataFrames
data1 = {'A': [1, 2, 3], 'B': [4, 5, 6]}
data2 = {'A': [4, 5, 6], 'B': [7, 8, 9]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Creating a Series
s = pd.Series([10, 20, 30], index=['A', 'B', 'C'])

# Performing arithmetic operations
addition = df1 + df2
subtraction = df1 - df2
multiplication = df1 * df2
division = df1 / df2

# Performing arithmetic operations with DataFrame and Series
addition_with_series = df1 + s
subtraction_with_series = df1 - s
multiplication_with_series = df1 * s
division_with_series = df1 / s
print("Addition:\n", addition)
print("\nSubtraction:\n", subtraction)
print("\nMultiplication:\n", multiplication)
print("\nDivision:\n", division)
print("\nAddition with Series:\n", addition_with_series)
print("\nSubtraction with Series:\n", subtraction_with_series)
print("\nMultiplication with Series:\n", multiplication_with_series)
print("\nDivision with Series:\n", division_with_series)

In this example, we first created two DataFrames df1 and df2 and a Series s. Then, we performed the arithmetic operations (addition, subtraction, multiplication, and division) on the DataFrames and also between the DataFrame and Series. The output will show the results of these operations.

Handling Missing Data in Series and DataFrames

Handling Missing Data in Series and DataFrames

In this code example, we will demonstrate how to handle missing data in Pandas Series and DataFrames. We will use the built-in dataset iris for this example.

First, let's import Pandas and load the iris dataset.

import pandas as pd
import plotly.express as px

# Load the iris dataset
iris = px.data.iris()

# Display the first 5 rows of the dataset
iris.head()

Now, let's create a DataFrame with some missing values.

# Introduce missing values to the iris dataset
import numpy as np
iris_with_missing_values = iris.copy()
iris_with_missing_values.loc[[2, 6, 25, 36, 55], "sepal_length"] = np.nan
iris_with_missing_values.loc[[10, 16, 45, 75, 90], "sepal_width"] = np.nan

# Display the first 10 rows of the dataset with missing values
iris_with_missing_values.head(10)

To handle missing data, we can use the following methods:

  1. Drop missing values using dropna()
  2. Fill missing values using fillna()
  3. Interpolate missing values using interpolate()
# 1. Drop missing values
dropped_missing_values = iris_with_missing_values.dropna()
print("Dropped missing values:")
print(dropped_missing_values.head())
# 2. Fill missing values with a constant value (e.g., 0)
filled_missing_values = iris_with_missing_values.fillna(0)
print("\nFilled missing values with 0:")
print(filled_missing_values.head(10))
# 3. Fill missing values with a method (e.g., forward fill)
forward_filled_missing_values = iris_with_missing_values.fillna(method="ffill")
print("\nForward filled missing values:")
print(forward_filled_missing_values.head(10))
# 4. Interpolate missing values (e.g., linear interpolation)
interpolated_missing_values = iris_with_missing_values.interpolate()
print("\nInterpolated missing values:")
print(interpolated_missing_values.head(10))

These are some of the ways to handle missing data in Pandas Series and DataFrames.

Exercises

1. Creating a Pandas DataFrame from a Built-in Plotly Dataset Gapminder

Instruction

In this exercise, you will create a Pandas DataFrame from a built-in dataset provided by the plotly library and display the first few rows of the dataset. Follow the steps below:

  1. Import the necessary libraries: Pandas and plotly.express.
  2. Load the built-in plotly.express dataset 'gapminder'.
  3. Display the first few rows of the DataFrame using the head() function.
  4. Write the code in the provided code block below.
  5. (Bonus) can you also display the last 10 rows? Can you guess which function you will use?

My Solution

# Your solution goes here

Hint

  1. To import Pandas, use import pandas as pd and to import plotly.express, use import plotly.express as px.
  2. To load the built-in gapminder dataset 'gapminder', use px.data.gapminder().
  3. To create a Pandas DataFrame from the loaded dataset, use pd.DataFrame(data).
  4. To display the first few rows of the DataFrame, use the head() function on the DataFrame object.
  5. To display the last few rows, you can use tail function

Solution


# Import libraries
import pandas as pd
import plotly.express as px

# Load the data
df = px.data.gapminder()

# Display the first few rows of the DataFrame
df.head()

2. Creating a DataFrame from a Dictionary

Instruction

In this exercise, you will create a DataFrame from a dictionary and display its contents. Follow these steps:

  1. Import the necessary libraries.
  2. Create a dictionary with keys as column names and values as lists containing the corresponding data.
  3. Convert the dictionary into a DataFrame using the pd.DataFrame() function.
  4. Display the contents of the DataFrame using the print() function.

Sample Dictionary Data

Name Age City
Alice 30 Paris
Bob 25 London
Carol 35 Rome

Expected Result

    Name  Age    City
0  Alice   30   Paris
1    Bob   25  London
2  Carol   35    Rome

My Solution

# Your solution goes here

Hint

Remember to import the pandas library and use the pd.DataFrame() function to convert the dictionary into a DataFrame. Make sure your dictionary keys match the column names, and the lists inside the dictionary contain the corresponding data.

Solution

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Carol'],
    'Age': [30, 25, 35],
    'City': ['Paris', 'London', 'Rome']
}

df = pd.DataFrame(data)
print(df)