Lesson

Introduction to Pandas

Learn Pandas fundamentals in SQLPad's Data Science in Action course with practical examples and guided lessons.

In this lesson, we will cover the basics of Pandas, a powerful and widely-used Python library for data manipulation and analysis. Pandas provides data structures and functions needed to work with structured data seamlessly. We will learn how to create and manipulate Pandas data structures called Series and DataFrames.

What is Pandas?

Pandas is an open-source library that provides high-performance, easy-to-use data structures and data analysis tools for Python programming language. It is built on top of the NumPy library and allows us to perform various data manipulation, cleaning, and analysis tasks with just a few lines of code.

Installing Pandas

You do not need to install Pandas for this course, as we will provide an online playground with everything ready to go.

Pandas Data Structures: Series and DataFrame

Pandas has two main data structures:

  1. Series: A one-dimensional labeled array capable of holding any data type.
  2. DataFrame: A two-dimensional labeled data structure with columns of potentially different types.

Let's start by importing the Pandas library and explore these data structures.

import pandas as pd

Series

A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.

Let's create a simple Pandas Series:

import pandas as pd

data = [3, 5, 7, 9]
ser = pd.Series(data)

print(ser)

DataFrame

A DataFrame is a two-dimensional tabular data structure with labeled axes (rows and columns). It can be thought of as a collection of Series sharing the same index. We can create a DataFrame from various data sources such as dictionaries, lists, or even NumPy arrays.

Let's create a simple Pandas DataFrame:

import pandas as pd

data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}

df = pd.DataFrame(data)

print(df)

Load Built-in Datasets

For this course, we will use built-in datasets provided by Plotly and Pandas. Let's load the famous "Iris" dataset using the Plotly Express library:

import plotly.express as px

iris_df = px.data.iris()
print(iris_df.head())

Now that we have loaded the Iris dataset into a DataFrame, let's explore some basic Pandas operations and functions.

Basic Operations and Functions

DataFrame Information

To get basic information about a DataFrame, such as the number of rows and columns, data types, and memory usage, we can use the info() function:

  • iris_df.info()

Descriptive Statistics

To get summary statistics of the numerical columns in a DataFrame, we can use the describe() function:

iris_df.describe()

Selecting Columns

To select a single column from a DataFrame, we can use either the column name in square brackets ['column_name'] or the column name as an attribute .column_name:

species_column = iris_df['species']
species_column

Filtering Data

To filter rows based on a condition, we can use boolean indexing. For example, let's filter the Iris dataset to only include rows where the sepal_width is greater than 3:

filtered_df = iris_df[iris_df['sepal_width'] > 3]
filtered_df

Grouping Data

We can group data based on the values in one or more columns using the groupby() function. For example, let's group the Iris dataset by the species column and calculate the mean of the other columns:

grouped_df = iris_df.groupby('species').mean()
grouped_df

Sorting Data

To sort a DataFrame based on the values in one or more columns, we can use the sort_values() function. For example, let's sort the Iris dataset by the sepal_length column in descending order:

sorted_df = iris_df.sort_values('sepal_length', ascending=False)
sorted_df

In the next lesson, we will learn how to create interactive visualizations using Plotly and Pandas together.

Exercises

1. Introduction to Pandas

Instruction

In this exercise, you will practice basic Pandas operations using the Iris dataset. Follow these steps:

  1. Import the necessary libraries:

python import pandas as pd import plotly.express as px

  1. Load the Iris dataset into a DataFrame called iris_df:

python iris_df = px.data.iris()

  1. Filter the Iris dataset to only include rows where the sepal_width is greater than 3. Assign the result to a new DataFrame called filtered_df.

  2. Group the filtered dataset by the species column and calculate the mean of the other columns. Assign the result to a new DataFrame called grouped_df.

  3. Create a bar plot of the grouped_df DataFrame using Plotly Express, with the species column on the x-axis and the sepal_width column on the y-axis. Assign the result to a variable called fig.

  4. Finally, display the plot using fig.show().

My Solution

# Your solution goes here

Hint

Remember to use the groupby() function to group the data by the species column, and the mean() function to calculate the mean of the other columns. Use the px.bar() function to create a bar plot with the species column on the x-axis and the sepal_width column on the y-axis.

Solution

import pandas as pd
import plotly.express as px

iris_df = px.data.iris()
filtered_df = iris_df[iris_df['sepal_width'] > 3]
grouped_df = filtered_df.groupby('species').mean()
fig = px.bar(grouped_df, x=grouped_df.index, y='sepal_width')
fig.show()