Lesson
Introduction to Pandas
Learn Pandas fundamentals in SQLPad's Data Science in Action course with practical examples and guided lessons.
In this lesson, we will cover the basics of Pandas, a powerful and widely-used Python library for data manipulation and analysis. Pandas provides data structures and functions needed to work with structured data seamlessly. We will learn how to create and manipulate Pandas data structures called Series and DataFrames.
What is Pandas?
Pandas is an open-source library that provides high-performance, easy-to-use data structures and data analysis tools for Python programming language. It is built on top of the NumPy library and allows us to perform various data manipulation, cleaning, and analysis tasks with just a few lines of code.
Installing Pandas
You do not need to install Pandas for this course, as we will provide an online playground with everything ready to go.
Pandas Data Structures: Series and DataFrame
Pandas has two main data structures:
- Series: A one-dimensional labeled array capable of holding any data type.
- DataFrame: A two-dimensional labeled data structure with columns of potentially different types.
Let's start by importing the Pandas library and explore these data structures.
import pandas as pd
Series
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.
Let's create a simple Pandas Series:
import pandas as pd
data = [3, 5, 7, 9]
ser = pd.Series(data)
print(ser)
DataFrame
A DataFrame is a two-dimensional tabular data structure with labeled axes (rows and columns). It can be thought of as a collection of Series sharing the same index. We can create a DataFrame from various data sources such as dictionaries, lists, or even NumPy arrays.
Let's create a simple Pandas DataFrame:
import pandas as pd
data = {
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
}
df = pd.DataFrame(data)
print(df)
Load Built-in Datasets
For this course, we will use built-in datasets provided by Plotly and Pandas. Let's load the famous "Iris" dataset using the Plotly Express library:
import plotly.express as px
iris_df = px.data.iris()
print(iris_df.head())
Now that we have loaded the Iris dataset into a DataFrame, let's explore some basic Pandas operations and functions.
Basic Operations and Functions
DataFrame Information
To get basic information about a DataFrame, such as the number of rows and columns, data types, and memory usage, we can use the info() function:
- iris_df.info()
Descriptive Statistics
To get summary statistics of the numerical columns in a DataFrame, we can use the describe() function:
iris_df.describe()
Selecting Columns
To select a single column from a DataFrame, we can use either the column name in square brackets ['column_name'] or the column name as an attribute .column_name:
species_column = iris_df['species']
species_column
Filtering Data
To filter rows based on a condition, we can use boolean indexing. For example, let's filter the Iris dataset to only include rows where the sepal_width is greater than 3:
filtered_df = iris_df[iris_df['sepal_width'] > 3]
filtered_df
Grouping Data
We can group data based on the values in one or more columns using the groupby() function. For example, let's group the Iris dataset by the species column and calculate the mean of the other columns:
grouped_df = iris_df.groupby('species').mean()
grouped_df
Sorting Data
To sort a DataFrame based on the values in one or more columns, we can use the sort_values() function. For example, let's sort the Iris dataset by the sepal_length column in descending order:
sorted_df = iris_df.sort_values('sepal_length', ascending=False)
sorted_df
In the next lesson, we will learn how to create interactive visualizations using Plotly and Pandas together.
Exercises
1. Introduction to Pandas
Instruction
In this exercise, you will practice basic Pandas operations using the Iris dataset. Follow these steps:
- Import the necessary libraries:
python
import pandas as pd
import plotly.express as px
- Load the Iris dataset into a DataFrame called
iris_df:
python
iris_df = px.data.iris()
-
Filter the Iris dataset to only include rows where the
sepal_widthis greater than 3. Assign the result to a new DataFrame calledfiltered_df. -
Group the filtered dataset by the
speciescolumn and calculate the mean of the other columns. Assign the result to a new DataFrame calledgrouped_df. -
Create a bar plot of the
grouped_dfDataFrame using Plotly Express, with thespeciescolumn on the x-axis and thesepal_widthcolumn on the y-axis. Assign the result to a variable calledfig. -
Finally, display the plot using
fig.show().
My Solution
# Your solution goes here
Hint
Remember to use the groupby() function to group the data by the species column, and the mean() function to calculate the mean of the other columns. Use the px.bar() function to create a bar plot with the species column on the x-axis and the sepal_width column on the y-axis.
Solution
import pandas as pd
import plotly.express as px
iris_df = px.data.iris()
filtered_df = iris_df[iris_df['sepal_width'] > 3]
grouped_df = filtered_df.groupby('species').mean()
fig = px.bar(grouped_df, x=grouped_df.index, y='sepal_width')
fig.show()