Data Visualization in Python
Data visualization is essential for exploring data, identifying patterns, and communicating insights. Python offers a rich ecosystem of libraries for creating everything from basic charts to interactive dashboards.
See the associated companion notebook for more extensive code examples.
Why Visualization Matters
Exploration: Understand the shape and distribution of your data.
Communication: Present insights effectively to different audiences.
Validation: Check model assumptions and evaluate results.
Core Libraries
Three essential libraries for Python visualization are:
Matplotlib: The foundational plotting library.
Seaborn: Simplifies statistical plots with attractive defaults.
Plotly: For interactive and web-ready graphics.
Matplotlib
Matplotlib is the foundational plotting library in Python, supporting a wide range of static, animated, and interactive visualizations.
Seaborn
Seaborn is built on Matplotlib and provides a high-level API for creating attractive statistical graphics.
Plotly
Plotly creates interactive plots ideal for dashboards and web applications.
Common Plot Types
Below are essential plots for data analysis. Each section includes data requirements, ideal use cases, examples for multiple Python libraries, and an example image.
Histogram
Histograms display the distribution of a dataset.
- Type: Univariate
- Variables:
- Single continuous variable
Matplotlib
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(1000)
plt.hist(data, bins=30)
plt.show()Seaborn
import seaborn as sns
data = sns.load_dataset('tips')
sns.histplot(data['total_bill'], bins=30)Plotly
import plotly.express as px
df = px.data.tips()
fig = px.histogram(df, x='total_bill')
fig.show()- Visualize data distribution
- Identify skewness and outliers
Bar Chart
Bar charts represent categorical data with rectangular bars.
- Type: Univariate or Bivariate
- Variables:
- x: Categorical
- y: Continuous
- Suitable for counts or aggregated values.
Matplotlib
import matplotlib.pyplot as plt
categories = ['A','B','C']
values = [3,7,5]
plt.bar(categories, values)
plt.show()Seaborn
import seaborn as sns
data = sns.load_dataset('tips')
sns.barplot(x='day', y='total_bill', data=data)Plotly
import plotly.express as px
categories = ['A','B','C']
values = [3,7,5]
fig = px.bar(x=categories, y=values)
fig.show()- Compare categories
- Show frequency or aggregated values
- Identify highest and lowest categories
Box Plot
Box plots summarize data with median, quartiles, and outliers.
- Type: Univariate or Bivariate
- Variables:
- x: Categorical (optional)
- y: Continuous
Matplotlib
import matplotlib.pyplot as plt
data = [7,8,5,6,4,9]
plt.boxplot(data)
plt.show()Seaborn
import seaborn as sns
data = sns.load_dataset('tips')
sns.boxplot(x='day', y='total_bill', data=data)Plotly
import plotly.express as px
df = px.data.tips()
fig = px.box(df, x='day', y='total_bill')
fig.show()- Identify outliers
- Compare distributions across groups
Violin Plot
Violin plots combine box plots with kernel density estimates.
- Type: Bivariate
- Variables:
- x: Categorical
- y: Continuous
Matplotlib
Matplotlib does not have built-in violin plot
Seaborn
import seaborn as sns
data = sns.load_dataset('tips')
sns.violinplot(x='day', y='total_bill', data=data)Plotly
import plotly.express as px
df = px.data.tips()
fig = px.violin(df, x='day', y='total_bill', box=True)
fig.show()- Visualize distribution and density
- Compare across categories
Scatter Plot
Scatter plots show the relationship between two variables.
- Type: Bivariate
- Variables:
- x: Continuous
- y: Continuous
Matplotlib
python
import matplotlib.pyplot as plt
x = [1,2,3,4,5]
y = [5,4,6,5,7]
plt.scatter(x,y)
plt.show()Seaborn
import seaborn as sns
data = sns.load_dataset('iris')
sns.scatterplot(x='sepal_length', y='petal_length', data=data)Plotly
import plotly.express as px
categories = ['A','B','C']
values = [3,7,5]
fig = px.scatter(tips, x='total_bill', y='tip')
fig.show()- Identify correlations
- Detect clusters or outliers