Scatter plots

Scatter plots are the gold standard for visualizing the relationship between two continuous variables — they instantly reveal correlations (positive, negative, none), clusters, outliers, non-linear patterns, and heteroscedasticity. They are essential for exploratory data analysis (EDA), regression diagnostics, feature engineering decisions, and communicating bivariate relationships.

In 2026, scatter plots remain one of the most powerful tools in any data scientist’s arsenal. Here’s a practical guide with real examples using Matplotlib (full control), Seaborn (beautiful statistical defaults), and Plotly (interactive & shareable).

1. Basic Setup & Sample Data


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Generate realistic correlated data (e.g., height vs weight with some noise)
np.random.seed(42)
n = 200
height = np.random.normal(170, 10, n)           # cm
weight = 0.6 * height + np.random.normal(0, 8, n) + 30  # kg

df = pd.DataFrame({'Height (cm)': height, 'Weight (kg)': weight})
print(df.describe())

2. Simple Scatter Plot with Matplotlib (Full Control)

Classic and highly customizable — perfect for publications or when you need precise styling.


plt.figure(figsize=(10, 6))
plt.scatter(df['Height (cm)'], df['Weight (kg)'], 
            color='royalblue', alpha=0.7, edgecolor='navy', s=60)
plt.title('Height vs Weight Relationship', fontsize=14, pad=15)
plt.xlabel('Height (cm)', fontsize=12)
plt.ylabel('Weight (kg)', fontsize=12)
plt.grid(True, alpha=0.3, linestyle='--')
plt.tight_layout()
plt.show()

3. Beautiful Scatter Plot with Seaborn (Recommended for EDA)

Seaborn gives attractive defaults, easy regression lines, hue grouping, and confidence intervals.


plt.figure(figsize=(10, 6))
sns.regplot(
    data=df, x='Height (cm)', y='Weight (kg)',
    scatter_kws={'alpha':0.7, 's':60, 'color':'teal'},
    line_kws={'color':'black', 'lw':2}
)
plt.title('Height vs Weight with Regression Line', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

4. Interactive Scatter Plot with Plotly (Best for Dashboards & Sharing)

Hover tooltips, zoom, pan, trendline, color by category — perfect for Streamlit, Dash, or stakeholder presentations.


fig = px.scatter(
    df, x='Height (cm)', y='Weight (kg)',
    title='Interactive Height vs Weight Scatter',
    labels={'Height (cm)': 'Height (cm)', 'Weight (kg)': 'Weight (kg)'},
    trendline='ols', trendline_color_override='black',
    color='Weight (kg)', color_continuous_scale='Viridis',
    hover_data={'Height (cm)':':.1f', 'Weight (kg)':':.1f'}
)
fig.update_traces(marker=dict(size=8, opacity=0.7))
fig.update_layout(
    xaxis_title='Height (cm)',
    yaxis_title='Weight (kg)',
    template='plotly_white',
    hovermode='closest'
)
fig.show()

5. Advanced: Grouping, Size, Color, & Regression (Real-World Power)

Add a third variable via color/hue/size, fit regression lines, or facet by groups.


# Add a third variable: BMI category
df['BMI'] = df['Weight (kg)'] / ((df['Height (cm)']/100) ** 2)
df['BMI Category'] = pd.cut(df['BMI'], bins=[0, 18.5, 25, 30, 100], 
                           labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

plt.figure(figsize=(12, 7))
sns.scatterplot(
    data=df, x='Height (cm)', y='Weight (kg)',
    hue='BMI Category', size='BMI', sizes=(20, 200),
    alpha=0.8, palette='viridis'
)
sns.regplot(data=df, x='Height (cm)', y='Weight (kg)', scatter=False, color='black')
plt.title('Height vs Weight by BMI Category', fontsize=14)
plt.legend(title='BMI Category', bbox_to_anchor=(1.02, 1), loc='upper left')
plt.tight_layout()
plt.show()

Best Practices & Common Pitfalls (2026 Edition)

Add regression line + confidence interval when exploring correlation (sns.regplot or Plotly trendline='ols')
Use color/hue for a third categorical variable, size for a continuous fourth — but avoid overcomplicating
Always label axes, add title, and use grid — clarity first
Watch for overplotting: increase alpha, reduce size, or add jitter (sns.stripplot or Plotly opacity)
Scale axes properly — do not force 0 origin if data doesn't start near zero
For huge data (>100k points), use Plotly or downsample — faster rendering and interactivity

Conclusion

Scatter plots are your best tool for discovering bivariate relationships — correlation strength, linearity, clusters, outliers, and heteroscedasticity all jump out immediately. In 2026, start with Seaborn for beautiful, quick EDA scatters with regression, switch to Plotly when interactivity or sharing is needed, and fall back to Matplotlib for pixel-perfect control. Master hue/size grouping, regression lines, jitter, and axis scaling, and you'll turn raw pairs of variables into clear, insightful stories that guide modeling and decisions.

Next time you have two numeric variables — plot a scatter first. One good scatter can reveal more than pages of correlation tables.