Avocados

The Avocado Prices dataset is one of the most beloved, beginner-to-intermediate friendly datasets in the data science community. It contains weekly 2015–2018 retail scan data for Hass avocados across US regions — average price, total volume sold, type (conventional vs organic), PLU codes (small, large, extra large), and more. It’s perfect for practicing regression, time-series forecasting, feature engineering, regional comparisons, seasonality analysis, and even simple demand modeling.

Why it’s still popular in 2026: real-world messiness (missing values, categorical regions, strong seasonality, price elasticity effects), plus it’s fun — who doesn’t love avocados?

1. Quick Data Overview

Typical columns:

Date — weekly date
AveragePrice — target variable (USD per unit)
Total Volume — total units sold
4046, 4225, 4770 — volume by PLU/size
Total Bags, Small Bags, Large Bags, XLarge Bags
type — conventional / organic
region — city/region (e.g., Albany, Atlanta, California, Northeast, US total)
year

2. Loading & Quick EDA (2026 Style)


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# Load (Kaggle / common public URL or local file)
url = "https://raw.githubusercontent.com/selva86/datasets/master/Avocado.csv"
df = pd.read_csv(url, parse_dates=['Date'])
df = df.sort_values('Date').reset_index(drop=True)

print(df.shape)          # ~18,249 rows × 14 cols
print(df.head(3))
print(df['region'].nunique())  # 54 regions
print(df['type'].value_counts())

3. Price Trend Over Time (National Level)


# National average price trend
national = df[df['region'] == 'TotalUS']

plt.figure(figsize=(12, 6))
sns.lineplot(data=national, x='Date', y='AveragePrice', hue='type', linewidth=2.5)
plt.title('US National Avocado Average Price (2015–2018)', fontsize=14)
plt.ylabel('Average Price ($)')
plt.grid(True, alpha=0.3)
plt.legend(title='Type')
plt.tight_layout()
plt.show()

4. Interactive Regional Comparison with Plotly


fig = px.line(
    df[df['region'] != 'TotalUS'], 
    x='Date', y='AveragePrice', color='region',
    title='Avocado Prices by Region (Interactive)',
    labels={'AveragePrice': 'Avg Price ($)', 'Date': 'Date'},
    hover_data=['type', 'Total Volume']
)
fig.update_traces(line=dict(width=1.5))
fig.update_layout(
    legend=dict(orientation='h', y=-0.2),
    xaxis_title='Date', yaxis_title='Average Price ($)',
    template='plotly_white', hovermode='x unified'
)
fig.show()

5. Price vs Volume Scatter (with Type Hue)


plt.figure(figsize=(10, 7))
sns.scatterplot(
    data=df, x='Total Volume', y='AveragePrice',
    hue='type', size='Total Volume', sizes=(10, 200),
    alpha=0.6, palette='deep'
)
plt.xscale('log')
plt.title('Price vs Total Volume Sold (log scale)', fontsize=14)
plt.xlabel('Total Volume (log)')
plt.ylabel('Average Price ($)')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left')
plt.tight_layout()
plt.show()

6. Popular Analyses & Modeling Ideas (2026)

Seasonality — strong yearly cycle (high in winter, low in summer)
Organic premium — organic usually $0.4–$0.8 more expensive
Price elasticity — inverse relationship with volume
Region clustering — Northeast/California vs South/Midwest
Forecasting — ARIMA, Prophet, LSTM on national/region time series
Regression — predict AveragePrice from volume, type, region, year, seasonality features

Best Practices & Tips

Always convert Date to datetime and sort — prevents plotting errors
Log-transform Total Volume for scatter plots — extreme skew
Use hue='type' or col='region' in Seaborn/Plotly for faceting
Interactive Plotly ? great for exploring 54 regions without clutter
Handle outliers — some weeks/regions have extreme volume spikes
Modern alternative: load with polars for speed on large versions

Conclusion

The Avocado dataset is small, clean-ish, real, seasonal, and interpretable — making it ideal for learning regression, time-series, feature engineering, and visualization. In 2026, load it fast, explore with Seaborn/Plotly, and build models that reveal price drivers, organic premiums, regional differences, and seasonality effects. It’s still one of the best “first real dataset” choices after Iris/Titanic.

Next time you want to practice EDA + forecasting — grab the avocados. They’re always in season for data science.