The Avocado Prices dataset is one of the most beloved, beginner-to-intermediate friendly datasets in the data science community. It contains weekly 2015–2018 retail scan data for Hass avocados across US regions — average price, total volume sold, type (conventional vs organic), PLU codes (small, large, extra large), and more. It’s perfect for practicing regression, time-series forecasting, feature engineering, regional comparisons, seasonality analysis, and even simple demand modeling.
Why it’s still popular in 2026: real-world messiness (missing values, categorical regions, strong seasonality, price elasticity effects), plus it’s fun — who doesn’t love avocados?
1. Quick Data Overview
Typical columns:
Date— weekly dateAveragePrice— target variable (USD per unit)Total Volume— total units sold4046,4225,4770— volume by PLU/sizeTotal Bags,Small Bags,Large Bags,XLarge Bagstype— conventional / organicregion— city/region (e.g., Albany, Atlanta, California, Northeast, US total)year
2. Loading & Quick EDA (2026 Style)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
# Load (Kaggle / common public URL or local file)
url = "https://raw.githubusercontent.com/selva86/datasets/master/Avocado.csv"
df = pd.read_csv(url, parse_dates=['Date'])
df = df.sort_values('Date').reset_index(drop=True)
print(df.shape) # ~18,249 rows × 14 cols
print(df.head(3))
print(df['region'].nunique()) # 54 regions
print(df['type'].value_counts())
3. Price Trend Over Time (National Level)
# National average price trend
national = df[df['region'] == 'TotalUS']
plt.figure(figsize=(12, 6))
sns.lineplot(data=national, x='Date', y='AveragePrice', hue='type', linewidth=2.5)
plt.title('US National Avocado Average Price (2015–2018)', fontsize=14)
plt.ylabel('Average Price ($)')
plt.grid(True, alpha=0.3)
plt.legend(title='Type')
plt.tight_layout()
plt.show()
4. Interactive Regional Comparison with Plotly
fig = px.line(
df[df['region'] != 'TotalUS'],
x='Date', y='AveragePrice', color='region',
title='Avocado Prices by Region (Interactive)',
labels={'AveragePrice': 'Avg Price ($)', 'Date': 'Date'},
hover_data=['type', 'Total Volume']
)
fig.update_traces(line=dict(width=1.5))
fig.update_layout(
legend=dict(orientation='h', y=-0.2),
xaxis_title='Date', yaxis_title='Average Price ($)',
template='plotly_white', hovermode='x unified'
)
fig.show()
5. Price vs Volume Scatter (with Type Hue)
plt.figure(figsize=(10, 7))
sns.scatterplot(
data=df, x='Total Volume', y='AveragePrice',
hue='type', size='Total Volume', sizes=(10, 200),
alpha=0.6, palette='deep'
)
plt.xscale('log')
plt.title('Price vs Total Volume Sold (log scale)', fontsize=14)
plt.xlabel('Total Volume (log)')
plt.ylabel('Average Price ($)')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left')
plt.tight_layout()
plt.show()
6. Popular Analyses & Modeling Ideas (2026)
- Seasonality — strong yearly cycle (high in winter, low in summer)
- Organic premium — organic usually $0.4–$0.8 more expensive
- Price elasticity — inverse relationship with volume
- Region clustering — Northeast/California vs South/Midwest
- Forecasting — ARIMA, Prophet, LSTM on national/region time series
- Regression — predict AveragePrice from volume, type, region, year, seasonality features
Best Practices & Tips
- Always convert
Dateto datetime and sort — prevents plotting errors - Log-transform
Total Volumefor scatter plots — extreme skew - Use
hue='type'orcol='region'in Seaborn/Plotly for faceting - Interactive Plotly ? great for exploring 54 regions without clutter
- Handle outliers — some weeks/regions have extreme volume spikes
- Modern alternative: load with
polarsfor speed on large versions
Conclusion
The Avocado dataset is small, clean-ish, real, seasonal, and interpretable — making it ideal for learning regression, time-series, feature engineering, and visualization. In 2026, load it fast, explore with Seaborn/Plotly, and build models that reveal price drivers, organic premiums, regional differences, and seasonality effects. It’s still one of the best “first real dataset” choices after Iris/Titanic.
Next time you want to practice EDA + forecasting — grab the avocados. They’re always in season for data science.