Examining a chunk

Examining a chunk when reading large CSV files with pd.read_csv(chunksize=...) is a powerful debugging and exploration technique — it lets you inspect the structure, data types, missing values, outliers, or sample rows of each chunk before full processing. This is especially useful for large files where loading everything at once would cause memory errors or take too long. By examining chunks early (e.g., chunk.head(), chunk.info(), chunk.describe(), chunk.isna().sum()), you can spot issues like inconsistent dtypes, unexpected delimiters, encoding problems, or data quality problems, then adjust parameters (usecols, dtype, parse_dates, converters) or preprocessing logic accordingly. In 2026, chunk examination remains essential for safe, efficient big data workflows in pandas — and Polars lazy scanning offers even better alternatives for exploration without full materialization.

Here’s a complete, practical guide to examining chunks with pd.read_csv(chunksize=...): basic inspection loop, common checks (head, dtypes, missing, stats), adjusting read parameters, real-world patterns, and modern best practices with type hints, error handling, Polars lazy equivalents, and pandas/Polars integration.

Basic chunk examination loop — use enumerate for chunk index, inspect first few chunks to understand the file.


import pandas as pd

file_path = 'large_file.csv'
chunksize = 100_000

for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunksize)):
    print(f"\n=== Chunk {i} ===")
    
    # Quick look at data
    print("First 5 rows:")
    print(chunk.head())
    
    # Data types and non-null counts
    print("\nInfo:")
    chunk.info()
    
    # Basic statistics
    print("\nDescribe:")
    print(chunk.describe())
    
    # Missing values per column
    print("\nMissing values:")
    print(chunk.isna().sum())
    
    # If chunk looks good, process it (e.g., append, transform, write)
    # chunk_processed = chunk.dropna()  # example
    # ... more processing ...
    
    # Stop after first few chunks for inspection
    if i >= 2:
        break

Common checks inside the loop — adapt based on your data.

chunk.head(10) — see sample rows and column names.
chunk.dtypes — check inferred types (often wrong for large CSVs).
chunk.select_dtypes(include='object').nunique() — cardinality of strings (detect categorical vs free text).
chunk.memory_usage(deep=True).sum() / (1024**2) — estimate chunk memory footprint.
chunk.describe(include='all') — stats for numeric + categorical.
chunk['column'].value_counts(dropna=False) — distribution of key columns.
chunk[chunk.duplicated()].shape[0] — duplicate rows count.

Adjusting read parameters after examination — common fixes based on chunk inspection.


# After seeing bad dtypes or missing columns
chunks = pd.read_csv(file_path, chunksize=100_000,
                     usecols=['id', 'value', 'date'],  # only needed columns
                     dtype={'id': 'int32', 'value': 'float32'},  # downcast
                     parse_dates=['date'],  # auto date parsing
                     date_format='%Y-%m-%d')  # specify format if needed

for chunk in chunks:
    # process with correct types
    pass

Real-world pattern: exploratory chunk analysis before full load — inspect first 3–5 chunks, then decide read parameters or preprocessing.


def explore_csv(file_path: str, chunksize: int = 100_000, max_chunks: int = 5):
    print(f"Exploring {file_path}...")
    for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunksize)):
        print(f"\nChunk {i} shape: {chunk.shape}")
        print("Columns:", chunk.columns.tolist())
        print("Sample:")
        print(chunk.head(3))
        print("Dtypes:\n", chunk.dtypes)
        print("Missing:\n", chunk.isna().sum())
        if i >= max_chunks - 1:
            break

explore_csv('very_large.csv')
# Use output to set usecols, dtype, parse_dates, etc. in final read

Best practices for chunk examination. Examine only first few chunks — break after 3–5 for quick insight. Modern tip: use Polars scan_csv().fetch(n_rows=100_000) — lazy scan + fast fetch for inspection without full load. Use chunksize with iterator=True — explicit control over chunk iteration. Log chunk info — use logging.info instead of print for production. Adjust chunksize — 10k–100k balances memory vs I/O; profile with psutil. Use usecols — reduce memory by reading only needed columns. Use dtype — downcast early (int32/float32) based on chunk inspection. Handle errors per chunk — try/except inside loop for bad rows/files. Write processed chunks to disk — chunk.to_parquet(f'chunk_{i}.parquet'). Use Polars lazy for full exploration — pl.scan_csv().head(1000) or .describe() without loading all. Add type hints — def process_chunk(chunk: pd.DataFrame) -> pd.DataFrame. Use pd.read_csv(..., low_memory=False) — avoid mixed-type warnings in chunks. Combine with dask — for distributed chunk processing if pandas chunks are still too large.

Examining chunks with pd.read_csv(chunksize=...) lets you inspect large CSVs safely — check structure, dtypes, missing values, and samples before full processing. In 2026, examine first 3–5 chunks, adjust usecols/dtype, prefer Polars scan_csv().fetch() for fast lazy inspection, and log/profile memory. Master chunk examination, and you’ll handle massive CSVs efficiently, avoid OOM, and make informed preprocessing decisions.

Next time you face a large CSV — examine chunks first. It’s Python’s cleanest way to say: “Peek inside this file safely — before committing to full load.”

Generating content...