Chunking Strategies

AutoChunks includes a library of 7+ primary chunking strategies, ranging from traditional mechanical splitters to advanced semantic engines.

1. Fixed-Length Chunker (Baseline)

The atomic baseline. It splits text into chunks of a strictly defined token count with an adaptive sliding overlap. While simple, it serves as the control group for measuring the semantic lift provided by more complex strategies.

Logic: Tokenization-based slicing with deterministic window stepping.
Parameters: base_token_size (int), overlap (int).

2. Recursive Character Chunker

A hierarchical "search-down" strategy. It attempts to maintain structural integrity by respecting a priority list of separators.

Logic: Recursive split → check size → split again if > limit.
Default Priority: Double Newline (Paragraph) > Newline (Line) > Space (Word) > Character.
Parameters: base_token_size, separators (list).

3. Sentence-Aware Chunker

Ensures that logical units of thought (sentences) are never bifurcated.

Logic: Uses boundary detection to group sentences into the largest possible contiguous blocks that fit within the token budget.
Use Case: High-precision RAG where semantic fragmentation of a single sentence leads to categorical retrieval failures.

4. Semantic Chunker (Local Gradient)

Detects topic shifts by analyzing the semantic derivative across a sliding window of localized sentence embeddings.

Logic:
1. Encode all sentences into embeddings.
2. Calculate a sliding window mean similarity.
3. Identify "Topic Chasms" where the similarity falls below a dynamic percentile-based threshold.
Use Case: Unstructured transcripts, long-form narratives, and fluid discussions.

5. Hybrid Semantic-Statistical Chunker

A sophisticated strategy that balances semantic topic shifting with token pressure constraints.

Logic: Boundary score calculation using a weighted scalar of semantic distance and current chunk length relative to target.
Advantage: Prevents "Stray Sentences" (semantic orphans) and "Runaway Chunks" (where topics never shift enough but size becomes unmanageable).

6. Layout-Aware Chunker (High-Fidelity)

Uses document structure as the primary boundary signal.

Logic: Parses the Markdown/HTML AST to identify H1-H3 headers, table starts, and blockquote boundaries. It weights these structural breaks higher than semantic similarity.
Use Case: Technical documentation, legal filings, and financial reports where the structure defines logical scope.

7. Parent-Child Chunker (Retrieval Strategy)

This is a retrieval strategy that indexes small child chunks for precise semantic search but retrieves a larger parent block to provide the LLM with sufficient context.

Logic: Bi-directional mapping between dense child vectors and sparse/semantic parent blocks.
Advantage: Solves the "Narrow Context" problem where a small chunk contains the answer but lacks the necessary logical surrounding to be useful for the LLM.

Strategy Selection Matrix

Strategy	Performance Overhead	Best For	Technical Insight
Fixed	Minimal	Homogeneous Text	Deterministic slicing.
Semantic	High (GPU)	Topic-Drifting Text	Sliding window cosine gradient.
Hybrid	Medium	Noisy Production Data	Weighted scalar of topic vs. size.
Layout-Aware	Low	Structured PDF/MD	AST/Structural boundary analysis.
Recursive	Minimal	Code/Markdown	Nested hierarchy splitting.

Next Steps * Synthetic Ground Truth * Optimization Goals * Evaluation Engine