Custom chunker¶

The Chunker protocol is minimal — implement chunk(document) -> list[Chunk] and you're done.

The protocol¶

Bases: Protocol

Splits a Document into a list of Chunks.

Contract — chunk.content is the exact text that will be embedded.

Implementations that prepend contextual information (e.g. heading hierarchy in a MarkdownChunker, code-block fences in a CodeChunker) MUST include that context in chunk.content, not only in chunk.metadata. The embedding cache keys off (model_id, sha256(chunk.content)); two chunks with the same body but different context would collide and return the wrong vector.

The companion chunk.content_hash is sha256(chunk.content) and is set by the implementation. Callers must not mutate chunk.content after the chunker returns.

Source code in src/cenote/chunkers/base.py

class Chunker(Protocol):
    """Splits a Document into a list of Chunks.

    Contract — `chunk.content` is the exact text that will be embedded.

    Implementations that prepend contextual information (e.g. heading hierarchy
    in a MarkdownChunker, code-block fences in a CodeChunker) MUST include that
    context in `chunk.content`, not only in `chunk.metadata`. The embedding
    cache keys off `(model_id, sha256(chunk.content))`; two chunks with the
    same body but different context would collide and return the wrong vector.

    The companion `chunk.content_hash` is `sha256(chunk.content)` and is set
    by the implementation. Callers must not mutate `chunk.content` after the
    chunker returns.
    """

    def chunk(self, document: Document) -> list[Chunk]:
        """Return the document split into ordered Chunks."""
        ...

`chunk(document: Document) -> list[Chunk]` ¶

Return the document split into ordered Chunks.

Source code in src/cenote/chunkers/base.py

def chunk(self, document: Document) -> list[Chunk]:
    """Return the document split into ordered Chunks."""
    ...

Contract¶

The Protocol docstring lays out the load-bearing invariant: chunk.content is the exact text that will be embedded. Implementations that prepend contextual information (heading hierarchy, code-block fences) must include that context in chunk.content itself, not only in chunk.metadata. The embedding cache keys off (model_id, sha256(chunk.content)); two chunks with the same body but different prepended context would otherwise collide.

Minimal example¶

from cenote.models import Chunk, Document
import hashlib


class SentenceChunker:
    """Splits on sentence boundaries (naive — period + space)."""

    def chunk(self, document: Document) -> list[Chunk]:
        sentences = [s.strip() for s in document.content.split(". ") if s.strip()]
        return [
            Chunk(
                id=Chunk.make_id(document.id, i),
                document_id=document.id,
                content=sentence,
                position=i,
                metadata=dict(document.metadata),
                content_hash=hashlib.sha256(sentence.encode()).hexdigest(),
            )
            for i, sentence in enumerate(sentences)
        ]

For production-quality sentence splitting, consider wrapping spaCy or nltk — but mind the new runtime dependency.

Custom chunker¶

The protocol¶

chunk(document: Document) -> list[Chunk] ¶

Contract¶

Minimal example¶

`chunk(document: Document) -> list[Chunk]` ¶