Skip to content

Custom chunker

The Chunker protocol is minimal — implement chunk(document) -> list[Chunk] and you're done.

The protocol

Bases: Protocol

Splits a Document into a list of Chunks.

Contract — chunk.content is the exact text that will be embedded.

Implementations that prepend contextual information (e.g. heading hierarchy in a MarkdownChunker, code-block fences in a CodeChunker) MUST include that context in chunk.content, not only in chunk.metadata. The embedding cache keys off (model_id, sha256(chunk.content)); two chunks with the same body but different context would collide and return the wrong vector.

The companion chunk.content_hash is sha256(chunk.content) and is set by the implementation. Callers must not mutate chunk.content after the chunker returns.

Source code in src/cenote/chunkers/base.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class Chunker(Protocol):
    """Splits a Document into a list of Chunks.

    Contract — `chunk.content` is the exact text that will be embedded.

    Implementations that prepend contextual information (e.g. heading hierarchy
    in a MarkdownChunker, code-block fences in a CodeChunker) MUST include that
    context in `chunk.content`, not only in `chunk.metadata`. The embedding
    cache keys off `(model_id, sha256(chunk.content))`; two chunks with the
    same body but different context would collide and return the wrong vector.

    The companion `chunk.content_hash` is `sha256(chunk.content)` and is set
    by the implementation. Callers must not mutate `chunk.content` after the
    chunker returns.
    """

    def chunk(self, document: Document) -> list[Chunk]:
        """Return the document split into ordered Chunks."""
        ...

chunk(document: Document) -> list[Chunk]

Return the document split into ordered Chunks.

Source code in src/cenote/chunkers/base.py
27
28
29
def chunk(self, document: Document) -> list[Chunk]:
    """Return the document split into ordered Chunks."""
    ...

Contract

The Protocol docstring lays out the load-bearing invariant: chunk.content is the exact text that will be embedded. Implementations that prepend contextual information (heading hierarchy, code-block fences) must include that context in chunk.content itself, not only in chunk.metadata. The embedding cache keys off (model_id, sha256(chunk.content)); two chunks with the same body but different prepended context would otherwise collide.

Minimal example

from cenote.models import Chunk, Document
import hashlib


class SentenceChunker:
    """Splits on sentence boundaries (naive — period + space)."""

    def chunk(self, document: Document) -> list[Chunk]:
        sentences = [s.strip() for s in document.content.split(". ") if s.strip()]
        return [
            Chunk(
                id=Chunk.make_id(document.id, i),
                document_id=document.id,
                content=sentence,
                position=i,
                metadata=dict(document.metadata),
                content_hash=hashlib.sha256(sentence.encode()).hexdigest(),
            )
            for i, sentence in enumerate(sentences)
        ]

For production-quality sentence splitting, consider wrapping spaCy or nltk — but mind the new runtime dependency.