Custom chunker¶
The Chunker protocol is minimal — implement chunk(document) -> list[Chunk] and you're done.
The protocol¶
Bases: Protocol
Splits a Document into a list of Chunks.
Contract — chunk.content is the exact text that will be embedded.
Implementations that prepend contextual information (e.g. heading hierarchy
in a MarkdownChunker, code-block fences in a CodeChunker) MUST include that
context in chunk.content, not only in chunk.metadata. The embedding
cache keys off (model_id, sha256(chunk.content)); two chunks with the
same body but different context would collide and return the wrong vector.
The companion chunk.content_hash is sha256(chunk.content) and is set
by the implementation. Callers must not mutate chunk.content after the
chunker returns.
Source code in src/cenote/chunkers/base.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | |
chunk(document: Document) -> list[Chunk]
¶
Return the document split into ordered Chunks.
Source code in src/cenote/chunkers/base.py
27 28 29 | |
Contract¶
The Protocol docstring lays out the load-bearing invariant: chunk.content is the exact text that will be embedded. Implementations that prepend contextual information (heading hierarchy, code-block fences) must include that context in chunk.content itself, not only in chunk.metadata. The embedding cache keys off (model_id, sha256(chunk.content)); two chunks with the same body but different prepended context would otherwise collide.
Minimal example¶
from cenote.models import Chunk, Document
import hashlib
class SentenceChunker:
"""Splits on sentence boundaries (naive — period + space)."""
def chunk(self, document: Document) -> list[Chunk]:
sentences = [s.strip() for s in document.content.split(". ") if s.strip()]
return [
Chunk(
id=Chunk.make_id(document.id, i),
document_id=document.id,
content=sentence,
position=i,
metadata=dict(document.metadata),
content_hash=hashlib.sha256(sentence.encode()).hexdigest(),
)
for i, sentence in enumerate(sentences)
]
For production-quality sentence splitting, consider wrapping spaCy or nltk — but mind the new runtime dependency.