Skip to main content

How to retrieve the whole document for a chunk

When splitting documents for retrieval, there are often conflicting desires:

  1. You may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.
  2. You want to have long enough documents that the context of each chunk is retained.

The ParentDocumentRetriever strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.

Note that "parent document" refers to the document that a small chunk originated from. This can either be the whole raw document OR a larger chunk.

Usageโ€‹

npm install @langchain/openai
import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { InMemoryStore } from "langchain/storage/in_memory";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { TextLoader } from "langchain/document_loaders/fs/text";

const vectorstore = new MemoryVectorStore(new OpenAIEmbeddings());
const docstore = new InMemoryStore();
const retriever = new ParentDocumentRetriever({
vectorstore,
docstore,
// Optional, not required if you're already passing in split documents
parentSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 500,
}),
childSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 50,
}),
// Optional `k` parameter to search for more child documents in VectorStore.
// Note that this does not exactly correspond to the number of final (parent) documents
// retrieved, as multiple child documents can point to the same parent.
childK: 20,
// Optional `k` parameter to limit number of final, parent documents returned from this
// retriever and sent to LLM. This is an upper-bound, and the final count may be lower than this.
parentK: 5,
});
const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

// We must add the parent documents via the retriever's addDocuments method
await retriever.addDocuments(parentDocuments);

const retrievedDocs = await retriever.invoke("justice breyer");

// Retrieved chunks are the larger parent chunks
console.log(retrievedDocs);
/*
[
Document {
pageContent: 'Tonight, I call on the Senate to pass โ€” pass the Freedom to Vote Act. Pass the John Lewis Act โ€” Voting Rights Act. And while youโ€™re at it, pass the DISCLOSE Act so Americans know who is funding our elections.\n' +
'\n' +
'Look, tonight, Iโ€™d โ€” Iโ€™d like to honor someone who has dedicated his life to serve this country: Justice Breyer โ€” an Army veteran, Constitutional scholar, retiring Justice of the United States Supreme Court.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
},
Document {
pageContent: 'As I did four days ago, Iโ€™ve nominated a Circuit Court of Appeals โ€” Ketanji Brown Jackson. One of our nationโ€™s top legal minds who will continue in just Brey- โ€” Justice Breyerโ€™s legacy of excellence. A former top litigator in private practice, a former federal public defender from a family of public-school educators and police officers โ€” sheโ€™s a consensus builder.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
},
Document {
pageContent: 'Justice Breyer, thank you for your service. Thank you, thank you, thank you. I mean it. Get up. Stand โ€” let me see you. Thank you.\n' +
'\n' +
'And we all know โ€” no matter what your ideology, we all know one of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
}
]
*/

API Reference:

With Score Thresholdโ€‹

By setting the options in scoreThresholdOptions we can force the ParentDocumentRetriever to use the ScoreThresholdRetriever under the hood. This sets the vector store inside ScoreThresholdRetriever as the one we passed when initializing ParentDocumentRetriever, while also allowing us to also set a score threshold for the retriever.

This can be helpful when you're not sure how many documents you want (or if you are sure, just set the maxK option), but you want to make sure that the documents you do get are within a certain relevancy threshold.

Note: if a retriever is passed, ParentDocumentRetriever will default to use it for retrieving small chunks, as well as adding documents via the addDocuments method.

import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { InMemoryStore } from "langchain/storage/in_memory";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { ScoreThresholdRetriever } from "langchain/retrievers/score_threshold";

const vectorstore = new MemoryVectorStore(new OpenAIEmbeddings());
const docstore = new InMemoryStore();
const childDocumentRetriever = ScoreThresholdRetriever.fromVectorStore(
vectorstore,
{
minSimilarityScore: 0.01, // Essentially no threshold
maxK: 1, // Only return the top result
}
);
const retriever = new ParentDocumentRetriever({
vectorstore,
docstore,
childDocumentRetriever,
// Optional, not required if you're already passing in split documents
parentSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 500,
}),
childSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 50,
}),
});
const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

// We must add the parent documents via the retriever's addDocuments method
await retriever.addDocuments(parentDocuments);

const retrievedDocs = await retriever.invoke("justice breyer");

// Retrieved chunk is the larger parent chunk
console.log(retrievedDocs);
/*
[
Document {
pageContent: 'Tonight, I call on the Senate to pass โ€” pass the Freedom to Vote Act. Pass the John Lewis Act โ€” Voting Rights Act. And while youโ€™re at it, pass the DISCLOSE Act so Americans know who is funding our elections.\n' +
'\n' +
'Look, tonight, Iโ€™d โ€” Iโ€™d like to honor someone who has dedicated his life to serve this country: Justice Breyer โ€” an Army veteran, Constitutional scholar, retiring Justice of the United States Supreme Court.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
},
]
*/

API Reference:

With Contextual chunk headersโ€‹

Consider a scenario where you want to store collection of documents in a vector store and perform Q&A tasks on them. Simply splitting documents with overlapping text may not provide sufficient context for LLMs to determine if multiple chunks are referencing the same information, or how to resolve information from contradictory sources.

Tagging each document with metadata is a solution if you know what to filter against, but you may not know ahead of time exactly what kind of queries your vector store will be expected to handle. Including additional contextual information directly in each chunk in the form of headers can help deal with arbitrary queries.

This is particularly important if you have several fine-grained child chunks that need to be correctly retrieved from the vector store.

import { OpenAIEmbeddings } from "@langchain/openai";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { InMemoryStore } from "langchain/storage/in_memory";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1500,
chunkOverlap: 0,
});

const jimDocs = await splitter.createDocuments([`My favorite color is blue.`]);
const jimChunkHeaderOptions = {
chunkHeader: "DOC NAME: Jim Interview\n---\n",
appendChunkOverlapHeader: true,
};

const pamDocs = await splitter.createDocuments([`My favorite color is red.`]);
const pamChunkHeaderOptions = {
chunkHeader: "DOC NAME: Pam Interview\n---\n",
appendChunkOverlapHeader: true,
};

const vectorstore = await HNSWLib.fromDocuments([], new OpenAIEmbeddings());
const docstore = new InMemoryStore();

const retriever = new ParentDocumentRetriever({
vectorstore,
docstore,
// Very small chunks for demo purposes.
// Use a bigger chunk size for serious use-cases.
childSplitter: new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 0,
}),
childK: 50,
parentK: 5,
});

// We pass additional option `childDocChunkHeaderOptions`
// that will add the chunk header to child documents
await retriever.addDocuments(jimDocs, {
childDocChunkHeaderOptions: jimChunkHeaderOptions,
});
await retriever.addDocuments(pamDocs, {
childDocChunkHeaderOptions: pamChunkHeaderOptions,
});

// This will search child documents in vector store with the help of chunk header,
// returning the unmodified parent documents
const retrievedDocs = await retriever.invoke("What is Pam's favorite color?");

// Pam's favorite color is returned first!
console.log(JSON.stringify(retrievedDocs, null, 2));
/*
[
{
"pageContent": "My favorite color is red.",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
}
}
},
{
"pageContent": "My favorite color is blue.",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
}
}
}
]
*/

const rawDocs = await vectorstore.similaritySearch(
"What is Pam's favorite color?"
);

// Raw docs in vectorstore are short but have chunk headers
console.log(JSON.stringify(rawDocs, null, 2));

/*
[
{
"pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) color is",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
},
{
"pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) favorite",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
},
{
"pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) red.",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
},
{
"pageContent": "DOC NAME: Pam Interview\n---\nMy",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
}
]
*/

API Reference:

With Rerankingโ€‹

With many documents from the vector store that are passed to LLM, final answers sometimes consist of information from irrelevant chunks, making it less precise and sometimes incorrect. Also, passing multiple irrelevant documents makes it more expensive. So there are two reasons to use rerank - precision and costs.

import { OpenAIEmbeddings } from "@langchain/openai";
import { CohereRerank } from "@langchain/cohere";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { InMemoryStore } from "langchain/storage/in_memory";
import {
ParentDocumentRetriever,
type SubDocs,
} from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

// init Cohere Rerank. Remember to add COHERE_API_KEY to your .env
const reranker = new CohereRerank({
topN: 50,
model: "rerank-multilingual-v2.0",
});

export function documentCompressorFiltering({
relevanceScore,
}: { relevanceScore?: number } = {}) {
return (docs: SubDocs) => {
let outputDocs = docs;

if (relevanceScore) {
const docsRelevanceScoreValues = docs.map(
(doc) => doc?.metadata?.relevanceScore
);
outputDocs = docs.filter(
(_doc, index) =>
(docsRelevanceScoreValues?.[index] || 1) >= relevanceScore
);
}

return outputDocs;
};
}

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 0,
});

const jimDocs = await splitter.createDocuments([`Jim favorite color is blue.`]);

const pamDocs = await splitter.createDocuments([`Pam favorite color is red.`]);

const vectorstore = await HNSWLib.fromDocuments([], new OpenAIEmbeddings());
const docstore = new InMemoryStore();

const retriever = new ParentDocumentRetriever({
vectorstore,
docstore,
// Very small chunks for demo purposes.
// Use a bigger chunk size for serious use-cases.
childSplitter: new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 0,
}),
childK: 50,
parentK: 5,
// We add Reranker
documentCompressor: reranker,
documentCompressorFilteringFn: documentCompressorFiltering({
relevanceScore: 0.3,
}),
});

const docs = jimDocs.concat(pamDocs);
await retriever.addDocuments(docs);

// This will search for documents in vector store and return for LLM already reranked and sorted document
// with appropriate minimum relevance score
const retrievedDocs = await retriever.getRelevantDocuments(
"What is Pam's favorite color?"
);

// Pam's favorite color is returned first!
console.log(JSON.stringify(retrievedDocs, null, 2));
/*
[
{
"pageContent": "My favorite color is red.",
"metadata": {
"relevanceScore": 0.9
"loc": {
"lines": {
"from": 1,
"to": 1
}
}
}
}
]
*/

API Reference:


Help us out by providing feedback on this documentation page: