-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: Remote Caching with IngestionPipeline? #17583
Comments
To control the batch size of nodes being processed in the Here's an example of how you can use the def _node_batcher(
num_batches: int, nodes: Union[Sequence[BaseNode], List[Document]]
) -> Generator[Union[Sequence[BaseNode], List[Document]], Any, Any]:
"""Yield successive n-sized chunks from lst."""
batch_size = max(1, int(len(nodes) / num_batches))
for i in range(0, len(nodes), batch_size):
yield nodes[i : i + batch_size] By determining the number of batches you want to process, you can divide your nodes into manageable chunks using To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
@dosu this "private" method seems to be called only when using multiprocessing by setting Is there a way to configure the |
Yes, the Here's a relevant excerpt from the code: @staticmethod
def _node_batcher(
num_batches: int, nodes: Union[Sequence[BaseNode], List[Document]]
) -> Generator[Union[Sequence[BaseNode], List[Document]], Any, Any]:
"""Yield successive n-sized chunks from lst."""
batch_size = max(1, int(len(nodes) / num_batches))
for i in range(0, len(nodes), batch_size):
yield nodes[i : i + batch_size]
@dispatcher.span
async def arun(
...
num_workers: Optional[int] = None,
...
) -> Sequence[BaseNode]:
...
if num_workers and num_workers > 1:
...
with ProcessPoolExecutor(max_workers=num_workers) as p:
node_batches = self._node_batcher(
num_batches=num_workers, nodes=nodes_to_run
)
tasks = [
loop.run_in_executor(
p,
partial(
arun_transformations_wrapper,
transformations=self.transformations,
in_place=in_place,
cache=self.cache if not self.disable_cache else None,
cache_collection=cache_collection,
),
batch,
)
for batch in node_batches
]
result: Sequence[Sequence[BaseNode]] = await asyncio.gather(*tasks)
nodes: Sequence[BaseNode] = reduce(lambda x, y: x + y, result, []) # type: ignore
... This setup allows the pipeline to handle large numbers of documents efficiently by processing them in parallel batches. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
Question Validation
Question
I am trying to implement an advanced document ingestion pipeline with a Redis cache using the examples found in the documentation.
https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/
Examining the source code for the ingestion pipeline
llama_index/llama-index-core/llama_index/core/ingestion/pipeline.py
Line 135 in b020aa0
it would seem that all nodes are cached following each transformation. I am processing several thousand documents in my ingestion pipeline and am noticing that the Redis cache crashes when the number of documents processed during a single call of
run
orarun
is too large. I believe this is because once the number of nodes gets sufficiently high the volume of data beingput
into the cache at once is too big for Redis to handle.Is there a way to control the "batch" size of the nodes being processed so that I don't overflow the cache?
The text was updated successfully, but these errors were encountered: