-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: run out of memory due to indexing/embedding documents #12882
base: main
Are you sure you want to change the base?
Conversation
Please link an existing issue or open an issue first. |
Please run |
lint check pass |
batch_documents = texts[i : i + max_batch_documents] | ||
batch_contents = [document.page_content for document in batch_documents] | ||
batch_embeddings = self._embeddings.embed_documents(batch_contents) | ||
self._vector_processor.create(texts=batch_documents, embeddings=batch_embeddings, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's risky and wrong to repeatedly create the collections and the underlying indexes in vdb, which may cause inconsistency or errors.
Correct it into :
- create the collection first , with empy array
- looping the batched documents and use add_text to append to the existed collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if texts: | ||
embeddings = self._embeddings.embed_documents([document.page_content for document in texts]) | ||
self._vector_processor.create(texts=texts, embeddings=embeddings, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_text()
won't ceate collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it's curiously to find out self._vector_processor.create
is used in both the create
and also the add_texts
in vector_factory.py
, which may possibly cause repeated index creation (distrubuted lock in redis avoiding it), even without this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we merge this PR first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it's curiously to find out
self._vector_processor.create
is used in both thecreate
and also theadd_texts
invector_factory.py
, which may possibly cause repeated index creation (distrubuted lock in redis avoiding it), even without this PR.
should vector.add_texts
call _vector_processor.add_texts
instead of _vector_processor.create
at line 164?
Summary
Fixes #12893
Tip
Close issue syntax:
Fixes #<issue number>
orResolves #<issue number>
, see documentation for more details.Screenshots
|
Checklist
Important
Please review the checklist below before submitting your pull request.
dev/reformat
(backend) andcd web && npx lint-staged
(frontend) to appease the lint gods