Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: run out of memory due to indexing/embedding documents #12882

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

rayshaw001
Copy link
Contributor

@rayshaw001 rayshaw001 commented Jan 20, 2025

Summary

Fixes #12893

Tip

Close issue syntax: Fixes #<issue number> or Resolves #<issue number>, see documentation for more details.

Screenshots

Before After
document pending indexing. image document avaliable. image

|

Checklist

Important

Please review the checklist below before submitting your pull request.

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. 🐞 bug Something isn't working labels Jan 20, 2025
@rayshaw001 rayshaw001 changed the title fix:run out of mem due to indexing documents fix:run out of memory due to indexing/embedding documents Jan 20, 2025
@crazywoola
Copy link
Member

Please link an existing issue or open an issue first.

@rayshaw001 rayshaw001 changed the title fix:run out of memory due to indexing/embedding documents fix: run out of memory due to indexing/embedding documents Jan 21, 2025
@crazywoola
Copy link
Member

Please run dev/reformat to pass the lint.

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Jan 21, 2025
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Jan 21, 2025
@rayshaw001
Copy link
Contributor Author

Please run dev/reformat to pass the lint.

lint check pass

@crazywoola crazywoola requested a review from JohnJyong January 21, 2025 04:45
batch_documents = texts[i : i + max_batch_documents]
batch_contents = [document.page_content for document in batch_documents]
batch_embeddings = self._embeddings.embed_documents(batch_contents)
self._vector_processor.create(texts=batch_documents, embeddings=batch_embeddings, **kwargs)
Copy link
Contributor

@bowenliang123 bowenliang123 Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's risky and wrong to repeatedly create the collections and the underlying indexes in vdb, which may cause inconsistency or errors.
Correct it into :

  1. create the collection first , with empy array
  2. looping the batched documents and use add_text to append to the existed collection

Copy link
Contributor Author

@rayshaw001 rayshaw001 Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is add_texts thread safe?
add_texts filter duplicated documents.
there are 10 workers running concurrently,
image

if texts:
embeddings = self._embeddings.embed_documents([document.page_content for document in texts])
self._vector_processor.create(texts=texts, embeddings=embeddings, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_text() won't ceate collection

Copy link
Contributor

@bowenliang123 bowenliang123 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it's curiously to find out self._vector_processor.create is used in both the create and also the add_texts in vector_factory.py, which may possibly cause repeated index creation (distrubuted lock in redis avoiding it), even without this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge this PR first?

Copy link
Contributor Author

@rayshaw001 rayshaw001 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it's curiously to find out self._vector_processor.create is used in both the create and also the add_texts in vector_factory.py, which may possibly cause repeated index creation (distrubuted lock in redis avoiding it), even without this PR.

should vector.add_texts call _vector_processor.add_texts instead of _vector_processor.create at line 164?

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

huge csv file occupies huge memory while indexing document
4 participants