fix: run out of memory due to indexing/embedding documents #12882

rayshaw001 · 2025-01-20T12:46:14Z

Summary

Tip

Close issue syntax: Fixes #<issue number> or Resolves #<issue number>, see documentation for more details.

Screenshots

Before	After
document pending indexing.	document avaliable.

|

Checklist

Important

Please review the checklist below before submitting your pull request.

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.
I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

crazywoola · 2025-01-20T13:45:02Z

Please link an existing issue or open an issue first.

crazywoola · 2025-01-21T02:33:06Z

Please run dev/reformat to pass the lint.

rayshaw001 · 2025-01-21T03:09:58Z

Please run dev/reformat to pass the lint.

lint check pass

…m_bomb

bowenliang123 · 2025-01-21T08:24:12Z

api/core/rag/datasource/vdb/vector_factory.py

+                batch_documents = texts[i : i + max_batch_documents]
+                batch_contents = [document.page_content for document in batch_documents]
+                batch_embeddings = self._embeddings.embed_documents(batch_contents)
+                self._vector_processor.create(texts=batch_documents, embeddings=batch_embeddings, **kwargs)


It's risky and wrong to repeatedly create the collections and the underlying indexes in vdb, which may cause inconsistency or errors.
Correct it into :

create the collection first , with empy array

looping the batched documents and use add_text to append to the existed collection

is add_texts thread safe?
add_texts filter duplicated documents.
there are 10 workers running concurrently,

JohnJyong · 2025-01-21T11:33:47Z

api/core/rag/datasource/vdb/vector_factory.py

        if texts:
-            embeddings = self._embeddings.embed_documents([document.page_content for document in texts])
-            self._vector_processor.create(texts=texts, embeddings=embeddings, **kwargs)


add_text() won't ceate collection

BTW, it's curiously to find out self._vector_processor.create is used in both the create and also the add_texts in vector_factory.py, which may possibly cause repeated index creation (distrubuted lock in redis avoiding it), even without this PR.

can we merge this PR first？

BTW, it's curiously to find out self._vector_processor.create is used in both the create and also the add_texts in vector_factory.py, which may possibly cause repeated index creation (distrubuted lock in redis avoiding it), even without this PR.

should vector.add_texts call _vector_processor.add_texts instead of _vector_processor.create at line 164?

fix:run out of mem

5079ed7

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. 🐞 bug Something isn't working labels Jan 20, 2025

rayshaw001 changed the title ~~fix:run out of mem due to indexing documents~~ fix:run out of memory due to indexing/embedding documents Jan 20, 2025

fix typo

42189fb

rayshaw001 changed the title ~~fix:run out of memory due to indexing/embedding documents~~ fix: run out of memory due to indexing/embedding documents Jan 21, 2025

fix E501

7e71501

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Jan 21, 2025

fix E301

a9ae412

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Jan 21, 2025

rayshaw001 added 2 commits January 21, 2025 10:51

fix python lint

bb06588

fix py lint

8dd961b

crazywoola requested a review from JohnJyong January 21, 2025 04:45

rayshaw001 and others added 3 commits January 21, 2025 13:23

Merge branch 'langgenius:main' into fix_mem_bomb

6cd40a5

make max_batch_documents local var

aef34ba

Merge branch 'fix_mem_bomb' of github.com:rayshaw001/dify into fix_me…

e3358b9

…m_bomb

bowenliang123 reviewed Jan 21, 2025

View reviewed changes

use add_texts method instead

57e6d89

JohnJyong reviewed Jan 21, 2025

View reviewed changes

create collection once

87f16fc

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Jan 22, 2025

rayshaw001 requested a review from bowenliang123 January 22, 2025 06:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: run out of memory due to indexing/embedding documents #12882

fix: run out of memory due to indexing/embedding documents #12882

rayshaw001 commented Jan 20, 2025 •

edited

Loading

crazywoola commented Jan 20, 2025

crazywoola commented Jan 21, 2025

rayshaw001 commented Jan 21, 2025

bowenliang123 Jan 21, 2025 •

edited

Loading

rayshaw001 Jan 21, 2025 •

edited

Loading

JohnJyong Jan 21, 2025

bowenliang123 Jan 22, 2025 •

edited

Loading

rayshaw001 Jan 22, 2025

rayshaw001 Jan 22, 2025 •

edited

Loading

fix: run out of memory due to indexing/embedding documents #12882

Are you sure you want to change the base?

fix: run out of memory due to indexing/embedding documents #12882

Conversation

rayshaw001 commented Jan 20, 2025 • edited Loading

Summary

Screenshots

Checklist

crazywoola commented Jan 20, 2025

crazywoola commented Jan 21, 2025

rayshaw001 commented Jan 21, 2025

bowenliang123 Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

rayshaw001 Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

JohnJyong Jan 21, 2025

Choose a reason for hiding this comment

bowenliang123 Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

rayshaw001 Jan 22, 2025

Choose a reason for hiding this comment

rayshaw001 Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

rayshaw001 commented Jan 20, 2025 •

edited

Loading

bowenliang123 Jan 21, 2025 •

edited

Loading

rayshaw001 Jan 21, 2025 •

edited

Loading

bowenliang123 Jan 22, 2025 •

edited

Loading

rayshaw001 Jan 22, 2025 •

edited

Loading