fix document extractor node incorrectly processing doc and ppt files #12902

AugNSo · 2025-01-21T06:59:29Z

Summary

In the current code, document extractor node was using python-docx library to handle doc files, which it cannot process. this commit is trying to use unstructured API to handle doc files.
In the current code, document extractor node is using partititon_ppt to handle ppt files, probably a leftover when unstructured[ppt] was still in pyproject.toml. This commit is trying to use unstructured API to handle ppt files.

Tip

Close issue syntax: Fixes #<issue number> or Resolves #<issue number>, see documentation for more details.

Screenshots

Before	After
...	...

Checklist

Important

Please review the checklist below before submitting your pull request.

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.
I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

crazywoola · 2025-01-22T02:59:59Z

Please fix the lint errors.

AugNSo · 2025-01-22T03:15:31Z

** Run Pytest / API Tests (3.12) (pull_request) ** Cancelled after 2m

Hi, I believe the error was not caused by linting but failure to pass test for _extract_text_from_doc function, the unit test is trying to call the function to process docx files, which should be calling _extract_text_from_docx instead with my new code.

I can try merge these two functions together under _extract_text_from_doc but I don't really think it is a good idea……

EDIT: updated the unit test

…TRUCTURED_API_KEY

fix: document extractor node incorrectly handles doc and ppt files

de1aa87

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 🐞 bug Something isn't working labels Jan 21, 2025

crazywoola requested a review from JohnJyong January 22, 2025 01:49

update document extractor unit test for docx file

accd3c0

AugNSo force-pushed the dev branch from c820eda to accd3c0 Compare January 22, 2025 03:21

AugNSo added 2 commits January 22, 2025 15:03

fix error found by mypy

5b97a19

add checking for dify_config.UNSTRUCTURED_API_URL and dify_config.UNS…

55e400e

…TRUCTURED_API_KEY

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix document extractor node incorrectly processing doc and ppt files #12902

fix document extractor node incorrectly processing doc and ppt files #12902

AugNSo commented Jan 21, 2025

crazywoola commented Jan 22, 2025

AugNSo commented Jan 22, 2025 •

edited

Loading

fix document extractor node incorrectly processing doc and ppt files #12902

Are you sure you want to change the base?

fix document extractor node incorrectly processing doc and ppt files #12902

Conversation

AugNSo commented Jan 21, 2025

Summary

Screenshots

Checklist

crazywoola commented Jan 22, 2025

AugNSo commented Jan 22, 2025 • edited Loading

AugNSo commented Jan 22, 2025 •

edited

Loading