Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect File Format Returned for .docx Upload via Remote URL #12551

Open
5 tasks done
yjc980121 opened this issue Jan 9, 2025 · 3 comments · May be fixed by #12693
Open
5 tasks done

Incorrect File Format Returned for .docx Upload via Remote URL #12551

yjc980121 opened this issue Jan 9, 2025 · 3 comments · May be fixed by #12693
Assignees
Labels
🐞 bug Something isn't working

Comments

@yjc980121
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

There’s an issue with remote URL uploads: .docx files are being incorrectly identified as .bin, while .pdf files work fine. Please fix this and verify other formats (e.g., .xlsx, .pptx, .txt) to ensure consistency.

the local file tranfer method is ok!

dify-version:0.11.0~0.15.0

this's is the start node request and response

request:
{
"files": [
{
"dify_model_identity": "dify__file",
"id": null,
"tenant_id": "cd8dce6c-0fda-4e5c-94d8-56fde48cea09",
"type": "document",
"transfer_method": "remote_url",
"remote_url": "https://oss.xxx.com/bucket-test4/20250109/8b9f011bd7eb43cdb88f49d9f811b440.docx",
"related_id": null,
"filename": "8b9f011bd7eb43cdb88f49d9f811b440.docx",
"extension": ".bin",
"mime_type": "application/octet-stream",
"size": 7876,
"url": "https://oss.xxx.com/bucket-test4/20250109/8b9f011bd7eb43cdb88f49d9f811b440.docx"
}
],
"requestId": "32c0dee160da4e4b8ee6ed83924352a1",
"profile": "uat",
"sys.files": [],
"sys.user_id": "123",
"sys.app_id": "afeef743-78e2-40ad-a49e-c8539afaaaeb",
"sys.workflow_id": "bfdc6dc3-cac7-4b3a-990c-e31c5e0ff5c6",
"sys.workflow_run_id": "1b56e5bb-fb76-49b7-b6d6-115e8550ce0c"
}

response
{
"files": [
{
"dify_model_identity": "dify__file",
"id": null,
"tenant_id": "cd8dce6c-0fda-4e5c-94d8-56fde48cea09",
"type": "document",
"transfer_method": "remote_url",
"remote_url": "https://oss.xxx.com/bucket-test4/20250109/8b9f011bd7eb43cdb88f49d9f811b440.docx",
"related_id": null,
"filename": "8b9f011bd7eb43cdb88f49d9f811b440.docx",
"extension": ".bin",
"mime_type": "application/octet-stream",
"size": 7876,
"url": "https://oss.xxx.com/bucket-test4/20250109/8b9f011bd7eb43cdb88f49d9f811b440.docx"
}
],
"requestId": "32c0dee160da4e4b8ee6ed83924352a1",
"sys.files": [],
"sys.user_id": "123",
"sys.app_id": "afeef743-78e2-40ad-a49e-c8539afaaaeb",
"sys.workflow_id": "bfdc6dc3-cac7-4b3a-990c-e31c5e0ff5c6",
"sys.workflow_run_id": "1b56e5bb-fb76-49b7-b6d6-115e8550ce0c"
}

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

@dosubot dosubot bot added the 🐞 bug Something isn't working label Jan 9, 2025
Copy link

dosubot bot commented Jan 9, 2025

The issue you're experiencing with remote URL uploads in Dify, where .docx files are being incorrectly identified as .bin, is likely related to how Dify handles file type detection for remote URLs. When a file is uploaded via a remote URL, Dify sets the file extension to an empty string, which might lead to incorrect MIME type detection, defaulting to application/octet-stream [1].

To address this, you might need to ensure that the MIME type is correctly set or detected when handling remote URL uploads. This could involve checking the server-side logic that processes these uploads to ensure it correctly interprets the file type based on the content or headers of the file, rather than relying solely on the file extension.

If the issue persists, reviewing the code paths or modules involved in processing remote URL uploads might provide further insights into where the detection is failing.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@yjc980121
Copy link
Author

After reviewing the fileapi/controllers/common/helpers.py,I noticed that it usesmimetypes.guess_typeto determine the file type.However,this approach first guesses the file type based on the file name and then relies on theContent-Typeheader from the response.For binary files,the initial guess often results in a binary type,i.e.,.bin,which can obscure the originalContent-Type.This might lead to incorrect handling of the file's actual type.

Here's a reference code to fix the issue:

import os
import re
import uuid
import magic
import mimetypes
import httpx
from urllib.parse import urlparse

class FileInfo:
    def __init__(self, filename, extension, mimetype, size):
        self.filename = filename
        self.extension = extension
        self.mimetype = mimetype
        self.size = size

def guess_file_info_from_response(response: httpx.Response):
    url = str(response.url)
    # Try to extract filename from URL
    parsed_url = urlparse(url)
    url_path = parsed_url.path
    filename = os.path.basename(url_path)

    # If filename couldn't be extracted, use Content-Disposition header
    if not filename:
        content_disposition = response.headers.get("Content-Disposition")
        if content_disposition:
            filename_match = re.search(r'filename="?(.+)"?', content_disposition)
            if filename_match:
                filename = filename_match.group(1)

    # If still no filename, generate a unique one
    if not filename:
        unique_name = str(uuid.uuid4())
        filename = f"{unique_name}"

    # Guess MIME type from filename first, then URL
    mimetype, _ = mimetypes.guess_type(filename)
    if mimetype is None:
        mimetype, _ = mimetypes.guess_type(url)
    if mimetype is None:
        # If guessing fails, use Content-Type from response headers
        mimetype = response.headers.get("Content-Type", "application/octet-stream")

    # Use python-magic to guess MIME type if still unknown or generic
    if mimetype == "application/octet-stream":
        magic_mime = magic.Magic(mime=True)
        file_content = response.content[:2048]  # Read the first 2048 bytes
        mimetype = magic_mime.from_buffer(file_content)

    # Ensure filename has an extension
    extension = os.path.splitext(filename)[1]
    if not extension:
        extension = mimetypes.guess_extension(mimetype) or ".bin"
        if extension == ".bin" and mimetype == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
            extension = ".docx"
        filename = f"{os.path.splitext(filename)[0]}{extension}"

    return FileInfo(
        filename=filename,
        extension=extension,
        mimetype=mimetype,
        size=int(response.headers.get("Content-Length", -1)),
    )

# Example request
url = "https://oss.xxx.com/bucket-test4/20250109/8b9f011bd7eb43cdb88f49d9f811b440.docx"
response = httpx.get(url)

if response.status_code == 200:
    file_info = guess_file_info_from_response(response)
    print(file_info.__dict__)
else:
    print(f"Request failed, status code: {response.status_code}")

This code provides a more robust way to determine the file type by first checking theContent-Typeheader and then usingmimetypes.guess_typeandpython-magicif necessary.It also ensures that the filename has a proper extension based on the MIME type.

@laipz8200
Copy link
Member

LGTM! Would you like to make a PR for this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants