-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect File Format Returned for .docx Upload via Remote URL #12551
Comments
The issue you're experiencing with remote URL uploads in Dify, where To address this, you might need to ensure that the MIME type is correctly set or detected when handling remote URL uploads. This could involve checking the server-side logic that processes these uploads to ensure it correctly interprets the file type based on the content or headers of the file, rather than relying solely on the file extension. If the issue persists, reviewing the code paths or modules involved in processing remote URL uploads might provide further insights into where the detection is failing. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
After reviewing the file Here's a reference code to fix the issue: import os
import re
import uuid
import magic
import mimetypes
import httpx
from urllib.parse import urlparse
class FileInfo:
def __init__(self, filename, extension, mimetype, size):
self.filename = filename
self.extension = extension
self.mimetype = mimetype
self.size = size
def guess_file_info_from_response(response: httpx.Response):
url = str(response.url)
# Try to extract filename from URL
parsed_url = urlparse(url)
url_path = parsed_url.path
filename = os.path.basename(url_path)
# If filename couldn't be extracted, use Content-Disposition header
if not filename:
content_disposition = response.headers.get("Content-Disposition")
if content_disposition:
filename_match = re.search(r'filename="?(.+)"?', content_disposition)
if filename_match:
filename = filename_match.group(1)
# If still no filename, generate a unique one
if not filename:
unique_name = str(uuid.uuid4())
filename = f"{unique_name}"
# Guess MIME type from filename first, then URL
mimetype, _ = mimetypes.guess_type(filename)
if mimetype is None:
mimetype, _ = mimetypes.guess_type(url)
if mimetype is None:
# If guessing fails, use Content-Type from response headers
mimetype = response.headers.get("Content-Type", "application/octet-stream")
# Use python-magic to guess MIME type if still unknown or generic
if mimetype == "application/octet-stream":
magic_mime = magic.Magic(mime=True)
file_content = response.content[:2048] # Read the first 2048 bytes
mimetype = magic_mime.from_buffer(file_content)
# Ensure filename has an extension
extension = os.path.splitext(filename)[1]
if not extension:
extension = mimetypes.guess_extension(mimetype) or ".bin"
if extension == ".bin" and mimetype == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
extension = ".docx"
filename = f"{os.path.splitext(filename)[0]}{extension}"
return FileInfo(
filename=filename,
extension=extension,
mimetype=mimetype,
size=int(response.headers.get("Content-Length", -1)),
)
# Example request
url = "https://oss.xxx.com/bucket-test4/20250109/8b9f011bd7eb43cdb88f49d9f811b440.docx"
response = httpx.get(url)
if response.status_code == 200:
file_info = guess_file_info_from_response(response)
print(file_info.__dict__)
else:
print(f"Request failed, status code: {response.status_code}") This code provides a more robust way to determine the file type by first checking the |
LGTM! Would you like to make a PR for this change? |
Self Checks
Dify version
0.15.0
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
There’s an issue with remote URL uploads: .docx files are being incorrectly identified as .bin, while .pdf files work fine. Please fix this and verify other formats (e.g., .xlsx, .pptx, .txt) to ensure consistency.
the local file tranfer method is ok!
dify-version:0.11.0~0.15.0
this's is the start node request and response
request:
{
"files": [
{
"dify_model_identity": "dify__file",
"id": null,
"tenant_id": "cd8dce6c-0fda-4e5c-94d8-56fde48cea09",
"type": "document",
"transfer_method": "remote_url",
"remote_url": "https://oss.xxx.com/bucket-test4/20250109/8b9f011bd7eb43cdb88f49d9f811b440.docx",
"related_id": null,
"filename": "8b9f011bd7eb43cdb88f49d9f811b440.docx",
"extension": ".bin",
"mime_type": "application/octet-stream",
"size": 7876,
"url": "https://oss.xxx.com/bucket-test4/20250109/8b9f011bd7eb43cdb88f49d9f811b440.docx"
}
],
"requestId": "32c0dee160da4e4b8ee6ed83924352a1",
"profile": "uat",
"sys.files": [],
"sys.user_id": "123",
"sys.app_id": "afeef743-78e2-40ad-a49e-c8539afaaaeb",
"sys.workflow_id": "bfdc6dc3-cac7-4b3a-990c-e31c5e0ff5c6",
"sys.workflow_run_id": "1b56e5bb-fb76-49b7-b6d6-115e8550ce0c"
}
response
{
"files": [
{
"dify_model_identity": "dify__file",
"id": null,
"tenant_id": "cd8dce6c-0fda-4e5c-94d8-56fde48cea09",
"type": "document",
"transfer_method": "remote_url",
"remote_url": "https://oss.xxx.com/bucket-test4/20250109/8b9f011bd7eb43cdb88f49d9f811b440.docx",
"related_id": null,
"filename": "8b9f011bd7eb43cdb88f49d9f811b440.docx",
"extension": ".bin",
"mime_type": "application/octet-stream",
"size": 7876,
"url": "https://oss.xxx.com/bucket-test4/20250109/8b9f011bd7eb43cdb88f49d9f811b440.docx"
}
],
"requestId": "32c0dee160da4e4b8ee6ed83924352a1",
"sys.files": [],
"sys.user_id": "123",
"sys.app_id": "afeef743-78e2-40ad-a49e-c8539afaaaeb",
"sys.workflow_id": "bfdc6dc3-cac7-4b3a-990c-e31c5e0ff5c6",
"sys.workflow_run_id": "1b56e5bb-fb76-49b7-b6d6-115e8550ce0c"
}
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
The text was updated successfully, but these errors were encountered: