Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers #84

Open
ChintanDonda opened this issue Jan 22, 2025 · 0 comments

Comments

@ChintanDonda
Copy link

I've used the Hindi dataset.

It works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers.

English words with Hindi text

Example 1:
आवेदन के नाम लेने से पहले (Registration process के पहले) समझने की बातें
==> Parsed from the PDF using the below code snippet as:
आवेदन के नाम लेने से पहले (२८्टा57807) 0700€55 के पहले) समझने की बातें

Example 2:
तेजस्विता जब किसी मालिकाना वस्तु पर (Possession) अथवा पद पर (Post/Position) निर्भर होते है
===> Parsed from the PDF using the below code snippet as:
तेजस्विता जब किसी मालिकाना वस्तु पर (?०८५७५५०) अथवा पद पर (?०5६/?०5ाधं0ा) निर्भर होते है

Example 3:
वस्तुनिष्ठ आनंद (objective happiness) यह हमेशा अपूर्ण होता है
===> Parsed from the PDF using the below code snippet as:
वस्तुनिष्ठ आनंद (०णुं०ता५ह 09000655) यह हमेशा अपूर्ण होता है

English words & Numbers with Hindi text

Example 1:
आवेदन लेने की प्रक्रिया (Registration process)) हमें 01/06/2024 से शुरू करनी है।
===> Parsed from the PDF using the below code snippet as:
आवेदन लेने की प्रक्रिया (९८8्डा[507820770655) हमें 0/06/2024 से शुरू करनी है। ====> also missed out 1 in 01

How to reproduce:

from pdf2image import convert_from_path
import pytesseract

# Specify Tesseract executable location
pytesseract.pytesseract.tesseract_cmd = '/opt/homebrew/bin/tesseract'

# Load and convert PDF to images
documents = convert_from_path("path_to_pdf.pdf")    # Try PDF that has Hindi text mixed with some English words/phrases and/or Numbers

# Extract text from each image in Hindi
page_content = ""
for doc in documents:
    try:
        page_content += pytesseract.image_to_string(doc, lang='hin')
        page_content += "\n"
    except Exception as e:
        print(f"Error in extracting page content for: {doc}")
        pass

print(page_content[0:5])

Any idea how I can also parse the Hindi text mixed with some English words/phrases and/or Numbers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant