Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers #84

ChintanDonda · 2025-01-22T11:52:07Z

I've used the Hindi dataset.

It works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers.

English words with Hindi text

Example 1:
आवेदन के नाम लेने से पहले (Registration process के पहले) समझने की बातें
==> Parsed from the PDF using the below code snippet as:
आवेदन के नाम लेने से पहले (२८्टा57807) 0700€55 के पहले) समझने की बातें

Example 2:
तेजस्विता जब किसी मालिकाना वस्तु पर (Possession) अथवा पद पर (Post/Position) निर्भर होते है
===> Parsed from the PDF using the below code snippet as:
तेजस्विता जब किसी मालिकाना वस्तु पर (?०८५७५५०) अथवा पद पर (?०5६/?०5ाधं0ा) निर्भर होते है

Example 3:
वस्तुनिष्ठ आनंद (objective happiness) यह हमेशा अपूर्ण होता है
===> Parsed from the PDF using the below code snippet as:
वस्तुनिष्ठ आनंद (०णुं०ता५ह 09000655) यह हमेशा अपूर्ण होता है

English words & Numbers with Hindi text

Example 1:
आवेदन लेने की प्रक्रिया (Registration process)) हमें 01/06/2024 से शुरू करनी है।
===> Parsed from the PDF using the below code snippet as:
आवेदन लेने की प्रक्रिया (९८8्डा[507820770655) हमें 0/06/2024 से शुरू करनी है। ====> also missed out 1 in 01

How to reproduce:

from pdf2image import convert_from_path
import pytesseract

# Specify Tesseract executable location
pytesseract.pytesseract.tesseract_cmd = '/opt/homebrew/bin/tesseract'

# Load and convert PDF to images
documents = convert_from_path("path_to_pdf.pdf")    # Try PDF that has Hindi text mixed with some English words/phrases and/or Numbers

# Extract text from each image in Hindi
page_content = ""
for doc in documents:
    try:
        page_content += pytesseract.image_to_string(doc, lang='hin')
        page_content += "\n"
    except Exception as e:
        print(f"Error in extracting page content for: {doc}")
        pass

print(page_content[0:5])

Any idea how I can also parse the Hindi text mixed with some English words/phrases and/or Numbers?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers #84

Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers #84

ChintanDonda commented Jan 22, 2025

Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers #84

Works mostly well for the pure Hindi text, but does NOT parse English words at all and misses out on a few Numbers #84

Comments

ChintanDonda commented Jan 22, 2025