Attempt at establishing distinction between -dict-gt-norm- and -gt-norm fails #4

rueter · 2024-02-22T11:13:38Z

Four example words have been selected to provide the *e vs *ä distinction found in the manuscript of the monolingual Erzya dictionary by Kuzʹma Abramov.
In the lexc file we have:

пей+N:пӓй
сэдь+N:сӓдь
седей+N:сьӓдей
эрзя+N:ӓрзя

‹ӓ› has been declared in twolc

the filter: ‹remove-diaereses-enhancement.regex› looks like this:

[[ Ь | ь ] -> 0 ||  _ [ ӓ | Ӓ ] ,, ӓ -> е || [ ь | Ь ]  _ ]   
.o.
ӓ -> е || [ в | В | б | Б | г | Г | ж | Ж | к | К | м | М | п | П | ф | Ф | х | Х | ч | Ч | ш | Ш | щ | Щ ] _ 
.o.
ӓ -> э || [ д | Д | з | З | л | Л | н | Н | р | Р | с | С | т | Т | ц | Ц ] _	
.o.
ӓ -> э || [ .#. | %- ] _ ;

So, there are a number of things going on in one place.
Line 1 removes underlying soft sign preceding underlying ӓ and simultaneously replaces underlying ӓ with е. (failure)
Line 2 replaces underlying ‹ӓ› with ‹е›. (partial success)
Line 3 replaces underlying ‹ӓ› with ‹э› following specific consonants. (partial success)
Line 4 replaces underlying ‹ӓ› with ‹э› word-initially. (partial success)

The script remove-diaereses-enhancement.hfst is called in
lang-myv/src/fst/Makefile.am and lang-myv/src/fst/filters/Makefile.am

The desired result for the four words give above would be:
Analysis

lang-myv jackrueter$ hfst-lookup src/fst/analyser-gt-norm.hfstol 
> пей
пей	пей+N+Sg+Nom+Indef	0,000000

> сэдь
сэдь	сэдь+N+Sg+Nom+Indef	0,000000

> седей
седей	седей+N+Sg+Nom+Indef	0,000000

> эрзя
эрзя	эрзя+N+Sg+Nom+Indef	0,000000

Dict-Generation:

lang-myv jackrueter$ hfst-lookup src/fst/generator-dict-gt-norm.hfst 
hfst-lookup: warning: It is not possible to perform fast lookups with foma format automata.
Using HFST basic transducer format and performing slow lookups
> пей+N+Sg+Nom+Indef
пей+N+Sg+Nom+Indef	пӓй	0,000000

> сэдь+N+Sg+Nom+Indef
сэдь+N+Sg+Nom+Indef	сӓдь	0,000000

> седей+N+Sg+Nom+Indef
седей+N+Sg+Nom+Indef	сьӓдей	0,000000

> эрзя+N+Sg+Nom+Indef
эрзя+N+Sg+Nom+Indef	ӓрзя	0,000000

Generation:

lang-myv jackrueter$ hfst-lookup src/fst/generator-gt-norm.hfstol
> пей+N+Sg+Nom+Indef
пей+N+Sg+Nom+Indef	пей	0,000000

> сэдь+N+Sg+Nom+Indef
сэдь+N+Sg+Nom+Indef	сэдь	0,000000

> седей+N+Sg+Nom+Indef
седей+N+Sg+Nom+Indef	седей	0,000000

> эрзя+N+Sg+Nom+Indef
эрзя+N+Sg+Nom+Indef	эрзя	0,000000

Instead, I get:
Analysis

lang-myv jackrueter$ hfst-lookup src/fst/analyser-gt-norm.hfstol 
> пей
пей	пей+?	inf

> сэдь
сэдь	сэдь+?	inf

> седей
седей	седей+?	inf

> эрзя
эрзя	эрзя+?	inf

Dict-Generation:

lang-myv jackrueter$ hfst-lookup src/fst/generator-dict-gt-norm.hfst 
hfst-lookup: warning: It is not possible to perform fast lookups with foma format automata.
Using HFST basic transducer format and performing slow lookups
> пей+N+Sg+Nom+Indef
пей+N+Sg+Nom+Indef	пӓй	0,000000

> сэдь+N+Sg+Nom+Indef
сэдь+N+Sg+Nom+Indef	сӓдь	0,000000

> седей+N+Sg+Nom+Indef
седей+N+Sg+Nom+Indef	седей+N+Sg+Nom+Indef+?	inf

> эрзя+N+Sg+Nom+Indef
эрзя+N+Sg+Nom+Indef	ӓрзя	0,000000

Generation:

lang-myv jackrueter$ hfst-lookup src/fst/generator-gt-norm.hfstol 
> пей+N+Sg+Nom+Indef
пей+N+Sg+Nom+Indef	пей	0,000000

> сэдь+N+Sg+Nom+Indef
сэдь+N+Sg+Nom+Indef	сэдь	0,000000

> седей+N+Sg+Nom+Indef
седей+N+Sg+Nom+Indef	седей+N+Sg+Nom+Indef+?	inf

> эрзя+N+Sg+Nom+Indef
эрзя+N+Sg+Nom+Indef	ӓрзя	0,000000

The text was updated successfully, but these errors were encountered:

flammie · 2024-02-22T16:55:47Z

I tried to debug the xerox script like this:

$  hfst-xfst
hfst[0]: read regex @"src/fst/filters/remove-diaereses-enhancement.hfst"
hfst[1]: apply down
apply down> эрзя
эрзя
apply down> ӓрзя
эрзя
apply down>

it seems it should work but also there is a flag diacritic in the lexicon between .#. and э which may be issue or otherwise I am not very good with xfst scripting debugger.

rueter · 2024-02-22T22:34:01Z

Replace rule: ӓ -> э || [ .#. | %- ] _ ;
with

ӓ -> э || \[ в | В | б | Б | г | Г | ж | Ж | к | К | м | М | п | П | ф | Ф | х | Х | ч | Ч | ш | Ш | щ | Щ | д | Д | з | З | л | Л | н | Н | р | Р | с | С | т | Т | ц | Ц ] _

which implicitly allows for flags.
Not a good idea, but it works.
The complex one remains

седей+N+Sg+Nom+Indef

Something like kaNpat >> kammat
сьӓдей >> седей

snomos · 2024-04-24T08:40:52Z

The present code does not work because there is a contradiction in it. What you have is basically this:

ӓ -> э || [ д | Д | … ] _  
.o.
ӓ -> э || \[ … | д | Д | … ] _

Ie you can't tell it to do one and the same change both in the context of д | Д and NOT in the context of д | Д. What do you really want?

rueter · 2024-04-24T08:49:03Z

what I would like is:

ӓ -> э || [ д | Д | … ] _  
.o.
ӓ -> е || \[ … | д | Д | … ] _

When ‹ӓ› is word initial or follows an alveolar it should become ‹э›. Following a non-alveolar or a soft-sign ‹ь› it should turn to ‹е› AND ь -> 0.
Does this mean that there should be a separate file for removing the soft sign?

snomos · 2024-04-24T09:03:47Z

Don't know, but at least you should fix the regex to say what you want: ӓ -> е in the second case (now it says ӓ -> э) 🙂

rueter · 2024-04-24T12:10:49Z

It now says:

ӓ -> е || [ в | В | б | Б | г | Г | ж | Ж | к | К | м | М | п | П | ф | Ф | х | Х | ч | Ч | ш | Ш | щ | Щ ] _
.o.
ӓ -> э || [ д | Д | з | З | л | Л | н | Н | р | Р | с | С | т | Т | ц | Ц | .#. | %- ] _  
.o.
[[ Ь | ь ] -> 0 ||  _ [ ӓ | Ӓ ] ,, ӓ -> е || [ ь | Ь ]  _ ] ;

but it still does not work

snomos · 2024-04-25T10:31:13Z

I have reordered the steps as follows:

[ [ Ь | ь ] -> 0 ||  _ [ ӓ | Ӓ ] ,, ӓ -> е || [ ь | Ь ]  _ ]
.o.
[ ӓ -> е || [ в | В | б | Б | г | Г | ж | Ж | к | К | м | М | п | П | ф | Ф | х | Х | ч | Ч | ш | Ш | щ | Щ ] _ ]
.o.
[ ӓ -> э || [ д | Д | з | З | л | Л | н | Н | р | Р | с | С | т | Т | ц | Ц | .#. ( ? ) | %- ( ? ) ] _ ] ;

and after this change it works in three out of four cases:

echo пей+N+Sg+Nom+Indef | hfst-lookup -q src/fst/generator-dict-gt-norm.hfstol
пей+N+Sg+Nom+Indef	пӓй	0.000000

echo сэдь+N+Sg+Nom+Indef | hfst-lookup -q src/fst/generator-dict-gt-norm.hfstol
сэдь+N+Sg+Nom+Indef	сӓдь	0.000000

echo седей+N+Sg+Nom+Indef | hfst-lookup -q src/fst/generator-dict-gt-norm.hfstol
седей+N+Sg+Nom+Indef	седей+N+Sg+Nom+Indef+?	inf

echo эрзя+N+Sg+Nom+Indef | hfst-lookup -q src/fst/generator-dict-gt-norm.hfstol 
эрзя+N+Sg+Nom+Indef	ӓрзя	0.000000

Only the седей case is not working.

rueter · 2024-04-25T10:43:20Z

I wanted to try a different ordering as well, but got this:

Making all in .
make[3]: Entering directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src/fst'
mkdir -p `dirname .generated/.stamp`
make[3]: *** No rule to make target 'filters/remove-diaereses-enhancement.%', needed by '.generated/analyser-pmatchdisamb-gt-desc.hfst'.  Stop.
make[3]: *** Waiting for unfinished jobs....
touch .generated/.stamp
make[3]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src/fst'
make[2]: *** [Makefile:1257: all-recursive] Error 1
make[2]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src/fst'
make[1]: *** [Makefile:450: all-recursive] Error 1
make[1]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src'
make: *** [Makefile:554: all-recursive] Error 1

snomos · 2024-04-25T11:07:21Z

Sorry about that, fixed now.

rueter · 2024-04-25T12:03:29Z

Now it dies in a different place:

Reading and minimizing rule ????v...
Reading and minimizing rule 74...
Reading lexicon... minimize(determinize(reverse(lexc(lexicon.lexc)))) read
Computing intersecting composition...
Storing result in <stdout>...
Minimizing reverse(compose(minimize(determinize(reverse(lexc(lexicon.lexc)))), intersect(morphology/.generated/phonology.rev.hfst)))...
make[3]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src/fst'
make[2]: *** [Makefile:1257: all-recursive] Error 1
make[2]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src/fst'
make[1]: *** [Makefile:450: all-recursive] Error 1
make[1]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src'
make: *** [Makefile:554: all-recursive] Error 1

but if I do make distclean, then it dies for lack of a rule to generate the .hfst as given above.

snomos · 2024-04-25T12:11:48Z

That seems completely unrelated, I have no idea. Wipe and reclone?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt at establishing distinction between -dict-gt-norm- and -gt-norm fails #4

Attempt at establishing distinction between -dict-gt-norm- and -gt-norm fails #4

rueter commented Feb 22, 2024 •

edited

Loading

flammie commented Feb 22, 2024

rueter commented Feb 22, 2024 •

edited

Loading

snomos commented Apr 24, 2024

rueter commented Apr 24, 2024

snomos commented Apr 24, 2024

rueter commented Apr 24, 2024 •

edited

Loading

snomos commented Apr 25, 2024

rueter commented Apr 25, 2024

snomos commented Apr 25, 2024

rueter commented Apr 25, 2024 •

edited

Loading

snomos commented Apr 25, 2024

Attempt at establishing distinction between -dict-gt-norm- and -gt-norm fails #4

Attempt at establishing distinction between -dict-gt-norm- and -gt-norm fails #4

Comments

rueter commented Feb 22, 2024 • edited Loading

flammie commented Feb 22, 2024

rueter commented Feb 22, 2024 • edited Loading

snomos commented Apr 24, 2024

rueter commented Apr 24, 2024

snomos commented Apr 24, 2024

rueter commented Apr 24, 2024 • edited Loading

snomos commented Apr 25, 2024

rueter commented Apr 25, 2024

snomos commented Apr 25, 2024

rueter commented Apr 25, 2024 • edited Loading

snomos commented Apr 25, 2024

rueter commented Feb 22, 2024 •

edited

Loading

rueter commented Feb 22, 2024 •

edited

Loading

rueter commented Apr 24, 2024 •

edited

Loading

rueter commented Apr 25, 2024 •

edited

Loading