Dictionary generation: Maskwacîs dictionary entries lack `<lc>`, and that messes up paradigm generation #120

eddieantonio · 2019-02-01T23:40:36Z

Spun off from #117.

Some entries lack a value for <lc>. This is needed to do smart things when generating the search page, and required to generate a paradigm. For example, "ayahciyiniw":

<e>
   <lg>
      <l pos="N">ayahciyiniw</l>
      <lc></lc>
      <stem>md_stem[nr]</stem>
   </lg>
   <mg>
   <tg xml:lang="eng">
       <t pos="N" sources="MD">A member of another Indian tribe. Used by the Plains Cree for a member of the Blackfoot Confederacy; or any outcast or ostracized member of an Indian
   </tg>
   </mg>
</e>

Note, this exists as a SEPARATE <e> tag, with all <t> source as "Cree : Words":

<e>
   <lg>
      <l pos="N">ayahciyiniw</l>
      <lc>NA-2</lc>
      <stem>ayahciyiniw-</stem>
   </lg>
   <mg>
   <tg xml:lang="eng">
       <t pos="N" sources="CW">Blackfoot</t>
       <t pos="N" sources="CW">Slavey</t>
       <t pos="N" sources="CW">stranger</t>
       <t pos="N" sources="CW">stranger</t>
   </tg>
   </mg>
</e>

Resolving this will fix help resolve #117.

Possibly related to #104.

The text was updated successfully, but these errors were encountered:

aarppe · 2019-02-02T00:07:01Z

If a lemma is solely in MD, that source has no subtype of verb or noun, nor any of their inflectional subtypes. Those can be added, but manually.

So, MD only provides info on whether smth is a verb or noun, but not VTA nor NA, nor VTA-2 nor NA-4w.

What exactly does <lc> do? Does it need to be actually correct, or can we have some bonus/default value?

eddieantonio · 2019-02-02T21:01:40Z

If a lemma is solely in MD, that source has no subtype of verb or noun, nor any of their inflectional subtypes. Those can be added, but manually.

Ideally, we'd add noun animacy to all of the lemmas unique to MD, even if it has to be done manually; however...

So, MD only provides info on whether smth is a verb or noun, but not VTA nor NA, nor VTA-2 nor NA-4w.

What exactly does <lc> do? Does it need to be actually correct, or can we have some bonus/default value?

<lc> is the "lemma comment". Practically, it will have the most specific breakdown of the part of speech for the lemma. In CW, this is something like "NA-4w" or "VTA-2". When entries differ based on animacy (e.g., mîtas (NDA) and mîtas (NDA)), this entry is required to disambiguate the two.

HOWEVER, as noted in the main issue, this particular form, "ayahciyiniw", exists as two entries: one with an empty <lc> and one with the <lc> as specified in CW. This is probably a bug in dictionary generation. Also note that the CW version of the entry has "stranger" as a translation twice.

aarppe · 2019-02-02T21:37:29Z

I’ve taken <lc> to mean lexical category, but ’lemma comment’ works just as well.

In principle, one could even have two lemmas with the same part of speech, but with different inlflectional classes. Can’t come up with any good example here.

What I should be asking is the degree of specificity that is explicutly needed in how NDS is coded?

Seems to me that the paradigm and layout files in their preamble specs need to match the linguistic analysis, so for animate nouns N and A, and the inflectional subtype associated with each lemma, listed in the preamble, like NA-1 or NA-4w, only need to be matched by the lemma in the XML source (potentially disambiguating otherwise similar items), and these inflectional subtypes do not influence paradigm generation.

The discrepancies intressiä XML are likely an issue of ambiguity in the comparison files and their linking with the original dictionary sources, and I think they’ll be best resolved by a single database for all sources (eventually).

eddieantonio · 2019-02-02T22:17:57Z

What I should be asking is the degree of specificity that is explicitly needed in how NDS is coded?

It's decided by the YAML header in the paradigm files.

https://github.com/UAlbertaALTLab/itwewina/blob/development/neahtta/configs/language_specific_rules/paradigms/README.md#analyzer-conditions

So, it's as specific as it is in the paradigm YAML. This could probably done in a more straightforward way. I don't fully understand how this work either :( I think it can be inferred from the linguistic analysis, for Plains Cree?

aarppe · 2019-02-03T13:35:24Z

Note: having 'stranger' twice is due to an error in CW source in providing that English translation twice for 'ayahciyiniw' - so it's not a matter of the script making an error. Also, that the MD and CW entries are presented separately is due to 'ayahciyiniw' missing from the comparison file. So issues with the source materials, for now, rather than the scripts.

eddieantonio · 2019-02-04T01:36:24Z

Okay! I'm going to take this issue off of "stable version", as there are a number of real TODOs that need to be addressed before we even get to finish this one!

Also, "lexical category" makes way more sense!

eddieantonio added the bug Something isn't working label Feb 1, 2019

eddieantonio assigned aarppe Feb 1, 2019

eddieantonio mentioned this issue Feb 1, 2019

Server error when searching for 'ayahciyiniw' #117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dictionary generation: Maskwacîs dictionary entries lack `<lc>`, and that messes up paradigm generation #120

Dictionary generation: Maskwacîs dictionary entries lack `<lc>`, and that messes up paradigm generation #120

eddieantonio commented Feb 1, 2019

aarppe commented Feb 2, 2019 •

edited

Loading

eddieantonio commented Feb 2, 2019

aarppe commented Feb 2, 2019

eddieantonio commented Feb 2, 2019

aarppe commented Feb 3, 2019

eddieantonio commented Feb 4, 2019

Dictionary generation: Maskwacîs dictionary entries lack <lc>, and that messes up paradigm generation #120

Dictionary generation: Maskwacîs dictionary entries lack <lc>, and that messes up paradigm generation #120

Comments

eddieantonio commented Feb 1, 2019

aarppe commented Feb 2, 2019 • edited Loading

eddieantonio commented Feb 2, 2019

aarppe commented Feb 2, 2019

eddieantonio commented Feb 2, 2019

aarppe commented Feb 3, 2019

eddieantonio commented Feb 4, 2019

Dictionary generation: Maskwacîs dictionary entries lack `<lc>`, and that messes up paradigm generation #120

Dictionary generation: Maskwacîs dictionary entries lack `<lc>`, and that messes up paradigm generation #120

aarppe commented Feb 2, 2019 •

edited

Loading