Skip to content
This repository has been archived by the owner on May 6, 2021. It is now read-only.

Dictionary generation: Maskwacîs dictionary entries lack <lc>, and that messes up paradigm generation #120

Open
eddieantonio opened this issue Feb 1, 2019 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@eddieantonio
Copy link
Member

Spun off from #117.

Some entries lack a value for <lc>. This is needed to do smart things when generating the search page, and required to generate a paradigm. For example, "ayahciyiniw":

<e>
   <lg>
      <l pos="N">ayahciyiniw</l>
      <lc></lc>
      <stem>md_stem[nr]</stem>
   </lg>
   <mg>
   <tg xml:lang="eng">
       <t pos="N" sources="MD">A member of another Indian tribe. Used by the Plains Cree for a member of the Blackfoot Confederacy; or any outcast or ostracized member of an Indian
   </tg>
   </mg>
</e>

Note, this exists as a SEPARATE <e> tag, with all <t> source as "Cree : Words":

<e>
   <lg>
      <l pos="N">ayahciyiniw</l>
      <lc>NA-2</lc>
      <stem>ayahciyiniw-</stem>
   </lg>
   <mg>
   <tg xml:lang="eng">
       <t pos="N" sources="CW">Blackfoot</t>
       <t pos="N" sources="CW">Slavey</t>
       <t pos="N" sources="CW">stranger</t>
       <t pos="N" sources="CW">stranger</t>
   </tg>
   </mg>
</e>

Resolving this will fix help resolve #117.

Possibly related to #104.

@aarppe
Copy link
Contributor

aarppe commented Feb 2, 2019

If a lemma is solely in MD, that source has no subtype of verb or noun, nor any of their inflectional subtypes. Those can be added, but manually.

So, MD only provides info on whether smth is a verb or noun, but not VTA nor NA, nor VTA-2 nor NA-4w.

What exactly does <lc> do? Does it need to be actually correct, or can we have some bonus/default value?

@eddieantonio
Copy link
Member Author

If a lemma is solely in MD, that source has no subtype of verb or noun, nor any of their inflectional subtypes. Those can be added, but manually.

Ideally, we'd add noun animacy to all of the lemmas unique to MD, even if it has to be done manually; however...

So, MD only provides info on whether smth is a verb or noun, but not VTA nor NA, nor VTA-2 nor NA-4w.

What exactly does <lc> do? Does it need to be actually correct, or can we have some bonus/default value?

<lc> is the "lemma comment". Practically, it will have the most specific breakdown of the part of speech for the lemma. In CW, this is something like "NA-4w" or "VTA-2". When entries differ based on animacy (e.g., mîtas (NDA) and mîtas (NDA)), this entry is required to disambiguate the two.

HOWEVER, as noted in the main issue, this particular form, "ayahciyiniw", exists as two entries: one with an empty <lc> and one with the <lc> as specified in CW. This is probably a bug in dictionary generation. Also note that the CW version of the entry has "stranger" as a translation twice.

@aarppe
Copy link
Contributor

aarppe commented Feb 2, 2019

I’ve taken <lc> to mean lexical category, but ’lemma comment’ works just as well.

In principle, one could even have two lemmas with the same part of speech, but with different inlflectional classes. Can’t come up with any good example here.

What I should be asking is the degree of specificity that is explicutly needed in how NDS is coded?

Seems to me that the paradigm and layout files in their preamble specs need to match the linguistic analysis, so for animate nouns N and A, and the inflectional subtype associated with each lemma, listed in the preamble, like NA-1 or NA-4w, only need to be matched by the lemma in the XML source (potentially disambiguating otherwise similar items), and these inflectional subtypes do not influence paradigm generation.

The discrepancies intressiä XML are likely an issue of ambiguity in the comparison files and their linking with the original dictionary sources, and I think they’ll be best resolved by a single database for all sources (eventually).

@eddieantonio
Copy link
Member Author

What I should be asking is the degree of specificity that is explicitly needed in how NDS is coded?

It's decided by the YAML header in the paradigm files.

https://github.com/UAlbertaALTLab/itwewina/blob/development/neahtta/configs/language_specific_rules/paradigms/README.md#analyzer-conditions

So, it's as specific as it is in the paradigm YAML. This could probably done in a more straightforward way. I don't fully understand how this work either :( I think it can be inferred from the linguistic analysis, for Plains Cree?

@aarppe
Copy link
Contributor

aarppe commented Feb 3, 2019

Note: having 'stranger' twice is due to an error in CW source in providing that English translation twice for 'ayahciyiniw' - so it's not a matter of the script making an error. Also, that the MD and CW entries are presented separately is due to 'ayahciyiniw' missing from the comparison file. So issues with the source materials, for now, rather than the scripts.

@eddieantonio
Copy link
Member Author

Okay! I'm going to take this issue off of "stable version", as there are a number of real TODOs that need to be addressed before we even get to finish this one!

Also, "lexical category" makes way more sense!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants