Skip to content
This repository has been archived by the owner on May 6, 2021. It is now read-only.

Check validity of dictionary when starting up #94

Open
7 tasks
eddieantonio opened this issue Dec 12, 2018 · 3 comments
Open
7 tasks

Check validity of dictionary when starting up #94

eddieantonio opened this issue Dec 12, 2018 · 3 comments
Labels
enhancement New feature or request

Comments

@eddieantonio
Copy link
Member

eddieantonio commented Dec 12, 2018

Ensure mistakes like #93 don't happen again.

Basically, do a few sanity checks before starting the app:

  • do all <source> elements have a <title>?
  • do all <source> elements have an ID?
  • do all <e> elements have one <lg> and at least one <mg>?
  • do all <mg> elements have at least one <tg>?
  • do all <tg> elements have at least one <t> element?
  • do all <t> elements with a sources attribute have a unique set of sources?
  • are all values in the <t sources> attribute valid <source> IDs?

EDIT: I'm pretty sure I can validate a lot of these things by creating an XML schema, and using a schema validator, but... that might be more effort than it's worth.

Print warnings on start up that are LOUDLY logged somewhere.

@eddieantonio eddieantonio added the enhancement New feature or request label Dec 12, 2018
@aarppe
Copy link
Contributor

aarppe commented Dec 12, 2018

I had noticed this in a few cases. Currently, the reason is that some of the comparative matches/mismatches between CW and MD result from the descriptive analysis allowing for two inflected forms belonging to different parts of speech (and lemmas with different parts of speech). Typically, we have an non-base inflected form lexical entry in MD, which matches with a base-form lexical entry in CW. In such a case one would need to different lexical entries. E.g.

MD: atos MD: Have him do something for you. CW: atos+N+AN+Sg CW: atos CW: arrow CW: atos COMP:lemma

This should be resolvable, but may require some thinking on how to produce the appropriate POS and LC info for the MD entries (which doesn't have CW-style LC:s, so I'd have to extract that through linking the lemma from the correct FST analysis of the MD entry with the LC in CW for the lemma.).

@aarppe
Copy link
Contributor

aarppe commented Dec 20, 2018

WIth some new scripting, dictionary entries end up being matched only if they belong to the same part-of-speech (by exclusion of 'conjugation' class in MD vs. CW comparisons) - the conjugated/inflected forms as now output as separate MD-only dictionary entries. So the 'atos' issue above is no longer an 'issue'.

@aarppe
Copy link
Contributor

aarppe commented Dec 20, 2018

Checking validity of XML source and delivering warnings of bad structure to appropriate location/email is a very desirable feature.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants