Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation of importjson structure so that the import process won't fail to pass database uniqueness constraints. #126

Open
2 of 4 tasks
fbanados opened this issue Jul 22, 2024 · 0 comments

Comments

@fbanados
Copy link
Member

fbanados commented Jul 22, 2024

Although this is more of an issue for the private altlab repository, it still affects how the importjson has to be generated so I'm adding it here (also to reference the issue from the code).

morphodict makes some uniqueness assumptions about the structure of the importjson file:

The latter two are not explicitly shown in the importjson, but are calculated and generated by morphodict at import time from either the semantic definition of a wordform if present, or the definition otherwise. Usually this is not a problem. But when it is, it manifests itself as a cryptic failure of UNIQUE constraints when running the importjsondict management command. It is hard to directly see what the problem is then.

  • Identify the problem. Problem has been clearly identified: if two importjsondict entries match to the same Wordform, the import-time safeguards that avoid duplicates for keywords are bypassed. This currently only happens when processing an entry that is a "formOf" another entry. The lifting of safeguards does not immediately trigger a failure, which only arises when the two entries share a keyword. This had not happened yet, but there was the risk of it eventually happening in CW: one entry in CW imports without safeguards (nisôkan as formOf misôkan@ndi) but, since no keywords are shared between the senses of their definitions , the process does not fail. Now that we are merging extra entries from the other sources, I identified one entry in AECD imports without safeguards AND has shared keywords, cîscahisîpwâkanis, which generates an importjson that canot be imported.
  • Make crk-db check that the generated importjson file is well formed and will not trigger this kind of failure. The process of generating the database should include safeguards to avoid import failures and to allow for linguists to debug the database and decide on the outcome when the data is inconsistent.
  • Decide whether the import process in morphodict should be changed to also ensure that the uniqueness restrictions for keywords do not stop the alter process. My take is that we should keep the import process unchanged, but I'm putting this out as an option for discussion if others feel it is really necessary to do that instead.
  • Fix the AECD entries for cîscahisîpwâkanis: The senses have opposite definitions, so my guess is that one of the entries is wrong and should be removed. However, that is a linguist decision.
fbanados added a commit that referenced this issue Jul 22, 2024
fbanados added a commit to UAlbertaALTLab/morphodict that referenced this issue Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant