-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguous, priority-tagged keys in example sentence index strings #121
Comments
Most of the ambiguous indices came about as a result of new JMdict entries being created since the original sentence indexing was done about 20 years ago. I think the 易々/やすやす/いい is such a case. Disambiguation can be done in two ways:
Turning to the specific questions:
Probably not. They both should be indexed using KANJI(ど).
No. The sense number is really only for matching by the reporting app; not for finding the sentence pair (at least for WWWJDIC - I can't comment on other apps.)
Correct. The part in {} is only to indicate the form in which the term appears in the sentence. In WWWJDIC it's used to enable the indexed term to be highlighted in the sentence display. I also use it in a validation program to verify the integrity of the indices. |
OK, so if the index parser is simple and doesn't use all the information available to find the correct entry, then by my count there are 572 instances of keys within index strings that we can consider to be ambiguous. We can probably use the extra available information (sense numbers, readings within curly braces, "usually kana" info) to fix a couple hundred of these instances automatically. If I were to provide a list of sentence IDs, index strings, and fixed index strings, would we be able to run a bulk update (find-and-replace) on the index database? By the way, out of those 572 instances, only 8 contain readings in parentheses. We'll need to replace these readings with explicit sequence numbers since the kanji-reading pairs all belong to more than one entry. (In the case of 家・うち, the pair is in entry 1457730 as a search-only form)
|
I don't think there are 572 ambiguous instances. If we look at the first line in your table:
Yes, 解ける is found in two entries (1546070;1198910) but it is only the leading kanji form in 1198910. In 1546070 the leading kanji form is 溶ける and sentences are linked to that form. You can verify this by looking up 解ける in WWWJDIC. Finding the actual ambiguous instances is tricky. I don't think the sense numbers and the written forms in curly braces are actually much use for that. |
It's the same with 家(うち) and 才(さい) - there are actually no ambiguities in the sentence linking. 朱(しゅ) was ambiguous so I replaced the reading with the entry number. |
This "leading kanji form" method doesn't seem to work reliably. For example, sentence #142850 has the following index string.
In the latest version of the
In the example above, the クジラ in curly braces shows that the sentence belongs to the くじら entry rather than いさな. I'm only suggesting that the info could be a useful heuristic for spotting these incorrectly indexed sentences. I've been meaning to get around to doing some more work on this issue. I might have some more progress to share soon. |
Yes, the arrival of that 鯨/いさな entry meant that it competes with 鯨/くじら for which one gets linked. I've now amended the 鯨 links to 鯨(くじら) which should fix it. If I get the time and energy I should check the unqualified kanji indices in the sentences for cases where there are multiple dictionary entries with the same form. Fixing them can be a problem - Tatoeba's global edit is very good, but it has a problem with cases where the index form is at the start of the sentence. |
I set up a repo here on GitHub to track my edits to the Tatoeba 'sentence annotations' database. It is synchronized once a week with the https://github.com/stephenmk/jmdict-tatoeba-sentence-linking/commits/main/ My fixes to problematic index strings will be visible here as I work on this issue over time. |
Occasionally I come across example sentences that are keyed to the incorrect entry because the key has been defined ambiguously. See this report about 易々 here for example. The index string contains
易々{やすやす}~
instead of易々(やすやす){やすやす}~
and consequently the sentence ends up in the entry for いい【易々】 in theJMdict_e_examp
file.I'd like to fix all of these errors at once, so I tried to search for priority-tagged keys (i.e., keys with the
~
symbol appended) which could be considered ambiguous. Unfortunately there seem to be at least several hundred. The precise number depends upon how we define ambiguity.Even if we assume that the key "ど" belongs to "ど[nokanji]" and also make use the sense number information, by my count there are still 331 ambiguous sentences. I posted a CSV file with the data here. Working through this list would be quite a challenge.
The text was updated successfully, but these errors were encountered: