Vocabulary dictionary

Kanji dictionary

Grammar dictionary

Sentence lookup

test
 

Forums - incorrect kana type frequency (hiragana vs katakana)

Top > renshuu.org > Bugs / Problems



avatar
myuu3
Level: 99

I have noticed that some words have the incorrect kana as in hiragana vs katakana usage when being quizzed.

this is a separate from whether or not kanji is more popular via JMDict than the kana form.

e.g. in master schedule for beginner japanese, katakana shows as かたかな, but カタカナ frequency is 70%, is remaining percent and then very small percent is かたかな.

i just come across another word like this, カニ for crab. on renshuu card, it defaults to the hiragana again. kanji is under 40% frequency with , but when written using kana it is almost exclusively written as カニ rather than かに, but it shows on the card in hiragana.

i am not sure if there is something on backend with frequency calculations that defaults every time to hiragana for these types of words, but that might need to be look into. for exclusively katakana foreign words that dont have kanji, they appear correctly in katakana. i am reporting cards like this, but I am seeing it quite a bit so there might be something that needs to be looked at as in general way in which these cards are being processed that have kanji components. kao_nozoki.png

0
3 months ago
avatar
マイコー
Level: 337

renshuu does not have the frequency data that jmdb, which I presume is where you are pulling that frequency data from. So, it goes by jmdict's ordering of linked terms. For example, looking up カニ on jisho.org (which also uses the same ordering, same source), カニ is listed as an alternate (secondary) form.

I'm not sure there is much of a way to fix this without taking the time to generate the same freq data.

1
3 months ago
avatar
myuu3
Level: 99

hi Michael! im pulling the data from jpdb and jmdictdb (on edrdg dot org)

here is jpdb assessment https://jpdb.io/search?q=%E3%8...

here is jmdictdb assessment https://www.edrdg.org/jmwsgi/e...

jpdb I think uses hybrid of jmdict comments and some proprietary frequency corpus list/ngram.

jpdb:

カニ 61%

38%

minor

for jmdict:

eij katakana
n-grams
32938
13
カニ 58605

which is very close to those same percentages

so here we can see that there are no uses of かに in hiragana from major or known corpora

for your lookup on jisho, there is no plain hiragana usage, it defaults to the kanji (the 30s%) n-gram frequency, and the katakana is shown as alternate, but there is no single hiragana usage listed. かに should not be used in hiragana form for a card here imho because it just isnt used that way in japanese (and ive never seen it that way either, restaurant or story uses one of the kanji or katakana)

sorry if im not explaining it well, but i do think for some things like this it is important that the common kana form (if kanji is not being used) is displayed when learning, because that is what the learner will see (u will probably never see かに, so why learn it as the first exposure to the word)

i understand that this might not be a frequent occurance of frequency lolz, but for certain cards and in the dictionary, think they should be manually edited such as this one to display katakana, not hiragana. i can just submit when i see ones like this. when user types "crab" into dictionary on renshuu, i think it should return either katakana or the major kanji form. currently it returns hiragana (virtually no real world usage), before then listing both kanji forms and the katakana.

this is just how i feel, its ofc fine however u choose to make it and i dont know how complicated this is to change, but for some words like this i think even if just manually edited on backend that would be good kao_shame.png

1
3 months ago
avatar
マイコー
Level: 337

You're explaining it quite well! At the moment, though, I do not have the resources to go through and check these on a case-by-case basis. I am not exactly sure where those ngrams are coming from - that's not "jmdict's asssement", that's an unsourced comment by another user. Without a clear and legally usable resource to pull from, this is not an improvement that I believe I'll be able to make.

Please do not get me wrong - I would *like* to add this, but short a dataset that I can use and trust, it's not something I have the time to look at each time a specific word comes up that may not be ordered in an ideal way.

1
3 months ago
avatar
myuu3
Level: 99

hi Michael!! that is astute noticing it is just a comment, and not from Jim Breen himself there. i just did little searching just now, this seems like it is open source frequency https://clrd.ninjal.ac.jp/bccw...

you can skip to later where I type skip to here!!! it is also included on a list of open source and free for educational usages on this list of yomitan dictionaries that also includes some other processed n-grams and frequency lists https://github.com/Kuuuube/yom...

some of those like jpdb I am assuming are non-free, but that BCCWJ one is "It is free for use for research or educational purposes." on their website. that one looks like it was processed from this https://github.com/toasted-nut...

also found this paper, which i just skim now http://faculty-sgs.tama.ac.jp/...

after thinking for a bit and doing some more searching, i think that toasted script just processes surface/lexical freuqency alone, and not agglomerated into lemma frequency...

u can SKIP TO HERE!!!!!!

did some more searching, and now find this http://faculty-sgs.tama.ac.jp/...

SUW and LUW data for lemmae! "The corpus word lists are grouped according to the two word unit definitions used within the BCCWJ project (NINJAL, 2011) and UniDic (Den, et al, 2007); namely, short unit words (SUW) and long unit words (LUW)." it includes OrthBFreq for lemmae, and you dont even need to perform the calculations because there's also OrthCover which is "Ratio of total lemma frequency covered by a particular orthographic base form"

so I think that's the what to do, should this be something that you end up being interested in doing. because renshuu is a learning resource and not some technical dictionary, I think that starting even with just the SUW data (not even needing to merge it with the LUW) would be good. could even add a checkmark flag/option in renshuu's dictionary to display percentages next to surface realizations/forms. then under some experimental feature, there can be a button that says something like "reprocess schedule to orthographic ratio" which will modify cards in that user's schedule (if they are not touching the dictionary?) or are the cards still not duplicated somehow? maybe shim layer or can create a shadow/duplicate copy of cards in the schedule and then rearrange forms according to ratio. that's more suggestion at the end, but for now i think i will do that with my exported test schedule (run it through SUW ratio processor i will program to rearrange the terms according to freq ratio).

in summary i think that something like former part about adding percentages to dictionary is at least simple, only thing that terry joyce asks is "Respecting standard curtesies in these matters, individuals who utilize the corpus word lists within published works are politely asked to cite the Joyce, et al (2012) paper." so could say something like when u check the experimental box to "add orthographic frequency/ratio to dictionary entries" there can be a small subtext that says "utilizes Japanese lexical properties database (JLPD) processed by terry joyce, as described in Joyce, Hodošček & Nishina (2012), based upon BCCWJ corpus (licensed as free use for educational purposes)" that is the paper that i listed before (this one: http://faculty-sgs.tama.ac.jp/... )

sorry i type a lot stream of consciousness, this was fun for me too and that paper interesting that i will read in more detail later. maybe you also find this interesting, maybe not though hehe kao_hug.png

0
3 months ago
avatar
マイコー
Level: 337

I also found this one, which seems to be usage/license unencumbered.

https://www.s-yata.jp/corpus/n...

1
3 months ago
avatar
myuu3
Level: 99

interesting! i wonder why those are so much larger? i cannot download to chek file of that size now, but the joyce ones have those computed ratios already included, and the files are only several MB, rather than GB... strange!

maybe because BCCWJ is textbook corpus, but google n-gram is... everything that people type online?

ok reading through that s-yata site more and it seem to me like those a large because most of the n-grams are of large N, and also web. i think to start at least use the small several MB ones from joyce because that's more vocabulary type listing. the s-yata ones are probably for training language models or machine translator programs? but 1000 のファイルリスト 107MB, 541MB I guess not too large... though u will have to process 500 MB file and perhaps calculate ratios. not sure how large the dictionary here is, but for expression morpheme processing, maybe if u have the backend compute?

the SUW data is only 13 MB, and even the LUW data is 115 MB... but just see that 115 MB compressed... close to the 107 MB from s-yata. maybe comparing the datasets and seeing how structured. having to compute orthographic frequences within lemmae might difficult, and I doubt that the s-yata file even has lemmae in it, as in no normalized dict headwords like joyce which is computed from unidic. morphological analysis segmentation on the s-yata set is going to probably require u to normalize the data somehow and will probably take some good compute time. joyce is already precomputed.

0
3 months ago
avatar
myuu3
Level: 99

ok just edited my answer lolz sry kao_heh.png

0
3 months ago
avatar
マイコー
Level: 337

The google-based ones are not just 1-gram, and they are not tied to just dictionary entries. I've been running them through a script I wrote, and I think it's maybe using 1-2% of the actual entries in those files, so most of them are not what we'd consider to be dictionary entries (at least, not a dictionary like jmdict).

0
3 months ago
avatar
マイコー
Level: 337

These are apparently from the a google in-house 20% project. Looking at the files more carefully, it has far past 1-gram, which is why the dataset is so large.

Strangely (ignore the duplicates), the ngram data is has for the crab example we've been working with is quite different from what you've seen. These are absolute numbers, not %s, but it gives かに at 561k, カニ at 901k, and the kanji at 706k.

If you keep out the hiragana, the numbers are *almost* the same to the 61/39 you noted before. Hoewver, I'm wondering (I'm really new to dealing with ngram data) how you can exclude the n-gram for かに and say "This is used for something else, cannot be considered as the kana form of the kanji.

1
3 months ago
avatar
myuu3
Level: 99

i assume the numbers for the s-yata set will have different frequencies because of how the internet is structured (the internet is more imbalanced than textbook corpora, as certain types of data are magnified (or suppressed) in frequency because of the nature of what types of content gets put online and how often it is "repeated" compared with a selected array of textbooks or other such written documents, and what parts of that are accessible to google. google does have a ngram viewer for books, but that dataset u linked on s-yata uses google internet searches, not books. NグラムはされているのWebページでGoogleがクロールしたものからされている。

the high number of かに might b becuz it is from a different lemma? is there lemma information in that dataset? there might be morphological forms of かに that imply crab (or otherwise) that are being parsed differently. かに versus カニ。

re: how you can exclude

I think this is done by whatever method is used to normalize the dataset into lemmae. in the joyce paper, it briefly describes and links how the unidic source handles it under § 5, orthographic representation, issues and data. it describes it well there i think, but also there are links to the actual unidic methodology.

are the lemmae within the google dataset normalized into semantic groupings? if not, that's probably why there is this descrepancy. i think maybe to compare the structure of both datasets maybe can see why there is difference. even though the upper set you are using is morpheme based, thats still different from semantically normalized.

0
3 months ago
avatar
myuu3
Level: 99

and if there is no semantic grouping in the google ngram set, that is because it doesnt need it because those types of ngrams in large datasets are for things like machine learning or translators, which do not "think" semantically, but are mainly just raw statistical engines. those types of datasets arent meant for humans, but are rather just meant for mass tokenization for machine ingestion.

0
3 months ago
avatar
myuu3
Level: 99

ok i think of another example besides かに to help make it more clear maybe, なにかかにか surface form, as in かかにか、this should morpheme parse into + か+か+に+か。  2--gram from these morphemes will give you かに。 but this is not crab. it's a 2-gram even after morphological parsing (not just a 2-gram by single kana parsing), but with no semantic normalization or disambiguation to crab (after all here it is not a crab).

that's why those ngram datasets, even with morphological parsing/segementation arent going to really work here i dont think (unless there is normalization in it but reading the site it doesnt seem it).


0
3 months ago
avatar
マイコー
Level: 337

These are 1-gram results for かに, though. You see it in jmdict's version of the search engine against Google's n-gram data as well: https://www.edrdg.org/~jwb/cgi...

Yea - if it's kata vs. kanji, then you can feel more comfortable about usage, but ..

0
3 months ago
avatar
myuu3
Level: 99

are you sure that page there is returning a 1-gram and not a 2-gram for かに? even if it's a morphologically parsed wordlevel or subword level 1-gram, it still isn't semantically normalized though, it's just for things like collocations and other types of language modelling. those aren't semantically linked as crab, the surface realization of かに there can have any other 1-gram forms dpeending on how that MeCab/IPADIC tokenizes it and how aggressively it splits particles. look at the example かれはでした where you get a 1-gram for かれ bad example sry because かれ is a proper parsing of the morpheme. particle sequences get agglomerated without really aggressive stripping/tokenization, so かに will still be tokenized into aas a 1-gram in かに、どうかに、するかに、things like this. but its not meaning that it's a crab, because no semantic normalization. particle usage in japanese is really common, so particle chains like かに are going to appear a ton unless the parser is set to really aggressively strip them down.

this lines up with the data you posted. have you ever seen かに used for crab? in a restaurant, in a book, even in a google search? i dont think i have, or it has been just a few times, and certainly less than カニ. but with such a large hiragana count in that google ngram viewer of 13565484, you would think that it would be far and away the most common, right? well that's because that 1-gram of かに with such a large count is almost all instances of か+に that are not being further aggressively stripped into か and に。

this is just for fun and i am sorry if it took a bit of time away from more important things, but mbe u stil find it interesting hehe, just want to say though that u need to be really careful if u are using non-normalized ngrams for something like a dictionary/frequency list, because that type of data isnt meant to be used for a human/semantic dictionary.

0
3 months ago
avatar
myuu3
Level: 99

as for the discrepancy with:

eij katakana
n-grams
32938
13
カニ 58605

i dont know where that user got those ngrams. you can run mecab/ipadic with aggressive splitting of particles, which maybe was done on a smaller dataset (far smaller numbesr here) or maybe it was taken from semi-supervised list like from joyce's based on BCCWJ+UniDic. even subword tokenizers mess this type of thing up. for a dictionary, especially a learner's type dictionary, just want to caution against using some massive google dataset because the type of data isnt meant for people to use really. even by their own words: "Another problem arises from occasional segmentation issues. The IPADIC lexicon which was used for the task was known to have some flaws, for example the term , which is usually regarded as morpheme pair (+), was treated as a single morpheme [my note here, this shows a bias toward agglomeration]... As mentioned in the article, the structure of the data in the original corpus distribution does not cater very well for the process of looking up the frequency counts for single terms or sets of terms."

i really doubt that even with google's processing power at that time that they were able to take such a massive dataset and aggressively handle particle splitting and subword tokenization as a side project. for a learner's dictionary, u want some supervised/semi-supervised dataset that has had humans overviewing semantic differences.

0
3 months ago
avatar
マイコー
Level: 337

I believe it was 1-gram, but I am not sure if it specifies on that webpage.

As to the "real usage" of かに, that is unfortunately irrelevant. I do agree with you - I never see かに used. This is just one example that we've been using, though, but a one-by-one investigation of terms is simply not something I can do at scale, hence the hope for some kind of external data to use. If the external data, though, ultimately still needs a person to say "well, the 1-gram for かに is extremely high, but I judge that to not be referring to "crab", then we're back to square one - human intervention makes this project a non-starter.

0
3 months ago
avatar
myuu3
Level: 99

lists like joyce's do meet this criterion, as it is from BCCWJ/UniDic is an example of a supervised (or at least semi-supervised, and semantically normalized) dataset, that's just one such example. it's from part of NINJAL/national institute for japanese language and linguistics and tama university research. there are probably other lists as well. n3way tho, just something to think about! kao_ponder.png

0
3 months ago
avatar
マイコー
Level: 337

This is all good info, and I feel like it's a good start to something we can use in the future!

1
3 months ago
Getting the posts




Top > renshuu.org > Bugs / Problems


Loading the list
Lv.

Sorry, there was an error on renshuu! If it's OK, please describe what you were doing. This will help us fix the issue.

Characters to show:





Use your mouse or finger to write characters in the box.
■ Katakana ■ Hiragana