HSK levels not representing every words

Ilraon · July 5, 2020, 4:57pm

Hello!
HSK5 is a level of Mandarin Chinese that is supposed to contain about 2500 words, and HSK6 about 5000 words.
It’s obvious that in Clozemaster, the HSK5 and HSK6 collections are far from containing every words of these levels, because the number of sentences within these collections are relatively low. (I think someone had stated in an other thread that for one of the two it was close to 50% of words contained…)
I was curious as to why this is the case, when “HSK?” category has such a huge number of words. I find it hard to believe that none of the missing HSK5 and HSK6 words appear in “HSK?”.

My hypothesis was that the condition of having >75% of words appearing in the sentence from that HSK level or below is too restrictive; and thus some words are missing. Of course sentences respecting that condition are ideal and very pleasant to learn; but I would rather have a sentence for each words, even if they don’t respect this condition, instead of half the words of the category missing.

mike · July 10, 2020, 10:10am

Hello!

My hypothesis was that the condition of having >75% of words appearing in the sentence from that HSK level or below is too restrictive; and thus some words are missing.

This is indeed the case.

Of course sentences respecting that condition are ideal and very pleasant to learn; but I would rather have a sentence for each words, even if they don’t respect this condition, instead of half the words of the category missing.

Good to know and good point. There are two things we can likely do:

Try to improve how we split Chinese sentences into words. I suspect some of the issue is from not splitting the sentence correctly causing some of the “words” not to match our HSK lists.
Keep the >75% but add at least once sentence for each word like you mentioned, taking one with the most matched words / best fit.

Chinese overall could use some work - we’ll also likely want to re-clozeify sentences on the Fast Track and Most Common Word groupings once we can figure out a better way of splitting sentences into words.

Curious to hear if you have any further thoughts of course. Thanks for the feedback!

Ilraon · July 13, 2020, 4:19pm

Hi hi!

I agree with your hypothesis regarding 1. Ensuring the 75% condition means you need a perfect split on the whole sentence, which I suppose is a challenge regarding the strength of AIs in Chinese. I have not done any ressearch on how to improve that split.

It remains unsure whether “ideal” sentences actually have been left unnoticed, or if you’ve already used the full potential of your pool of sentences regarding the 75% condition. It might be that even after implementing a perfect split algorithm, there would still be words missing.

Thus I think in all cases (doing 1 or not) you should go with 2 as well. You’ve mentioned using one “with the most matched words”, which seems like a good idea. Actually, I think using two would be more reasonable, since it’s quite difficult to learn a word in a single sentence. Especially if the example sentences aren’t ideal, it might help to have more than one to guess the pattern.
The advantage of 2 is that it is comparatively to 1 quick to implement, or at least the ignorant me feels that way, and I guess as such it could be used as a quick improvement. (plz I’ll start the HSK5 collection in a couple of weeks )

Now, slightly unrelated notes that I just thought I’d share:

the pronunciation of the AI sometimes is wrong. The example I have most come upon is sentences including 得, which can be pronounced both ‘dei’ and ‘de’, but with a different meaning - and the AI never gets the ‘dei’ right.
There are some sentences including traditional characters in the simplified course (the one we’re talking about), most notably with 为什么 being written 为甚麽. By the way, this section being called Mandarin Chinese and the other being called Traditional Mandarin Chinese implicitly implies that simplified Mandarin Chinese is the legitimate Mandarin Chinese, which, at least culturally, isn’t obviously the case since it’s relatively new.

Thanks for your time!
Have a nice day

Leriot · October 17, 2020, 8:20pm

I also think that those words don’t fulfil the 75% condition should still have one or two sentences, at least until there is a better idea on how to achieve better results.

Ilraon · October 28, 2020, 6:01pm

Thank god! I’m not alone!
Also, I started learning HSK5. A suitable time to bring this again to your attention @mike ^-^
(As far as I can tell the chinese section of clozemaster is pretty active!)

ps I keep recommending your app to people

huyuan · October 29, 2020, 11:26pm

I was wondering whether the HSK collections had all the words, guess this answers the question. Personally I’d expect a collection of sentences that’s supposed to match HSK levels to have most if not all the words (though I understand there may be limitations with using Tatoeba as the dataset), so “completing” a collection would be more representative of your actual learning progress in relation to the HSK.

As a stopgap measure, I wouldn’t mind if the sentence restrictions were loosened if it gave me a more complete exposure to HSK vocab. It may be not ideal, but difficult words can be ignored or inferred from context, especially if they’re nouns, and we have the translations to compare them to anyway.

(On a side note, I’ve also seen quite a few sentence in the Fast Track collections with more advanced vocabulary than expected (one of them included something like “palm tree” for example, while the clozed word was much more common), I assume this is also due to the difficulty of splitting words.)

Ilraon · January 4, 2021, 6:14am

~Hello and happy new year!~

I am now about two thirds through the HSK5 collection and start to be worried that I don’t recognize the majority of the words on online vocabulary lists corresponding to that level.
(sidenote : I’ll finish this collection by 15/01 if it’s not extended )

If nothing changes I think I’ll look for alternatives on anki, because toiling through most common words looks like I’m going to suffer a lot of unnecessary repetition.

mike · January 4, 2021, 11:28am

Thanks for the follow up, happy New Year to you as well, and apologies for the delay on this! We should be able to make something work. What’s your preferred list?

Ilraon · January 4, 2021, 3:53pm

Wonderful news!
I suppose you could use those handy .txt files:

Which, although they date back to 2012, are still up to date.

Ilraon · January 9, 2021, 7:20pm

As promised, getting close!..

mike · January 16, 2021, 11:56am

Nice! Just a quick update on this - we’re still working on getting more sentences added. One question we’re considering that we’re curious to get your input on: should there be any difficulty threshold for adding a sentence to a given HSK collection? In other words, if an HSK1 word only appears in a sentence that occurs in the “20,000 Most Common Words” collection (in other words a rather difficult sentence), is it worth including in the HSK1 collection?

It seems like the consensus in this thread is to include such sentences, but interested to hear if you have any thoughts.

Ilraon · January 18, 2021, 4:01pm

Hello Mike, thanks for the reply!

In my opinion, yes they should be included.
The case you mentioned (a HSK1 cloze sentence having very rare words) is unlikely to happen, because only collections HSK5 and HSK6 have significant amounts of missing words.
By this stage, I believe the learner is sufficiently advanced that he can handle such difficulties. (Basically because he has a good enough sense of the language to analyze new characters based on his experience).

mike · January 21, 2021, 12:20pm

Ok thanks - that’s helpful. We’re working on getting more sentences added, especially for HSK 4/5/6.

It looks like for HSK 4, Tatoeba doesn’t yet have sentences for

包子, 打印, 打折, 打针, 登机牌, 积累, 京剧, 烤鸭, 矿泉水, 垃圾桶, 凉快, 马虎, 填空, 小吃, 性别, 预习, 占线

HSK 5

岸, 报到, 本科, 本领, 比例, 鞭炮, 标点, 步骤, 操场, 差距, 成分, 诚恳, 吃亏, 尺子, 初级, 闯, 打工, 单元, 发票, 分布, 分配, 辅导, 概括, 干活儿, 格外, 个别, 公布, 乖, 怪不得, 官, 归纳, 国庆节, 海关, 合影, 何必, 何况, 后背, 糊涂, 华裔, 急诊, 纪律, 系领带, 夹子, 嘉宾, 坚决, 健身, 阶段, 结合, 结账, 近代, 经商, 桔子, 军事, 均匀, 拦, 朗读, 劳驾, 厘米, 连忙, 陆续, 名牌, 名胜古迹, 模仿, 目录, 内部, 嫩, 培训, 赔偿, 配合, 盆, 拼音, 频道, 青, 群, 人事, 弱, 色彩, 商务, 生动, 省略, 使劲儿, 梳子, 撕, 随身, 桃, 特征, 提纲, 体会, 体现, 调皮, 透明, 推广, 委屈, 武术, 勿, 吸取, 夏令营, 鲜艳, 斜, 幸亏, 性质, 休闲, 虚心, 叙述, 学历, 要不, 一律, 乙, 油炸, 运用, 阵, 振动, 证件, 执照, 转告, 紫, 自觉

and HSK 6 is especially poorly represented with 1040 words missing of a possible 2500.

So! We’ll work on getting the existing HSK 4/5/6 collections updated first, and we’ll see if we can’t get sentences added for the words currently missing on Tatoeba. If you’d like to help contribute to Tatoeba for these words that would of course be helpful. Will post more updates here. Work in progress!

Edit: An additional note - splitting Chinese sentences into words can be a bit tricky/inaccurate sometimes, so it may be that a word above is “missing” because we haven’t identified it in a given sentence accurately, but we’re using the most accurate method we’re aware of at the moment.

Ilraon · January 25, 2021, 10:48am

Thank you for the update Mike!
It’s especially helpful of you to have included the lists of ultimately missing words.

Some of the words indeed exist in Tatoeba and are not found due to incorrect splits. Due to this I prefer adding sentences directly in public collections on Clozemaster. I pick the sentences from Tatoeba, translate them in English if translation be missing; or reformulate and simplify already simple sentences found elsewhere.
All this is naturally manual, hopefully I will do few mistakes.

I’ve done the HSK4 list you mentionned already, I’ll get on the HSK5 one slowly in the coming weeks.

Ilraon · January 31, 2021, 3:37pm

Hello!
I finished putting the missing words you pointed out in public collections, manually picking the sentences from tatoeba or making my own with inspiration from other material (that I never used directly).

Any advance with the rest of the words, that clozemaster detects, but the sentences of which don’t respect the 75% rule?

Ilraon · March 2, 2021, 6:43pm

Hi! I’m bringing this up again:

Ilraon · March 25, 2021, 1:34pm

@mike Hello Mike, I hope not to bother you but simultaneously can’t help bumping this, since to my understanding adding the missing words which are already on tatoeba, but don’t respect the 75% rule, shouldn’t take much of your time (hopefully). If I’m not seeing some technical aspect, let me know!

ps- had good success with Mandarin recently, I had interesting conversations with two Chinese students of my campus. Thank you for helping toward that accomplishment.

mike · March 27, 2021, 9:06pm

No bother at all - sorry for the slow progress and thanks for the bumps! This is still top of mind, will aim to get at least HSK 5/6 updated within the next few weeks.

Ilraon · April 17, 2021, 5:56pm

@mike bump, kindly bringing this up up again with high hopes because I registered for the HSK5 test in roughly two months lol
Have a nice week-end

mike · April 18, 2021, 10:43am

Thanks for the bump! That’s exciting! In general - what do you think would be most useful in prepping for the test? What skill do you think you need the most practice with? And more specific to Clozemaster - what do you think would be an ideal collection for prepping for the test? For example - N sentences for each word on the list? Something else?