HSK levels not representing every words

~Hello and happy new year!~

I am now about two thirds through the HSK5 collection and start to be worried that I don’t recognize the majority of the words on online vocabulary lists corresponding to that level.
(sidenote : I’ll finish this collection by 15/01 if it’s not extended :upside_down_face:)

If nothing changes I think I’ll look for alternatives on anki, because toiling through most common words looks like I’m going to suffer a lot of unnecessary repetition.

1 Like

Thanks for the follow up, happy New Year to you as well, and apologies for the delay on this! We should be able to make something work. What’s your preferred list?

Wonderful news!
I suppose you could use those handy .txt files:

Which, although they date back to 2012, are still up to date.

As promised, getting close!.. :stuck_out_tongue:

1 Like

Nice! Just a quick update on this - we’re still working on getting more sentences added. One question we’re considering that we’re curious to get your input on: should there be any difficulty threshold for adding a sentence to a given HSK collection? In other words, if an HSK1 word only appears in a sentence that occurs in the “20,000 Most Common Words” collection (in other words a rather difficult sentence), is it worth including in the HSK1 collection?

It seems like the consensus in this thread is to include such sentences, but interested to hear if you have any thoughts.

1 Like

Hello Mike, thanks for the reply!

In my opinion, yes they should be included.
The case you mentioned (a HSK1 cloze sentence having very rare words) is unlikely to happen, because only collections HSK5 and HSK6 have significant amounts of missing words.
By this stage, I believe the learner is sufficiently advanced that he can handle such difficulties. (Basically because he has a good enough sense of the language to analyze new characters based on his experience).

Ok thanks - that’s helpful. We’re working on getting more sentences added, especially for HSK 4/5/6.

It looks like for HSK 4, Tatoeba doesn’t yet have sentences for

包子, 打印, 打折, 打针, 登机牌, 积累, 京剧, 烤鸭, 矿泉水, 垃圾桶, 凉快, 马虎, 填空, 小吃, 性别, 预习, 占线


岸, 报到, 本科, 本领, 比例, 鞭炮, 标点, 步骤, 操场, 差距, 成分, 诚恳, 吃亏, 尺子, 初级, 闯, 打工, 单元, 发票, 分布, 分配, 辅导, 概括, 干活儿, 格外, 个别, 公布, 乖, 怪不得, 官, 归纳, 国庆节, 海关, 合影, 何必, 何况, 后背, 糊涂, 华裔, 急诊, 纪律, 系领带, 夹子, 嘉宾, 坚决, 健身, 阶段, 结合, 结账, 近代, 经商, 桔子, 军事, 均匀, 拦, 朗读, 劳驾, 厘米, 连忙, 陆续, 名牌, 名胜古迹, 模仿, 目录, 内部, 嫩, 培训, 赔偿, 配合, 盆, 拼音, 频道, 青, 群, 人事, 弱, 色彩, 商务, 生动, 省略, 使劲儿, 梳子, 撕, 随身, 桃, 特征, 提纲, 体会, 体现, 调皮, 透明, 推广, 委屈, 武术, 勿, 吸取, 夏令营, 鲜艳, 斜, 幸亏, 性质, 休闲, 虚心, 叙述, 学历, 要不, 一律, 乙, 油炸, 运用, 阵, 振动, 证件, 执照, 转告, 紫, 自觉

and HSK 6 is especially poorly represented with 1040 words missing of a possible 2500.

So! We’ll work on getting the existing HSK 4/5/6 collections updated first, and we’ll see if we can’t get sentences added for the words currently missing on Tatoeba. If you’d like to help contribute to Tatoeba for these words that would of course be helpful. Will post more updates here. Work in progress! :slight_smile:

Edit: An additional note - splitting Chinese sentences into words can be a bit tricky/inaccurate sometimes, so it may be that a word above is “missing” because we haven’t identified it in a given sentence accurately, but we’re using the most accurate method we’re aware of at the moment.


Thank you for the update Mike!
It’s especially helpful of you to have included the lists of ultimately missing words.

Some of the words indeed exist in Tatoeba and are not found due to incorrect splits. Due to this I prefer adding sentences directly in public collections on Clozemaster. I pick the sentences from Tatoeba, translate them in English if translation be missing; or reformulate and simplify already simple sentences found elsewhere.
All this is naturally manual, hopefully I will do few mistakes.

I’ve done the HSK4 list you mentionned already, I’ll get on the HSK5 one slowly in the coming weeks.

I finished putting the missing words you pointed out in public collections, manually picking the sentences from tatoeba or making my own with inspiration from other material (that I never used directly).

Any advance with the rest of the words, that clozemaster detects, but the sentences of which don’t respect the 75% rule?

Hi! I’m bringing this up again:

1 Like

@mike Hello Mike, I hope not to bother you but simultaneously can’t help bumping this, since to my understanding adding the missing words which are already on tatoeba, but don’t respect the 75% rule, shouldn’t take much of your time (hopefully). If I’m not seeing some technical aspect, let me know!

ps- had good success with Mandarin recently, I had interesting conversations with two Chinese students of my campus. Thank you for helping toward that accomplishment.

1 Like

No bother at all - sorry for the slow progress and thanks for the bumps! This is still top of mind, will aim to get at least HSK 5/6 updated within the next few weeks.


@mike bump, kindly bringing this up up again with high hopes because I registered for the HSK5 test in roughly two months lol
Have a nice week-end

1 Like

Thanks for the bump! That’s exciting! In general - what do you think would be most useful in prepping for the test? What skill do you think you need the most practice with? And more specific to Clozemaster - what do you think would be an ideal collection for prepping for the test? For example - N sentences for each word on the list? Something else?

@mike As I mentioned in Improve quality of Traditional Chinese - Questions, Suggestions, Feedback - Clozemaster, I have done some work to improve word segmentation of Chinese. Although I have mainly focused on Traditional Chinese, these results can also be used for Simplified Chinese. You can see the results in my online dictionary, which I use to help me learn vocabulary.

For Traditional Chinese I have created a few custom collections, such as TOCFL6 and TOCFL6 missing words (TOCFL is the Taiwanese version of the HSK). You can see that for TOCFL6, more than 50% of the words do not have a corresponding sentence in the Tatoeba corpus. I expect the results for HSK to be similar.

If you’re interested, I can help you to improve the Chinese tracks of Clozemaster. Perhaps this file, which contains the full list of Tatoeba sentences (extracted some time ago), segmented both in Traditional and Chinese characters, could be useful.

1 Like

Thanks a lot Mike, I see you’ve updated the HSK5 collection. Much much appreciated! I’m going to try it out a bit and let you know if the new imports went smoothly. It seems to me as per a quick browse through the new sentences that there might have been lots of imports for clozewords that were already included in the initial 2000 sentences. However, it also seems all the words that were missing up to now have been added :slight_smile: yay!

I haven’t stood an HSK test yet, but I’ve been following the HSK4 course of an official institute (Confucius Institute). In my experience, doing the Clozemaster collections up to HSK4 was extremely helpful in understanding the course. Other skills I’ve had to develop where being more spontaneous orally, being able to read a bit lengthier unknown texts, and a couple grammar points. I did so both by making tandems and paying attention to the explanations of the teacher. There are online free resources as well for those grammar points, but I believe they are protected and couldn’t be used by Clozemaster.

Once I’m done with HSK5, in about a month and a half, I will come back to this topic and will hopefully give a more informed and thought-through answer to your question.

Have a nice day


@Ilraon, I saw on the “Learning Mandarin Chinese” thread that you only started learning simplified Mandarin in Nov 2019. I’m really impressed that you’re taking the HSK5 after such a short time.

I don’t know if I’ll stick with it long enough to get that far, but if I do, I’ll be glad that you prompted improvements to the HSK collections! I’m taking the HSK1 next month just for fun and am currently partway through the HSK2 collection.

Thank you! It’s indeed a fun detail. All in all, having fun or finding anything interesting along the way matters more than speed in all cases! Such that I believe my case is not necessarily more shiny than someone who would’ve learnt in ten years.

1 Like

Hello, I was supposed to do HSK5 today, but due to me messing up with PCR test certificates, I can’t write it.
I do have a couple remarks on the update:

  • Some sentences are pretty long. One instance is, a dozen or so sentences are excerpts from speeches of presidents of the USA. Listening to them sometimes takes about a minute each.

  • Some are duplicates ; rarely with exactly the same clozeword, most often with a different clozeword in the sentence which is still interesting.

  • Some words which were listed “not in tatoeba” in your other posts, actually appear in the update

  • a few characters use traditional instead of simplified mandarin, as was already noted in other places of clozemaster, which messes up the audio (usually those characters are skipped by the TTS altogether). This issue is really perplexing to me because the conversion from traditional to simplified characters doesn’t appear to be difficult, I’m wondering why those mistakes exist in the first place.

  • some sentences have no pinyin for some reason

  • most sentences have a really well adapted level (they use multiple hsk5 words and are appropriate to prepare it)

  • As the update is only to HSK5, there are still potentially some missing words in HSK4, but I believe not too many. (One instance, to my knowledge, is 符合). I might check all the words of HSK4 to make sure.

In terms of how well that prepares for the HSK5, even though I didn’t do it, I did practice for it, and my weakest area was reading and writing. This is logical considering I always work with audio mode. There is an other reason, though: HSK5 reading actually uses vocabulary beyond the scope of HSK5 - I think they expect you to have a full command of the HSK5 vocabulary, and understand a bit more. Besides, Clozemaster doesn’t fully train you to write - for this it’s still better to just exchange with real people who’ll correct you. My grammar was also not quite to the level.
All in all, I’d say having fully mastered Clozemaster’s HSK5 gives good chances of passing the test, but only if one pays attention to the global context of the sentences, and not just the clozewords. Best would probably be to follow a HSK5 course in parallel to self-study with Clozemaster.

/edit : one such example

So sorry to hear you didn’t get to take the test!