Tagalog cloze selection is broken

Cloze selection for the Tagalog language is broken.
From the FAQ: “The cloze deletion to test, or the blank in the sentence, is the least common word in the sentence within the 10,000 (or as many as possible) most common words in the language.”

Yet a very large proportion of Tagalog sentences have extremely common linker words like “ng”, “ang”, and “sa” chosen as the cloze - the worst possible selection since they are very obvious.

For example, I just went through 5 sentences, and 4 of them selected one of these words as the cloze:

  1. Kung minsan nawawalan tayo ng pasensiya para {{sa}} matatanda.
  2. {{Ang}} tatay ko ay nasa ospital ngayon.
  3. Isa lang ang sinehan {{sa}} bayan.
  4. Nakarating ka na ba {{sa}} Paris?

This is really broken, and clearly a frequency list has not been used. I strongly request a frequency list to be generated and used for this language, or at the very least a manual list to exclude linker words (and most words 3 characters or less) would greatly enhance the utility of Clozemaster for learning Tagalog.


Here are some more examples.
These are from a single session (of 10 sentences):

  • {{Ang}} tanghalian ay nasa mesa.
  • Mas malamig ngayong umaga kaysa {{sa}} kahapon.
  • Sa tingin {{mo}}?
  • Tumawag ka {{ng}} pulis!
  • Nakikinig ako {{sa}} musika.

As you can see, half of the 10 used terrible cloze selections.

These are some more examples, randomly from a few other sessions:

  • Ito’{{y}} isang ospital.
  • Pinahiram ko {{ang}} aking amerikana sa isang kaibigan ng kapatid ko.
  • Kumain {{ka}} ng gulay!
  • Sinauli mo na {{ba}} ang aklat sa aklatan?
  • Lagi na lang siyang humihiram {{ng}} pera sa akin.
  • Marunong ka bang magbilang {{sa}} Italian?
  • Iniiwasan ka {{ni}} Tom.
  • May naniniwala na may parte sa utak at ito {{ay}} responsable para sa mga insulto at ito ay mas aktibo sa ibang tao. Ito ay sakit at kung minsan ay endemik sa buong rasa.

That last sentence being a perfect example.


I am an absolute beginner of Tagalog learning, but I can tell the Clozemaster team that the Tagalog course is such an agony. “Ang” and “sa” in Tagalog are similar to “the” in English. It seems that the Tagalog course doesn’t pick up cloze-words based on word frequencies as other popular courses do. Cloze-words in Tagalog are probably randomly chosen because the only word frequency list that Wiktionary picks up has top 2K words only. So, I guess the Tagalog course simply ignores the top 2K list.

Do you think that the word frequency list from the University of Leipzig Corpora Collection (LCC) is a good alternative solution in Tagalog? It’s downloadable here.

In my primary target language, Indonesian, the Clozemaster course refers to the top 50K word frequency list based on OpenSubtitles’ texts (OS) because Wiktionary lists up only this one. But the ranking of OS is so skewed, and I find that LCC’s ranking makes much more sense in Indonesian. LCC crawls a massive number of texts both in formal and casual writings in the real world. OS ranking, on the other hand, is based on subtitles of (mostly) Western movies and TV programs. OS overrates words frequently used in fictional stories and underrates local cultures. I presume that is the same in Tagalog.

Please be noticed that LCC has one disadvantage. It’s case sensitive. This means that “you”, “You” and “YOU” are recognized as three independent items. I contacted the admin of LCC last year, and he confirmed that it is not a bug but LCC adopts the case sensitivity approach on purpose. So, the Clozemaster team needs to clean up duplicated items before implementing the LCC ranking into the cloze-word algorithm.


@zeiphon @MsFixer thank you both for letting us know! We’ll work on a fix and will keep you posted.


Thank you @mike as always for taking our questions and inquiries!

The Tagalog course currently offers the Random Collection (RC) only. But I would like you to offer the Most Common Words Collections (MCWC) as an add-on by using the LCC’s word frequently list for the following two reasons:

First, changing {{cloze-words}} across the entire course is a major update. Some of the current RC players may get confused and even be disappointed by losing their progress. Rather than fixing the current RC, it’s better to add the MCWC. Let each player choose between two versions.

Second, it’s effective and easy to implement the LCC list. I quickly compared LCC with OS 2018 version and also the list recommended by Wiktionary (top 2K by Tagalog.com). Top 2K is such a short list that it won’t cover all of the 8K+ sentences that Clozemaster currently offers via the RC. OS 2018 is not practical because the most frequent word {{ang}} has only 4,035 occurrences in the Tagalog data set. On the other hand, {{aku}} appears 2,051,910 times in the OS Indonesian 2018 list. OS Tagalog is too small for meaningful stats. {{sa}} is the most frequent word in LCC Tagalog with 1,120,694 occurrences, and {{ang}} ranks 4th in LCC Tagalog with 887,678 occurrences.

LCC and OS lists are basically in the same data format with only two minor differences: 1) case sensitivity and duplicated tokens (as I explained above); and 2) the first ID 1 - 100 should be ignored because these are reserved for technical uses across all languages of LCC lists.

Hope this helps and not so overwhelming!

I definitely like the idea of having a ‘most common words collection’ but I also think the current collection really does need fixing because it’s quite broken in its current state.

This is totally up to the Clozemaster team, but I personally think that it’s important to minimize additional efforts and backlash.

First, I guess it’s easier to develop a completely new one from scratch than update the existing one. CM added many courses in the past. MCWC for Tagalog doesn’t require any new procedure.

Also, I observed that many Duolingo “conservative” users frequently complained on the internet such as Reddit about major (and even better) updates. They asked Duo to roll back to the older versions simply because they don’t want to loose their progress and not to be bothered in the middle of their learning journey. Their harsh voices negatively impact on branding.

This is not limited to a language learning app. Many people hate major updates on Windows OS and want to stay the current familiar version as long as possible.

Clozemaster may completely update the existing RC based on LCC list (or any other) while keeping the learning progress of each user. And then, let each user choose either they want to keep the current progress, or to push the “reset” button to redo from scratch.

But if you and other “progressive” users are okay to redo from scratch, you can simply switch to MCWC. RC and MCWC use the same sentence data set. You don’t need both. So, it looks to me that adding MCWC and leaving RC as it is will bring a simple win-win solution for everyone.