Extremely common words used in the blank

Hi! Clozemaster newbie learning Hindi. Big fan, etc.

The FAQ question on selecting blanks says: “The cloze deletion to test, or the blank in the sentence, is the least common word in the sentence within the 10,000 (or as many as possible) most common words in the language.”

I looked at 40 of the Random Hindi sentences I’ve done so far. In twenty of them, the cloze test was a Hindi form of “too” (which seems to be used all over the place, like “like” in English vernacular) or a form of the verbs be, be able, know, do (like French “faire”, used all the time). Other sentences use “I” or “my” when there are clearly less common words in the sentence.

Is it possible there’s something wrong with the algorithm? How would we check?

Thanks

It depends on which collection(s) you are playing. If you are doing the “Fluency Fast Track” for instance, you will indeed start from e.g. the 100 most common words, then gradually (but faster than just playing the word collections themselves), you will progress up to more difficult words. If you play the “Most Common Words” collection in random order, you will similarly thus be confronted with such words.

However, the idea of Clozemaster is also to learn these clozes in the context of the whole sentence. As such, even though I already knew most of the few thousand most common words, I still started with the Fluency Fast Track from the beginning, and was surprised at how much grammar I managed to subconsciously pick up along the way, even though most of the “clozes” themselves weren’t new to me, and in fact perhaps “too” simple, in and of themselves.

In addition to this, I really appreciate that you can make Clozemaster as easy or difficult as you want. For instance, have you tried switching to “Text Input” (I guess that might be more difficult for Hindi, depending on your input keyboard method used), or to “Listening” or “Speaking”?

3 Likes

Since I still know fewer than 1000 Hindi words, I would have loved to do FFT if it were an option. But I believe Hindi in general is much less popular (maybe not for long!) on Clozemaster, and there doesn’t seem to be an option for Fluency Fast Track or other collections. As such, I’ve been working with the Random collection.

In any case, there are all sorts of interesting words in the sentences I’ve done so far – building, ghost, information, committee, scare. So why does CM’s Random collection keep giving me “is” and “can” in the blanks?

So why does CM’s Random collection keep giving me “is” and “can” in the blanks?

Latin is like that too. There’s just one collection of 20,000 sentences. “Est” (is) is the cloze word for a whopping 932 of them, and “et” (and) for 217.

Hm. That suggests that the Clozemaster software isn’t acting the way the FAQ describes.

Hindi sentences can be tricky to split into words. It may that we the process we use sometimes break the sentence into words incorrectly, leading to some of the “words” not being actual words and therefore not appearing on the frequency list, causing the algorithm to select a word that is on the list but is more common.

We’re always looking to use the best natural language processing technology available of course, and if we come up with something better for Hindi we’ll re-clozeify the sentences. In the meantime one option is to report sentences you think should be improved and once we have a moderator for Hindi they’ll be able to update them all. Another option is to copy sentences to a custom collection where you can change the cloze word to any text you select. We may also eventually allow selecting a new cloze word for all collections, curious to hear what you all think of this option.

And actually as I’m looking at Hindi, we should be able to make some significant improvements within the next few months - hopefully improving the cloze word selection like you mentioned, as well as adding a Fast Track and Most Common Word groupings.

3 Likes

Hindi sentences can be tricky to split into words. It may that we the process we use sometimes break the sentence into words incorrectly, leading to some of the “words” not being actual words and therefore not appearing on the frequency list, causing the algorithm to select a word that is on the list but is more common.

As a programming geek, I’m curious why splitting the sentence is difficult. That is, words with dashes may be hard, but I’m not sure what else doesn’t work. Presumably all the mushed together letters in Hindi are handled by whatever parser you’re using; you can still find spaces.

Another option is to copy sentences to a custom collection where you can change the cloze word to any text you select.

I’ve started three collections - easy, advanced beginner, and hard. I’m partly doing it to know which sentences I’ll want to get back to in the future. The fact that I picked the right cloze out of four options doesn’t truly mean I understand the words and grammar of a sentence. But I’m also doing it in the hopes that (if I have the stamina to make it through a couple thousand sentences) other Hindi learners could start with the easy stuff. I was intimidated at some of the sentences I got from Random on my first day. While I’m at it, I’m changing a bunch of the cloze words in the collections I copy to. I haven’t been changing the cloze words in the random collection, though maybe I should.

We may also eventually allow selecting a new cloze word for all collections, curious to hear what you all think of this option.

Do you mean a button to change a sentence in all collections? That’s a neat option.

And actually as I’m looking at Hindi, we should be able to make some significant improvements within the next few months - hopefully improving the cloze word selection like you mentioned, as well as adding a Fast Track and Most Common Word groupings.

That would be amazing! Seeing 9750 sentences that I have to go through in random order is a bit overwhelming. It seems that you could get common words automatically from some combination of the online lists of common words and the words in the sentence collection itself. 10K sentences is enough that a ton of words will appear more than once, right? Some combination of “number of words in the sentence” and “number of words in the sentence weighted by how uncommon they are” has to be a better metric of difficulty than completely random.

I could put this in the suggestions channel, but I wonder whether you could have a 1-5 difficulty rating for each sentence that users could optionally select as they’re playing the sentences. It would only take an extra second per sentence. It looks like Hindi only has a few dozen people actively going through it right now, so you won’t have thousands of ratings per sentence. But even ten ratings is better than nothing, and over time you should have statistically significant ratings. (If you want to be fancy don’t turn the ratings on until a user has leveled up a few times, since all sentences may look difficult at first.)

Thanks for your response. More broadly, big thanks to “Mike and the Team at Clozemaster” for all the work you’ve done in putting this great resource together!

1 Like

Yes, big thanks to Mike & Team who regularly communicate with us; this is always appreciated.

1 Like

(As a reminder, the topic is that “be” and “do” and other such words are the most common Cloze words.)

We’re always looking to use the best natural language processing technology available of course, and if we come up with something better for Hindi we’ll re-clozeify the sentences. In the meantime one option is to report sentences you think should be improved and once we have a moderator for Hindi they’ll be able to update them all. Another option is to copy sentences to a custom collection where you can change the cloze word to any text you select.

I’m up to a couple hundred sentences in my “easy” and “advanced beginner” collections. Soon they might actually be useful! (Reporting half the sentences OTOH seems like too much work.)

And actually as I’m looking at Hindi, we should be able to make some significant improvements within the next few months - hopefully improving the cloze word selection like you mentioned, as well as adding a Fast Track and Most Common Word groupings.

I would still love to see this. It’s been a few months since the post, so I’m hoping there might be an update.

Thanks,

-Amir

2 Likes

Hi Amir, thanks for the nudge on this! We’ve updated the sentences in the random collection to improve the missing word selection - curious to hear what you think!

In the past we’ve typically added the Fast Track collection when it becomes possible, hid the Random Collection, and migrated progress from the Random Collection to the Fast Track collection. This hasn’t proven to be the best approach though - some progress is usually lost in the process since not all sentences in the Random Collection exist in the Fast Track collection. So at the moment I’m thinking we may start making the Random Collection available even if the Fast Track collection is available, and perhaps add a way to optionally sync progress from one collection to another. Curious to hear if you have any thoughts on this as well. There’s a bit of work to make these changes, but it should be doable.

2 Likes

We’ve updated the sentences in the random collection to improve the missing word selection - curious to hear what you think!

Thanks! The first Cloze word I got was “Tom”, but maybe that was just a bad coincidence. I’ll let you know if things look particularly bad.

In the past we’ve typically added the Fast Track collection when it becomes possible, hid the Random Collection, and migrated progress from the Random Collection to the Fast Track collection. … some progress is usually lost in the process since not all sentences in the Random Collection exist in the Fast Track collection. So at the moment I’m thinking we may start making the Random Collection available even if the Fast Track collection is available, and perhaps add a way to optionally sync progress from one collection to another.

That’s a good point. And who knows, some people might prefer the Random collection. Like if they’re not beginners, but just looking to get some more native-style language thrown at them? Your plan makes sense, though the optional sync may be more work for you.

I’ll keep curating my easy/medium/hard collections in the hopes that I’m a teeny bit better than the FFT AI.

3 Likes