Feature request: aggregate user data for clozes

I feel like it might be a good idea to keep track of the percentage of successful solves, across all users, for a given cloze; this would give good insight for which clozes are easier and which are harder, as well as what the most common errors for a given cloze are. Then we could have collections based on difficulty, or we could have the site give some of the more common errors as alternate answer choices to make the clozes harder.

2 Likes

To be honest, I was very surprised to hear that. My success rate is probably around 98%. So, I don’t need to know how many of other users tend to make a mistake by every cloze-word. This is not a show-off. I’m actually very bad at memorizing new words. I start from the text input mode, and then switch to the multiple choice mode if I get stuck. The multiple choice mode is quite easy for me, and probably for other players in my target language too.

Besides, such stats wouldn’t motivate me to study harder. It might be even a distraction, and I would just turn it off if a personal setting would allow.

2 Likes

@MsFixer I think you’re misunderstanding what I’m saying. My point isn’t that this aggregate data should be presented to the player (I agree that this is not very useful), but rather that it might be used to algorithmically generate better answer choices in the multiple-answer mode, or to create better collections of clozes.

So if, for instance, people often incorrectly answer “goes” instead of “go”, then perhaps the Clozemaster algorithm may choose to more often make “goes” one of the other answer choices whenever the correct answer is “go”, and/or vice versa.

Another possibility is that we could have cloze collections consisting of only the clozes that people tend to get wrong more often.

One problem with Clozemaster is that I often run into a scenario where, in multiple-choice mode, I can see that the word I need to fill in is a verb, but of the four choices, I can see that one or two of them are proper names, and one or two of them are plural nouns, which results in me answering via process-of-elimination and not actually learning the meaning of the verb. At the absolute least, if the Clozemaster algorithm kept track of which words were least-confused-with-one-another, this suggestion could mitigate that.

2 Likes

@edderiofer
Sorry but I don’t think your idea is practical. The correct answering rate for Cloze A is 97.5% while that for Cloze B is 98.5%. The algorithm you proposed needs to categorize these two words into easy/hard word groups solely based on such a subtle gap.

If the average correct answering rate is lower than 60%, it may work, though.

One of the most critical problems Clozemaster should tackle is not the algorithm but the word frequency lists. The lists Clozemaster refers to are curated by Wiktionary. Multiple choices are randomly picked up from the lists. Even if a cloze-word is a verb, the system sometimes gives a noun and an adjective as other choices. That makes the multiple choice mode less effective. The lists should flag each word with each word class or some other grammatical topic, and also eliminate unique nouns and typos from the lists.

In other words, what we need is not a tech savvy guy (data mining guy) but linguistic experts in each language. As far as I know, Clozemaster hires error report handlers to improve the quality of sentences. Once it’s done, they may work on improving the word frequency lists.

FYI: 70% of the top 50K words on the Indonesian word frequency list should be removed: they are either non-Indonesian words, typos, unique nouns, or duplicated due to mishandling conjugations. The list is significantly diluted. I also took a look at the Japanese list, but it often fails in handling conjugations.

2 Likes

Very fair point; I suppose that the word-class problem needs to be fixed first (to lower the answer rate) before my suggestion would be practical (for implementing cloze collections and common errors).

Thankfully Wiktionary already lists word-classes for words in the definitions in their entries (though not in the Wiktionary word frequency lists). So it would seem like this is the datamining guy’s job again.

FYI: 70% of the top 50K words on the Indonesian word frequency list should be removed: they are either non-Indonesian words, typos, unique nouns, or duplicated due to mishandling conjugations.

Ouch. Hope we can get some response from the Clozemaster team.

2 Likes

@edderiofer
I like your optimism and enthusiasm, but I would recommend you to download the word frequency lists in your target languages and quickly figure out whether your proposed approach works. At least, it won’t work for the Indonesian list.

As I said, approx. 70% of the top 50K Indonesian words are “improper” and they should be removed. This means that 70% of automatic data match-ups return an “error” when you check the word frequency list against Wiktionary entries because they don’t exist in Wiktionary. Besides, half of the “proper” 30% have not been registered in Wiktionary yet.

A data mining guy with no knowledge of Indonesian cannot sort out which words are 1) proper and registered in Wiktionary (15%), 2) proper but not registered in Wiktionary (15%), and 3) improper (70%). After this initial filtering, the guy needs to further categorize Group 1 and 2 based on conjugation rules and word families, which are totally different in each language.

Some regular forum discussion participants talked about a similar issue (i.e. handling conjugations and lemmatization) four months ago.

Why am I so confident? I have actually checked 65% of the top 100K words in the Indonesian frequency list. It took me 15 months. Once it’s done, I will release the clean lemmatized data set publicly. In short, I am not trying to be a naggish critic but to manage your expectation and deliver a possible solution to Clozemaster.

@mike Please contact me at my email if you are interested in my filtered and lemmatized data set of the Indonesian frequency list. I’m happy to give it to Clozemaster for free with some conditions. My data set refers not only to the word frequency list recommended by Wiktionary (i.e. texts sourced from OpenSubtitles), but also to two other Indonesian corpora.

2 Likes

Not knowing any Indonesian, I’ll take your word that the problem is that bad for Indonesian.

I would recommend you to download the word frequency lists in your target languages and quickly figure out whether your proposed approach works

Well, we instantly run into a problem; Toki Pona, being a minor conlang, doesn’t even have a word-frequency list on Wiktionary.:stuck_out_tongue:

In seriousness, though, I’ve just looked at the “10,000 most common words from Esperanto Wikipedia” wordlist, and while it’s nowhere near as bad as you say the Indonesian wordlist is, I do notice that some words are likely overrepresented (e.g. “Hungario”, “loĝantojn”. However, I’ve not noticed either of these words as answer choices so far, so I can only assume that the Esperanto wordlist used by Clozemaster isn’t this list or that I’m not paying close enough attention. Either way, it’s interesting to note that that list is at least 10 years out of date.

In any case, it still doesn’t seem like too bad an idea to cross-reference Wiktionary for word classes where possible (and where Wiktionary doesn’t list the word, default to how it currently works); this would at least improve things for some of the more-edited-on-Wiktionary languages.

@edderiofer
I guess you haven’t read the previous discussion four month ago that I have already given you the hyperlink… You missed many technical points that you need to take into account when you take a look at the word frequency lists. Once you read the previous discussion, you will probably never say cross-reference to Wiktionary can be done by a data mining guy with no knowledge in each language.

I’m not the only one who pointed out this challenge there. Italian has the same conjugation issue, according to LuciusVorenusX. According to alanf, Hebrew Wiktionary is not so good. I also pointed out that Chinese (including Cantonese, one of your seemingly target languages) has another challenge of word splitting (lemmatization) - same as Korean and Japanese, which don’t split each word by space.

I have something more to counterargue, but I believe the aforementioned hyperlink is enough to lower your expectation.

This is the last thing I would like to say - I support the Clozemaster’s basic business model: offer the service at a reasonable price by fully leveraging external free resources. Your proposed idea has not been done by anyone else in order to distribute for free uses. If Clozemaster tried to carry it out internally, we might have to pay more.