Unique cloze words - statistics that would be especially useful to me

alanf_us · December 4, 2020, 11:19pm

Statistics that would be especially useful to me would refer to unique cloze words (see end of post for what I mean by “words”):

How many unique cloze words remain in this category?
How many unique cloze words have I encountered (“played”) in this category?
How many unique cloze words have I mastered in this category?
How many unique cloze words have I encountered (“played”) overall?
How many unique cloze words have I mastered overall?

This information would be more useful to me than how many sentences I’ve played. For instance, the Russian grammatical category “Adverbs of Place and Direction” contains 506 sentences, but only about a dozen unique cloze words (здесь, тут, там, куда, туда, сюда, откуда, оттуда, отсюда, где, and maybe one or two more). After I’ve seen every word five times or so, there’s nothing more to learn. On the sentence/category construction side (meaning Mike and whatever team is working with him), if the number of unique cloze words were used to determine how many sentences should go into a category, we could have the category sizes make more sense. On the player side, if I knew that I had already seen all the unique words in a category, I could go on to a new category. I would also love it if there were a way to prioritize sentences with cloze words I hadn’t seen before so that they would show up sooner. (Hopefully, these would be genuine words in the language, rather than unassimilated foreign names.)

Unique words would also be a good way to determine how much vocabulary I’ve been exposed to overall, if I’m looking for a way to quantify my progress. I’d much rather know that than how many sentences I’ve played or mastered.

As for how I’m defining “words”, it could mean “word forms” (where “walk”, “walks”, and “walking” would count separately) or “stemmed/lemmatized words” (where they would be treated the same). Of the two, stemmed/lemmatized would be more interesting to me, but I suspect much harder to count.

LuciusVorenusX · December 5, 2020, 12:23am

I can see some utility in that, especially in some collection types. There are a few issues that may be encountered, though, such as:

Words that are identically spelt, but which have multiple meanings. I don’t know what Russian is like with that, but I know that some languages are worse than others in this respect. {Glares in the general direction of English.}
Context. In Italian I have some collections, both built in and custom, of a couple of dozen unique words. Nice, short little words. However those couple of dozen words appear in literally hundreds of my reviews every week. The words themselves I know by heart. It’s the context of the words that matters. Which words? {Deep breath, hissing voice…} Prepositionssssss. Italian is an absolute sod when it comes to the consistency (or the lack thereof) of prepositions. (Just like English to be honest, though native English speakers don’t notice it as much, just as native Italian speakers won’t (until they try to learn English.) Although I have long since mastered the unique words, what I haven’t mastered, and which Just. Won’t. Stick. (so far) is the context in which to use them. I’m aware that for the purposes of communicating nobody in Firenze is going to care if I say “alla” when I should say “sulla” or “nella”, but my objective is to NOT have a native speaker roll their eyes and think “uno straniero…” (Which is why I’ll continue doing them until I get zero errors and have the usages drilled into my memory.) In cases like that knowing the unique word count won’t necessarily help, though it won’t do any harm either. In collections of “common” nouns or verbs it could well be very useful.

Agreed. In Italian and English verb roots are relatively regular, if you squint and tilt your head at just the right angle. I don’t know what Russian is like in that regard. German… well there’s sort of a pattern, maybe? Sometimes? Other languages, I don’t even want to think about.

alanf_us · December 5, 2020, 4:02pm

You make a good point: unique word count is a more useful metric for some collection types than others. But I still think that I would profit from knowing how many unique words I’ve encountered on the whole, even if that statistic doesn’t tell me everything.

I’m aware of stemmers for most European languages and a few Semitic ones. I’m not sure whether they make sense for East Asian languages.