Frequency by root and/or frequency by surface (inflected) form?

alanf_us · February 18, 2020, 11:13pm

I am wondering whether you “stem” words (condense inflected forms into a single lemma/base form) or treat all inflected forms separately when you calculate the frequency used to select sentences.

Why do I care? Because I’d like to see more distinct roots, rather than more inflected forms. For instance, in English, “walk”, “walks”, “walking”, and “walked” are all common words, and if each one were treated separately, they could drive out other words. I’d rather see a greater variety of verbs, so I would hope that all the “walk*” words get put into one bin (though when the sentences within that bin are chosen, I would hope that one form doesn’t drive out all the others).

Your description does mention that you use frequency lists from Wiktionary, but I think those lists might be compiled in different ways.

kadrian · February 19, 2020, 8:56am

Unfortunately I think all forms are treated separately.

In French, the 20,000-50,000 and >50,000 most common word categories, which should contain relatively infrequently used words, are full of very easy-to-produce words like “teachers” and “umbrellas”. Just being in the plural form put them in what I was hoping would be a more difficult category. It seemed to me that most of the words in those categories were just different forms of the same verbs. There was much less variety of new nouns than I hoped.