A frequency word list of lexemes/lemmas?

Somepony · November 16, 2022, 5:05pm

i. e. counting all conjugated forms as one word instead of counting each form separately (i. e. כתבתי and כותב are one word כתב/לכתוב)

Unfortunately I can only find lists like the latter one (each form is its own word) and it’s not really useful for me (it’s not for cramming words, I want make a list of irregularities and using the entire dictionary is an overkill). The best I can do is getting the lemma forms of each form (I can script in Python, but I don’t know if there are tools/libraries to do that automatically) and then remove duplicates, but it’ll be skewed, because e. g. nouns have less forms than verbs, so they’ll have an advantage.

Ideally I’d like 5000 words with frequency data (if it had the number of occurrences in the corpus, it’d be perfect, but them just being sorted is also fine), but, of course, I’ll take whatever I can get: 2000 unsorted words is also fine.

If you have man-made lists, meaning it’s just a choice of basic words compiled by someone with no actual statistical backing, it’d also be useful. Whether the words are translated doesn’t matter at all.