Number of unique “root” words?

In the Polish sentences, many of the words seem to be variations based on declination, conjugation, etc.

Does that mean we have fewer than 20,000 words? If so, how many unique “root” words are we working with?

1 Like

From what I’ve been told (for Russian), the number of words refers to unique forms, not to unique roots. So indeed, if you had 20,000 unique forms, you would have fewer than 20,000 unique roots.

Answering the question of how many unique root words one is working with is difficult not only in terms of developing an implementation, but in terms of agreeing on a standard. For instance, does a noun derived from a verb (or vice versa) count as unique?

I suspect that no attempt to count unique root words has been made. But perhaps I’m wrong.

Thanks for the feedback Alan.

That makes a lot of sense, and I can see where the difficulty arises in counting.

Based on this article, it seems the linguistic term is “lemmas”, and 5,000 is around what a native English speaker uses daily (while understanding anywhere from 3-8x that number): Redirect Notice

I’m hoping that, with the 30-40k clozes in the most common words collections, we’d hit somewhere near that 5,000 mark.

1 Like

I wouldn’t hope for that, based on the alphabetical list of all Polish words I see there is 4-7 lemmas in every 100 words at best, that would make 1,200-2,800 lemmas in 30-40k clozes, not even close to 5,000.

2 Likes

I just attempted to roughly count the number of lemmas in Polish language based on the list provided by Słownik Języka Polskiego. There are 238000 total entries, some of which are allowed to use in scrabble and some not. If we assume that the ones that the root words allowed in the game are lemmas, then we can estimate the number of words. I counted the number of allowed word entries on 20 random pages provided by them and I summed it up. Assuming that the number of allowed words is a uniform distribution, I can estimate the number of root words to be between 172074 and 192661 with 95% confidence.

By that I mean all the root words found in the dictionaries, and I count the derivative forms separately, but I count the conjugated and declined forms together. E.g. karta & karciany are counted as 2 words, and karta & kartą are counted as one word.

I took all the sentences from clozemaster Polish/English common words. Split them into individual, unique words. Got 23k words. Then asked ChatGPT “Identify and list only the dictionary (base) form for each of the following Polish words.”
And the results was 6600 root words (28%).
https://drive.google.com/drive/folders/1iNRHvpXEYbE6udfZM-KHPm4WiFYhbbwG

@siniy good job. I revised your list, I tried to filter out the words from that list that are one of those:

  • proper nouns: names, names of the cities, countries e.g. tamiza, artur, tajwan, ibm
  • English words: e.g. you
  • no such a word e.g. ać
  • 2 words glued together, e.g. szałszaleć, dokarmiaćdokładka (I guess it’s the chatGPT artefact)
  • not the main form
  • negations e.g. niemało
  • artificially extended words like pra-pradziad
  • diminuitives e.g. fartuszek

I got the ratio of 2881 valid words and 2556 from your list of root words, but it’s not worth much, I think closer to truth would be 4000-5000 valid / 1000-2000 invalid. I have a problem with scraping the SJP and I don’t always get the nice results and so on.

At any rate, I was wrong. I’m convinced you can learn 4000-5000 lemmas using Clozemaster. There is so much trash here though, 50% of the words are trash. I’m not even talking about pure trash like xyz or linux, or youtube but words like szczygieł, cie or powieściopisarz. Those are valid Polish words but you’re not gonna use them in real life.