Custom cloze-collections for Cantonese?

scottlawsonbc · July 20, 2018, 7:37am

I am learning Cantonese. Unfortunately, Tatoeba only has 2,886 sentences for this language.

I would like to add 3,500 of my own custom cloze sentences, but Cantonese is not supported in the custom cloze-collections beta. I can provide my custom sentences in any format.

As a workaround, I added Cantonese -> English sentences to a Danish -> English cloze-collection (Danish is supported in the beta). It seemed to work fine and could be a tolerable workaround for getting “Cantonese” custom cloze-deletions.

Downsides to this workaround:

Statistics would be a mess
Bold font used in Danish text detracts from legibility of Chinese characters
TTS must be disabled

Can Clozemaster enable support for Cantonese custom cloze-collections? Also, how technically challenging is this request? (just curious)

Interesting facts!

Cantonese (Yue) has 79 million speakers and 2,886 Tateoba sentences (27,373 people/sentence)
Danish has 5.4 million speakers and 21,361 sentences (253 people/sentence)
Clozemaster has far more Danish learners than Cantonese learners

mike · July 20, 2018, 9:55am

Haha interesting work around! How did you add the sentences? With the cloze already defined?

We should be able to enable support for Cantonese custom cloze-collections, thanks for posting and letting us know you’re interested. It shouldn’t be challenging technically and we should be able to get support added within the next week or so. I’ll keep you posted.

Those are interesting facts! Hopefully we can get more learning material added for Cantonese in the future. Are the sentences you’re looking to add public domain? If so might you consider making your cloze-collection public once we get it enabled for Cantonese?

Also - any feedback other feedback on the cloze-collections? Anything else we can do to make them better?

Thanks again!

scottlawsonbc · July 20, 2018, 8:46pm

Yes, I defined the cloze myself. For this workaround test, I only added a few sentences. For my other 3,500 sentences, I use Python to compute statistics on a corpus of Cantonese text that I’ve collected, and then select cloze words using that data.

Awesome! I really appreciate that! I’m going on vacation for a week starting July 27, 2018, and was hoping to be able to do lots of studying during this time. If there is any chance this change could be made before July 27, that would be so great, but I also understand if that isn’t possible.

I have a collection of 3,500 sentences. Unfortunately, 3,000 of the sentences are not in the public domain.

I plan to create at least 1,000-2,000 more sentences in the next 12 months. I would be happy to release these into the public domain.

I am part of a network of around 250 highly motivated Cantonese learners. If clozemaster supported custom cloze-collections, I would very strongly recommend clozemaster to this group. Custom cloze collections is a killer feature for Cantonese because learning resources are scarce and tend to be plagued with errors.

Perhaps this could result in 5-10 additional clozemaster pro subscriptions. Additionally, this network could help to bring more public domain sentences to Cantonese.

For standard Chinese (Mandarin), there are JavaScript libraries which can generate the Pinyin pronunciation for arbitrary Chinese sentences. Automatically generating Pinyin could simplify the process for adding custom Chinese sentences, since the pronunciation field could be computed automatically. I didn’t get a chance to see if you are already doing this.

For Cantonese, it is more difficult to generate pronunciation information. Pinyin is the official romanization system for Mandarin, but there are multiple non-standard romanizations for Cantonese. The most common romanization systems for Cantonese are Jyutping and Yale. This presents some challenges:

Automatically generating pronunciation information is more difficult for Cantonese than standard Chinese. It is still possible though.
If you don’t automatically generate pronunciation information for Cantonese, users will have to enter their own.
Since there is no standard romanization for Cantonese, users will enter Jyutping or Yale, or even worse, a mixture of both! Non-standardized pronunciation information makes it harder to share custom cloze-collections.

Ideally, I think both Jyutping and Yale would be generated automatically and the user could see both romanizations.

Panasta · November 12, 2020, 1:34am

Is it possible to share these custom collections?