Follow

It is time we shared the dataset with everyone. This is a collection of text from Tamil news articles. Has around 7 millions lines of text, all cleaned up, ready to used for language modelling task, in case anyone want to try. You can use the code from git repo below to get started.

Dataset: [kaggle.com/paarulakan/tamilnew]
Code: [github.com/vanangamudi/tamil-l]

@vanangamudi நன்றி, இதுபோல் பதப் படுத்தப்பட்ட ('curated'..?) கட்டுரை கோப்பு மிக அருமை & அவசியாமானது.

I'll think about an idea to use it somehow.

If you have an idea how we can make a browser addon that can bundle this (and do something with it), I'll be glad to look into that as my next free as in swantra software project.

cc @prashere

Sign in to participate in the conversation
Mastodon

Server run by the main developers of the project 🐘 It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!