TEXT CLASSIFICATION USING NLTK


In this work, the task was to check how well-known technologies and tools for the English language work in the case of the Georgian language. Namely, using the example of text classification (NLP tasks).
The 20 newsgroups dataset (free dataset) was used as processing data. Basically, this dataset is used in the process of testing new algorithms for solving the problem of classifying text data. The dataset is presented in English and there is no translation into Georgian. For the current work, this dataset was translated into Georgian using an online translator from Google. The translated text was represented by two different coding methods (ASCII and UTF-8). Processing text in English and Georgian with ASCII encoding produces a good result, but processing text in Georgian with UTF-8 encoding takes longer to obtain the same result.
The results obtained allow us to say that for processing Georgian text using NLTK tools, it is better to present ASCII-encoded text.