Hi Experts,
We are dealing with Text Classification Problem. We have around 80K records with around 50 classes. The data is highly imbalanced. It has 2 columns one for description and other contains class.
Till now we have tried following models and techniques:
- Data Preprocessing: a. Lowercase conversion, removed numeric texts, removed punctuations b. Removed unimportant words and stop words c. Lemmatization
- TFIDF transformation
- Using SKLEARN Models: a. Linear SVC b. Linear Regression c. Logistic Regression d. Decision Trees e. Random Forest
- Using Huggingface Transformers: a. Google Bert b. Distil Bert
- SMOTE sampling
It is observed that the maximum accuracy we got is 70% (Random Forest and Google Bert).
Is there any scope to improve accuracy?
If yes, what other techniques or models we can use to improve accuracy?
Top comments (0)