B4J Tutorial [NLP] Sentiment analysis

Erel · Sep 1, 2021

As a test, I've added another IMDB dataset to the training data (https://www.kaggle.com/volodymyrgav...cation?select=imdb_10K_sentimnets_reviews.csv). I've then tested it with the same test data.

B4X:

Accuracy: 90.83%

-------------------------------------------------------------------------
|      Tag | Errors |  Count |   % Err | Precision | Recall | F-Measure |
-------------------------------------------------------------------------
| negative |   1164 |  12500 | 0.093   | 0.909     | 0.907  | 0.908     |
| positive |   1129 |  12500 | 0.09    | 0.907     | 0.91   | 0.908     |
-------------------------------------------------------------------------

The gain is quite nice. Especially as the efforts to add it are minimal. This is really the power of machine learning.

Erel · Sep 1, 2021

By default, the doccat trainer uses the "bag of words" as the only features generator. Bag of words means that it simply collects all words, without preserving the order or anything else except of the words frequencies..
Each word becomes a feature and its frequency is the feature's value.

OpenNLP supports another type of features generator named NGramFeatureGenerator. This generator creates n-grams. n-grams are simply sequences of words, where the sequence length is n. We can add this generator like this:

B4X:

opennlp DoccatTrainer -lang en -params props.txt -model en-movies.bin -featureGenerators opennlp.tools.doccat.NGramFeatureGenerator,opennlp.tools.doccat.BagOfWordsFeatureGenerator -data C:\Users\H\Downloads\projects\Movies\Objects\train.txt

Note that we kept the bag of words generator as well. The NGram generator is set to 2 (bigrams).

Training takes a bit longer however it does improve the results:

B4X:

  Accuracy: 92.71%

-------------------------------------------------------------------------
|      Tag | Errors |  Count |   % Err | Precision | Recall | F-Measure |
-------------------------------------------------------------------------
| negative |    984 |  12500 | 0.079   | 0.932     | 0.921  | 0.927     |
| positive |    838 |  12500 | 0.067   | 0.922     | 0.933  | 0.928     |
-------------------------------------------------------------------------

Think about this quite amazing task. We give the computer a text, movie review in this case, that it never seen before, and in 93% of the cases it correctly tells us whether it is a positive or negative review. Skynet is coming.

AnandGupta · Sep 1, 2021

Erel said:
Skynet is coming.

And you are the only one, who knows the underground to hide into. Please take us too ?

B4J Tutorial [NLP] Sentiment analysis

Erel

B4X founder

Erel

B4X founder

AnandGupta

Expert

Similar Threads