In this exercise, we will build a classifier which can detect the sentiment in a text. Sentiment can be defined as a view or an opinion that is expressed. Consider this movie review-"that was an awesome movie". Here, the sentiment is positive.
Till now, we used datasets provided by ML libraries. In this exercise, we will download the dataset and perform some operations to make it suitable for our model. You will notice that most of the time during this exercise is spent on data loading and preprocessing. If you are unfamiliar with machine learning algorithms then check out my practical machine learning tutorials - part-1, part-2 and part-3. Github Repo - Practical Machine Learning with Python
We will be using movie review dataset. It consists of 1000 positive and 1000 negative reviews. You can download it here.
#import dependencies
import os
DATA_DIR ="./txt_sentoken"
classes = ['pos', 'neg']
#vars to store data
train_data = []
train_labels = []
test_data = []
test_labels = []
for c in classes:
data_dir = os.path.join(DATA_DIR, c)
for fname in os.listdir(data_dir):
with open(os.path.join(data_dir, fname), 'r') as f:
content = f.read()
if fname.startswith('cv9'):
test_data.append(content)
test_labels.append(c)
else:
train_data.append(content)
train_labels.append(c)
type(train_data)
print(len(train_data), len(test_data))
print(train_data[3])
print(train_labels[3])
In this exercise, we are dealing with textual data but note that our machine learning model only accepts numerical data. Thus, we have to convert this textual data into numerical data. This is known as feature extraction . Also, machine learning models need numerical data with fixed size vectors whereas textual data usually has variable length size.
Here, we will see widely used bag of words representation. Let me explain this with an example. Consider we have two sentences. 1- That was an awesome movie. 2- I really appreciate your work in this movie.
For bag of words representation, we have to follow following procedure:
1 - tokenize tokenized_words = ['that','was','an','awesome','movie','I','really','appreciate','your','work','in','this']
2 - build a vocabulary
3 - sparse matrix encoding -now we represent each sentence with a sparce array. 1 - [1,1,0,0,0,0,0,1,1,0,0,0] 2 - [1,0,1,1,1,1,1,0,0,1,1,0]
Length of each vector will be same as length of vocabs. Here 1's in first vectors represents presence of particular word in a sentence. For example. 1 in first vector shows word "movie" is present in the sentence.
We use term vectorization to represent the process of converting text into numerical features.
X = ["That was an awesome movie","I really appreciate your work in this movie"]
#import count vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data = vectorizer.fit_transform(X)
#get the vocabulary
vectorizer.vocabulary_ # "I" was removed because it is one of the stopwords, that is, that doesnt contain important significance
#transform sparce matrix into an array
data.toarray()
#print feature names
vectorizer.get_feature_names()
In large corpus, some words (i.e "the","i") may occur lot of times and hence carrying little meaningful information about the contents of the document. For this reason, we use tf-idf vectorizer. Tf means term frequency and idf means inverse-term frequence. For more details on tf-idf and other vectorizers plese visit - Feature Extraction
#we will use tf-idf for our sentiment analysis task
from sklearn.feature_extraction.text import TfidfVectorizer
import random
random.shuffle(train_data)
random.shuffle(test_data)
vect = TfidfVectorizer(min_df=5,max_df = 0.8,sublinear_tf=True,use_idf=True)
train_data_processed = vect.fit_transform(train_data)
test_data_processed = vect.transform(test_data)
We also need to convert our labels into numerical data. We have two possible labels- pos,neg
from sklearn.preprocessing import LabelEncoder
random.shuffle(train_labels)
random.shuffle(test_labels)
le = LabelEncoder()
train_labels_processed = le.fit_transform(train_labels)
test_labels_processed = le.transform(test_labels)
train_labels_processed[:33]
le.classes_ #0 for neg and 1 for pos
from sklearn.svm import SVC
model = SVC(C=10, kernel="rbf")
#train
model.fit(train_data_processed, train_labels_processed)
model.score(test_data_processed, test_labels_processed)
train_data_processed
from sklearn.ensemble import RandomForestClassifier
m = RandomForestClassifier()
m.fit(train_data_processed, train_labels_processed)
m.score(test_data_processed, test_labels_processed)
x = ["That was an awesome movie"]
x_p = vect.transform(x) #perform vectorization
m.predict(x_p) #1 means positive
x2 = ["That was very bad movie"]
x2_p = vect.transform(x2)
m.predict(x2_p)
Even though we got accuracy around 52, seems like it works well.
If you have any query or feedback kindly write me at vsavan7@gmail.com. You can follow me on twitter (@savanvisalpara7) to get updates on new blog posts.
Share this post on social media :