BLOG - SAVAN VISALPARA

[ Musings on Machine Learning ]
Posted on 28 May, 2017

Sentiment Analysis

In this exercise, we will build a classifier which can detect the sentiment in a text. Sentiment can be defined as a view or an opinion that is expressed. Consider this movie review-"that was an awesome movie". Here, the sentiment is positive.

Till now, we used datasets provided by ML libraries. In this exercise, we will download the dataset and perform some operations to make it suitable for our model. You will notice that most of the time during this exercise is spent on data loading and preprocessing. If you are unfamiliar with machine learning algorithms then check out my practical machine learning tutorials - part-1, part-2 and part-3. Github Repo - Practical Machine Learning with Python

Prepare Dataset

We will be using movie review dataset. It consists of 1000 positive and 1000 negative reviews. You can download it here.

In [12]:
#import dependencies
import os
In [14]:
DATA_DIR ="./txt_sentoken"
classes = ['pos', 'neg']

#vars to store data
train_data = []
train_labels = []
test_data = []
test_labels = []

for c in classes:
    data_dir = os.path.join(DATA_DIR, c)
    for fname in os.listdir(data_dir):
        with open(os.path.join(data_dir, fname), 'r') as f:
            content = f.read()
            if fname.startswith('cv9'):
                test_data.append(content)
                test_labels.append(c)
            else:
                train_data.append(content)
                train_labels.append(c)
In [15]:
type(train_data)
Out[15]:
list
In [17]:
print(len(train_data), len(test_data))
1800 200
In [18]:
print(train_data[3])
print(train_labels[3])
 " jaws " is a rare film that grabs your attention before it shows you a single image on screen . 
the movie opens with blackness , and only distant , alien-like underwater sounds . 
then it comes , the first ominous bars of composer john williams' now infamous score . 
dah-dum . 
from there , director steven spielberg wastes no time , taking us into the water on a midnight swim with a beautiful girl that turns deadly . 
right away he lets us know how vulnerable we all are floating in the ocean , and once " jaws " has attacked , it never relinquishes its grip . 
perhaps what is most outstanding about " jaws " is how spielberg builds the movie . 
he works it like a theatrical production , with a first act and a second act . 
unlike so many modern filmmakers , he has a great deal of restraint , and refuses to show us the shark until the middle of the second act . 
until then , he merely suggests its presence with creepy , subjective underwater shots and williams' music . 
he's building the tension bit by bit , so when it comes time for the climax , the shark's arrival is truly terrifying . 
he doesn't let us get bored with the imagery . 
the first act opens with police chief martin brody ( roy scheider ) , a new york cop who has taken an easy , peaceful job running the police station on amity island , a fictitious new england resort town where there hasn't been a murder or a gun fired in 25 years . 
the island is shaken up by several vicious great white shark attacks right before the fourth of july , and the mayor , larry vaughn ( murray hamilton ) , doesn't want to shut down the beaches because the island is reliant on summer tourist money . 
brody is joined by matt hooper ( richard dreyfuss ) , a young , ambitious shark expert from the marine institute . 
hooper is as fascinated by the shark as he is determined to help brody stop it -- his knowledge about the exact workings of the shark ( " it's a perfect engine , an eating machine " ) make it that much more terrifying . 
when vaughn finally relents , hooper and brody join a crusty old shark killer named quint ( robert shaw ) on his decrepit boat , the orca , to search for the shark . 
the entire second act takes place on the orca as the three men hunt the shark , and inevitably , are hunted by it . 
 " jaws " is a thriller with a keen sense of humor and an incredible sense of pacing , tension , and horror . 
it is like ten movies all rolled into one , and it's no wonder it took america by storm in the summer of 1975 , taking in enough money to crown it the box office champ of all time ( until it was unceremoniously dethroned in 1977 by " star wars " ) . 
even today , fascination with this film is on par with hitchcock's " psycho , " and it never seems to age . 
although grand new technology exists that makes the technical sequences , including several mechanical sharks , obsolete , none of it could improve the film because it only would lead to overkill . 
the technical limitations faced by spielberg in 1975 may have actually produced a better film because it forced him to rely on traditional cinematic elements like pacing , characterization , sharp editing , and creative photography , instead of simply dousing the audience with digital shark effects . 
scheider , dreyfuss , and shaw were known actors at the time " jaws " was made , but none of them had the draw of a robert redford or paul newman . 
nevertheless , this film guaranteed them all successful careers because each gave an outstanding performance and refused to be overshadowed by the shark . 
scheider hits just the right notes as a sympathetic husband and father caught in the political quagmire of doing what's right and going against the entire town . 
 " it's your first summer here , you know , " mayor vaughn warns him . 
dreyfuss , who had previously been seen in " american graffiti " ( 1973 ) and " the apprenticeship of duddy kravitz " ( 1974 ) gives a surprisingly mature , complex performance for someone who had literally only played kids and teenagers . 
however , most outstanding is the gnarled performance by robert shaw as the movie's captain ahab , a performance sorely overlooked by the academy awards . 
bordering of parody , shaw plays quint as a grizzled old loner whose machismo borders on masochism . 
he's slightly deranged , and shaw's performance is almost a caricature . 
however , there is one scene late in the film , when he and brody and hooper are below deck on the orca comparing scars . 
quint is drawn into telling the story of his experiences aboard the u . s . s . 
indianapolis , a navy ship in world war ii that was sunk by the japanese . 
his tale of floating in the water for more than a week with over 1 , 000 other men while swarms of sharks slowly devoured them them is actually more hair-raising than anything spielberg put on screen . 
shaw delivers the story in one long take , and it is the best acting in the film . 
of course , we can't leave out the shark itself ; with its black eyes , endless rows of teeth , and insatiable urge to eat , it is basically the epitome of all mankind's fears about what is unknown and threatening in nature . 
a shark is such a perfect nemesis it is real -- having survived sinch the dinosaurs , great whites do exist , they can be as large as the shark in " jaws , " and they are a threat . 
every one of spielberg's subjective underwater shots makes us feel queasy because lets us see how we look to the shark : a bunch of writihing , dangling , completely unprotected legs just ready to be chomped into . 
the shark in " jaws " was actually a combination of actual footage and five different mechanical sharks ( all nicknamed " bruce " by the crew ) built to be shot from different angles . 
many have forgotten , but " jaws " was a sort of precursor to " waterworld " ( 1995 ) , a movie's who soggy production and cost overruns had universal studios worried about a bomb . 
but , as we can see now , spielberg overcame all the obstacles , and delivered one of the finest primal scare-thrillers ever to come out of hollywood . 

pos

Preprocess the dataset

In this exercise, we are dealing with textual data but note that our machine learning model only accepts numerical data. Thus, we have to convert this textual data into numerical data. This is known as feature extraction . Also, machine learning models need numerical data with fixed size vectors whereas textual data usually has variable length size.

Here, we will see widely used bag of words representation. Let me explain this with an example. Consider we have two sentences. 1- That was an awesome movie. 2- I really appreciate your work in this movie.

For bag of words representation, we have to follow following procedure:

1 - tokenize tokenized_words = ['that','was','an','awesome','movie','I','really','appreciate','your','work','in','this']

2 - build a vocabulary

  • first of all, we have to count the occurrences of tokens in each document.
  • arrange vocabs such that starting vocabs have more occurences than which come later in the set. vocabs = ['movie','awesome','appreciate','work','really','I','your','an','that','in','this','was']

3 - sparse matrix encoding -now we represent each sentence with a sparce array. 1 - [1,1,0,0,0,0,0,1,1,0,0,0] 2 - [1,0,1,1,1,1,1,0,0,1,1,0]

Length of each vector will be same as length of vocabs. Here 1's in first vectors represents presence of particular word in a sentence. For example. 1 in first vector shows word "movie" is present in the sentence.

We use term vectorization to represent the process of converting text into numerical features.

In [19]:
X = ["That was an awesome movie","I really appreciate your work in this movie"]
In [26]:
#import count vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data = vectorizer.fit_transform(X)
In [27]:
#get the vocabulary
vectorizer.vocabulary_ # "I" was removed because it is one of the stopwords, that is, that doesnt contain important significance
Out[27]:
{'an': 0,
 'appreciate': 1,
 'awesome': 2,
 'in': 3,
 'movie': 4,
 'really': 5,
 'that': 6,
 'this': 7,
 'was': 8,
 'work': 9,
 'your': 10}
In [30]:
#transform sparce matrix into an array
data.toarray()
Out[30]:
array([[1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0],
       [0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1]], dtype=int64)
In [31]:
#print feature names
vectorizer.get_feature_names()
Out[31]:
['an',
 'appreciate',
 'awesome',
 'in',
 'movie',
 'really',
 'that',
 'this',
 'was',
 'work',
 'your']

In large corpus, some words (i.e "the","i") may occur lot of times and hence carrying little meaningful information about the contents of the document. For this reason, we use tf-idf vectorizer. Tf means term frequency and idf means inverse-term frequence. For more details on tf-idf and other vectorizers plese visit - Feature Extraction

In [45]:
#we will use tf-idf for our sentiment analysis task
from sklearn.feature_extraction.text import TfidfVectorizer
import random
random.shuffle(train_data)
random.shuffle(test_data)
In [61]:
vect = TfidfVectorizer(min_df=5,max_df = 0.8,sublinear_tf=True,use_idf=True)
train_data_processed = vect.fit_transform(train_data)
test_data_processed = vect.transform(test_data)

We also need to convert our labels into numerical data. We have two possible labels- pos,neg

In [62]:
from sklearn.preprocessing import LabelEncoder
random.shuffle(train_labels)
random.shuffle(test_labels)
In [63]:
le = LabelEncoder()
train_labels_processed = le.fit_transform(train_labels)
test_labels_processed = le.transform(test_labels)
In [64]:
train_labels_processed[:33] 
Out[64]:
array([1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 1, 1], dtype=int64)
In [65]:
le.classes_  #0 for neg and 1 for pos
Out[65]:
array(['neg', 'pos'], 
      dtype='<U3')

Train a model

In [77]:
from sklearn.svm import SVC
model = SVC(C=10, kernel="rbf")
#train
model.fit(train_data_processed, train_labels_processed)
Out[77]:
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
In [78]:
model.score(test_data_processed, test_labels_processed)
Out[78]:
0.51500000000000001
In [71]:
train_data_processed
Out[71]:
<1800x12495 sparse matrix of type '<class 'numpy.float64'>'
	with 508468 stored elements in Compressed Sparse Row format>
In [ ]:
 
In [ ]:
 
In [79]:
from sklearn.ensemble import RandomForestClassifier
In [81]:
m = RandomForestClassifier()
m.fit(train_data_processed, train_labels_processed)
Out[81]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
In [82]:
m.score(test_data_processed, test_labels_processed)
Out[82]:
0.52500000000000002
In [87]:
x = ["That was an awesome movie"]
x_p = vect.transform(x) #perform vectorization
In [88]:
m.predict(x_p) #1 means positive
Out[88]:
array([1], dtype=int64)
In [89]:
x2 = ["That was very bad movie"]
x2_p = vect.transform(x2)
In [90]:
m.predict(x2_p)
Out[90]:
array([0], dtype=int64)

Even though we got accuracy around 52, seems like it works well.

If you have any query or feedback kindly write me at vsavan7@gmail.com. You can follow me on twitter (@savanvisalpara7) to get updates on new blog posts.

Share this post on social media :