The frequency of stopwords in a Twitter corpus
In a previous post, I had calculated the frequency of stopwords within the Brown corpus. The Brown corpus is considered to be fairly representative of standard English usage. As part of the project I was working on, the question arose as to whether the frequencies seen in the Brown corpus would also be representative of usage within a body of Tweets from Twitter. It’s not at all clear that we’d expect the same frequency , so when in doubt, run the numbers.
The first step would be to obtain a large, hopefully representative body of Tweets. Collecting this on our own MIGHT be a possibility, but it turns out that others have already done this heavy lifting. The folks at www.sentiment140.com have collected a large collection of Tweets for their project and have been kind enough to make it available in a CSV file (the zipped file can be found at http://help.sentiment140.com/for-students ) . This file contains 1,600,00 Tweets which were collected as part of their project.
Now we would proceed as before with the Brown corpus, thought we must code around the fact that this is not a standard NLTK corpus as the Brown corpus is. The following code will read the input Twitter CSV and product an output with the frequencies of the top stopwords. A condensed list of the 50 most frequent stopwords follows the code listing.
from nltk.corpus import stopwords
from collections import Counter
import nltk
import csv
import sys
i =1
# This will contain all the words in the Tweets
tw= []
fin = open("c:\\TwitterCorpus\\training_all.csv", 'rb') # opens the Twitter file
fout = open("c:\\nomt\\twit_stop_words.csv", 'w') # opens the csv output file
try:
reader = csv.reader(fin)
for row in reader:
tw.extend([x for x in nltk.word_tokenize(row[5].decode('ascii','ignore'))])
finally:
fin.close() # closing
#Pluck out the stowords from the Twitter words
stwords = [word.lower() for word in tw if word.lower() in stopwords.words('english') ]
# Calculate the count of stopwords
for item in [stwords]:
cw=Counter(item)
cwmost = cw.most_common()[:200]
totwords = len(tw)
rnk =1
for key,value in cwmost:
fline = str(rnk) + ',' +str(key) + ',' + str(value) + ',' + str(round(float(value)/float(totwords),4))
fout.write(fline + '\n')
rnk = rnk +1
fout.close()
The following condensed list of the top 50 stopwords was produced by importing the CSV into Excel and formatting. Note that the most common stopword is “I” (versus “the” in Brown), a result not hard to believe when one considers the personal nature of tweeting. In future posts, I will compare how the frequency of stopwords differs between the twitter corpus and the Brown corpus.