The frequency of stopwords in the Brown corpus

As part of a project I was working on, I was computing the frequency of individual stopwords within different collections of words. (I will write up more specifics of this soon).

As I computed these stopword frequencies for one collection of words after another, I was hoping to compare them to each other, as well as some standard for the English language in general. That is, if the stopword “the” was the most frequent stopwords in one of
my collections, does this differ from what I would expect from the English language in general ? However, it occurred to me at this point that I had no idea as to which were the most common stopwords and with what frequency they occurred.

Thus, I wanted to compute some sort of baseline to which to compare.
I decided to compute the most common stopwords and their frequency for all the words
in the Brown corpus, which is included in the NLTK package.

As it turns out , there are 1,161,192 words in the Brown corpus. The following table lists the counts and frequency (percentage of total) for the 50 most common stopwords.

The frequency of stopwords in the Brown corpus

0 Comments on “The frequency of stopwords in the Brown corpus”