Email Sentiment Analysis

Below is the result of a weekend playing around with doing sentiment analysis over my own corpus of personal email. Email is one of the often overlooked goldmines of user data, previous work at SendGrid and Return Path have given me some insight into this.

First approach

First I tried the basic approach. Use Naive Bayes and bag-of-words over the IMDB dataset to create our classifier. The graph below shows this analysis over the past few years of my email.

Can you guess when I got married? 😄

This is a good sign. Already, and without much work, I can see clear points in the past where my email should’ve been happier. Let’s also take a look at the volume of my email over this time:

Good to know.

Looking back up at this graph and the one before though, one of the first thoughts I have looking at them are all the noisy emails in there. The marketing fluff. A simple way to remove this is ignoring all the email with a header Precedence: Bulk, high bulkscore, or the word bulk or bounce in the from or reply-to address, or had something like *no*reply*@* in there. Simple filtering, but seems to do a decent job, and it’s as far as I’m willing to take it for now.

Now let’s look at the volume again.

There we go. Looking through my email, this definitely correlates more strongly with times I was receiving more personal emails. It’s not perfect, but pretty close.

Now, let’s try to improve my classifier. First thing I could do here is use Tf–idf.

This looks better, and after giving the data a quick look over, these results are definitely more inline with the corpus. Now, let’s try using a SVM and not NB.

Awesome, even better. Although it may be hard to tell from the graphs, so far the F1 scores have agreed, we’re improving. Oh, and now’s probably a good time to mention how I’m tokenizing. Starting off with single words, though I plan to try skipgrams at some point. The only punctuation I include are “!” and “?” – and I try to use anything that looks like an emoji, :) :D ;-) etc. All this, plus using lemmas and removing stopwords.

def emoji(s):
  eyes, nose, mouth = [':', ';'], ['-'], ['D', ')', '(', '3']
  for i in xrange(len(s)-1):
    x = s[i:i+3]
    if x[0] in eyes and x[1] in nose and len(x) > 2 and x[2] in mouth:
      yield x
    elif x[0] in eyes and x[1] in mouth:
      yield x[:2]

def punctuation(s):
  meaningful_punctuation = ['!', '?']
  for x in s:
    if x in meaningful_punctuation:
      yield x

def tok(m):
  stops = set(stopwords.words("english"))
  az = re.sub("[^a-zA-Z]", " ", m).lower()
  words = [w.lemma for w in TextBlob(az).words if w not in stops]
  return words + list(punctuation(m)) + list(emoji(m))

Second approach

Next, strongly influenced by this work coming out of Standford’s AI group, I decided to try and use unsupervised learning to build word vectors, then a supervised approach to categorize my corpus.

To build the word vectors, I used word2vec within gensim, though there are many other implementations out there. A bit more about it on the authors blog here, look around his blog a bit while you’re there, the guy is brilliant. With word2vec I set the vector dimensionality at 700, had a minimum word count of 50, set the context window size to 12, and downsampled at 1e-3.

It can be an insane amount of fun to play around with the model once it’s done. I definitely spent a few hours dicking around with it. Wicked cool stuff. Also, for other great papers on this subject, see this guy, this guy, and this guy. It’s also worth checking out GloVe.

I used wikipdea’s corpus, a roughly 60 gig xml file, and the UMBC webbase corpus, about the same size. Altogether about 7 billion words. This took awhile. I also had to raise the stack size limit on my iMac, otherwise it’d never complete before the process would be kicked off. For those interested, you can rise your stack size to the hard limit using:

$ ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       709
-n: file descriptors                2560
$ ulimit -s hard
$ ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             65532
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       709
-n: file descriptors                2560

Now on to the supervised part. For this, I decided to continue with SVMs, with simple k-means clustering for my feature vectors. Again, starting with IMDBs dataset.

Also, this time, I’m going to throw in a corpus made using twitter. The approach I used can be found here. Pretty simple idea. Tweets that have happy emoji in them are most likely positive. Tweets with sad emoji in them are most likely negative. I take their work a step further and, if a tweet has a url in it, get the article text using readability, continuing along with their base assumption that positive emoji means good and negative emoji means bad. There must be cases where this doesn’t hold, but I haven’t seen any yet. There is good information to glean here.

Movie reviews are definitely written with a certain tone, combining these datasets will help overcome some of that by throwing in more common language, albeit some of that is twitter language.

I’d kill for Facebook’s dataset. Not only would it be a fairly good representation of how people normally talk to their friends, their users, as of recently, even have the option to label a post with with it’s ‘Feeling’. I wonder why they added this? 😏

All this took a little while, especially without a good cuda capable gpu. Go grab coffee, or perhaps while you’re waiting, read more of their paper, or any of the others I linked to here.

The results are very promising:

Looking closely at these results, and over my own email, these are by far the best. Getting married, my job at Apple, graduating college, and any other events in my life that have resulted in congratulations from friends and family members over email are clearly shown.

The ‘qualified self’ is very interesting to me. Email is a great place to start. I’ve also got all my IRC logs since the beginning of time. And now that Messages on OSX included SMSs, of course along with the iMessages, in a few years I could have a corpus of chats to analyze. Facebook Messenger and others also make it easy to fetch chats from. I’m super interested in this. Texts likely say a whole lot more about my mood over time then email does, as the majority of my remote communication happens there. For those interested, something like this below could be used collect it. Along these same lines, Stephen Wolfram’s Personal Analytics is a good read, and I’d watch a few of these as well.

on write_to_file(this_data, event_description, target_file)
  set timeStr to time string of (current date)
  do shell script "echo  " & quoted form of timeStr & " >>  " & quoted form of target_file
  do shell script "echo  " & quoted form of event_description & " >>  " & quoted form of target_file
  do shell script "echo  " & quoted form of this_data & " >>  " & quoted form of target_file
end write_to_file

using terms from application "Messages"
  on message sent theMessage with eventDescription for theChat
    my write_to_file(theMessage, eventDescription, "/Users/adammenges/corpora/messages/" & theChat) -- TODO: pull name from theChat
  end message sent

  on message received theMessage from theBuddy with eventDescription for theChat
    my write_to_file(theMessage, eventDescription, "/Users/adammenges/corpora/messages/" & theChat)
  end message received
end using terms from

Ending thoughts

Next I’m going to try some 1D CNN approaches. There are also other neat things you could try to glean here. With whom do I have the strongest relationships? The worst? What kind of things am I interested in? From there, what’s my mean purchase amount? My salary? Where have I traveled to? I wonder what forecasting you could do. Where am I most likely to travel to next? What am I most likely to buy?

For those interested in seeing what all can be done here, first grab all the corpora linked to here, download your own email using something like this below, pull up ipython, import sklearn, and have fun!

def corpus(username, password, server='', mailbox="[Gmail]/All Mail"):
  import imaplib
  import os
  mail = imaplib.IMAP4_SSL(server)
  mail.login(username, password)
  result, data =, 'ALL')
  directory = 'corpus-' + username

  if not os.path.exists(directory):

  for x in data[0].split():
    typ, email = mail.fetch(x, '(RFC822)')
    with open(directory + '/' + x, 'w') as f:

def parse(message):
  from bs4 import BeautifulSoup
  import email
  import dateutil.parser
  msg = email.message_from_string(message)
  lhs = {}
  lhs['Date'] = dateutil.parser.parse(msg['Date'], fuzzy=True).strftime('%Y-%m-%d')

  for part in msg.walk():
    if part.get_content_type() == 'text/plain':
      lhs['Body'] = BeautifulSoup(part.get_payload()).get_text()

  return lhs

I’ll spend the next few weekends playing around with this, try some other deep approaches, but I’ve nonetheless been impressed by these results. If you’d like to chat, or try out anything interesting here, be sure to get in touch. Maybe I’ll swap out the Tf–idf for a Recurrent Neural Network, or perhaps see how Dato or Indico preform, or try out NB-SVM. I’m excited to see where NLP and Machine Learning head to next, to be a part of it and contribute. Machine learning is definitely one of, if not the most, exciting place to be working. We’re at the cusp of a huge incoming shift. These technologies will have profound impact on our society, with predictions ranging from utopic to apocalyptic.

“Look at you, hacker: a pathetic creature of meat and bone, panting and sweating as you run through my corridors. How can you challenge a perfect, immortal machine?” ― Ken Levine