Tuesday, October 23, 2012

Twitter Mood Predicts Stock Market Movement


Download: mood_dataset.zip

In this post, I try to predict the daily up and down movement of stock prices using twitter mood data and machine learning algorithms. Some time ago, I read a paper called “Twitter mood predicts the stock market.” They claimed to be able to predict stock price movement with an accuracy of over 86%. They used 9,853,498 tweets posted by 2.7 million English speaking users in 2008 and showed that general twitter mood could be used to predict the DJIA. 

My initial intention was to reproduce their results. However, their method would have required that I have access to large scale historical twitter data which is not free and probably not cheap. Instead, I found two companies that publish daily mood sentiment for individual stocks with historical data going back a couple of months. You can download here the anonymized dataset I used for IBM and AGN. I have anonymized the dataset for two reasons; I don’t know if it is legal to share this data and secondly, I might decide to use this method for my own financial gain – the results are impressive! 

For the first company that publishes mood data, I  combine the mood data for the previous 2, 5, and 8 days to predict whether a stock price goes up, down or stays flat. The accuracy is around 75% for AGN and 80% for IBM using 8 days of mood data. For the second company that publishes mood data the results where very impressive; around 90%-100% accuracy using just the previous day's mood data for all the stocks I tested. Below are the results summarized in a confusion matrix.

I used a decision tree and 5x cross-validation for all tests.

Company (1)

$AGN Confusion Matrix with 2 days worth of mood data
  down  
  flat  
  up  
down
55.1 %
0.0 %
44.9 %
205
flat
33.3 %
0.0 %
66.7 %
3
up
35.9 %
0.0 %
64.1 %
234
198
0
244
442
Note: columns represent predictions, row represent true classes

$AGN Confusion Matrix with 5 days worth of mood data
  down  
  flat  
  up  
down
71.7 % 
0.0 % 
28.3 % 
205
flat
66.7 % 
0.0 % 
33.3 % 
3
up
23.7 % 
0.0 % 
76.3 % 
232
204
0
236
440

Note: columns represent predictions, row represent true classes

$AGN Confusion Matrix with 8 days worth of mood data
  down  
  flat  
  up  
down
74.9 % 
0.0 % 
25.1 % 
203
flat
0.0 % 
0.0 % 
100.0 % 
3
up
23.5 % 
0.4 % 
76.1 % 
230
206
1
229
436
Note: columns represent predictions, row represent true classes

$IBM Confusion Matrix with 8 days worth of mood data
  down  
  flat  
  up  
down
82.8 % 
0.0 % 
17.2 % 
215
flat
N/A % 
N/A % 
N/A % 
0
up
20.1 % 
0.0 % 
79.9 % 
249
228
0
236
464
Note: columns represent predictions, row represent true classes

Company (2)

$AGN Confusion Matrix with 1 day worth of mood data
  down  
  flat  
  up  
down
100.0% 
0.0 % 
44.9 % 
460
flat
0.0 % 
100.0 % 
0.0 % 
46
up
0.0 % 
0.0 % 
100.0 % 
436
460
46
436
942
Note: columns represent predictions, row represent true classes






Socializer Widget By Blogger Yard
SOCIALIZE IT →
FOLLOW US →
SHARE IT →

8 comments:

  1. Interresting,
    Would you mind detail your approach? Didi you use a neural net as they did? Would you privately share the companies providing the data?
    Julien

    ReplyDelete
  2. I read too fast, you said you used a tree, sorry.

    ReplyDelete
  3. Could you please explain what each column in the mood dataset means? specifically, s0, s1, s2,.. etc. I am trying to understand the dataset in term of column/fields, however, there is little said about that.

    Thanks a lot
    --
    Mamoun

    ReplyDelete
  4. For example, using mood data (company 1) for the previous 2 days:

    s0 s1 s2 s3 adj_close
    c c c c d
    class
    18.0 5.0 25.0 4.0 up

    The attributes "s" in the dataset refer to the number of positive or negative mentions in the social media:

    s0 = 18 : positive mentions 2 days before closing price
    s1 = 5: negative mentions 2 days before closing price
    s2 = 25: positive mentions 1 day before closing price
    s3 = 4: negative mentions 1 day before closing price

    ReplyDelete
  5. Dear Dimitri,
    Thank you for your later explanation of the attributes. There is only one question remaining: what represents each data point (row)? How did you get, for example, more than 400 data points (rows) in the data set when you use 9 day history?

    Mamoun

    ReplyDelete
  6. Each row represents the data for a particular day of the year. The data includes the closing stock price movement for that day and the sentiment data for the last x days. Each day has a date associated with it (even though it is not shown in the dataset). I could have added an extra column to the dataset to show the date. For example,

    date s0 s1 s2 s3 adj_close
    d c c c c d class
    01/03/2010 18.0 5.0 25.0 4.0 up
    02/03/2010 17.0 3.0 22.0 3.0 up
    03/03/2010 19.0 5.0 29.0 4.0 down
    04/03/2010 13.0 4.0 25.0 4.0 up
    ....
    ....

    ReplyDelete
  7. Dear Dimitri,

    Thanks a lot for this introduction on using weka from Python. Do you know if it could fully creating classifiers and nested classifiers using methods like weka.core.Utils.splitOptions. It supports a command like:
    weka.classifiers.meta.MultiScheme -X 0 -S 1 -B "weka.classifiers.rules.ZeroR " -B "weka.classifiers.meta.AdaBoostM1 -P 100 -S 1 -I 20 -W weka.classifiers.trees.DecisionStump" -B "weka.classifiers.trees.RandomForest -I 200 -K 30 -S 1 -num-slots 8" -B "weka.classifiers.meta.CostSensitiveClassifier -cost-matrix \"[0.0 1.0; 10.0 0.0]\" -S 1 -W weka.classifiers.trees.RandomForest -- -I 200 -K 0 -S 1 -num-slots 8" -B "weka.classifiers.rules.JRip -F 3 -N 3.0 -O 2 -S 1"

    Thank you,
    Xavier

    ReplyDelete
    Replies
    1. Sorry, I was replying to one of your older posts.

      Delete