In this post, I try to predict the daily up and down movement of stock prices using twitter mood data and machine learning algorithms. Some time ago, I read a paper called “Twitter mood predicts the stock market.” They claimed to be able to predict stock price movement with an accuracy of over 86%. They used 9,853,498 tweets posted by 2.7 million English speaking users in 2008 and showed that general twitter mood could be used to predict the DJIA.
My initial intention was to reproduce their results. However, their method would have required that I have access to large scale historical twitter data which is not free and probably not cheap. Instead, I found two companies that publish daily mood sentiment for individual stocks with historical data going back a couple of months. You can
download here the anonymized dataset I used for IBM and AGN. I have anonymized the dataset for two reasons; I don’t know if it is legal to share this data and secondly, I might decide to use this method for my own financial gain – the results are impressive!
For the first company that publishes mood data, I combine the mood data for the previous 2, 5, and 8 days to predict whether a stock price goes up, down or stays flat. The accuracy is around 75% for AGN and 80% for IBM using 8 days of mood data. For the second company that publishes mood data the results where very impressive; around 90%-100% accuracy using just the previous day's mood data for all the stocks I tested. Below are the results summarized in a confusion matrix.
I used a decision tree and 5x cross-validation for all tests.
Company (1)
$AGN Confusion Matrix with 2 days worth of mood data
|
down
|
flat
|
up
|
down
|
55.1 %
|
0.0 %
|
44.9 %
|
205
|
flat
|
33.3 %
|
0.0 %
|
66.7 %
|
3
|
up
|
35.9 %
|
0.0 %
|
64.1 %
|
234
|
|
198
|
0
|
244
|
442
|
Note:
columns represent predictions, row represent true classes
$AGN Confusion Matrix with 5 days worth of mood data
|
down
|
flat
|
up
|
down
|
71.7 %
|
0.0 %
|
28.3 %
|
205
|
flat
|
66.7 %
|
0.0 %
|
33.3 %
|
3
|
up
|
23.7 %
|
0.0 %
|
76.3 %
|
232
|
|
204
|
0
|
236
|
440
|
Note: columns represent predictions, row represent true classes
$AGN Confusion Matrix with 8 days worth of mood data
|
down
|
flat
|
up
|
down
|
74.9 %
|
0.0 %
|
25.1 %
|
203
|
flat
|
0.0 %
|
0.0 %
|
100.0 %
|
3
|
up
|
23.5 %
|
0.4 %
|
76.1 %
|
230
|
|
206
|
1
|
229
|
436
|
Note: columns represent predictions, row represent true classes
$IBM Confusion Matrix with 8 days worth of mood data
|
down
|
flat
|
up
|
down
|
82.8 %
|
0.0 %
|
17.2 %
|
215
|
flat
|
N/A %
|
N/A %
|
N/A %
|
0
|
up
|
20.1 %
|
0.0 %
|
79.9 %
|
249
|
|
228
|
0
|
236
|
464
|
Note: columns represent predictions, row represent true classes
Company (2)
$AGN Confusion Matrix with 1 day worth of mood data
|
down
|
flat
|
up
|
down
|
100.0%
|
0.0 %
|
44.9 %
|
460
|
flat
|
0.0 %
|
100.0 %
|
0.0 %
|
46
|
up
|
0.0 %
|
0.0 %
|
100.0 %
|
436
|
|
460
|
46
|
436
|
942
|
Note: columns represent predictions, row represent true classes
|
|
|
|
|
|
Socializer Widget By Blogger Yard
Related Posts:
Machine-learning
Interresting,
ReplyDeleteWould you mind detail your approach? Didi you use a neural net as they did? Would you privately share the companies providing the data?
Julien
I read too fast, you said you used a tree, sorry.
ReplyDeleteCould you please explain what each column in the mood dataset means? specifically, s0, s1, s2,.. etc. I am trying to understand the dataset in term of column/fields, however, there is little said about that.
ReplyDeleteThanks a lot
--
Mamoun
For example, using mood data (company 1) for the previous 2 days:
ReplyDeletes0 s1 s2 s3 adj_close
c c c c d
class
18.0 5.0 25.0 4.0 up
The attributes "s" in the dataset refer to the number of positive or negative mentions in the social media:
s0 = 18 : positive mentions 2 days before closing price
s1 = 5: negative mentions 2 days before closing price
s2 = 25: positive mentions 1 day before closing price
s3 = 4: negative mentions 1 day before closing price
Dear Dimitri,
ReplyDeleteThank you for your later explanation of the attributes. There is only one question remaining: what represents each data point (row)? How did you get, for example, more than 400 data points (rows) in the data set when you use 9 day history?
Mamoun
Each row represents the data for a particular day of the year. The data includes the closing stock price movement for that day and the sentiment data for the last x days. Each day has a date associated with it (even though it is not shown in the dataset). I could have added an extra column to the dataset to show the date. For example,
ReplyDeletedate s0 s1 s2 s3 adj_close
d c c c c d class
01/03/2010 18.0 5.0 25.0 4.0 up
02/03/2010 17.0 3.0 22.0 3.0 up
03/03/2010 19.0 5.0 29.0 4.0 down
04/03/2010 13.0 4.0 25.0 4.0 up
....
....
Dear Dimitri,
ReplyDeleteThanks a lot for this introduction on using weka from Python. Do you know if it could fully creating classifiers and nested classifiers using methods like weka.core.Utils.splitOptions. It supports a command like:
weka.classifiers.meta.MultiScheme -X 0 -S 1 -B "weka.classifiers.rules.ZeroR " -B "weka.classifiers.meta.AdaBoostM1 -P 100 -S 1 -I 20 -W weka.classifiers.trees.DecisionStump" -B "weka.classifiers.trees.RandomForest -I 200 -K 30 -S 1 -num-slots 8" -B "weka.classifiers.meta.CostSensitiveClassifier -cost-matrix \"[0.0 1.0; 10.0 0.0]\" -S 1 -W weka.classifiers.trees.RandomForest -- -I 200 -K 0 -S 1 -num-slots 8" -B "weka.classifiers.rules.JRip -F 3 -N 3.0 -O 2 -S 1"
Thank you,
Xavier
Sorry, I was replying to one of your older posts.
Delete