A number of spam datasets are published on this webpage. Two types of datasets are available:
Each dataset is a
collection of spam and legitimate email received by one individual.
No stop word removal or stemming has been
performed on the emails although the static datasets have included feature selection using Information Gain. HTML markup was not removed but was
included in the tokenisation (split on tag delimiters). The name and value of
HTML attributes were tokenised separately. The body of the email and certain
email headers were included in the tokenisation, including the subject line and
the To: and CC: header fields.
- Static Datasets: 5 datasets of 1000 emails (500 spam/500 legitimate), each one extracted from an individual's email over the period of some months.
- Concept Drift Datasets: 2 datasets each consisting of more than
10,000 emails collected over a period of approximately 2 years by an individual.
Three types of features have been extracted.
Word features which represent a sequence of characters
separated by white space or HTML tag delimiters as discussed above.
Character features which represent the occurrence of a
single character in the email.
A small number of Structural features representing the
structure of the email. These include the proportion of
white space characters, punctuation characters, uppercase characters, lowercase
characters, total number of characters in the email.
In the concept drift datasets, each feature is represented in the
following format F99999 where
F is the feature type: W is a word feature,
C is a character feature and
S is a structural feature.
99999 is a unique code that represents the
In the static datasets the feature is represented as F:XXXX where F is the feature type and XXXX is the actual text of the feature.
Each email is represented as a line in a csv text file. The
format of the line is as follows:
dd/mm/yyyy represents the date
of the email,
class represents the
classification either spam or nonspam,
represents a repeated list of feature value pairs
The static datasets include the top 700 features (approx) selected using Information Gain. Details of the selection process are included in the reference below.
Each concept drift dataset includes the following files:
SpamTraining.txt = all emails used as initial training data in
the concept drift experiments performed using this dataset.
NonspamTraining.txt = all emails used as initial training data
in the concept drift experiments performed using this dataset.
TestMMM99.txt = all emails used as 'test' data in the concept
drift experiments using this dataset where MMM represents the month and 99 the
year the emails were originally received.
Download the Datasets:
The static datasets are described in the following reference:
Delany SJ, & Cunningham P. (2004) An Analysis of Case-base Editing in a Spam
Filtering System, In: Funk P & Gonzales Calero P.A. (eds.), Advances in
Case-Based Reasoning, (Proceedings of Seventh European Conference on Case-Based
Reasoning, ECCBR-04), LNAI 3155 p128-141, Springer Verlag
The concept drift datasets were used in the following references:
Delany SJ, P Cunningham & A Tysmbal
(2006) A comparison of Ensemble and Case-base Maintenance Techniques for
Handling Concept Drift in Spam Filtering, In: G.Sutcliffe and R.Goebel
(eds.), Proc. 19th Int. Conf. on Artificial Intelligence FLAIRS'2006,
p340-345, AAAI Press
Delany SJ, Cunningham P, Tsymbal A, Coyle L (2005) A Case-based Technique for
Tracking Concept Drift in Spam Filtering Journal of Knowledge Based Systems 18 (4-5) p187-195, Elsevier.