ECUE Spam Datasets

A number of spam datasets are published on this webpage. Two types of datasets are available:

  • Static Datasets: 5 datasets of 1000 emails (500 spam/500 legitimate), each one extracted from an individual's email over the period of some months.
  • Concept Drift Datasets: 2 datasets each consisting of more than 10,000 emails collected over a period of approximately 2 years by an individual.
Each dataset is a collection of spam and legitimate email received by one individual. No stop word removal or stemming has been performed on the emails although the static datasets have included feature selection using Information Gain. HTML markup was not removed but was included in the tokenisation (split on tag delimiters). The name and value of HTML attributes were tokenised separately. The body of the email and certain email headers were included in the tokenisation, including the subject line and the To: and CC: header fields.

Three types of features have been extracted.

  • Word features which represent a sequence of characters separated by white space or HTML tag delimiters as discussed above.

  • Character features which represent the occurrence of a single character in the email.

  • A small number of Structural features representing the structure of the email. These include the proportion of white space characters, punctuation characters, uppercase characters, lowercase characters, total number of characters in the email.

In the concept drift datasets, each feature is represented in the following format F99999 where F is the feature type: W is a word feature, C is a character feature and S is a structural feature. 99999 is a unique code that represents the feature.
In the static datasets the feature is represented as F:XXXX where F is the feature type and XXXX is the actual text of the feature.

Each email is represented as a line in a csv text file. The format of the line is as follows:



  • dd/mm/yyyy represents the date of the email,

  • class represents the classification either spam or nonspam,

  • {feature:#occurrences,} represents a repeated list of feature value pairs

The static datasets include the top 700 features (approx) selected using Information Gain. Details of the selection process are included in the reference below.

Each concept drift dataset includes the following files:

  • SpamTraining.txt = all emails used as initial training data in the concept drift experiments performed using this dataset.

  • NonspamTraining.txt = all emails used as initial training data in the concept drift experiments performed using this dataset.

  • TestMMM99.txt = all emails used as 'test' data in the concept drift experiments using this dataset where MMM represents the month and 99 the year the emails were originally received.

Download the Datasets:

  Size Number of Unique Features Number of Emails
Concept Drift Dataset 1 8.2MB 287,034 10,983
Concept Drift Dataset 2 6.9MB 166,047 11,905
Static Datasets 0.5MB various 1000 each (500 spam)


The static datasets are described in the following reference:

Delany SJ, & Cunningham P. (2004) An Analysis of Case-base Editing in a Spam Filtering System, In: Funk P & Gonzales Calero P.A. (eds.), Advances in Case-Based Reasoning, (Proceedings of Seventh European Conference on Case-Based Reasoning, ECCBR-04),  LNAI 3155 p128-141, Springer Verlag
(BibTeX, pdf)

The concept drift datasets were used in the following references:

Delany SJ, P Cunningham & A Tysmbal (2006) A comparison of Ensemble and Case-base Maintenance Techniques for Handling Concept Drift in Spam Filtering,  In: G.Sutcliffe and R.Goebel (eds.), Proc. 19th Int. Conf. on Artificial Intelligence FLAIRS'2006,  p340-345, AAAI Press
(BibTeX, pdf)

Delany SJ, Cunningham P, Tsymbal A, Coyle L (2005) A Case-based Technique for Tracking Concept Drift in Spam Filtering Journal of Knowledge Based Systems 18 (4-5) p187-195, Elsevier.