ECUE Spam Concept Drift Datasets


 

The ECUE Spam Concept Drift Datasets each consist of more than 10,000 emails collected over a period of approximately 2 years. Each is a collection of spam and legitimate email received by an individual.

No stop word removal, stemming, or feature reduction has been performed on the emails before tokenisation. HTML markup was not removed but was included in the tokenisation (split on tag delimiters). The name and value of HTML attributes were tokenised separately. URLs and email addresses were parsed on the '\' and '.' delimiters respectively. The body of the email and certain email headers were included in the tokenisation, including the subject line and the To: and CC: header fields.

Three types of features have been extracted.

  • Word features which represent a sequence of characters separated by white space or HTML tag delimiters as discussed above.

  • Character features which represent the occurrence of a single character in the email.

  • A small number of Structural features representing the structure of the email. These include the proportion of white space characters, punctuation characters, uppercase characters, lowercase characters, total number of characters in the email.

As the emails are personal each feature is represented in the following format X99999 where X is the feature type: W is a word feature, C is a character feature and S is a structural feature. 99999 is a unique code that represents the feature.

Each email is represented as a line in a csv text file. The format of the line is as follows:

dd/mm/yyyy,class,{feature;#occurrences,}

where

  • dd/mm/yyyy represents the date of the email,

  • class represents the classification either spam or nonspam,

  • {feature:#occurrences,} represents a repeated list of feature value pairs

The following files are included in each dataset:

  • SpamTraining.txt = all emails used as initial training data in the concept drift experiments performed using this dataset.

  • NonspamTraining.txt = all emails used as initial training data in the concept drift experiments performed using this dataset.

  • TestMMM99.txt = all emails used as 'test' data in the concept drift experiments using this dataset where MMM represents the month and 99 the year the emails were originally received.


Download the Datasets:

  Size Number of Unique Features Number of Emails
Dataset 1 8.2MB 287,034 10,983
Dataset 2 6.9MB 166,047 11,905

 


These datasets were introduced in the following references:

Delany SJ, P Cunningham & A Tysmbal (2006) A comparison of Ensemble and Case-base Maintenance Techniques for Handling Concept Drift in Spam Filtering,  In: G.Sutcliffe and R.Goebel (eds.), Proc. 19th Int. Conf. on Artificial Intelligence FLAIRS'2006,  p340-345, AAAI Press
(BibTeX, pdf)

Delany SJ, Cunningham P, Tsymbal A, Coyle L (2005) A Case-based Technique for Tracking Concept Drift in Spam Filtering Journal of Knowledge Based Systems 18 (4-5) p187-195, Elsevier.
(BibTeX)


Home