|
|
||||||||||||
ECUE Spam Concept Drift Datasets |
||||||||||||
|
The ECUE Spam Concept Drift Datasets each consist of more than 10,000 emails collected over a period of approximately 2 years. Each is a collection of spam and legitimate email received by an individual. No stop word removal, stemming, or feature reduction has been performed on the emails before tokenisation. HTML markup was not removed but was included in the tokenisation (split on tag delimiters). The name and value of HTML attributes were tokenised separately. URLs and email addresses were parsed on the '\' and '.' delimiters respectively. The body of the email and certain email headers were included in the tokenisation, including the subject line and the To: and CC: header fields. Three types of features have been extracted.
As the emails are personal each feature is represented in the following format X99999 where X is the feature type: W is a word feature, C is a character feature and S is a structural feature. 99999 is a unique code that represents the feature. Each email is represented as a line in a csv text file. The format of the line is as follows: dd/mm/yyyy,class,{feature;#occurrences,} where
The following files are included in each dataset:
Download the Datasets:
These datasets were introduced in the following references: Delany SJ, P Cunningham & A Tysmbal
(2006) A comparison of Ensemble and Case-base Maintenance Techniques for
Handling Concept Drift in Spam Filtering, In: G.Sutcliffe and R.Goebel
(eds.), Proc. 19th Int. Conf. on Artificial Intelligence FLAIRS'2006,
p340-345, AAAI Press Delany SJ, Cunningham P, Tsymbal A, Coyle L (2005) A Case-based Technique for
Tracking Concept Drift in Spam Filtering Journal of Knowledge Based Systems 18 (4-5) p187-195, Elsevier. |
||||||||||||