HW4_Text-1

July 15, 2018 | Author: Utkarsh Shrivatava | Category: Statistical Classification, Areas Of Computer Science, Statistics, Applied Mathematics, Computing
Share Embed Donate


Short Description

Assignment...

Description

DATA MINING FOR BUSINESS ANALYTICS Section 99: Spring 2015 Due: Saur!a" Saur!a"## Apri$ 1%# 2015# &a' Homework #4

 _______________  (put your name above)

 Total grade: _______ out of ______  points

 This assignment will give you hands- on experience in building text classication models, using the application of email spam ltering !ou will use "e#a to convert the textual data (emails) into feature vectors and build text mining models for automatic spam ltering The email messages you will use were delivered to a particular server between $ %pr &'' and  *ul &'' The target variable represents whether an email is either spam or ham (non-spam) +ollow the directions and answer any uestions !our report should not be verbose, but should present your results clearly and professionally .aveat: /sing large sets of words, building the models and evaluating them with cross-validation reuires a lot of memory The assignment as#s you to save your data les before each "e#a run 0n the in-class lab at the beginning of the semester, you should have increased your heapsi1e 0f for some reason you get memory2heap errors anyway, try restarting and avoiding having any other applications open, except where indicated 0f you have problems for more than 3 or 4' minutes, contact the T% or me--don5t spend time being frustrated with that6 4 7n the blac#board, you will nd a le called spam_data_Text.arf , which contains all the emails in our dataset and is in a format ready-to-input by "e#a 0f you were to open this le in "ord8ad or some other capable text editor, you would see that each instance is 9ust a string of the email text (between single uotes), with the target variable at the end Thus the ar le has two variables for each instance: the text string, and the target variable (called ;class;), which ta#es on two values, spam and ham  To be able to build spam ltering predictive models, we need to convert the email texts into feature vectors and engineer the features !ou can do this by following the steps below: 4)

 
View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF