l   p
n

 

Using Natural Language Processing to Filter Spam

E-mail describing Project
from an e-mail i sent to my brother

To      : troq
Subject : nlp: spam detection

ok, so on the same topic as the last e-mail i sent you, here is what i'm
doing so far:

Text is data, and there are 2 main types of that data: text characters and
text delimiters.  delimiters come in 2 types as well: those that separate
groups of characters into words and those that separate groups of words
into phrases.

I know that this eliminates a lot of the details of how text is
structured, like what part is the noun, what part is the verb, how these
relate, not'ed phrases, prepositional phrases, descriptive phrases, etc.,
but developing the heuristics to parse through that is too much for me
right now...  instead i want to simplify this for right now, and i'm not
too sure what difference the deep meaning has on the identification of the
text.  maybe once i fall on my face i'll begin to understand, but for
now...

By doing frequency counts of the words and the phrases of a known type of
email, i should be able to develop a fingerprint of that type of email.
If i develop that fingerprint for multiple types of email (say,
spam/not-spam), then i should be able to compare any given e-mail to both
sets of data to see which that particular e-mail is more like.  I'll
compare the frequencies of words and phrases, as weighted by whether the
phrase/word occurs in both meta-sets of email and how long the word/phrase
is.  If an email has more words/phrases in common with spam, then it is
more likely spam, etc, etc.

so, do you think this will work?



currently, i've written a script to run through my email and do the
word/phrase matching and insert it into a database.  next i work on the
comparison, and then i'll see whether or not i can get some good matches.

i'll have a webpage with current status shortly.

-f
http://www.blackant.net/

 

Code

---snip---

	my @words = split(/\s+/);

	foreach my $word (@words) {
		$mail[$message]{$word} =
			defined($mail[$message]{$word}) ?
			$mail[$message]{$word}+1 : 1;

		$allw{$word} = defined($allw{$word}) ? $allw{$word}+1 : 1;
	}

---more coming later---

 

Results

Word/phrase frequency counts from a selection of 32 spam messages:


+------+-------+
| word | count |
+------+-------+
| the  |   290 |
| to   |   274 |
| you  |   228 |
| and  |   206 |
| of   |   205 |
| a    |   159 |
| for  |   120 |
| is   |   109 |
| in   |   106 |
| your |   104 |
+------+-------+

+---------------------+-------+
| phrase              | count |
+---------------------+-------+
| year and Save       |    16 |
| ready to mail       |     9 |
| be removed from     |     9 |
| Move the name       |     8 |
| field is empty      |     8 |
| on the Internet     |     8 |
| to this message     |     8 |
| address in REPORT   |     8 |
| All rights reserved |     8 |
| 2 MONTHS FREE       |     7 |
+---------------------+-------+



Now we gather all the phrases from a spam email and compare their frequency
to the frequency of the same phrase in the selection of spam, and a
selection of good email:

Phrase                                 Email   Good    Spam
-----------------------------------------------------------
you will be                               1       1       2
free of charge                            1       0       1
from all of                               1       0       1
be able to                                1       3       3
you wish to                               1       0       2
LET US HELP                               1       0       2
it as a                                   1       1       0
every 100 days                            1       0       1
TO PARTICIPATE IN                         1       0       1
doubling every 100                        1       0       1
wish to be                                1       0       2
is doubling every                         1       0       1
YOU NEED to                               1       2       2
be PERMANENTLY REMOVED                    1       0       1
If you wish                               1       0       2

cool, results look promising.  The above shows that the email message in
question has more phrases in common with the selection of spam than with
the selection of good email, which makes sense b/c it is spam.

let's see if that is duplicatable by running the routine against a different
spam message...

Phrase                                 Email   Good    Spam
-----------------------------------------------------------
mailing list please                       1       0       2
This offer is                             1       0       2
the link below                            1       0       1
To be removed                             1       0       1
our mailing list                          1       0       2
be removed from                           1       0       1
removed from our                          1       0       7
from our mailing                          1       0       1

great!  this spam's phrases are identified only in the spam column, as
they should be.  so it seems we can identify spam correctly.


now let's try running the same routine on a not-spam message...

Phrase                                 Email   Good    Spam
-----------------------------------------------------------
to use the                                1       1       0
you did not                               1       0       1
Be sure to                                1       1       0
it will be                                1       1       2
It does not                               1       0       3

not so good...  there are more phrases in common with the spam selection
than with the good selection, but this time that sucks b/c the message in
question is not spam.


let's try it on another good message...

Phrase                                 Email   Good    Spam
-----------------------------------------------------------
but I dont                                1       1       0
just in case                              2       0       1

damn.  still more in common with the spam message...


ok, one more time (keep going until we get the results we want, right?)

Phrase                                 Email   Good    Spam
-----------------------------------------------------------
i could be                                1       1       0
i think i                                 1       1       0

yes, it works!  too bad we only correctly identified 1 of 3 messages that were
not-spam as being not-spam.


So what does this mean?

We identified 2 for 2 spam messages as spam.  We identified 1 for 3
not-spam messages as not-spam.


A few thoughts present themselves:

1- perhaps my selection of mail was not wide enough.  perhaps with a larger
selection of spam and a larger selection of not-spam, the phrase comparison
will be better.

2- perhaps matching by highest percentage is incorrect.  if we instead
interpret the above results by only considering a message spam if 75% or
more phrases are in common with the spam side, then all the above messages
will get matched correctly.

3- perhaps if i also did the word frequency counts the results would be more
accurate.


Only time, more coding and more testing will tell.

Update


choice #2 was more accurate.  upon further inspection of the data, it seems
the cut-off point is about 70%.  that is, if 70% of the phrases used are spam
phrases, then the mail is most likely spam.  using this classification, so far
i have had the following results:

Email    Correctly   Incorrectly
Type      Detected      Detected

Spam           16             0
Not Spam       16             0


So far, so good...
s
 
[ Java ] [ CGI ] [ VRML ] [ Other ] [ Code Index ]
[ Art ] [ Code ] [ Personal ] [ Other ] [ Main Index ]
 
r   f