Semantic-aware unsolicited mail filtering with reduction of labelling efforts

Laorden Gómez, Carlos

Semantic-aware unsolicited mail filtering with reduction of labelling efforts

dc.contributor.advisor	Álvarez Marañón, Gonzalo	es_ES
dc.contributor.author	Laorden Gómez, Carlos	es_ES
dc.contributor.other	Facultad de Ingeniería	es_ES
dc.contributor.other	SISTEMAS DE INFORMACION	es_ES
dc.date.accessioned	2024-02-20T10:01:55Z
dc.date.available	2024-02-20T10:01:55Z
dc.date.issued	2012-07-05
dc.description.abstract	Electronic mail is a powerful communication channel. Nevertheless, as happens with all useful media, it is prone to misuse. Spam has become a significant problem for e-mail users over the past decade; an enormous amount of spam arrives in peoples' mailboxes every day. Spam is also a major computer security problem: it is a medium for phishing (i.e., attacks that seek to acquire sensitive information from end-users) and for spreading malicious software (e.g., computer viruses, Trojan horses, spyware and Internet worms). In order to find a solution to this problem, the research community has made a great effort, with good results in solving text categorization problems. Thus, spam filtering systems have adapted different machine-learning techniques, providing a satisfactory evaluation of the e-mails' content. These techniques model the e-mails using the Vector Space Model (VSM), an algebraic approach for Information Filtering, Information Retrieval, indexing and ``ranking''. This model represents natural language documents in a mathematical way by vectors in a multidimensional space with good results. Still, the VSM assumes that all terms are independent, what, at least from the linguistic point of view, is not entirely correct. Therefore, it can not support the linguistic phenomena that can be found in natural languages. In a similar vein, the VSM is also affected by other characteristics of the text such as word sense ambiguity. Indeed, today's attacks against Bayesian spam filters attempt to keep the content of spam e-mail visible to humans, but obscured to filters. This could lead to misclassified legitimate e-mails and spammers evading filtering. Furthermore, junk e-mails evolve at an incredible pace to adapt to the most effective classifiers and surpass the filters, hence limiting in time the validity of the spam collections and classifiers. That is why obtaining properly labelled datasets for the training phase required by the machine-learning methods employed by anti-spam filters becomes a very complex task. In light of this background, we propose to study the application of new techniques capable of overcoming the semantic limitations of current spam filtering systems. To this end, we propose (i) the application of a new representation based on VSM, the enhanced Topic-based Vector Space Model (eTVSM), and (ii) a disambiguation pre-process that enhances the filtering capabilities of anti-spam systems. Moreover, it is also proposed to reduce the labelling efforts necessary for the proper performance of machine-learning methods, by applying (i) collective classification, which looks for connections between the different documents to optimise the classification, and (ii) anomaly detection, which generates a model based solely on one class and detects deviations from this model.	es_ES
dc.identifier.uri	http://hdl.handle.net/20.500.14454/1017
dc.language.iso	eng	es_ES
dc.publisher	Universidad de Deusto	es_ES
dc.subject	Matemáticas	es_ES
dc.subject	Ciencia de los ordenadores	es_ES
dc.title	Semantic-aware unsolicited mail filtering with reduction of labelling efforts	es_ES
dc.type	doctoral thesis	es_ES

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: 2012laordseman.pdf
Tamaño:: 7.53 MB
Formato:: Adobe Portable Document Format

Descargar

Colecciones

Tesis doctorales