Semantic-aware unsolicited mail filtering with reduction of labelling efforts

dc.contributor.advisorÁlvarez Marañón, Gonzaloes_ES
dc.contributor.authorLaorden Gómez, Carloses_ES
dc.contributor.otherFacultad de Ingenieríaes_ES
dc.contributor.otherSISTEMAS DE INFORMACIONes_ES
dc.date.accessioned2024-02-20T10:01:55Z
dc.date.available2024-02-20T10:01:55Z
dc.date.issued2012-07-05
dc.description.abstractElectronic mail is a powerful communication channel. Nevertheless, as happens with all useful media, it is prone to misuse. Spam has become a significant problem for e-mail users over the past decade; an enormous amount of spam arrives in peoples' mailboxes every day. Spam is also a major computer security problem: it is a medium for phishing (i.e., attacks that seek to acquire sensitive information from end-users) and for spreading malicious software (e.g., computer viruses, Trojan horses, spyware and Internet worms). In order to find a solution to this problem, the research community has made a great effort, with good results in solving text categorization problems. Thus, spam filtering systems have adapted different machine-learning techniques, providing a satisfactory evaluation of the e-mails' content. These techniques model the e-mails using the Vector Space Model (VSM), an algebraic approach for Information Filtering, Information Retrieval, indexing and ``ranking''. This model represents natural language documents in a mathematical way by vectors in a multidimensional space with good results. Still, the VSM assumes that all terms are independent, what, at least from the linguistic point of view, is not entirely correct. Therefore, it can not support the linguistic phenomena that can be found in natural languages. In a similar vein, the VSM is also affected by other characteristics of the text such as word sense ambiguity. Indeed, today's attacks against Bayesian spam filters attempt to keep the content of spam e-mail visible to humans, but obscured to filters. This could lead to misclassified legitimate e-mails and spammers evading filtering. Furthermore, junk e-mails evolve at an incredible pace to adapt to the most effective classifiers and surpass the filters, hence limiting in time the validity of the spam collections and classifiers. That is why obtaining properly labelled datasets for the training phase required by the machine-learning methods employed by anti-spam filters becomes a very complex task. In light of this background, we propose to study the application of new techniques capable of overcoming the semantic limitations of current spam filtering systems. To this end, we propose (i) the application of a new representation based on VSM, the enhanced Topic-based Vector Space Model (eTVSM), and (ii) a disambiguation pre-process that enhances the filtering capabilities of anti-spam systems. Moreover, it is also proposed to reduce the labelling efforts necessary for the proper performance of machine-learning methods, by applying (i) collective classification, which looks for connections between the different documents to optimise the classification, and (ii) anomaly detection, which generates a model based solely on one class and detects deviations from this model.es_ES
dc.identifier.urihttp://hdl.handle.net/20.500.14454/1017
dc.language.isoenges_ES
dc.publisherUniversidad de Deustoes_ES
dc.subjectMatemáticases_ES
dc.subjectCiencia de los ordenadoreses_ES
dc.titleSemantic-aware unsolicited mail filtering with reduction of labelling effortses_ES
dc.typedoctoral thesises_ES
Archivos
Bloque original
Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
2012laordseman.pdf
Tamaño:
7.53 MB
Formato:
Adobe Portable Document Format
Colecciones