Our work on Cloud-based Spam URL Deduplication for Big Datasets Accepted in the International Journal of Cloud Computing (IJCC)

Congratulations to Shams Zawoad, Ragib Hasan, Gary Warner, and Md Munirul Haque for having their work on Cloud-based Spam URL Deduplication for Big Datasets accepted in the International Journal of Cloud Computing (IJCC).

Shams Zawoad, Ragib Hasan, Gary Warner, Md Munirul Haque “Towards a Cloud-based Approach for Spam URL Deduplication for Big Datasets”, International Journal of Cloud Computing (IJCC), 2(3), 2014, pp. 1-14.

Abstract
Spam emails are often used to advertise phishing websites and lure users to visit such sites. URL blacklisting is a widely used technique for blocking malicious phishing websites. To prepare an effective blacklist, it is necessary to analyze possible threats and include the identified malicious sites in the blacklist. However, the number of URLs acquired from spam emails is quite large. Fetching and analyzing the content of this large number of websites are very expensive tasks given limited computing and storage resources. To solve the problem of massive computing and storage resource requirements, we need a highly distributed and scalable architecture, where we can provision additional resources to fetch and analyze on the fly. Moreover, there is a high degree of redundancy in the URLs extracted from spam emails, where more than one spam emails contain the same URL. Hence, preserving the contents of all the websites causes significant storage waste. Additionally, fetching content from a fixed IP address introduces the possibility of being reversed blacklisted by malicious websites. In this paper, we propose and develop CURLA – a Cloud-based spam URL Analyzer, built on top of Amazon Elastic Computer Cloud (EC2) and Amazon Simple Queue Service (SQS). CURLA allows deduplicating large number of spam-based URLs in parallel, which reduces the cost of establishing equally capable local infrastructure. Our system builds a database of unique spam-based URL and accumulates the content of these unique websites in a central repository. This database and website repository will be a great resource to identify phishing websites and other counterfeit websites. We show the effectiveness of our architecture using real-life, large-scale spam-based URL data.