Introduction
Internet is frequently used by criminals for illegal activities, such as financial fraud, global voice phishing, online gambling, fake TV shopping, fraudulent prize winning and spam SMS in social networks.
The dark side of Internet has emerged and bedeviled the world.
In recent years, mainland Chinese citizens suffered more than 20 billion Yuan loss per year from global voice phishing, most of which were with the aid of phishing or fake websites located outside China. Besides, widespread usage of smart phones also stimulates the rapid increase of mobile and OR code phishing activities, especially towards old people without much knowledge about phishing.
More than 10000 phishing websites were reported to Anti-phishing Alliance of China (APAC) per month on average from Aug 2011 to May 2017.
This project is launched by us to collect malicious URLs dataset, extract features, and provide sustainable malicious detection support for anti-phishing researches and industrial applications. We here call for the global cooperation and effort to fight against the dark side of the Internet, and make the online world a better place for all people.
System Development
The system is developed using Java EE, with many cutting-edge technologies such as distributed cache (Hazelcast), Map-reduce computing (Hazelcast), and NoSQL database (Cassandra). The technical architecture of the system is shown in Fig.2

Fig.2 Technical architecture of the learning system
The system consists of four layers.
(1) Distributed database layer. Cassandra was deployed on 2 machines. Thanks to the Gossip protocol, Cassandra does not need central node.
(2) Distributed cache layer. Almost all of the functional modules involving data reading and writing. In order to reduce the access response time, it is very important to build caches for “hot data”. We built three type of caches for the system: shared cache, replicated cache and local cache. a) Shared cache is used as a distributed big data grid. All of the modules share the same data in the shared cache; b) Replicated cache is built on every server. Each server have the same copy of the cached data. Replicated cache is used to store the most frequently read data but with rare update; c) Local cache is used to store the data which are used only for local frequent computation purpose.
(3) Service layer. Most of the aforementioned core functions were encapsulated in service layer. The distance metric learning, classifier training were implemented according to the models, using JAVA, and developed as standard services. They are generally invoked by crawling service while new instance incomes. URL and feature crawling services involve HTTP client, XML/HTML parsing and web services invocation. Business logic here mainly refers to how to organize the data used in the GUI and handle the events invoked by users.
(4) Access interface layer. The system may be used by human or invoked by other systems. So we provide two type of interface: web pages and web services API. The web part of the system was developed using Java Server Faces. The pages is built via XHTML, namely Facelet, and they exchange data with managed beans. Managed beans are java classes and they retrieve data from service layer to pages or propagate the actions/events from the pages to the service layer. As for web services, they provide common APIs to interact with other systems.
The technical architecture was deployed into a distributed environment as a cluster. The modules connected by dual arrows was deployed in the same Tomcat application. With shared distributed caches and databases, each of the routine connected by the arrows in Fig.2 can be replicated, and then deployed as a member in the cluster. Different Tomcats were integrated by a common Ngnix providing inverse proxy service.
In terms of distributed computing, Hazelcast supports “Map-Reduce” and “ExecutorService”. Nyström method and DMZ benefit from these mechanisms. In model training tasks, such as the search of the closest points in DML, clustering in Nyström method, were encapsulated in independent threads, which were implemented by “ExecutorService”, and then ran in the distributed cluster. In space transformation tasks, the kernel and DMZ matrix information was stored in a shared memory, and then allocated the computation of the transformation of each single instance into the map tasks.