Dr. Khan's research emphasizes on Fundamentals and Applications

With regard to data mining, Dr. Khan considered clustering and classification problems. He has proposed and developed various data mining techniques for the analysis of stream data (e.g., text stream, malware stream, etc.). Dr. Khan is researching supervised learning algorithms to classify evolving data streams. Data streams are continuous flows of data. Examples of data streams include network traffic, sensor data, call center records and so on. Their sheer volume and speed pose a great challenge for the data mining community to mine them. Dr Khan's research group, along with the UIUC group, has presented a novel class detection technique that automatically detects novel classes in data streams by analyzing and quantifying the clustering properties. Dr. Khan's group has applied this algorithm in various domains such as intrusion detection, text mining and geospatial information management.

Multi-labelity may be a common phenomenon for text stream. A single text document may cover multiple class labels at the same time and hence gives rise to the concept of multi-labelity. From a classification perspective, an immediate drawback of such a characteristic is that traditional binary or multi-class classification techniques perform poorly on multi-label text data. Dr. Khan's group has proposed a SISC (Semi-supervised Impurity-based Subspace Clustering) approach to address the multi-label issue. In addition, Dr. Khan is developing data mining tools for data intensive applications using cloud computing. His stream mining work was initially funded by NASA, and, recently, he has received an additional grant from Raytheon to continue this work. His work is published or accepted in top conferences and journals including ICDM, IJCAI, ECML/PKDD, and TKDE.

Dr. Khan's group has applied data mining techniques for malware detection. Dr. Khan has presented a scalable and multi-level feature extraction technique to detect malicious executables. He has proposed a novel combination of three different kinds of features at different levels of abstraction. These are binary n-grams, assembly instruction sequences, and Dynamic Link Library (DLL) function calls, extracted from binary executables, disassembled executables, and executable headers, respectively. He has also proposed an efficient and scalable feature extraction technique, and applied this technique on a large corpus of real benign and malicious executables. The above mentioned features are extracted from the corpus data and a classifier is trained, which achieves high accuracy and low false positive rates in detecting malicious executables. This work along with stream mining work is extended for antivirus defenses.

Polymorphic malware poses increasing challenges to effective signature-based antivirus protection. However, antivirus defenses have managed to stay largely ahead in the virus-antivirus coevolution race because antivirus signature updates are specific and targeted, whereas polymorphic malware variation is random and undirected. Dr. Khan, along with Dr. Hamlen, has proposed a far more powerful malware mutation strategy that uses automated data mining techniques to adapt to specific signature updates fully automatically in the wild. The result is reactively adaptive malware that can survive signature updates without requiring re-propagation. Such malware could be extremely effective in sustaining long-term attacks against large distributed systems that rely on antivirus products for protection. This work was funded by AFOSR and published in conferences such as ICC, PAKDD, and journals such as CSI and Information Systems Frontier, Springer.

With regard to semantic web and scalable data management, cloud computing is the newest paradigm in the IT world and the focus of significant research. Companies hosting cloud computing services face the challenge of handling data intensive applications (e.g., Call detailed records (CDR) in the telecommunications industry). Semantic web technologies are an ideal candidate to be used together with cloud computing tools to provide a solution. These technologies have been standardized by the World Wide Web Consortium (W3C). One such standard is the Resource Description Framework (RDF). With the explosion of semantic web technologies, large RDF graphs are common place. Current frameworks do not scale for large RDF graphs. Dr. Khan's group has described a framework that was built using Hadoop, a popular open source framework for Cloud Computing, to store and retrieve large numbers of RDF triples. The group has described a scheme to store RDF data in Hadoop Distributed File System. The group has presented an algorithm to generate the best possible query plan to answer a SPARQL Protocol and RDF Query Language (SPARQL) query based on a cost model. The proposed framework can store large RDF graphs in Hadoop clusters built with cheap commodity class hardware. Furthermore, the framework is scalable and efficient and can easily handle billions of RDF triples, unlike traditional approaches. This work is funded by IARPA and Raytheon. This work is accepted/published in premier journals and conferences such as TKDE, and IEEE Cloud. Dr. Khan is extending this framework for other datasets (e.g., CDR) funded by Tektronix.

Besides the cloud-based solution for semantic web data, Dr. Khan has developed a tool for managing, persisting and querying semantic web knowledge using relational database management systems, named, RDFKB (Resource Description Framework Knowledge Base). RDFKB provides a flexible data management schema that allows additions, deletions, and updates at all levels in the data model. RDFKB also supports 1) knowledge inference, 2) provenance and lineage, and 3) probabilities and uncertainty reasoning. Dr. Khan has shown through a variety of use cases and experiments that RDFKB enables each of these tasks, and has also shown that RDFKB outperforms all other semantic web repositories through experiments using 26 queries against 2 accepted benchmark datasets. This work is accepted/published in premier conferences including AAAI.

With regard to geographic information management, social networks provide services and capabilities to the users to associate location to their profiles. However, because of privacy and security reasons, most of the people on social networking sites like Twitter are unwilling to specify their locations in the profiles. This creates a need for software that mines the location of the user based on the implicit attributes associated with him/her. Dr. Khan has developed a tool, Tweecolization that predicts the location of the user purely on the basis of his/her social network using the strong theoretical framework of semi-supervised learning. Dr. Khan's group has performed extensive experiments to show the validity of the system in terms of both accuracy and running time. Experimental results show that Tweecolization outperforms the content-based geo-tagging approach in both accuracy and running time.

Geographic information can be distributed across multiple sources with heterogeneous schemas. To facilitate retrieval of information from various sources, the first step is schema matching. Dr. Khan's group has proposed an innovative technique for aligning heterogeneous schemas in the geospatial domain (this problem is known as schema matching). To solve this problem, 1:1 matches between concepts/attributes are derived between pairs of concepts/attributes. Once this is done, semantic similarity between each pair of compared attributes/concepts is determined. Many existing, state-of-the-art techniques widely touted today rely on either syntactic measurements, such as performing string matches between the names of the compared attributes, or shallow semantic techniques, such as a WordNet based name match, to calculate semantic similarity. However, these techniques are fraught with uncertainty and error since the name chosen for a concept or an attribute is frequently not representative of its instances, or because schema evolution on the data occurs over a period of time without a corresponding name change being assigned to the attribute/concept. He addressed this problem in his application, known as GSim, by creating an alignment technique that examines the types of the respective instances for each compared attribute/concept. Based on this, a semantic similarity measurement between the attributes can be calculated by matching the similarity of their instance type distributions. The similarity value is an information-theoretic measure known as entropy-based distribution (EBD) that is naturally normalized. This work is funded by NGA, IARPA and Raytheon and. published or accepted in top conferences such as ACM GIS, IEEE ICSC, and Web Semantics Journal. After receiving tenure, Dr. Khan applied the fundamentals of his previous multimedia information management research to applications such as surveillance and security; this research was subsequently published at ACM SACMAT and IEEE ISI.