Knowledge Discovery and Data Mining (KDD) Lab

Knowledge Discovery and Data Mining (KDD) is the nontrivial process of extracting implicit, novel, actionable and useful information from large volume of data. It has emerged as a unique combination of several fields of science and technology including machine learning, computer programming, statistics, and database systems. KDD spans a wide range of applications in Engineering (e.g. cybersecurity, intrusion detection in computer networks, flow classification, Web mining), business (e.g. fraud detection, risk analysis, decision support systems, forecasting market trend), medicine and population health (e.g. study of drug implications, disease outbreak), bioinformatics (e.g. protein interactions, gene sequence analysis), and environmental science (e.g. flood prediction, sattelite image processing) to name a few.

Dr. Raahemi has established the The Knowledge Discovery and Data mining (KDD) Lab at the University of Ottawa hosting graduate students and researchers in multidisciplinary areas of Digital Transformation and Innovation, Computer Science, and Computer Engineering. Students and researchers from the related fields of Mathematics and Statistics are also welcome in his research group.

The research projects in the KDD lab focus on the following two main areas:

1- Novel techniques in data analytics and machine learning, as well as emerging applications of data mining and machine learning in engineering, healthcare, and business. In particular, the focal points of the projects are on study and development of advanced algorithms for (a) outlier detection in high-dimensional data; (b) Big data analytics; and (c) stream data mining; as well as emerging applications of the proposed solutions in the areas of Engineering (networks security, intrusion detection, networks flow classification), business (business analytics, fraud detection in financial statements), and Healthcare (study of health coverage, predicting high-cost patients, and risk of hospitalization, predicting immune-base disease).

2- Information systems and technologies; Data communications networks and services; and applications of information systems in business and healthcare.

A partial list of the research projects in the KDD Lab includes:

Reliable AI for Outlier Detection in Unstructured Data with Applications in Engineering and Business (Supported by NSERC Discovery, 2023-2028)

Detecting outliers, especially in large-scale unstructured data is a challenging and computationally complex problem. In this research program funded by NSERC Discovery, Dr. Raahemi explores, designs, and analyzes reliable AI algorithms for outlier detection in large-scale unstructured data, including transformers and word embedding techniques, as well as effective dimension reduction and data summarization methods, and advanced machine learning models (both text mining and data mining models), to make reliable, and accurate detection of outliers in vast amounts of documents containing both textual and numerical records. The new methods will be applied to the emerging applications in engineering (namely, detection of intrusion detection in computer networks), and business (namely, fraud detection in financial big data).
Ensuring the AI solutions are unbiased, reliable, and trustworthy is another important research challenge that will be investigated in this research program at two levels: (a) data collection and pre-processing: the collected data must be inclusive, fair and trustworthy in order to be reliable. The collected and pre-processed data will be verified not to be biased towards specific organizations, regions, or minority groups; and (b) AI Algorithm design: the AI algorithms must perform fairly and ethically to make reliable conclusions.


Outlier Detection in High-Dimensional Big Data using Bio-Inspired Algorithms (Supported by NSERC Discovery, 2007-2022)

In this research program funded by NSERC Discovery, Dr. Raahemi explored innovative algorithms for feature engineering and analysis of large data using bio-inspired and machine learning approaches with a particular focus on outlier detection. He investigates the competency of the proposed algorithms in various applications including (a) intrusion detection systems and anomaly detection for network security; (b) protocol identification of the Internet traffic for resource allocation and quality of service assurance; and (c) maritime vessel scheduling.


Smart Factory (Supported by Ontario Centre of Innovation (OCI) and Ciena Canada)

Following the trends in Industry 4.0 employing artificial intelligence and data analytics to improve the business processes, and in collaboration with Ciena Canada, a large amount of both numerical and textual data collected at the manufacturing test centre were analyzed using both text mining and data mining techniques to categorize the reasons of return (for faulty products) and optimize the test centre's processes. The solutions were implemented with Apache Spark on Databricks cloud. Additionally, a digital dashboard was designed to visualize and monitor the test centre's activities, and produce reports on the returns and supply chain.

 
Cyber Threat and Malware Detection in Network Traffic using Big Data Analytics (supported by Bell Canada and MITACS)

Maintaining Quality of Services in the network requires traffic monitoring and security control measurements. Classification of internet traffic (e.g., peer-to-peer, web server, mail server and attacks including malware, virus and worm) is a fundamental requirement in areas such as network provisioning, network security, traffic engineering, and network management.

In a close collaboration with Bell Canada Cyber Threat Intelligence (CTI) team , Dr. Raahemi and his group developed solutions, using big data analytic techniques, to classify cyber threat including malware and attacks, based on their behavioral characteristics.

 

Metaheuristic Optimization in Maritime Vessel Scheduling:
Big-Data-Enabled Multi-Objective Modelling of Vessel Scheduling Recovery Problem
(supported by Larus Technologies and NSERC-CRD)

Seaborne includes 90% of international trades (significant impact on the global economy). Due to limited differentiation of services, the main competition between stakeholders in this industry is cost-based.
This research explores multi-objective optimization techniques to address an optimization problem with 3 objectives:
-  minimize financial loss
-  minimize delay time
-  maximize average speed compliance

Traffic at port, traffic on major world sea routes, and  special atmospheric condition at a geospatial location are the parameters affecting the sailing speed.
This research employs metaheuristic techniques to solve the optimizations problems. In particular, we use distributed cooperative coevolution methods on Apache Spark framework to increase the performance and quality of solutions.

Our proposed solution generated a Pareto front which reflects the trade-off among the three objectives.

 
Estimating Bus Passengers' Origin-Destination of Travel Route using Data Analytics on Wi-Fi and Bluetooth Signals (supported by SMATS Traffic Solutions and OCE/NSERC-Engage)

The solutions we propose in this research improve the efficiency of public transportation systems by facilitating efficient bus scheduling and route planning, improving ride comfort, and also lowering operating costs of cities.

TrafficBox sensor collects mobile devices’ MAC addresses, Received Signal Strength Indication (RSSI), time stamps, and GPS data and then stores them on its internal storage.

The main challenge in using Wi-Fi and Bluetooth sensors is distinguishing between passengers and non-passengers’ signals as the sensors detect all the transmitted signals from the surrounding environment. To address this issue, we employed K-Means and Hierarchical clustering methods based on our previous experiment to automatically differentiate between passengers’ and others’ signals.

 

Managing and Analysing Data for Concrete Building Infrastructure (Supported by Giatec Scientific and NSERC-Engage) 

Dr. Raahemi led a research in collaboration with Giatech to collect and store the data generated by wireless sensors on a cloud infrastructure, then manage and analyze the data using data mining and machine learning techniques to detect anomalies and explore hidden patterns in the data.

 

Analyzing EEG signals for depression diagnosis (supported by the IBM, Royal Ottawa Hospital and MITACS)

Dr. Raahemi and his team, in collaboration with the researchers at the Royal Ottawa Hospital, supported by the IBM Canada and MITACS, have undertaken an interesting project to analyze the electroencephalogram (EEG) signals collected from patients with major depressive disorder to build predictive models and identify the brain bio-markers for diagnosis of depression.


Predicting Immune-bases Disease with Reliable Data Mining on Population-Based Health Administrative Data (supported by Children Hospital of Eastern Ontario Research Instritute CHEO-RI, ICES and MITACS)


The prevalence of immune-mediated chronic diseases has increased worldwide, including in Canada, over the past years. 

In the project sponsored by the Institute for Clinical Evaluation Sciences, Children Hospital of Eastern Ontario Research Instritute, and MITACS, Dr. Raahemi and his colleagues are currently investigating exploratory data analysis and predictive modelling to build risk prediction models for chronic immune-mediated diseases such as IBD, asthma, multiple sclerosis, and type-1 diabetes.

The ase study is implemented using on a real-world health data (ToH, OHIP, ICES) to tackle a rising population-based issue – immune-mediated diseases among children in Ontario, Canada, and validate the results in consulting with domain experts.

 

Classification of Peer-to-Peer traffic using data mining techniques (supported by Alcatel Networks (now Nokia) and ORNEC)
    
Telecommunication equipment vendors and the Internet Service Providers are very interested in solutions to classify Peer-to-Peer (P2P) traffic. P2P applications consume significant bandwidth and exhausts network resources, resulting in network congestion, affecting the availability, reliability and quality of services.

Supported by Alcatel Networks, and in collaboration with its Research and Innovation Centre (R&I), we collected real Internet traffic, performed pre-processing on the data, and prepared a training data set based on which we built several models including decision tree, neural networks, incremental neural networks, incremental Tri-Training, fast decision tree, and concept-drift fast decision tree to identify P2P traffic.