PhD Public Defense,  Usman Ali

Title: A Framework for Concept Drift Detection and Adaptation for Classification Problems in Data Streams
PhD Public Defense: Usman Ali, lecturer, Department of Computer Science, IBA
Advisor: Dr. Tariq Mahmood
External Examiners: Dr. Sohail Asghar (Comsat University) | Dr. Ahmar Rashid (GIKI)
Date:  January 12, 2025, at 2:00 PM
Venue: Tabba Conference Room, Tabba Academic Block, Second Floor [North Wing], Main Campus, IBA Karachi

Abstract
This research work addresses the problem of unsupervised concept drift detection, i.e. drift detection without the need of truth labels yet with a high confidence that the detected drift is real with reduced false alarms. To address this, we established an autoencoder-based drift detection framework (followed by any standard drift adaptation mechanism) for machine learning based classification problems in data streams. In streaming data environments, data characteristics and probability distributions are likely to change over time, causing a phenomenon called concept drift, which poses challenges for machine learning models to predict accurately. In such non-stationary environments, there is a need to detect concept drift and update the model to maintain an acceptable predictive performance. Existing approaches to drift detection have inherent problems like requirements of truth labels in supervised detection methods and high false positive rate in case of unsupervised drift detection. This research presents a novel semi-supervised Autoencoder based Drift Detection Method (AEDDM) aimed at detecting drift without the need of truth labels, yet with a high confidence that the detected drift is real.

The developed AEDDM method works in a batch mode and has three architectural components; an offline component (training phase) where two autoencoders are trained on labelled data to learn the data distribution of each class and two different thresholds namely batch threshold and count threshold are computed from the reconstruction error values of the validation data; an ensemble component which defines the sequential order of the autoencoders; and an online component where data arrives in batches and drift detection is performed for the whole batch data stream by comparing changes in reconstruction loss values with thresholds learned in the offline training phase.

AEDDM is considered as a semi-supervised drift detection method since it leverages both labelled and unlabeled data in its complete framework. While labeled training data is required in the initial training of the autoencoders during the offline phase, there is no need for class labels in online detection phase. Although the drift is detected in a completely unsupervised way in online detection phase, considering the whole framework, the method is considered as a semi-supervised drift detection method.

The AEDDM method is assessed on a combination of synthetic and real-world datasets, which exhibited both sudden and gradual changes in data distribution. To evaluate the method's effectiveness, it was tested on seven popular batch classifiers and a Hoeffding’s Tree classifier in an online learning setting. The results indicate that AEDDM accurately identifies distributional changes that are likely to degrade classifier performance (real drift), while disregarding irrelevant changes (virtual drift). This ability to distinguish between true and false alarms, coupled with its adaptability to changing data distributions, makes AEDDM a valuable tool for maintaining classifier performance in dynamic environments.

Within the field of drift detection, AEDDM is one of the few detailed preliminary works that leverages the power of deep learning, specifically autoencoders. It is designed considering the characteristics of an ideal drift detector after careful review of supervised, semi-supervised, unsupervised, and deep learning-based techniques. It is probably the first method that integrates the best part of each method ; the detected drift through AEDDM is real as incase of supervised drift detection methods, available labelled data is fully leveraged for autoencoder’s training and threshold computations like in semi-supervised drift detection methods , drift detection is done in completely unsupervised way similar to unsupervised drift detection methods , and the power of deep learning is harnessed to process multidimensional data eliminating the needs of any feature selection or dimensionality reduction.