Current advances in data collection techniques and the drop in the storage prices allow for collecting vast amount of data. Processing all the data that has been collected becomes a challenging task for any organization. In fact, only a small fraction of their collected data is processed due to the limitations in the current systems and algorithms. Usually, organizations deal with large amount of data stored as relational databases or arriving continuously in data streams. Extracting models that can summarize the data becomes crucial to get insights from the data to help in improving our lives.
At Utrecht University, I joined the very large data management group. My research is centered around data stream mining, data cleaning and explainability of machine learning techniques.
As a postdoc at QCRI, we worked on extracting syntactic patterns that could reveal important information about the data. The discovered patterns are used to detect disguised missing values which are assumed to be represented by one of the non-dominating patterns. wE also used the patterns in the data to develop a new set of integrity constraints that could help in discovering data inconsistencies. wE called the new integrity constraints as pattern functional dependencies (PFDs).
During my PhD studies at KAUST, we worked on estimating the probability density function (PDF) for numerical data streams which is used as a summarization model for the the stream. The PDF can reveal very important information about the data in the stream so that low density regions can be considered as outliers and large peaks in the PDF curve can be viewed as centers of the clusters in the data. We used the estimated PDF in three applications: i) outlier detection, ii) Change detection and iii) monitoring the taxicab demand. My work at KAUST was under the supervision of Xiangliang Zhang (KAUST) and Soujin Wang (Texas A&M University).