Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

About me

Posts

portfolio

publications

Graph Based Anomaly Detection in Chimbuko: Feasible or Fallible?

Published in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '23), 2023

A study on the applicability of graph-based deep learning methods for performance anomaly classification in the Chimbuko framework, a performance analytics tool used to monitor and improve large-scale supercomputing applications. Published at SC23.

Recommended citation: Chase Phelps, Ankur Lahiry, Tanzima Z Islam, and Christopher Kelly. (2023). "Graph Based Anomaly Detection in Chimbuko: Feasible or Fallible?" Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '23). https://sc23.supercomputing.org/proceedings/tech_poster/tech_poster_pages/rpost184.html

Novel Representation Learning Technique Using Graphs for Performance Analytics

Published in 2023 International Conference on Machine Learning and Applications (ICMLA), 2023

A novel graph-based representation learning approach that transforms tabular HPC performance data into graphs, enabling Graph Neural Network techniques to capture complex relationships between features. Also available as arXiv preprint 2401.10799.

Recommended citation: Tarek Ramadan, Ankur Lahiry, and Tanzima Z Islam. (2023). "Novel Representation Learning Technique Using Graphs for Performance Analytics." 2023 International Conference on Machine Learning and Applications (ICMLA). IEEE, pp. 1311–1318. https://ieeexplore.ieee.org/document/10460066/

Reimagine Application Performance as a Graph: Novel Graph-Based Method for Performance Anomaly Classification in High-Performance Computing

Published in 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), 2024

A novel graph-based representation of application performance for performance anomaly classification using graph neural networks, capturing complex relationships among tasks and resources. Published at IEEE COMPSAC 2024.

Recommended citation: Chase Phelps, Ankur Lahiry, Tanzima Z Islam, and Line C Pouchard. (2024). "Reimagine Application Performance as a Graph: Novel Graph-Based Method for Performance Anomaly Classification in High-Performance Computing." 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, pp. 240–245. https://ieeexplore.ieee.org/document/10633643/

WANDER: An Explainable Decision-Support Framework for HPC

Published in arXiv preprint arXiv:2506.04049, 2025

An explainable decision-support framework for High Performance Computing (HPC) that uses graph-based and explainable AI methods to turn complex system logs into intuitive signals, enabling early detection and diagnosis of unusual performance behavior in large-scale HPC systems.

Recommended citation: Ankur Lahiry, Banooqa Banday, Yugesh Bhattarai, and Tanzima Z Islam. (2025). "WANDER: An Explainable Decision-Support Framework for HPC." arXiv preprint arXiv:2506.04049. https://arxiv.org/abs/2506.04049

Scalable GPU Performance Variability Analysis Framework

Published in arXiv preprint arXiv:2506.20674, 2025

A scalable framework for GPU log-analysis pipelines for large trace datasets using distributed partitioning and parallel processing to reduce analysis time and memory overhead. Achieves a 67% improvement in scalability while enabling fast identification of performance variability, memory stalls, and system bottlenecks across repeated HPC runs.

Recommended citation: Ankur Lahiry, Ayush Pokharel, Seth Ockerman, Amal Gueroudji, Line Pouchard, and Tanzima Z Islam. (2025). "Scalable GPU Performance Variability Analysis Framework." arXiv preprint arXiv:2506.20674. https://arxiv.org/abs/2506.20674

A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces

Published in arXiv preprint arXiv:2510.18300, 2025

A distributed framework for causal modeling of performance variability in GPU traces, enabling identification of root causes behind performance anomalies in HPC workloads through causal inference over large-scale GPU trace data.

Recommended citation: Ankur Lahiry, Ayush Pokharel, Banooqa Banday, Seth Ockerman, Amal Gueroudji, Mohammad Zaeed, Tanzima Z Islam, and Line Pouchard. (2025). "A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces." arXiv preprint arXiv:2510.18300. https://arxiv.org/abs/2510.18300

COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC

Published in arXiv preprint arXiv:2604.22688, 2026

A unified decision-intelligence system for HPC environments that recommends performant configurations by balancing speed, cost, and reliability trade-offs with explainable outputs and uncertainty-aware ranking, scaling to traces with 1.3B samples (126 GB) and achieving up to 100× faster training and 80× faster inference than state-of-the-art generative baselines.

Recommended citation: Ankur Lahiry, Banooqa Banday, Yugesh Bhattarai, Tanzima Z Islam, and Mohammad Zaeed. (2026). "COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC." arXiv preprint arXiv:2604.22688. https://arxiv.org/abs/2604.22688

talks

teaching