Publications
You can also find my articles on my Google Scholar profile.
Published in arXiv preprint arXiv:2604.22688, 2026
A unified decision-intelligence system for HPC environments that recommends performant configurations by balancing speed, cost, and reliability trade-offs with explainable outputs and uncertainty-aware ranking, scaling to traces with 1.3B samples (126 GB) and achieving up to 100× faster training and 80× faster inference than state-of-the-art generative baselines.
Recommended citation: Ankur Lahiry, Banooqa Banday, Yugesh Bhattarai, Tanzima Z Islam, and Mohammad Zaeed. (2026). "COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC." arXiv preprint arXiv:2604.22688. https://arxiv.org/abs/2604.22688
Published in arXiv preprint arXiv:2510.18300, 2025
A distributed framework for causal modeling of performance variability in GPU traces, enabling identification of root causes behind performance anomalies in HPC workloads through causal inference over large-scale GPU trace data.
Recommended citation: Ankur Lahiry, Ayush Pokharel, Banooqa Banday, Seth Ockerman, Amal Gueroudji, Mohammad Zaeed, Tanzima Z Islam, and Line Pouchard. (2025). "A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces." arXiv preprint arXiv:2510.18300. https://arxiv.org/abs/2510.18300
Published in arXiv preprint arXiv:2506.20674, 2025
A scalable framework for GPU log-analysis pipelines for large trace datasets using distributed partitioning and parallel processing to reduce analysis time and memory overhead. Achieves a 67% improvement in scalability while enabling fast identification of performance variability, memory stalls, and system bottlenecks across repeated HPC runs.
Recommended citation: Ankur Lahiry, Ayush Pokharel, Seth Ockerman, Amal Gueroudji, Line Pouchard, and Tanzima Z Islam. (2025). "Scalable GPU Performance Variability Analysis Framework." arXiv preprint arXiv:2506.20674. https://arxiv.org/abs/2506.20674
Published in arXiv preprint arXiv:2506.04049, 2025
An explainable decision-support framework for High Performance Computing (HPC) that uses graph-based and explainable AI methods to turn complex system logs into intuitive signals, enabling early detection and diagnosis of unusual performance behavior in large-scale HPC systems.
Recommended citation: Ankur Lahiry, Banooqa Banday, Yugesh Bhattarai, and Tanzima Z Islam. (2025). "WANDER: An Explainable Decision-Support Framework for HPC." arXiv preprint arXiv:2506.04049. https://arxiv.org/abs/2506.04049
Published in 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), 2024
A novel graph-based representation of application performance for performance anomaly classification using graph neural networks, capturing complex relationships among tasks and resources. Published at IEEE COMPSAC 2024.
Recommended citation: Chase Phelps, Ankur Lahiry, Tanzima Z Islam, and Line C Pouchard. (2024). "Reimagine Application Performance as a Graph: Novel Graph-Based Method for Performance Anomaly Classification in High-Performance Computing." 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, pp. 240–245. https://ieeexplore.ieee.org/document/10633643/
Published in 2023 International Conference on Machine Learning and Applications (ICMLA), 2023
A novel graph-based representation learning approach that transforms tabular HPC performance data into graphs, enabling Graph Neural Network techniques to capture complex relationships between features. Also available as arXiv preprint 2401.10799.
Recommended citation: Tarek Ramadan, Ankur Lahiry, and Tanzima Z Islam. (2023). "Novel Representation Learning Technique Using Graphs for Performance Analytics." 2023 International Conference on Machine Learning and Applications (ICMLA). IEEE, pp. 1311–1318. https://ieeexplore.ieee.org/document/10460066/
Published in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '23), 2023
A study on the applicability of graph-based deep learning methods for performance anomaly classification in the Chimbuko framework, a performance analytics tool used to monitor and improve large-scale supercomputing applications. Published at SC23.
Recommended citation: Chase Phelps, Ankur Lahiry, Tanzima Z Islam, and Christopher Kelly. (2023). "Graph Based Anomaly Detection in Chimbuko: Feasible or Fallible?" Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '23). https://sc23.supercomputing.org/proceedings/tech_poster/tech_poster_pages/rpost184.html