Computational Agents for Systems’ Operations


The CERN accelerator complex is a safety-critical, interconnected system-of-systems where reliability is paramount. Traditional model-based maintenance struggles with evolving system behavior and rapidly growing, heterogeneous sensor data. CASO provides a scalable, adaptable framework that learns models directly from operational data to anticipate failures, recommend interventions, and coordinate decisions across subsystems.

CASO’s architecture supports:

Secure operations:

  • Parameter exchange and learning without exposing sensitive data, aligned to stringent operational security requirements.
  • Edge readiness: Lightweight agents for resource-constrained devices and distributed subsystems.
  • Interoperability: Standardized interfaces for seamless integration with existing monitoring, alarms, and maintenance management workflows.
  • Prescriptive intelligence: Early anomaly detection, Remaining Useful Life (RUL) estimation, and optimized scheduling to minimize downtime and failure propagation.
  • Scalability & adaptability: Designed to evolve with equipment, sensors, and operating conditions across the LHC ecosystem.

With advanced analytics and intelligent agent coordination, CASO raises care-of-machine quality, reduces variation in maintenance practices, and maximizes physics program availability.

Our role in the project

CERN leads the design, implementation, and evolution of CASO across the accelerator complex. Responsibilities include secure systems integration with legacy and multi-vendor equipment, deployment of edge-capable agents, lifecycle MLOps for on-prem updates and monitoring, validation with domain experts, and continuous hardening in production environments. CERN coordinates pilots, benchmarks, documentation, and training to ensure sustainable adoption and measurable impact on availability and reliability. 

Success Stories deploying CASO at CERN

Deep Learning for Anomaly Detection in Particle Accelerators Technical Infrastructure

Author: Lorenzo Giusti (CERN MSc Thesis)

A BiLSTM autoencoder detects anomalies in time-series from LHC collimators and UPS systems using reconstruction error thresholds. Reported performance includes 93.8% accuracy and F1 = 0.864, with detection up to one week before engineers’ official reports-particularly robust where prediction-based methods struggle. The approach provides component- and system-level insights to target interventions.

A hierarchical architecture of autoencoders for the identification of abnormal condition in complex technical infrastructures

Authors: P. Baraldi, M. Bartesaghi, U. Gentile, X. Lu, L. Serio, E. Zio

A cascading, hierarchical AE framework combines component- and system-level models with Z-tests and decision logic to distinguish sensor, component, and system failures in LHC electrical transformers (2015-2018 data). It identified early sensor malfunctions and component deterioration with false alarm rates as low as 0.0025, improving classification of malfunction types and informing proactive maintenance (with noted sensitivity tuning needs).

Association Rules Extraction for Functional Dependencies in Complex Technical Infrastructures

*Authors: F. Antonello, P. Baraldi, A. Shokry, E. Zio, U. Gentile, L. Serio

Using large alarm databases, alarms are represented as a binary matrix and mined via Apriori to surface functional dependencies-especially cross-system relations often unknown to experts. Results on synthetic data and real LHC datasets reveal actionable dependencies for safety, resilience, maintenance planning, and root-cause analysis. Careful selection of minimum support/confidence mitigates spurious rules and computational cost.

A novel association rule mining method for the identification of rare functional dependencies in Complex Technical Infrastructures from alarm data

Authors: F. Antonello, P. Baraldi, A. Shokry, E. Zio, L. Serio

A multi-support association rule approach identifies rare but critical functional dependencies across systems (validated on synthetic data and LHC Zone 8). Compared to single-support mining, it reduces computational effort and rule volume while preserving meaningful cross-system relations that guide reliability and availability improvements.

A Novel Metric to Evaluate the Association Rules for Identification of Functional Dependencies in Complex Technical Infrastructures

Authors: F. Antonello, P. Baraldi, E. Zio, L. Serio

A ranking metric prioritizes meaningful functional dependencies (FDEPs) from alarm-mined rules, filtering spurious correlations and resisting statistical noise. Applied to 250k+ alarms from cryogenic, electric, and cooling systems, it surfaces rules aligned with expert knowledge-improving maintenance planning and diagnostic accuracy and enabling operator-support automation.

A Niching Augmented Evolutionary Algorithm for the Identification of Functional Dependencies in Complex Technical Infrastructures From Alarm Data

Authors: F. Antonello, P. Baraldi, E. Zio, L. Serio

Niching Augmented Genetic Algorithm (NAGA) couples niching GAs with augmentation to efficiently explore vast alarm-combination spaces, directly identifying diverse, spurious-free functional dependencies-without small support thresholds or heavy post-processing. On synthetic and LHC datasets, NAGA cuts compute, scales better than standard ARM, and enhances accuracy. Future work integrates these dependencies into resilience analysis and maintenance planning.

Federated Learning Framework to support AI-Driven Prescriptive Maintenance in Large-Scale Cryogenic Infrastructures at CERN

Authors: P. Cacace, L. Serio, D. Reis Santos

Built on CERN’s CAFEIN Federated Learning Platform, localized models train per compressor unit and aggregate into specialized global models for low-/high-pressure systems. Autoencoder-based anomaly detection with real-time data supports RUL estimation and optimized maintenance. The decentralized setup addresses unbalanced data, enhances prediction accuracy, reduces downtime, extends equipment lifespan, and lowers costs-scalable to other cryogenic facilities.

PenguinGPT: A Virtual Assistant for Reliability & Prescriptive Maintenance

A domain-tuned assistant that blends RAG + LLMs with cryogenic engineering knowledge to analyze logs, sensors, and maintenance history. In LHC case studies, PenguinGPT reduced operational downtime, improved availability, and streamlined troubleshooting for ultra-low-temperature operations required by superconducting magnets-advancing from reactive to prescriptive maintenance in demanding thermal environments.

Collaborators