A woman-owned, HUB-certified and DIR contract  company  dedicated to helping our customers meet their Information Technology requirements.

  Remote Care

Pro-active Monitoring and Administration for HPC


SAI provides shared, remote support for HPC systems, called Remote Care, at a fraction of the cost of dedicated, full-time, on-site support. Rather than a break-fix approach, SAI employs a managed service approach to build, configure, and upgrade systems; monitor the system and provide performance metrics; and proactively respond to system events. By using SAI's Cluster Managed Services, scientists and engineers can focus on using the system for what it was intended for, rather than using their valuable time to support the system themselves, or using administrators without specialized cluster training and experience. Even sites with larger systems and dedicated support can augment their support staff through SAI's services to provide more cost-effective system and user support services.

SAI's Cluster Managed Services are priced according to system size and the Service Level Agreement (SLA) needed. RAI can also provide consulting throughout your system's lifecycle, from planning and acquisition through system decommissioning. SAI can also provide on-site services including system administration, user services, and application support. SAI has experienced computational scientists available to help optimize and parallelize numerical modeling codes and tune your system to its maximum possible level of performance.

SAI's Remote Care is made up of three services:

  1. Remote Management - Supports the day-to-day operations needed to administer your cluster.  Typical functions performed include adding, deleting and modifying user accounts, changing permissions, backups/restorations, log management, system and application software/patch installation, upgrades, managing system security, etc.

  2. Hardware & Performance Monitoring - Pro-active and automated monitoring with alarming of equipment temperatures, fan speeds, ECC memory errors, PCI bus errors, hard drive errors, CPU usage, memory utilization, disk I/O rates, disk usage high/low watermarks, Interconnect performance, Ethernet network performance, and application metrics, etc..

  3. Weekly System Operational Summary - A compilation and analysis of the data gathered during the previous week's 'Remote Management' and 'Hardware & Performance Monitoring' activities.  The summary report focuses on meeting the operational computational needs of the customer and pro-actively addressing issues before they can impact the productivity or availability of the system.

Printer Friendly Format Printer Friendly Format    Send to a Friend Send to a Friend