Host Restart

Incident Report for True North Ned Web Services

Resolved

The NWS NOC Team has continued to monitor host performance over the last week and have determined the following as Root Cause

At 14:30PM PST on 10/25 one of the NWS Las Vegas cluster hosts rebooted during production. At this time VM's operating on this host were automatically migrated via HA to other functional hosts and brought back online. After reviewing logs it was determined a CPU wait state issue caused the Kernel to halt and thus force the restart.

That evening the we upgraded firmware across all cluster nodes, as well as implemented BIOS adjustments recommended by AMD for situations like this. Once complete we verified all cluster nodes joined properly and monitored going forward.

Since the work performed on the evening of 10/25 we have not see anything indicative of a possible failure. Our monitoring systems continue to operate (by design) and will notify if anything see comes up.
Posted Oct 31, 2022 - 10:49 PDT

Update

We are continuing to monitor for any further issues.
Posted Oct 31, 2022 - 10:44 PDT

Update

The NWS NOC team is continuing to monitor host stability at this time. Cluster load is normal and we have implemented additional stability improvements by the means of firmware and configuration tweaks provided by AMD.
Posted Oct 27, 2022 - 09:56 PDT

Monitoring

We have identified a load related issue which has caused the host to restart. We are in the process of adjusting cluster load across additional nodes and will monitor everything upon completion. This work is currently underway.
Posted Oct 25, 2022 - 15:56 PDT

Investigating

A host restart in the NWS cluster has caused several VM's to be reallocated to other hosts. We are tracking the issue and verifying all VM's are back online.
Posted Oct 25, 2022 - 14:31 PDT
This incident affected: Ned Web Services Las Vegas Overall Availability (Las Vegas Virtual Machine Availability).