Monitoring the Hardware for HyperScale X

HyperScale hardware is monitored to detect and report errors in the following components:

  • Disk Drives

  • Power Supply

  • Fan

  • NIC (Network)

Commvault Appliance HyperScale Hardware Alert

The Commvault Appliance HyperScale Hardware alert is triggered when a hardware error or failure is detected for any of the data drives (IO errors, SMART errors), power supply, fan, or NIC, or when any of these components go offline in the underlying hardware. You can configure the alert to send a notification to the CommCell administrator and hyperscalealerts@commvault.com when an error is detected. For instructions, see Configuring the Commvault Appliance HyperScale Hardware Alert.

On new installations with Commvault Platform Release 2023E version, the alert is enabled by default for all HyperScale X nodes.

On existing appliances, when you upgrade to Commvault Platform Release 2023E, you must enable the new enhanced alerts after the upgrade. On the upgraded appliance nodes, the Dial Home for Hyperscale and Appliance Hardware Alert and Scale out disk health alerts remain enabled to monitor the hardware status and send alerts. To use the new Commvault appliance Hyperscale hardware alert for hardware monitoring and HyperScale platform alert, you must first disable the previously configured alerts and then enable the new alerts. For more information, see Disabling an Alert.

RefArch HyperScale Hardware Alert

The RefArch HyperScale Hardware alert is triggered when a hardware error or failure is detected for any of the data drives (IO errors, SMART errors), power supply, fan, or NIC, or when any of these components go offline in the underlying hardware.

On new installations with Commvault Platform Release 2023E version, the alert is enabled by default for all Commvault HyperScale X nodes.

On existing installations, when you upgrade to Commvault Platform Release 2023E, you must enable the new enhanced alerts after the upgrade. On the upgraded nodes, the Dial Home for Hyperscale and Appliance Hardware Alert and Scale out disk health alerts remain enabled to monitor the hardware status and send alerts. To use the new RefArch HyperScale Hardware alert for hardware monitoring and HyperScale platform alert, you must first disable the previously configured alerts and then enable the new alerts. For more information, see Disabling an Alert.

Enhanced HyperScale Data Drive Monitoring

The software detects disk failure with continuous monitoring of SMART warnings, errors or failures and system log messages in the /var/log/messages file for data drives within the storage pool. The software sends alerts for the uncorrectable reads and writes counter errors and for any non-medium counter errors for data drives.

The software sends different alerts based on the type of data drive failure – predictive or real.

  • Predictive Data Drive Failure: The software monitors the SMART status of each drive during every health check cycle. In case of predictive data drive failure such as uncorrectable reads and writes counter errors and non-medium counter errors for data drives between previous and current health check cycles, a Warning: I/O Test Failure alert is sent.

    The software continues to check the errors again in the next health check cycle. If there is an increment in the value of counter errors, the software sends a Warning: I/O test failure alert again. If there is no increment in the value, the alert condition is cleared.

    To determine any requirement for proactive replacement of the drives, contact Commvault support.

    In case of predictive disk failures, the software does not change the status of data drive and continues to report the status as Ready in the hardware monitoring report in the Command Center.

  • Real Data Drive Failure: If any real data drive failure occurs, the software sends a critical alert. You must contact Customer Support to initiate a case to obtain a replacement drive. Status of data drive appears as Offline in the hardware monitoring report in the Command Center.

    For instructions to perform drive replacement, see Replacing Disks in HyperScale X Appliance Nodes.

Sample Output

Predictive Drive Failure

appliance_predictive_failure

Actual Drive Failure

appliance_real_failure

Dial Home for Hyperscale and Appliance Hardware Alert

The Dial Home for Hyperscale and Appliance Hardware Alert generates and sends a dial home (or call home) alert to the Administrator and hyperscalealerts@commvault.com when an error is detected in a disk drive, power supply, fan, or NIC, or when any of these components goes offline in the underlying hardware. This alert is generated by default for all HyperScale Appliances.

If the Appliance is installed only as a MediaAgent, then you must download the HyperScale Hardware Monitoring alert from the Commvault Store. To download the alert, see HyperScale Hardware Alerts.

You must configure the alert to send notifications to the cloud and to the HyperScale Appliance support team. For instructions, see Configuring the HyperScale Hardware Monitoring Alert.

On upgraded appliance nodes, the alert is still enabled to monitor the hardware status and send dial home alerts. To use the new Commvault Appliance HyperScale Hardware alert for hardware monitoring, you must first disable the dial home alert and then enable the new alert.

Additional Alerts

The following additional alerts are recommended to monitor your HyperScale entities. For more information about creating an alert, see Creating an Alert from the Alert Wizard.

  • Mount path Went Offline, Library Went Offline alerts

    Enable these alerts to generate a notification when a mount path or library goes offline. For more information about these alerts, see Predefined Alert Criteria - Device Status.

  • MediaAgent went offline alert

    Enable this alert to generate a notification when a MediaAgentgoes offline. For more information about this alert, see Predefined Alert Criteria - MediaAgents.

  • Insufficient storage alert

    Enable this alert to generate a notification when there is insufficient disk space. For more information about this alert, see Predefined Alert Criteria - Library Management.

  • No DDB Space Reclamation from past N days alert

    Enable the option to generate the No DDB Space Reclamation from past N days alert when a DDB space reclamation was not performed. For more information, see DDB Data Verification.

Loading...