Page tree

Contents

System Health is a NetMRI feature to provide a view of the system health of the NetMRI appliance. NetMRI provides two visual inputs to notify and assist the administrator in responding to issues in the NetMRI appliance:

  • Report message banners at the top of the NetMRI screen provide quick notification of problems.
  • A Settings page, System Health, provides a more-detailed list of the problems affecting the system, including Controller and Collector appliances in an Operation Center environment (where applicable).

System Health input categories include the following:

  • Hardware – Appliance hardware, including fans and power supplies, internal and external (ambient) temperatures, and RAID array status;
  • Software – The appliance's NetMRI software;
  • Network – Connectivity on MGMT and SCAN interfaces, and reachability to external database archive systems;
  • Storage – Available disk space, and Internal hard disk status.
  • Platform Capacity – Warnings about exceeding support capacity for devices, interfaces and end hosts.
  • Processing – Warnings about various causes of excessive demands on system resources.
  • Collector Connectivity – Operations Center Collector network reachability. Does not apply to standalone NetMRI appliances.
  • Configuration – Comprised of Unassigned VRF notifications, letting the administrator know that a discovered virtual routing and forwarding network (VRF) has been discovered and must be mapped to a network view.

Operations Center System Health Listings

System Health features for the NetMRI Operations Center environment will list all issues associated with the Operations Center appliance and for all of its associated Collectors.
All reported issues are the same for all alerts described in the previous topics; the main difference is that the System Health feature applies globally to all appliances and virtual appliances within the distributed Operations Center environment.


Note: Every message listed in the System Health page provides an Alert Code, similar to the following:

SOFT001

If you need to communicate with Customer Support for an issue, ensure that you provide this code to the support representative.


In this section, you will find descriptions for all alerts in the System Health category, descriptions of possible causes for the issue, and potential fixes for each alert.

System Health Color Coding

System Health alerts provide the following standard color coding in the System Health page under NetMRI Settings:

  • Green: indicates no issues currently present in the category.
  • Yellow: Warning. Warning health alerts appear when an issues appears that poses potential for more severe problems in the future, or a configuration issue that should be addressed; for example, a disk utilization level of 70% in a NetMRI appliance, Operations Center, or a Collector in an Operations Center network will raise a Warning alert, as will a detected VRF network that is not yet mapped to a network view.
  • Red: Critical. An issue that needs to be addressed as soon as possible. Critical alerts occur in cases where, for example, storage utilization is at 90% or higher, or a system fan fails or is removed from the appliance.
  • Grey: Offline. Alerts colored Grey appear only for Operations Center Collectors that are offline due to expected causes, such as a Collector being taken offline for replacement or changes to configuration.

Banner System Health messages appear only in yellow (Warning) and red (Critical). Click directly on the banner text to display the System Health page with its alert listings.
You may disallow the System Health banners from appearing to non-Admin NetMRI users, by opening the Settings > General Settings –> Advanced Settings page and choosing the Hide the system banners from non-admin users setting. (It is on the last page of Advanced Settings, under User Administration.) Click the Action icon and choose Edit, choose Yes and click OK.

Categories of Health Status

System Health alerts also support notification subscriptions (see Subscribing to Notifications for more information). System Health notifications fall into the following general types: System Hardware Alert, Software Health Alert, Processing Health Alert, Storage Health Alert, Network Health Alert, Platform Capacity Health Alert, and Collector Connectivity Health Alert.
Individual alert types gather under the seven basic System Health categories. The following table provides a summary of the System Health alerts.

Health Alert Category

Alert Messages

General Notes

Hardware (see Details on Hardware Alerts for more information)

RAID Drive <X> Failed.

RAID Array Failed.

Fan <X> Failed.

Power Supply <X> Failed.

High Ambient Temperature.

High Internal Temperature.

RAID Battery Failed.

RAID Array Failed.

This category applies only to hardware-based NetMRI systems and will not appear for virtual machine-based NetMRI instances.
RAID messages apply only to appliances that directly support RAID, including the NT-2200 and NT-4000 models.
NetMRI 1102-A models do not support hardware monitoring alerts.
NT-1400 and NT-2200 systems do not report Ambient Temperatures.
Double-clicking any hardware Issue that appears in this category opens the Settings –> Notifications –> Hardware Status page.

Network (see Details on Network Alerts for more information)

High rate of network errors on MGMT port.

Network link down on MGMT port.

High rate of network errors on SCAN port.

Network link down on SCAN port.

General network connectivity issues on the NetMRI appliance.

Platform Capacity (see Details on Network Alerts for more information)

Number of interfaces <count> exceeds Platform Interface Limit of <limit>.
Number of end hosts <count> exceeds Platform SPM End Host Limit of <limit>.
Number of devices <count> exceeds Platform Total Device Limit of <limit>.

Reflects issues where the current level of discovered network devices, interfaces or end hosts is exceeding the platform limits for the appliance. Does not apply to licensed limits. Platform limit values can be located in the Settings icon –> Setup –> Settings Summary page.

Processing (see Details on Processing Alerts for more information)

Processing Capacity is being exceeded.

Processing Alerts reflect Issues where the system processing capacity is being exceeded in the current system configuration.

Software (see Details on Software Alerts for more information)

A software problem was detected.
A software problem was detected during Weekly Maintenance.

In all cases, contact Customer Support for assistance.

Storage (see Details on Storage Alerts for more information)

Low on disk space

Critically low on disk space

Cannot Connect to remote archive storage

Could not save archive to remote storage  <hostname>

Disk <X> Failed.

Low on Disk Space indicates that System Health recommends preventive action to increase available disk space in the appliance.
Critically Low on Disk Space indicates an impending failure due to insufficient disk space.

Collector Connectivity (see Details on Operation Center Collector Alerts for more information)

Connection to Collector <X> lost. Collector <X> Reset.
Collector <X> is Rebooting.

Issues associated with collector reachability and connectivity in an Operation Center deployment.

Configuration (see Details on Configuration Alerts for more information)

New unassigned VRF discovered.

Warning notification that a VRF network has been discovered and should be placed into a network view by the administrator.

Details on Software Alerts

System Health monitors the overall health and operation of the NetMRI software. It is used for reporting potentially important software issues to Customer Support that might otherwise go unnoticed by the user. In all cases, software problem messages should be reported to Customer Support along with the issue code.

Alert Message

User Action

Warning — A software problem was detected. Contact Support.

Contact Customer Support.

Critical — A critical software problem was detected.

Contact Customer Support.

Warning — A software problem was detected during Weekly Maintenance.

Contact Customer Support.

Details on Network Alerts

Network alerts apply to the MGMT and SCAN Ethernet interfaces on the NetMRI appliance.

Alert Message

User Action

High number of network errors on MGMT port.

Check the network connection for the appliance MGMT port, including the neighboring interface configuration.

Critical — Network link down on MGMT port.

Check the network connection for the appliance MGMT port, including the neighboring interface configuration.

Warning — High rate of network errors on SCAN port.

Check the network connection for the appliance SCAN port, including the neighboring interface configuration.

Critical — Network link down on SCAN port.

Check the network connection for the appliance SCAN port, including the neighboring interface configuration.

Details on Platform Capacity Alerts

Platform Capacity alerts do not necessarily reflect a problem in the NetMRI system. Each NetMRI appliance has an advisory limit in the number of discovered interfaces, discovered devices and discovered end host devices that it is expected to support, based on disk space and system processing capabilities inherent in the appliance model. These values are called the Platform Capacity and are also reflected in the NetMRI Configuration values shown under the Settings icon –> Setup –> Settings Summary page.
Unlike other System Alert categories, Platform Capacity warnings will always appear when all three of the advisory system limits (Number of managed interfaces, Number of end hosts devices, number of discovered devices) are exceeded by the appliance. Note that the Processing category (also see Details on Processing Alerts) provides the same three warnings (along with others) in its alerts category. When any of these three limits is violated as the result of a processing issue, one of the Platform Capacity warnings also will appear in the notification. These limits are not enforced and the NetMRI appliance operates normally; excess devices continues to appear in the Discovered Devices table. (For related information, see Understanding Platform Limits, Licensing Limits and Effective Limits.)

Alert Message

User Action

Number of interfaces <count> exceeds Platform Interface Limit of
<limit>.

The number of interfaces counted across all discovered devices exceeds the platform capacity of the appliance or Operations Center. Consider reducing the size of discovery ranges or move a discovery range to a different appliance.

Number of end hosts <count> exceeds Platform SPM End Host Limit of <limit>.

Number of SPM (Switch Port Manager) discovered end host devices exceeds the platform capacity of the appliance or Operations Center. Consider reducing the size of discovery ranges or move a discovery range to a different appliance.

Number of devices <count> exceeds Platform Total Device Limit of
<limit>.

The total number of discovered devices exceeds the platform capacity of the appliance or Operations Center. Reduce the size of discovery ranges, and/or the number of seed routers or move a discovery range to a different appliance.

Details on Hardware Alerts

The Hardware Alerts category applies only to hardware-based NetMRI systems and will not appear for virtual machine-based NetMRI instances. Hardware issues may involve system fans, power supplies, physical hard drives, and RAID Controllers. Temperature alerts also appear under the Hardware category.
Issues associated with RAID appear only for systems that support RAID disk arrays.
Hardware alerts appear for the NetMRI NT-1400, NT-2200 and NT-4000 appliances. System Health monitors hardware elements such as system fans, the RAID controller status, ambient cooling and internal cooling.
Important subcategories of Hardware alerts include the following:

  • Cooling: Fan failures, high ambient temperatures (the temperatures outside of the unit are too high), high internal temperatures. NetMRI NT-1400 and NT-2200 systems do not have ambient temperature sensors and will not display the Ambient Temperature is high alert.
  • RAID: Applies only to NetMRI systems that support RAID disk arrays. Possible alerts include RAID Array Failed and RAID Drive "X" Failed.
  • Power Supply: Alerts include Power Supply <1|2> Failed.

Double-clicking on any hardware alert opens the alert in the Settings –> Notifications –> Hardware Status page

Alert Message

User Action

RAID Drive <X> Failed

Replace the hard disk with a replacement drive authorized by Infoblox.

RAID Array Failed

Contact Customer Support.

Fax <X> Failed

Replace the system fan. Appears only in systems where system fans are user-replaceable, as with the NetMRI NT-2200 and NT-4000 devices. Fan assemblies must be replaced with authorized Infoblox parts. Contact Customer Support if this message appears in systems where fans are not user-replaceable.

Power Supply <X> Failed

Check Power Supply operation. Message appears only for systems in which a redundant 1+1 power supply configuration is available and running in the device in question. (For a single-power-supply system, the appliance simply shuts down.) The alerts also allow for the possibility that a power supply is unplugged.

Ambient temperature is high. Internal temperature is high.

Both messages may appear for the same system, with internal temperature being affected by the ambient temperature. Reduce the ambient temperature where possible; if the Internal temperature remains high, look for a Fan Failed error message along with the Internal Temperature message. Contact Customer Support if an Internal Temperature is High issue persists when conditions are otherwise optimal.

Critical — RAID Battery failed.

Contact Customer Support.

RAID Array Degraded.

The RAID array is not fully operational due to a disk in the process of rebuilding or a disk being removed. If a disk has been removed in preparation for replacement, this issue will also appear, and will clear when the replacement is finished rebuilding. If you know that no disk replacement operation has been started with the appliance and this issue appears, contact Customer Support.

Details on Storage Alerts


Note: Disk space that is set aside for database archive creation is considered non-usable by the system.


The Storage health status provides a link to a special Storage Trend chart. To view it, click any link under the Storage category in the System Health page. Storage is particularly sensitive, for example, when NetMRI runs on a VMware VM and begins to run up against its disk storages limits, or on standalone NetMRI systems that run a single hard disk.

Alert Message

User Action

A software problem was detected. Contact Support.

Contact Customer Support.

Low on disk space

Warning health Issue will appear when overall storage utilization exceeds levels considered safe for long-term operation, recommends preventive action to increase available disk space in the appliance.
A Critical health Issue (Critically Low on Disk Space) will appear when overall storage utilization exceeds levels indicates an impending failure due to insufficient disk space.
To begin addressing this issue, remove any unneeded files from the administrator home directories through the NetMRI administrative shell.

Cannot connect to remote archive storage

Check reachability to the system providing the remote storage on the network.

Could not save archive to remote storage <hostname>

Check the operating state and configuration for the system providing the remote storage on the network. NOTE: This alert can be suppressed in the System Health page when it is active, when the user considers that the issue has been solved. When the error occurs and is remedied by the administrator, NetMRI will not display the alert again unless the issue is found again during the next archiving attempt.

Disk <X> Failed.

Check the LEDs on the disk drives for the appliance and replace the disk drive. For information on the behavior of disk drive LEDs in your system, check the Infoblox Installation Guide for your appliance.

RAID Battery failed.

Contact Customer Support.

Warning — RAID Array Degraded.

Usually, in this case the RAID array is in a degraded state due to a disk in the process of rebuilding or a disk being removed. If a disk has been removed in preparation for replacement, this issue will also appear, and clears when the replacement is finished rebuilding. If you know that no disk replacement operation has been started with the appliance and this issue appears, contact Customer Support.


The Storage Trend chart provides a two-week sliding window with the 6-hour time measurement on the horizontal X axis. The total storage capacity is reflected on the vertical Y axis. The trend chart example to the right shows available disk storage for a two-day period, the number of increments in the chart increases up to a two-week period and then acts as a moving window across the timeline. The latest measurement date appears on the far right.
As available disk space decreases, the trendline declines to the right and approaches the X axis. When the admin frees disk space in response to an alert, the chart line inclines upwards to the right.
When a storage issue appears, System Health checks data retention settings. If any data retention categories occupy a significant amount of otherwise usable disk space, and are set beyond factory defaults, NetMRI will display a request to change the data retention settings in response to this alert.

Details on Processing Alerts

Processing alerts provide warning messages regarding excessive demand on system resources, including the following possible causes:

  • specifying too many jobs to simultaneously execute on an appliance
  • requesting too many reports to run in a given time period, or too many reports scheduled concurrently
  • exceeding recommended limits on managed devices and interfaces
  • attempts to discover too large a network.
  • Too many deployed Policy rules.

Other warnings notify when a NetMRI appliance infringes its licensing limits.

Alert Message

User Action

System Processing capacity is being exceeded.

A number of causes may contribute to processing slowdowns on the appliance.
Some processing warnings reflect higher quantities of various network entities than can be supported by the hardware platform:

  • Number of interfaces <count> exceeds the recommended capacity of <limit>.
    Solution: Consider reducing discovery ranges.
  • Number of end hosts <count> exceeds the recommended capacity of <limit>.
    Solution: Consider reducing discovery ranges.
  • Number of devices <count> exceeds the recommended capacity of <limit>.
    Solution: Consider reducing discovery ranges.
    If any one of these three processing warnings appear, a Platform Capacity message of the same type also appears.
    Other processing warnings include the following:
  • Policy Rule deployment exceeds the recommended limit of <X>. Solution: Reduce the number of deployed Policy rules.
  • Executed jobs exceed the recommended limit of <X> per 24 hours. Solution: Reduce the number of scripted Jobs that execute over a 24-hour period.
    The following messages are enforced on current platforms and will appear on appliances only when a) a Processing Capacity alert is present; b) that are over-provisioned with discovered devices beyond the licensed limit:
  • Number of Licensed Devices exceeds licensed platform limit of <X> devices.
    Solution: un-license some network devices.
    Appliances cannot have more licenses in-use than the number of installed licenses; appliances can have more installed licenses than the maximum allowed if the appliances are grandfathered in from older deployments with higher licensed levels. These messages only appear if the number of licenses exceeds the maximum number of licenses allowed for the hardware platform. For more information, see Understanding Platform Limits, Licensing Limits and Effective Limits .

Details on Operation Center Collector Alerts

Collector alerts apply only to Operations Center deployments with one or more Collector systems, whether VM-based or physical appliances.

Alert Message

User Action

Connection to Collector <X> lost.

The VPN between the Operations Center appliance and the collector is not working, preventing the OC from reaching the Collector. This alert appears if connectivity to the Collector is unexpectedly lost, due to a failed VPN, a flawed or failed physical network connection, or an issue with the Collector instance.

Collector <X> Reset.

The Collector appliance or VM has sent a message that it is being Reset, to the administrator. This message appears when the VPN connection is administratively disconnected.

Collector <X> is Rebooting.

The Collector appliance or VM has sent a message that it is being rebooted, to the administrator. This message appears after a Connection Lost message for a grace period after an administrative reboot.


Details on Configuration Alerts

After discovering unassigned VRFs, NetMRI displays a warning alert in the main page with a hyperlink to open the System Health page to view details. The Unassigned VRF message includes a hyperlink to launch the Network View Editor, which is required to assign unassigned VRFs to a network view. You can suppress Unassigned VRF System Health alerts.

Alert Message

User Action

An unassigned VRF was detected.

Open the Network View editor, create a new network view if necessary, and assign the discovered VRF instances to it.

Collector Time Zone must match the Operation Center Time Zone.

Use the Admin Shell CLI on the collector to run the configure server command to adjust Time Zone settings.

  • No labels

This page has no comments.