Isilon - The /var partition is near capacity


Issue: When the /var partition reaches 75%, 85%, or 95% of capacity, an event is logged and an alert is sent.

Fix: Rotate logs
If the /var partition returns to a normal usage level, review the list of recently written logs to determine if a specific log is rotating frequently. Rotation can resolve the full-partition issue by compressing or removing large logs and old logs, thereby automatically reducing partition usage.
Check the percentage of free isilon nodesOpen an SSH connection to the node that reported the error and log in using the "root" account.
Run the following command:
df -i |grep var |grep -v crash

Output similar to the following appears:
Filesystem 1K-blocks Used Avail Capacity iused ifree %iused Mounted on
/dev/mirror/var0 1013068 49160 882864 5% 1650 139276 100% /var

If the %iused value is 90% or higher, reduce the number of files in the /var partition using one of the methods described below:
Remove files that do not belong in the /var partition.
On the node that generated the alert, run the following command to list files in the /var partition that are greater than 5 MB:

find -x /var -type f -size +10000 -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

In the output, look for files that do not typically belong in the /var partition. For example, a OneFS installer file, log gathers, or a user-created file.
Remove the files or move them to the /ifs directory. If you are unsure what to remove, contact Isilon Technical Support for assistance.
Determine if a process is holding a large file open

You can use the fstat command to list the open files on a node or in a directory, or to list the files that were opened by a particular process. A list of the open files can help you monitor the processes that are writing large files. See How to use the fstat command to list the open files on a node, 16648 .

If neither of the above tasks resolves the issue, continue with the following solution:
Limit the rollover file size and compress the file
Open an SSH connection on any node in the cluster and log in using the "root" account.
Run the following commands to create a backup of the /etc/newsyslog.conf file:
cp /etc/newsyslog.conf /ifs/newsyslog.conf
cp /etc/newsyslog.conf /etc/newsyslog.bak

Open the /ifs/newsyslog.conf file in a text editor.
Locate the following line:
/var/log/wtmp 644 3 * @01T05 B

Change the line to:
/var/log/wtmp 644 3 10000 @01T05 ZB

These changes instruct the system to roll over the /var/log/wtmp file when it reaches 10 MB and to compress the file with gzip.
Save and close the /ifs/newsyslog.conf file.
Run the following command to copy the updated file to all nodes on the cluster:
isi_for_array 'cp /ifs/newsyslog.conf /etc/newsyslog.conf'

If other logs are rotating frequently, or if the preceding solutions do not resolve the issue, run the isi_gather_info command to gather logs, and then contact Isilon Technical Support for assistance.

Ref: EMC KB Article 000471789

/var/log/isi_phone_home.log can grow without bound and fill up /var partition, causing issue with CELOG event generation and other operational issues with processes

Issue: /var/log/isi_phone_home.log can grow without bound and fill up /var partition, causing issue with CELOG event generation and other operational issues with processes

Cause: Log rotation does not work to auto rotate log file generated by isi_phone_home

Fix: if /var is getting too full (>85%) for any of the nodes, then run the following command:
# isi_for_array 'truncate -s 0 /var/log/isi_phone_home.log'

Ref: EMC KB Article Number 000516735

Isilon Health Check script

#cd to the Isilon Support Directory
IsilonCluster-1# cd /ifs/data/Isilon_Support

#Copy the Script from EMC through FTP
IsilonCluster-1# curl --disable-epsv -O ftp.emc.com/pub/rcm/Isilon/tools/IOCA
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  499k  100  499k    0     0  70871      0  0:00:07  0:00:07 --:--:--  233k

#Run the Script
IsilonCluster-1# perl IOCA

Output looks Similar like below:
Isilon On-Cluster Analysis                        0.1206
Live Cluster Analysis                             Wed Sep 19 10:43:08 2018
Cluster Name                                      IsilonCluster
Cluster GUID                                      XXXXXXXXXXXXXXXXXXXXXX
Node Count                                        6
Current OneFS Version                             8.0.0.4
Contact Information                               PASS
Email Settings                                    PASS
System Partition Free Space                       PASS
Drive Support Package (1.26)                      INFO
FCO F042415EE                                     PASS
FCO F031617FC/KB469133                            PASS
Highly Recommended Patches                        PASS
Node Firmware (10.1.6)                            INFO
ETAs                                              PASS
Hardware Status                                   PASS
BMC/CMC Hardware Monitoring                       PASS
Boot Disks                                        PASS
BXE Nodes                                         PASS
  DETAILS: 22 nodes have BXE interfaces: 1-4,9-26
Drives Health                                     PASS
Drive Load                                        PASS
Drive Stall Timeout                               PASS
Duplicate Gateway Priority                        PASS
Processes                                         PASS
IB Interfaces Active                              PASS
Memory                                            PASS
Mirror Status                                     PASS
Node Compatibility                                PASS
Access Zones                                      PASS (3)
OneFS Version                                     PASS
KB507031                                          PASS
Authentication Status                            PASS
Cluster Capacity                                  PASS
Cluster Encoding                                  PASS (utf-8)
DialHome & Remote Connectivity                    PASS
  DETAILS: Current Service States:
  DETAILS:    ConnectEMC Service is Enabled
  DETAILS:    RemoteSupport (isi remotesupport) is enabled
Critical Events                                   PASS
File Sharing                                      PASS
HDFS                                              PASS
SPN List                                          PASS
Cluster Health Status                             PASS
IDI Errors                                        PASS
Jobs Status                                       PASS
Jobs History                                      PASS
Licenses                                          PASS
LWIOD Log                                         PASS
Listen Queue Overflows                            WARN
  WARN: Listen Queue Overflows count over 50,000 on the following nodes: 1
NFS                                               PASS
Kernel Open Files Count                           PASS
Storage Pools                                     PASS
Cluster Services                                  PASS
SmartConnect Service IP                           PASS
Snapshot                                          PASS
SyncIQ                                            PASS
Cluster Time Drift                                PASS
Cluster Time Sync                                 PASS
Cluster Time Zone                                 PASS (America/Los_Angeles)
Upgrade Agent Port                                PASS
Upgrade Status                                    PASS
Node Uptime                                       PASS (100 days)



Physical Server with Unity Boot LUN got rebooted during Unity SP Reboot

Issue:  Physical Server with Unity Boot LUN got rebooted during Unity SP Reboot

Errors:
Warning Host 1076      User1  The reason supplied by user XXXXX for the last unexpected shutdown of this computer is: Other (Unplanned) Reason Code: 0xa000000 Problem ID: Bugcheck String:
Error SHost       1001      Microsoft-Windows-WER-SystemErr           The computer has rebooted from a bugcheck. The bugcheck was: 0x000000d1 (0xffffe8013242b000, 0x0000000000000002, 0x0000000000000000, 0xfffff8017ad60e81). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: DDMMYY-35046-01.

Cause: this is a known issue with Microsoft Security patch and PowerPath. one of the security patches installed which caused the server to bluescreen and crash

Fix:
This is a known  issue with PowerPath and Microsoft latest update for Windows 2012 servers.
This is due to the updates by Microsoft Windows 2012 R2 update KB3185279, KB3185331, KB3192404, KB3197875, KB3197874, KB3205401 recently.
https://support.microsoft.com/en-in/help/24717/windows-8-1-windows-server-2012-r2-update-history

Please find the KB article to fix the issue
490865 : Windows 2012 R2 server crash pointing to EMC PowerPath driver EMCPMPX.SYS https://support.emc.com/kb/490865

++++++++++++++++++++++++++++++++++++++++++++++++++++
OS Name             Microsoft Windows Server 2012 R2 Standard
Version 6.3.9600 Build 9600
Other OS Description      Not Available
OS Manufacturer             Microsoft Corporation
System Name    HostName
System Manufacturer    Cisco Systems Inc
System Model   UCSB-B200-M3
System Type      x64-based PC

  manfac: Cisco Systems, Inc.
   sernum: FCH1824J0E9
    model: Cisco VIC FCoE HBA
   descrp: Cisco VIC-FCoE Storport Miniport Driver
   symblc: Cisco VIC FCoE HBA FW:2.1(3d) DRV:2.3.0.20
 
 EMC powermt for PowerPath (c) Version 6.0 SP 2 (build 206)
             
*******************************************************************************
*                        Bugcheck Analysis                                    *
*******************************************************************************
Bugcheck code 000000D1
Arguments ffffe801`3242b000 00000000`00000002 00000000`00000000 fffff801`7ad60e81

RetAddr           : Args to Child                                                           : Call Site
fffff802`9d1e3ee9 : 00000000`0000000a ffffe801`3242b000 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx
fffff802`9d1e273a : 00000000`00000000 ffffe001`a31b6010 00000000`00000000 fffff801`7ad6a00c : nt!setjmpex+0x37d9
fffff801`7ad60e81 : fffff801`7ad74dbf ffffe801`270008d0 00000000`00000000 7fffffff`ffffffff : nt!setjmpex+0x202a
fffff801`7ad74dbf : ffffe801`270008d0 00000000`00000000 7fffffff`ffffffff 00000000`00000000 : EmcpMpx!EmcpMpxLogPlatfEvent+0x3f99
fffff801`7ad75236 : fffff801`7ad74f50 ffffe001`a31b68b0 ffffe001`a5400480 00000000`00000000 : EmcpMpx!EmcpMpxLogPlatfEvent+0x17ed7
fffff801`7ad62958 : ffffe001`a1e0f010 ffffe001`a0be8010 ffffe001`a0bd3be0 ffffe801`270008d0 : EmcpMpx!EmcpMpxLogPlatfEvent+0x1834e
fffff801`7ad5c1c0 : ffffe001`a16c42a0 ffffe801`270007d0 ffffe801`270008d0 ffffe801`331ae810 : EmcpMpx!EmcpMpxLogPlatfEvent+0x5a70
fffff801`7ac21dcf : ffffe001`a15e2010 00000000`00000016 00000000`00000002 00000000`00000000 : EmcpMpx!PxDsmLamUnregister+0x1d14
fffff801`7ac082d0 : 00000000`00010000 00000000`00000004 ffffe001`a16c41b0 00000000`ffffffff : MPIO!DsmGetVersion+0xb2b
fffff801`7ac08b13 : 00000000`00000000 ffffe801`331ae810 00000000`00000007 ffffe801`269d8410 : MPIO+0x82d0
fffff802`9d13043e : ffffe001`a1c4d650 ffffe801`331ae810 ffffe801`27801a01 ffffe801`c00000c0 : MPIO+0x8b13
fffff801`7b28f5b3 : ffffe801`331ae810 fffff801`7b291a00 00000000`00010000 ffffe001`a16c41b0 : nt!IoCompleteRequest+0x2fa
fffff801`7b291574 : ffffe801`2713eba0 ffffe001`a16c41b0 ffffe801`331ae810 fffff801`7b28d60e : storport!StorPortNotification+0x2173
fffff801`7b28e360 : ffffe001`a1655010 ffffe001`40200382 00000000`00010000 ffffe001`a1739400 :

Other option is to upgrade PowerPath 6.0SP2 to PowerPath 6.3 which have all the fixes above.