SAN Trouble Shooting
Short Description
Download SAN Trouble Shooting...
Description
SAN Troubleshooting
Rene Burema Brocade Communications March, 2008
1
Product Knowledge is Valuable •
•
Problem determination requires you to be able to identify – –
Products, associated port numbers, and LED status Switch and port status
– –
License requirements Related compatibility information
Available resources include – Brocade FOS Documentation – – –
March 2008
Brocade Connect and/or Brocade Partner Sites Training materials including Products, FRUs and LEDs (Webbased training module associated with this course) Brocade switch provider information including compatibility matrices
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
2
2
Common SAN Problems
Many common SAN problems are related to - in alphabetical order •
Configuration - Port, device, switch is not correctly configured –
Problems accessing a switch or connecting switches or end devices can be related to configuration problems
•
Firmware Download - FTP configuration and release.plist confusion
•
Licensing - Customers do not have the license to do what they are attempting –
•
Marginal Links - Bad or marginal cables/GBICs/SFPs –
•
Problems related to performance or problems that occur when connecting switches or end-devices can be related to marginal links
Zoning - Zoning is not configured correctly –
March 2008
Problems connecting switches can be related to licensing problems
Problems that occur when end-devices are not able to access each other can be related to zoning
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
3
3
What does the switch status tell you?
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
4
4
What can port status LEDs tell you?
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
5
5
Adding/Replacing a Switch in a Fabric and Resolving Fabric Segmentations
6
When to Add or Replace a Switch •
Faulty hardware – Components on a switch that are not FRUs – Motherboard, including FC ports – Damaged chassis
•
Upgrading to new hardware – 2 Gbit/sec to 4 Gbit/sec – Port density – Increased availability – New features: FCR, FCIP, iSCSI – Replacing EOL hardware
•
Growing your fabric – Increased port density per switch – Increased number of switches
• March 2008
Whenever your switch provider recommends SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
7
7
Adding or Replacing a Switch
•
Any switch added to an existing fabric must be configured properly
•
LAN configuration information
•
Fabric configuration information
•
Your configuration plan should include a checklist that answers the following questions: – Special port configurations required? – Are the correct license keys installed? – What versions of firmware are running in the fabric? – Will you be using any additional capabilities i.e. ACLs, ADs, FCIP?
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
8
8
Adding or Replacing a Switch (cont.) •
Clear previous configuration from the switch – Zoning: cfgdisable; cfgclear; cfgsave – Switch configuration: configdefault
March 2008
•
Gather all required information for new or replacement switch using a switch connection checklist
•
Configure new or replacement switch to join an existing fabric
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
9
9
Methods for Configuring
March 2008
•
Use appropriate Fabric OS commands or Web Tools to configure the new or replacement switch
•
Use configdownload command to copy a previously saved back up file to a new or replacement switch and also restore a configuration to an existing switch
•
Fabric Manager baseline utility can copy the configuration of another switch or a previously saved configuration file
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
10
10
Merging Two Fabrics •
March 2008
Successful merge will create a single fabric with four switches
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
11
11
Fabric Segmentation •
•
March 2008
Fabric segmentation is generally caused by one of the following conditions: 1.
Licensing problems: Switches segment due to value line license limitations
2.
Zoning conflicts: The zoning configuration in both fabrics cannot be merged
3.
Admin Domain (AD) conflict: The AD configuration and/or AD zoning configurations cannot be merged
4.
Fabric parameters conflict: fabric.ops parameters do not match
5.
Port parameters conflict: ISL port settings are not compatible. FCIP tunnel settings must match.
6.
Domain ID overlap: Two or more switches have the same domain ID
7.
Access Control List (ACL): If configuration is strict all switches must comply
In addition, all switches in a fabric with user-defined ADs 1-254, ACLs, and/or a zoning database size greater than 256K must support the Reliable Commit Service (RCS) protocol SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
12
12
Identify Fabric Segmentations Primary sources for identifying fabric segmentations •
switchshow output –
•
Switch error logs –
•
Lists all the criteria that is exchanged during the ELP process and flags any parameter that is mismatched between the two switches
Fabric Manager –
March 2008
errshow and errdump will capture fabric segmentation events
fabstatsshow output –
•
E_Port state will identify the state of all E_Ports – possible segmentations errors are: Domain Overlap, Zone Conflict or Op Mode Incompatible
Fabric merge check will identify a fabric segmentation cause
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
13
13
switchshow Output RSL1_ST01_B20:admin> switchshow switchName: RSL1_ST01_B20 switchState: Online switchMode: Native switchRole: Principal switchDomain: 1 switchId: fffc01 switchWwn: 10:00:00:05:1e:02:12:2c zoning: ON (lab1) Area Port Media Speed State ============================== 0 0 id N4 Online F-Port 10:00:00:00:c9:53:c6:c5 1 1 id N2 Online E-Port segmented, (domain overlap) (Trunk master)
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
14
14
Error Logs Capture Segmentation Events RSL1_ST01_B20:admin> errshow –r Fabric OS: v5.1.0c 2006/08/15-11:52:12, [FABR-1001], 204,, WARNING, RSL1_ST01_B20, port 1, domain IDs overlap 2006/08/15-11:45:57, [FABR-1001], 203,, WARNING, RSL1_ST01_B20, port 1, incompatible VC translation link init, ensure it is set to 1 (2) 2006/08/15-11:37:54, [FABR-1001], 202,, WARNING, RSL1_ST01_B20, port 1, Zone conflict
RSL1_ST10_B41:admin> errshow –r Fabric OS: v5.2.0a 2007/01/31-12:50:27, [FABR-1001], 4,, WARNING, rsl1_st10_b41_1, port 8, ELP rejected by the other switch March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
15
15
fabstatsshow Output RSL1_ST01_B20:admin> fabstatsshow Description
Count
----------------------------------------domain ID forcibly changed:
0
E_Port offline transitions:
7 (Last on port 14)
Reconfigurations:
6
Segmentations due to: Loopback:
0
Incompatibility:
8 < Identifies mismatch
Overlap:
0
Zoning:
0
E_Port Segment:
0
What parameters would you compare next? fabric.ops March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
16
16
Licensing Conflicts •
Switches can be purchased with value line licenses –
A value line 2 license enables the switch to exist in a two domain fabric
–
A value line 4 license enables the switch to exist in a four domain fabric
•
Prior to Fabric OS v3.1.2/4.2 value line licensed switches in fabrics that exceeded the allowable number of domains segmented
•
After Fabric OS v3.1.2/4.2 value line licensed switches in fabrics that exceeded the allowable number of domains have a grace period – –
The switch is allowed to join the fabric but Web Tools access is disabled after 45 days The following messages continuously display at the CLI even with
quietmode on:
0x102b9f00 (tFcph): Jan 31 18:44:15 CRITICAL FABRIC-SIZE_EXCEEDED, 1, Critical fabric size (3) exceeds supported configuration (2). Switch status marginal. Contact Technical Support. 0x102b9f00 (tFcph): Jan 31 18:44:15 CRITICAL FABRIC-WEBTOOL_LIFE, 1, Webtool will be disabled in 44 days 23 hours and 50 minutes
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
17
17
Identify Zoning Conflicts •
There are three general types of zoning conflicts:
Type 1. Configuration mismatch: the enabled zone configurations are different –
Fabric A: cfgcreate "cfg4", "Red_Zone"
–
Fabric B: cfgcreate "cfg4", "Red_Zone; Blue_Zone"
sw4100:admin> cfgshow
sw4900:admin> cfgshow
Defined configuration:
Defined configuration:
Effective configuration:
Effective configuration:
cfg: cfg4
cfg: cfg4
zone: Red_Zone; 1,4; 1,5
zone: Red_Zone; 1,4; 1,5 zone: Blue_Zone; 2,8; 2,11
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
18
18
Identify Zoning Conflicts (cont.) Type 2. Type mismatch: The name of a zone object (alias, zone, cfg.) in one fabric is used for a different zone object in the other fabric
March 2008
–
Fabric A: alicreate “Device1”, ”1,1”
–
Fabric B: zonecreate “Device1”, ”1,1; 2,3”
sw4100:admin> cfgshow
sw4900:admin> cfgshow
Defined configuration:
Defined configuration:
alias: Device1 1,1
zone: Device1 1,1; 2,3
Effective configuration:
Effective configuration:
No effective configuration
No effective configuration
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
19
19
Identify Zoning Conflicts (cont.) Type 3. Content mismatch: The definition of a zone object in one fabric is different from a zone object with the same name in the other fabric (including the order of the zone members)
March 2008
–
Fabric A: zonecreate “Green_Zone”, ”1,1; 2,3”
–
Fabric B: zonecreate “Green_Zone”, ”2,3; 1,1”
sw4100:admin> cfgshow
sw4900:admin> cfgshow
Defined configuration:
Defined configuration:
zone: Green_Zone 1,1; 2,3
zone: Green_Zone 2,3; 1,1
Effective configuration:
Effective configuration:
No effective configuration
No effective configuration
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
20
20
Identify Zoning Conflicts •
Begin by running the switchshow and errshow commands –
Segmentations caused by zoning conflicts are noted as such
sw4100:admin> errshow -r Fabric OS: v5.1.0c 2006/08/15-11:37:54, [FABR-1001], 202,, WARNING, sw4100, port 1, Zone conflict
•
March 2008
To identify zoning conflict cause, perform the following actions on both fabrics: –
Display the current zone configuration in both fabrics (cfgshow)
–
Review the zone configurations in both fabrics for configuration, type, and content mismatches
–
Verify that the Advanced Zoning license is installed (licenseshow)
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
21
21
Identify Zoning Conflicts (cont.) •
Use Fabric Manager 5.2 Fabric Merge to check and analyze and Offline zoning management tool to correct –
•
Copy the existing zoning configuration from an installed switch, and push it to the new switch.
defzone - check this setting before you connect sw4100:admin> defzone --show Default Zone Access Mode committed - No Access transaction - No Transaction
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
22
22
Resolve Zoning Conflicts •
Use Web Tools or zone editing commands to resolve the mismatches (ali*, cfg*, zone*, defzone*)
•
To prevent zone conflicts clear the zoning database on the new/replacement switch, cfgdisable, cfgclear, cfgsave –
•
March 2008
Set defzone parameters to match existing fabric
Use Fabric Manager 5.2+ offline zoning capabilities
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
23
23
Incompatible Switch Parameters •
Incompatible switch parameters are reported as incompatibility
•
To verify the flow control settings without disrupting the fabric, run the configshow command in both fabrics and look at the fabric.ops parameters: –
•
March 2008
R_A_TOV – fabric.ops.R_A_TOV
–
E_D_TOV – fabric.ops.E_D_TOV
–
Data field size – fabric.ops.dataFieldSize
–
Disable device probing – fabric.ops.mode.fcpprobedisable
–
Suppress class F traffic – fabric.ops.mode.noClassF
–
Per-frame route priority – fabric.ops.UseCsCtl
–
BB credit – fabric.ops.BBcredit
–
Interop mode – switch.interopMode
–
PID format – fabric.ops.mode.pidFormat
–
Long distance – fabric.ops.mode.longDistance
You can also review these values by uploading the switch configuration file with the configupload command or Fabric Manager baseline SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
24
24
Incompatible Switch Parameters (cont.) •
To change these values at the command line (disruptively):
•
First, disable the switch (switchdisable)
•
Next, use the Fabric parameters menu in the configure command sw4100:admin> switchdisable; configure Configure... Fabric parameters (yes, y, no, n): [no] yes Domain:(1..239) [1] R_A_TOV: (4000..120000) [10000] E_D_TOV: (1000..5000) [2000] WAN_TOV: (0..30000) [0] MAX_HOPS: (7..19) [7] Data field size: (256..2112) [2112] Sequence Level Switching: (0..1) [0] Disable Device Probing: (0..1) [0] Switch PID Format: (1..2) [2] 1 Per-frame Route Priority: (0..1) [0] BB credit: (1..16) [16]
• March 2008
Finally, re-enable the switch (switchenable) SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
25
25
Incompatible Port Parameters •
•
Port-level parameters will cause a segmentation if not set to the same values: –
Basic connections: Port speed, type, licensed, and enabled
–
Long-distance connections: Long distance mode, VC Link Init, ISL R_RDY mode, and FCIP tunnel configurations
Verify the current settings by running the portcfgshow command rsl1_st10_b41_1:admin> portcfgshow 8
March 2008
Area Number:
8
Speed Level:
AUTO
Trunk Port
ON
Long Distance
LS
VC Link Init
ON
Desired Distance
40 Km
Locked L_Port
OFF
Locked G_Port
OFF
Disabled E_Port
OFF
ISL R_RDY Mode
OFF
RSCN Suppressed
OFF
Persistent Disable
OFF
NPIV capability
ON
Mirror Port
OFF
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
26
26
Incompatible Port Parameters (cont.) •
•
•
Fabric OS v5.2 Extended Fabrics long-distance modes were revised: –
Modes L0, LE, LD, and LS are supported and can be configured on any FC port
–
Modes L0.5, L1 and L2 are supported, but can not be configured
When upgrading from Fabric OS v5.1 to v5.2, what happens to ports set to mode L0.5, L1, or L2? –
The long-distance mode is still displayed in command line output (switchshow, etc.), but modes L0.5, L1, and L2 cannot be configured
–
To change the distance on these ports, use mode LD or LS
When connecting a Fabric OS v5.2 switch to a pre-Fabric OS v5.2 switch both ports on the link must have the same mode –
March 2008
Result: Use mode LS or LD
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
27
27
Incompatible Switch Parameters (cont.) •
•
March 2008
Change these settings with the following commands: –
Port speed: portcfgspeed
–
Reset to defaults: portcfgdefault
–
Port type (L_Port only): portcfglport
–
Port type (E_Port or F_Port only): portcfggport
–
Port type (E_Port disabled): portcfgeport
–
Port disable/enabled: portdisable, portenable
–
Port persistently disabled/enabled: portcfgpersistentdisable, portcfgpersistentenable
–
Long-distance mode, VC link initialization: portcfglongdistance
–
ISL R_RDY mode: portcfgislmode
Verify settings are the same by invoking portcfgshow on both switches and comparing output SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
28
28
Domain ID Conflicts
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
29
29
Domain ID Conflicts (cont.) •
Duplicate domain IDs are reported as Domain Overlap or Overlap. To resolve domain ID conflicts, follow these steps: – In each fabric, display the assigned domain IDs with the fabricshow or switchshow command – Review the command output, and determine those switches whose domain ID must be changed – Disable the switch (switchdisable), run the configure command to change the domain ID manually, then enable the switch (switchenable) – The switch will now join the fabric with the unique domain ID you assigned
•
March 2008
Option: set Insistent domain ID (required for FICON) SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
30
30
End Device Troubleshooting
31
Run supportsave Before and After •
•
March 2008
Run supportsave as soon as you experience a problem in your SAN –
Critical data will be captured if supportsave is run right away
–
Run supportsave prior to all problem determination steps
–
If unable to resolve during problem then run supportsave again
If you have to escalate problem send escalation team both supportsave files
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
32
32
End Device Troubleshooting End device troubleshooting requires the following: •
Is there light from the host or device? A powered off or failed device may not provide light. Without light there will never be a login.
•
Does the switch port speed configuration match the attached device speed configuration? Devices and switch ports typically autonegotiate. Verify that the switch port is not locked to a speed the device cannot handle.
•
Are the transmission characters synchronized with the switch port?
•
How far has the login process progressed? Did the device log in properly as a loop and/or fabric device?
•
Are the FOS v5.2 ACLs, specifically Device Connection Control (DCC) policies, preventing device from receiving a response to a login?
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
33
33
End Device Troubleshooting (cont.) •
With the maturation of Fibre Channel, most devices login as point-topoint via a Fabric Login (FLOGI). Has this occurred? –
•
If the end device logs in as loop or Fabric, it will be assigned a 24-bit address –
March 2008
Even if the device logs in as loop, it should still proceed to the FLOGI stage to get a Public Loop Address (24-bit address)
Until then, it has no source ID (SID) with which to initiate communication in the fabric
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
34
34
End-to-End Device Connectivity Use LLFD to Divide and Conquer
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
35
35
End-to-End Device Connectivity (cont.) Link, Login, Fabric, Devices
Link – Physical and logical connection of device to switch •
Transmission of light/signal
•
Negotiation of speed
•
Synchronization of characters and words –
Loop/Fabric initialization primitives
Login – Device to switch connectivity •
FLOGI to Fabric Port (FFFFFE)
•
Security Policy Check– Device Connection Control POLICY (DCC_POLICY) Access Control List (ACL); –
Switch responses: • •
•
March 2008
Accept: Assign fabric unique 24-bit address No response: Do not assign fabric address
Port Login (PLOGI) to Name Server (FFFFFC)
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
36
36
End-to-End Device Connectivity (cont.) Link, Login, Fabric, Devices Fabric •
Name Server Registration (FFFFFC) –
Device registers to local Name Server
–
Name Server is distributed within the fabric
–
If user-defined Virtual Fabric Admin Domains (ADs) are enabled, the Name Server will only show devices within the current AD
•
AD255 is the Physical Fabric view
•
AD0-AD254 will have a filtered view of the Name Server
•
Device attribute data may be registered: –
•
March 2008
Device Model and Vendor
–
Firmware and Driver revisions
–
Host name
SCR and RSCN to Fabric Controller (FFFFFD) –
Initiators register using State Change Registration (SCR)
–
Initiators receive notifications by Name Server of Registered State Change Notifications (RSCNs) SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
37
37
End-to-End Device Connectivity (cont.) Link, Login, Fabric, Devices Devices •
• •
Initiator queries Name Server for available devices –
Response contains devices within the effective zone configuration
–
FC devices are Type 8 (FCP)
–
Devices must successfully be logged into the fabric to exist within the Name Server
–
Initiators are zoned with targets
Initiator PLOGI to each target device, based upon Name Server query results Process Login (PRLI) from initiator to target(s) –
•
March 2008
Provides the end-to-end connectivity for device communication
Issue Report LUNs and Inquiry to each available device
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
38
38
Troubleshooting End-to-End Device Connectivity Start at the switch •
The switch contains a wealth of information concerning the condition of the fabric: –
Devices that are logged into the fabric
–
Devices registered within the Name Server
–
Which devices are within the same zone
Don’t forget about LUN Masking and Persistent Binding •
•
March 2008
Storage array may implement LUN Masking –
Initiator WWN (Port or Node) presented to array properly?
–
Correct LUNs made available to initiator by array?
HBAs may use Persistent Binding to specify LUN WWN or 24-bit PID to OS device mapping –
Target LUN WWN (Port or Node) or PID specified correctly in host file(s)
–
May require entry for new or replaced target LUNs
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
39
39
Troubleshooting End-to-End Device Connectivity (cont.) •
March 2008
If previous steps have been verified, there should be end-to-end device connectivity and communication •
If there is no communication between end devices, use CLI commands to determine where the problem exists. Verify connectivity through the SAN first.
•
If everything looks correct from switch CLI commands, use storage and host specific message logs and commands to isolate problems to the end point (initiator or target)
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
40
40
Troubleshooting Starts with switchshow •
The first command to enter when you start troubleshooting is switchshow. That shows whether: –
•
Switch is online
–
SFP is installed in each port
–
Port licensing – e.g. Ports-On-Demand (POD)
–
End devices are online
For remote devices, there are several commands to choose from, but start with nscamshow –
Tells if remote devices are seen within the fabric. • •
•
Next get a view of the fabric configuration with cfgshow
•
…or just get a supportsave –
March 2008
Name Server (ns*) commands are filtered by ADs in FOS v5.2+ If ADs are implemented, select AD255 (Physical Fabric View): rsl1_st15_b20_1:admin> ad --select 255
Super command script file. It gets all these commands and more!
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
41
41
Light/Signal •
March 2008
Fibre Channel Layer 0 connectivity – –
The actual light transmitted and received over FC cabling Use switchshow command to verify light/signal is being transmitted from a device. Use portflagsshow to see if LED is seen.
–
Additionally use sfpshow to verify SFP is not faulty
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
42
42
Light/Signal (cont.) Successful light (still no speed/synchronization) output examples •
March 2008
Use output of switchshow, portshow, and portflagsshow to verify light is being received:
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
43
43
Link – Speed Negotiation •
•
Speed Negotiation –
Device and switch use special transmission characters to agree upon a transfer speed of 4 Gbit/sec, 2 Gbit/sec, or 1 Gbit/sec
–
Speed negotiation starts with the highest possible speed and negotiates down until a speed is agreed upon or the lowest possible speed is attempted without success
CLI output information associated with the port when speed negotiation is successful: – – –
March 2008
switchshow: port speed will display the speed1 and State will display Online portshow: port speed will display configured or negotiated speed portflagsshow: Physical command column output field will display No_Sync or In_Sync
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
44
44
Link – Speed Negotiation (cont.) Unsuccessful Speed Negotiation • switchshow 1 1 id 2G No_Sync
•
portshow 1 | grep portSpeed portSpeed: 2Gbps
•
portflagsshow 1 Offline No_Sync PRESENT
Ensure port is set to default values: • portcfgdefault 1 Or manually set port to auto negotiate speed: • Use portcfgspeed 1 0
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
45
45
Physical Connectivity • • • •
Physical connectivity between a device and a switch port includes light/signal, speed, and link negotiation processes After speed negotiation the connecting points have to synchronize Devices can get into a condition defined as marginal when they go into and out of sync Commands that help identify this issue include – porterrshow – The errshow output may also have relevant output
•
March 2008
Fabric Watch can greatly augment the event reporting found in the error log (RASLog)
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
46
46
Physical Connectivity (cont.) porterrshow •
The porterrshow command is very helpful for getting a picture of all ports and their associated error and link related counters
•
Using this information, you can quickly isolate problems down to a specific port
•
A Marginal link is defined as a degraded physical connection; it is not optimally passing data
•
March 2008
–
The porterrshow, portstatsshow, and portshow output display counters that help monitor marginal ports
–
Symptoms include poor performance and occasional loss of connectivity
A delta of the counters can help you isolate a problem to a port and/or the connected HBA or Storage device –
Note that you can clear the port counters using portstatsclear on a per-port/port-group basis (granularity is dependent on FOS version)
–
The link counters cannot be cleared without a reboot
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
47
47
Physical Connectivity (cont.) Use the porterrshow command for initial investigation of marginal links
portstatsclear can be used to clear port errors on error statistics to left of the dotted line. The other counters get cleared on a reboot/fastboot. March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
48
48
Physical Connectivity (cont.) Granularity on ports with high error counters: •
porterrshow – Less granularity – Good for quickly identifying port(s) of interest
•
portstatsshow –
March 2008
Good for monitoring exact values of counters
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
49
49
Error Counters Certain port counters can point to physical link layer issues:
March 2008
•
enc_in: This counter increments when 8b/10b encoding errors are detected within a frame. enc_in errors are always detected on the ingress port.
•
crc_err: Indicates corruption within the frame. Always seen on ingress port but will be passed by the switch unaltered through the fabric (like a trail of bread crumbs).
•
enc_in and/or crc_err = Possible bad media (SFP, cable, patch panel)
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
50
50
Error Counters (cont.) •
enc_out: 8b/10b encoding errors NOT associated with frames (IDLE, R_RDY, and various other primitives). This counter increments during speed negotiation prior to login. Locking a port to a speed supported by the end device can be used to isolate issues. – Possible bad media (SFP, cable, patch panel) – Can cause a performance problem due to buffer recovery
•
disc_c3: Class 3 frame has been discarded because it is not routable to a destination address – Corrupted or not-online Destination ID (DID) – Timeout exceeded (Condor ASIC hold time exceeded) – Counter may increment when FC nodes and/or switches rapidly transition between online and offline; look at fabriclog –s output (described in the Logical Connectivity slide later)
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
51
51
Link Counters These are point-to-point errors; they do not propagate through the fabric •
Link failures - error conditions that cause a port to drop out of an active state –
Requires the reconnecting device to FLOGI back into fabric (No speed negotiation required, since the device does not lose synchronization)
•
Loss of sync - occur when bit and word synchronization on link is lost
•
Loss of signal – occur when light or an electrical signal is lost on a link –
•
March 2008
Require connected device to renegotiate speed and FLOGI back into fabric
If you experience device connectivity and/or performance issues and rising link counters look for –
bad cables/SFPs/patch-panel connections
–
repeating cycles of online/offline states in fabriclog -s output
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
52
52
Device Initialization into Fabric
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
53
53
Device Initialization - Port Configuration Device initialization could be affected by port configuration •
March 2008
portcfgshow – display port status
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
54
54
Port Configuration (cont.) •
switchshow – display login status; F/L/E or G: 1 1 id N1 Online G-Port
•
portcfglport – Lock port to L-Port to force Loop Initialization prior to FLOGI portcfglport
•
portcfggport – Lock to G-Port if HBA/storage has difficulties negotiating initial Loop Initialization portcfggport
•
portcfg mirrorport – A port configured as a mirror port will prevent HBA/storage login portcfg mirrorport --enable – Disable mirror port configured to connect a device portcfg mirrorport --disable
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
55
55
Login Services Three different levels of login: •
Fabric Login (FLOGI) is used by an N_Port or NL_Port (Nx_Ports) to establish service parameters with the switch – The following information is implicitly captured and put into the Name Server during this process: type; COS; PID; PortName (port WWN) ; and NodeName (node WWN)
•
N_Port Login (PLOGI) is used by one Nx_Port to establish service parameters with another N_Port or NL_Port
•
Process Login (PRLI) is used by an upper-level process in one port to establish image pairs and service parameters with the corresponding upper-level process in the other port – For example, it can be used to establish the environment between related SCSI processes on an origination Nx_Port and a responding Nx_Port
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
56
56
Fabric Login (FLOGI)
•
When devices 1st connect, their address is 000000 (unless they are loop devices, then their address will be 0000pp)
•
FLOGI is required before any frame can be sent thru the fabric
•
FLOGI is sent to well-known address FFFFFE (Fabric F_Port)
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
57
57
Commands to Check FLOGI Status •
switchshow – A successful login displays an F_Port (including its WWN) or L_Port
•
portshow – A successful login displays fabric viewpoint of device – portFlags - a bit map and English translation of the ports login process – portState - Online – portPhys - In_Sync, receiving light and synchronized – portId - 24-bit Fabric Address, port identifier (PID) of device – portScn - F_Port, from the fabrics point of view all end devices that successfully logged in are F_Ports – port WWN(s) of connected device(s) - an F_Port will have one WWN; an FL_Port can have multiple WWNs – Distance and Speed Configuration of the port
•
March 2008
portflagsshow – Lists the translation of all port login state flags; same as portshow portFlags output
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
58
58
portshow
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
59
59
portstatsshow – BB Credit
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
60
60
portcamshow •
Hardware enforced – SID/DID zone tables are kept in ASIC – portcamshow
•
Out of CAM Entries – Changes to Session-Based zoning – Resource issue - not an actual error condition
•
portzoneshow – undocumented/unsupported command – Displays type of zoning (Hard, Session based) for each port
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
61
61
Logical Connectivity fabriclog -s •
fabriclog –s supersedes the fabstateshow command – Use it to check for port Online/Offline transitions:
– Port 1 transitioned from Offline to Online multiple times – Check physical connectivity for bad cable, SFPs, patch-panel, etc.
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
62
62
Fabric – Name Server Successful port login and registration to Name Server
– A port login (PLOGI) to the Name Server can be confirmed by looking at the Name Server information – Verify using the nsshow command – Unsuccessful port login means no information within the Name Server
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
63
63
Fabric – Name Server (cont.) •
March 2008
Check for successful port login with -t option: device is an Initiator or Target
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
64
64
State Change Notification Services •
State Change Notification (SCN) - State Change Notifications (SCN) are used for internal state change notifications, not external – This is the switch logging that the port is online or is an Fx_port – This is not sent from the switch to the Nx_ports!
•
State Change Register (SCR) – Nx_Port request to receive notification when something in the fabric changes – FC Devices that choose to receive RSCNs must register for this service •
Devices send a State Change Registration (SCR) to FFFFFD
•
Registration indicates that the device wants to be notified of changes
– Devices register after PLOGI to Name Server
•
March 2008
Registered State Change Notification (RSCN) - issued by the Fabric Controller Service or an Nx_Port to devices that registered (issued an SCR requesting this notification) – only sent to devices within an affected zone
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
65
65
Fabric Controller Services •
The Fabric Controller (FFFFFD) Service alerts device that changes have occurred in the fabric by sending a Registered State Change Notification (RSCN) if: – Device registered to receive RSCN using an SCR – A new device has been added (within the same zone) – An existing device has been removed (within the same zone) – A zone has been changed – A switch name or IP address changed – The fabric reconfigured
•
Registration is optional – SCSI initiators normally register – SCSI targets do not register
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
66
66
Changes Within the Fabric •
“Properly” written device drivers will do the following in response to an RSCN: – Query the Name Server for changes related to devices they are (or were) currently logged into – Initiate a port login for any new devices the Name Server has notified them of within their Virtual Fabric zoning configuration
•
Sometimes it isn’t a device driver issue. Applications can fail if their I/O is not satisfied quickly. (“Quickly” is a relative term.) – If necessary, FOS gives the ability to suppress RSCN’s per port:
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
67
67
Device Identification Commands •
Use switchshow, nsshow, nscamshow, nsallshow, and nodefind to identify devices in the fabric
•
nsallshow lists all 24-bit PID addresses within the current fabric (Name Server view of current AD)
•
nodefind lists Name Server information for: – Specified Alias – Specified WWN – Specified PID address
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
68
68
Devices - End-to-End Connectivity •
End-to-end device connectivity communication could be blocked on the switch by: – Zoning – AD configuration – Commands to check include: fcping, cfgshow, and ad --show
•
End-to-end device connectivity flow – Nx_Port to Nx_Port communication – Initiator to target (similar to SCSI model) – PLOGI/PRLI from Nx_Port to Nx_Port
•
Name Server Query – Initiators learn about “devices of interest”, based upon FC4 layer type (5 or 8): where 8 = FCP/SCSI, 5 = IP over FC
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
69
69
End-to-End Device Connectivity (cont.) Use the fcping command to check for end device connectivity and zoning •
Response when device is not online: rsl1_st15_b41_1:admin> fcping 0x1400e8 0x0a0100 fcping: Error destination port invalid
•
Response when devices are online; but one does not respond to the fcping ELS ECHO frame: rsl1_st15_b20_1:admin> fcping 0x0a0000 0x1400e2 Source: 0xa0000 Destination: 0x1400e2 Zone Check: Not Zoned Pinging 0xa0000 with 12 bytes of data: received reply from 0xa0000: 12 bytes time:650 usec 5 frames sent, 5 frames received, 0 frames rejected, 0 frames timeout Round-trip min/avg/max = 567/618/674 usec Pinging 0x1400e2 with 12 bytes of data: Request timed out 5 frames sent, 0 frames received, 0 frames rejected, 5 frames timeout Round-trip min/avg/max = 0/0/0 usec
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
70
70
Device to Device Login •
Don’t forget, devices do not only log into the fabric. Initiators will initiate PLOGIs and PRLIs to other end devices after: – Each device is Online in the switch database – Each device has registered with the Name Server – Devices are zoned together and within the same Virtual Fabric Administrative Domain (AD)
• •
The mechanism for devices to login to each other through PLOGI is the same as used for device to switch login The switch acts as a “middle-man” – Passing PLOGI/PRLI requests and ACCEPT responses or – Discarding such requests if the devices are not zoned together or in the same AD
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
71
71
Port Configuration – End-to-End Check port configuration for end-to-end device connectivity •
Use nszonemember as a final step to verify that: – End devices have logged into Name Server, are Online, and are zoned together within the same AD rsl1_st15_b20_1:admin> nszonemember 0x0a0100 1 local zoned members: Type Pid COS PortName NodeName SCR N 0a0100; 2,3;10:00:00:00:c9:22:1f:23;20:00:00:00:c9:22:1f:23; 3 FC4s: FCP NodeSymb: [30] "Emulex LP8000 FV3.90A7 DV6.02h" Fabric Port Name: 20:01:00:05:1e:02:0c:77 Permanent Port Name: 10:00:00:00:c9:22:1f:23 Device type: Physical Initiator Port Index: 1 Share Area: No Device Shared in Other AD: No
…output continued on next slide
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
72
72
Port Configuration – End-to-End (cont.) Check port configuration for end-to-end device connectivity (nszonemember 0x0a0100 output continued…) 1 remote zoned members: Type Pid COS PortName NodeName NL 1400e8; 3;21:00:00:04:cf:92:6a:58;20:00:00:04:cf:92:6a:58; FC4s: FCP PortSymb: [28] "SEAGATE ST318452FC 0004" Fabric Port Name: 20:00:00:05:1e:02:aa:7b Permanent Port Name: 21:00:00:04:cf:92:6a:58 Device type: Physical Target Port Index: 0 Share Area: No Device Shared in Other AD: No
•
March 2008
Verifies end-to-end zoning within the fabric
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
73
73
When to use an Analyzer? •
When all devices are logged into the fabric, zoning is configured properly, and hosts do not see their targets
•
When there are I/O disruptions that cannot be isolated with RASLog (errdump) or porterrshow/portstatsshow
•
When a problem exists within the payload of a transfer
•
To monitor the health of a system for error statistics and performance problems (the switch also has relevant built-in diagnostic capabilities)
•
To diagnose protocol problems
•
–
A complete look at the FC header and payload
–
Capture end-to-end protocol information (including ULPs)
To troubleshoot extended Fabric communication –
March 2008
An FC analyzer can be installed between the switch and the gateway at each end •
Is the transmission the same as the reception?
•
Can bit – char – word sync be established?
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
74
74
Port Mirroring - Configuration •
Decide location of mirror port; on same ASIC as SID or DID port
•
Login to the physical fabric using an Admin role account
•
Follow these steps to use port mirroring to capture a FC analyzer trace: 1.
Configure the port as a mirror port by invoking the following command: portcfg mirrorport --enable •
2.
Verify the configuration, invoke portcfgshow and switchshow
Connect a FC Analyzer to the mirror port and verify that it comes online
3.
Configure port mirroring connection between the SID & DID thru the mirror port portmirror --add
4. 5.
•
The mirror port must be online
•
Verify mirror connection, invoke portmirror –-show
Start FC Analyzer capture, reproduce problem, stop capture and review output Remove the port mirror connection with the portmirror --delete command: portmirror --delete
6.
Remove the mirror port configuration (to allow other connections to this port): portcfg mirrorport --disable
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
75
75
Gathering Switch Support Data for Problem Determination and Escalation
76
Switch Support Data - Overview •
Up to this point, we have gathered details about a switch by running one CLI command at a time
•
For long-term support of a switch, we need to begin gathering switch support data – Larger, file-oriented data that provides a broader view of the switch – Configuration of parameters – State of FRUs and ports, both currently and in the past
•
There are several different types of switch support data that can be collected from a Brocade switch, router, or Director: – Switch error logs (RASLogs) – Audit logs – FFDC files – Panic dump and core files – Trace dump files
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
77
77
RASLog - Overview •
Starting in Fabric OS v4.4, the System Message Log began to be called the Reliability, Availability, and Serviceability Log (RASLog)
•
RASLog error messages are defined in one of two groups – External messages – CRITICAL, ERROR, WARNING, and INFO can be viewed by admin-level users – Internal messages - DEBUG and PANIC can not be viewed by adminlevel users
•
There is one RASLog stored in persistent memory – Up to 1024 external messages stored in a non-volatile circular buffer – In blade-based switches, each CP maintains a separate RASLog
•
March 2008
In Fabric OS v5.1+, certain security- and zoning-related commands cause an AUDIT flag to be added to error messages
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
78
78
RASLog - Standard Message Format •
Fabric OS v4.4+ error messages follow a standard format: –
Start Delimiter (customizable): Start
–
Date (including year) and Time: 2006/03/08-11:59:32
–
Message Module and Numeric Instance: ZONE-3006
–
Sequence Number: 9
–
Audit Flag: AUDIT or FFDC (added in Fabric OS v5.1)
–
Severity Level (one of four levels): INFO
–
Switch Name: NDA-ST01-B48
–
Error description: User: admin, Role: admin, Event: cfgdisable, Status: success, Info: Current zone configuration disabled.
–
End Delimiter (customizable): End
Start 2006/03/08-11:59:32, [ZONE-3006], 9, AUDIT, INFO, NDAST01-B48, User: admin, Role: admin, Event: cfgdisable, Status: success, Info: Current zone configuration disabled. End
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
79
79
RASLog - Management •
Use the following commands to view the RASLog associated with external messages: – Display all external messages in the error log with no line breaks – errdump (default display order: least-recent to most-recent) – Display all external messages in the error log with line breaks - errshow (default display order: least-recent to most-recent) – Use errdump/show -r to display error messages in reverse order: most-recent to least-recent – Clear all internal and external messages from the error log with Admin level errclear command
•
Forward RASLog and Console log entries to a syslogd daemon on a host computer (syslogdipadd) – Especially important on dual-CP systems as host computer logs maintain a single, sequentially ordered, merged file for both CPs
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
80
80
Audit Log - Overview •
The RASLog was designed to capture abnormal, error-related messages – not highfrequency AUDIT events
•
In Fabric OS v5.1 and earlier, error messages and AUDIT events are sent to the RASLog
•
In Fabric OS v5.2+, error messages go to the RASLog, and all AUDIT events go only to a new Audit Log
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
81
81
Audit Log – Overview (cont.) •
•
•
March 2008
SAN Troubleshooting Basics
The new Audit Log is designed for post event audits, and problem determination –
Captured per Virtual Fabric AD
–
Configurable (off by default)
For a given event it captures –
Who (user), when (timestamp), what (SAN component), and which AD
–
Event type
–
Other event-specific information (description)
–
Format consistent with DMTF standard
AUDIT messages are always sent to the console, and can be configured to go to syslog servers
® 2008 Brocade Communications Systems, Inc. All rights reserved.
82
82
Audit Log - Details •
•
Fabric OS v5.2+ continues to audit all Fabric OS v5.1 AUDIT messages –
Secure Fabric OS configuration
–
Security related: SSL, RADIUS, Zone, and password strengthening configuration
Fabric OS v5.2+ can also be configured to audit these tasks: –
configdownload (not configupload)
–
firmwaredownload start, complete, and error messages encountered during download
–
User initiated security events related to ACLs
–
Fabric events related to command execution in other ADs (ad --exec)
•
In an AD-aware fabric, Audit Log configuration is done from AD255
•
Commands involved in configuring the Audit Log include:
March 2008
–
auditcfg to enable auditing and define what gets audited (filters)
–
syslogdipadd to specify IP address of syslog server configured to receive audit messages
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
83
83
FFDC - Overview •
•
•
March 2008
To minimize requests for problem recreation from certain Brocadedefined events, Fabric OS captures First Failure Data Capture (FFDC) data –
Goal: Allow Brocade engineers to gain insight into problems that are transient, difficult-to-recreate, or difficult-to-solve
–
Triggered by error MSG_IDs that are selected by Brocade engineering
–
Messages are written to the console and the error log with an FFDC flag
Automatically collects “supportshow-like” information (based on CLI commands) as readable text when the selected event occurs –
A single FFDC event may create one or more FFDC files
–
Up to 4 MB for all FFDC files combined (if max size is reached, a RASLog message is generated, and periodic console messages are sent)
FFDC files are stored on the switch, and transferred by supportsave (automatically deletes files) or savecore (does not automatically delete files)
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
84
84
FFDC - Configuring •
Enable and disable the FFDC functionality with the supportffdc command – Enabled by default - disable only if directed to do so by next-level support switch:admin> supportffdc --enable --disable --show
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
85
85
FFDC - Capturing •
The supportsave command uploads the FFDC data via FTP, and deletes it from the switch –
•
File name indicates the triggering event, and date/time stamp (example: FSSM1005-2006-08-12-114707.ffdc)
The savecore command also uploads the FFDC data via FTP (same file name), but does not delete it from the switch switch:admin> savecore following 1 directories contains core files: [ ]0: /core_files/ffdc_data Welcome to core files management utility. Menu 1(or R): Remove all core files 2(or F): FTP all core files 3(or r): Remove marked files 4(or f): FTP marked files 5(or m): Mark Files for action 6(or u): Un Mark Files for action 9(or e): Exit Your choice:
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
86
86
Panic Dump and Core Files - Overview •
Fabric OS creates panic dump and core files when there are problems in the Fabric OS kernel – Generated when an important Fabric OS daemon no longer responds or terminates unexpectedly – Captures a snapshot of the current state of the switch at the time of the crash – no historical information retained – Panic dumps are text files, core file contents are encrypted
•
March 2008
In a dual-CP Director, each CP can create these files, so always check both CPs
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
87
87
Panic Dump and Core Files (cont.) •
To display panic dump files at the command line, enter the pdshow command switch:admin> pdshow Could not find any valid pd file!
•
To upload (FTP) or delete (remove) panic dump and core files via FTP, use the savecore command switch:admin> savecore -l /core_files/panic/core.873 /core_files/zoned/core.1234 /core_files/zoned/core.5678 /mnt/core_files/nsd/core.873 /mnt/core_files/panic/core.873 switch:admin> savecore -h 192.168.204.188 -u jsmith –d core_files_here -p password –f /core_files/zoned/,/mnt/core_files/nsd/ /core_files/zoned//core.1234: 1.12 kB 382.60 B/s /core_files/zoned//core.5678: 1.12 kB 381.95 B/s /mnt/core_files/nsd//core.873: 1.12 kB 382.53 B/s Files transferred successfully!
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
88
88
Trace Dump - Overview •
The trace functionality is a proactive troubleshooting tool – Included in Fabric OS v4.4+ to aid Fabric OS debugging – Always running, maintaining a historic record of the current and past state of the switch – can not be disabled – No impact on user data performance
•
The results from the trace operation are stored in a trace dump file – Triggered by a panic; timeout; CRITICAL-level event; or a manual trigger – Binary file, retained in persistent memory – Can be uploaded automatically or manually via FTP
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
89
89
Trace Dump - Implementation •
Initiate or remove a trace dump file, or display trace dump status with the tracedump command – tracedump –n: Create a trace dump manually – tracedump –r: Remove (delete) a trace dump from the switch
•
Use the traceftp command to manage the uploading (but not deleting) of trace dumps: – traceftp –n: Manually upload trace dumps via FTP – traceftp –e: Enable automatic FTP upload of trace dumps – traceftp –d: Disable automatic FTP upload of trace dumps – With traceftp –e, specify the FTP server to which trace dumps are uploaded with the supportftp command – must do this, or trace dump files will not be automatically uploaded
•
March 2008
Web Tools supports some of the traceftp command functionality
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
90
90
Capturing Switch Support Data - Overview •
There are several tools that you can use to capture switch support data: – supportshow – supportsave – Fabric Manager – SAN Health
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
91
91
Capturing Switch Support Data supportshow •
supportshow is a script that executes groups of pre-selected Fabric OS and LINUX commands, and displays them at the CLI command output
•
To simplify troubleshooting for the future, use the supportshow output to establish a switch baseline – Documents the switch configuration under good conditions – Future troubleshooting can start by comparing the current supportshow output with the baseline
•
supportshow takes ADs into consideration: – Command is relevant only in AD0 (no user-defined ADs) or AD255 (with user-defined ADs) – AD must include the switch on which the command is run – Example supportshow response in non-AD0/AD255 context: Operation not allowed in AD1-AD254 context
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
92
92
Capturing Switch Support Data supportsave •
To aid the capture of supportshow information, Fabric OS v4.4 introduced supportsave –
Uploads supportshow in a text file whose name indicates the switch name (Director), CP slot (S0, S5), time stamp (200605200014), and SUPPORTSHOW
–
Also uploads FFDC files, as well as other information switch:admin> supportsave –h 192.168.1.1 –u anonymous –d tmp This command will collect RASLOG, TRACE, and supportShow (active CP only) information for the local CP and then transfer them to a FTP server. The operation can take several minutes. OK to proceed? (yes, y, no, n): [no] y ... Saving support information for module SUPPORTSHOW... ...rtSave_files/Director-S5-200605200014-SUPPORTSHOW: 1.11 MB 346.39 kB/s
•
March 2008
supportsave needs to be run on both the Active and Standby CPs
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
93
93
Capturing Switch Support Data – SAN Health •
Another tool that automates the documentation of a SAN is Brocade SAN Health
•
SAN Health is a free utility that helps you create: – Comprehensive Documentation – Historical Performance Graphs – Detailed Topology Diagrams – Best Practice Recommendations
•
SAN Health can be run against: – Brocade systems running any version of Fabric OS or XPath OS – McData systems running EOS 4.x+
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
94
94
Gathering Switch Support Data Troubleshooting •
•
Before troubleshooting a Brocade switch, router, or Director, gather all the basic information that you can: –
Document the current state of the switch with supportsave: RASLogs, numerous command outputs (supportshow)
–
Identify user actions taken in the past: Audit logs (if available)
Validate the current state of the switch by reviewing supportshow: –
Verify switch access settings (e.g. ipaddrshow)
–
Check FRU status (e.g. fanshow)
–
Validate firmware revisions (e.g. firmwareshow)
–
Check port status, port errors (e.g. porterrshow)
•
Identify faults on the switch by checking the RASLog (errdump) for errorrelated messages
•
As needed, compare time stamps between the RASLog and the Audit Log to determine whether user actions were a problem source
March 2008
SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
95
95
Gathering Switch Support Data – Escalating to Next-Level Support •
•
•
March 2008
If you are escalating an issue to next-level support, gather all the basic and Brocade information from the switch by running supportsave: –
RASLogs
–
supportshow
–
FFDC files
–
Trace dumps
–
Core files and panic dumps
–
AP blade details
In addition, describe the problem in as much detail as possible: –
Affected devices/ports/switches
–
SAN topology drawing
–
Previous course of action (timeline, commands run)
–
Details on recent changes to the fabric (additions/removal/configs)
If available, also capture the Audit logs, so that past user actions can be identified SAN Troubleshooting Basics
® 2008 Brocade Communications Systems, Inc. All rights reserved.
96
96
Fin
97
View more...
Comments