SAN Trouble Shooting

Share Embed Donate


Short Description

Download SAN Trouble Shooting...

Description

SAN Troubleshooting

Rene Burema Brocade Communications March, 2008

1

Product Knowledge is Valuable •



Problem determination requires you to be able to identify – –

Products, associated port numbers, and LED status Switch and port status

– –

License requirements Related compatibility information

Available resources include – Brocade FOS Documentation – – –

March 2008

Brocade Connect and/or Brocade Partner Sites Training materials including Products, FRUs and LEDs (Webbased training module associated with this course) Brocade switch provider information including compatibility matrices

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

2

2

Common SAN Problems

Many common SAN problems are related to - in alphabetical order •

Configuration - Port, device, switch is not correctly configured –

Problems accessing a switch or connecting switches or end devices can be related to configuration problems



Firmware Download - FTP configuration and release.plist confusion



Licensing - Customers do not have the license to do what they are attempting –



Marginal Links - Bad or marginal cables/GBICs/SFPs –



Problems related to performance or problems that occur when connecting switches or end-devices can be related to marginal links

Zoning - Zoning is not configured correctly –

March 2008

Problems connecting switches can be related to licensing problems

Problems that occur when end-devices are not able to access each other can be related to zoning

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

3

3

What does the switch status tell you?

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

4

4

What can port status LEDs tell you?

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

5

5

Adding/Replacing a Switch in a Fabric and Resolving Fabric Segmentations

6

When to Add or Replace a Switch •

Faulty hardware – Components on a switch that are not FRUs – Motherboard, including FC ports – Damaged chassis



Upgrading to new hardware – 2 Gbit/sec to 4 Gbit/sec – Port density – Increased availability – New features: FCR, FCIP, iSCSI – Replacing EOL hardware



Growing your fabric – Increased port density per switch – Increased number of switches

• March 2008

Whenever your switch provider recommends SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

7

7

Adding or Replacing a Switch



Any switch added to an existing fabric must be configured properly



LAN configuration information



Fabric configuration information



Your configuration plan should include a checklist that answers the following questions: – Special port configurations required? – Are the correct license keys installed? – What versions of firmware are running in the fabric? – Will you be using any additional capabilities i.e. ACLs, ADs, FCIP?

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

8

8

Adding or Replacing a Switch (cont.) •

Clear previous configuration from the switch – Zoning: cfgdisable; cfgclear; cfgsave – Switch configuration: configdefault

March 2008



Gather all required information for new or replacement switch using a switch connection checklist



Configure new or replacement switch to join an existing fabric

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

9

9

Methods for Configuring

March 2008



Use appropriate Fabric OS commands or Web Tools to configure the new or replacement switch



Use configdownload command to copy a previously saved back up file to a new or replacement switch and also restore a configuration to an existing switch



Fabric Manager baseline utility can copy the configuration of another switch or a previously saved configuration file

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

10

10

Merging Two Fabrics •

March 2008

Successful merge will create a single fabric with four switches

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

11

11

Fabric Segmentation •



March 2008

Fabric segmentation is generally caused by one of the following conditions: 1.

Licensing problems: Switches segment due to value line license limitations

2.

Zoning conflicts: The zoning configuration in both fabrics cannot be merged

3.

Admin Domain (AD) conflict: The AD configuration and/or AD zoning configurations cannot be merged

4.

Fabric parameters conflict: fabric.ops parameters do not match

5.

Port parameters conflict: ISL port settings are not compatible. FCIP tunnel settings must match.

6.

Domain ID overlap: Two or more switches have the same domain ID

7.

Access Control List (ACL): If configuration is strict all switches must comply

In addition, all switches in a fabric with user-defined ADs 1-254, ACLs, and/or a zoning database size greater than 256K must support the Reliable Commit Service (RCS) protocol SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

12

12

Identify Fabric Segmentations Primary sources for identifying fabric segmentations •

switchshow output –



Switch error logs –



Lists all the criteria that is exchanged during the ELP process and flags any parameter that is mismatched between the two switches

Fabric Manager –

March 2008

errshow and errdump will capture fabric segmentation events

fabstatsshow output –



E_Port state will identify the state of all E_Ports – possible segmentations errors are: Domain Overlap, Zone Conflict or Op Mode Incompatible

Fabric merge check will identify a fabric segmentation cause

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

13

13

switchshow Output RSL1_ST01_B20:admin> switchshow switchName: RSL1_ST01_B20 switchState: Online switchMode: Native switchRole: Principal switchDomain: 1 switchId: fffc01 switchWwn: 10:00:00:05:1e:02:12:2c zoning: ON (lab1) Area Port Media Speed State ============================== 0 0 id N4 Online F-Port 10:00:00:00:c9:53:c6:c5 1 1 id N2 Online E-Port segmented, (domain overlap) (Trunk master)

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

14

14

Error Logs Capture Segmentation Events RSL1_ST01_B20:admin> errshow –r Fabric OS: v5.1.0c 2006/08/15-11:52:12, [FABR-1001], 204,, WARNING, RSL1_ST01_B20, port 1, domain IDs overlap 2006/08/15-11:45:57, [FABR-1001], 203,, WARNING, RSL1_ST01_B20, port 1, incompatible VC translation link init, ensure it is set to 1 (2) 2006/08/15-11:37:54, [FABR-1001], 202,, WARNING, RSL1_ST01_B20, port 1, Zone conflict

RSL1_ST10_B41:admin> errshow –r Fabric OS: v5.2.0a 2007/01/31-12:50:27, [FABR-1001], 4,, WARNING, rsl1_st10_b41_1, port 8, ELP rejected by the other switch March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

15

15

fabstatsshow Output RSL1_ST01_B20:admin> fabstatsshow Description

Count

----------------------------------------domain ID forcibly changed:

0

E_Port offline transitions:

7 (Last on port 14)

Reconfigurations:

6

Segmentations due to: Loopback:

0

Incompatibility:

8 < Identifies mismatch

Overlap:

0

Zoning:

0

E_Port Segment:

0

What parameters would you compare next? fabric.ops March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

16

16

Licensing Conflicts •

Switches can be purchased with value line licenses –

A value line 2 license enables the switch to exist in a two domain fabric



A value line 4 license enables the switch to exist in a four domain fabric



Prior to Fabric OS v3.1.2/4.2 value line licensed switches in fabrics that exceeded the allowable number of domains segmented



After Fabric OS v3.1.2/4.2 value line licensed switches in fabrics that exceeded the allowable number of domains have a grace period – –

The switch is allowed to join the fabric but Web Tools access is disabled after 45 days The following messages continuously display at the CLI even with

quietmode on:

0x102b9f00 (tFcph): Jan 31 18:44:15 CRITICAL FABRIC-SIZE_EXCEEDED, 1, Critical fabric size (3) exceeds supported configuration (2). Switch status marginal. Contact Technical Support. 0x102b9f00 (tFcph): Jan 31 18:44:15 CRITICAL FABRIC-WEBTOOL_LIFE, 1, Webtool will be disabled in 44 days 23 hours and 50 minutes

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

17

17

Identify Zoning Conflicts •

There are three general types of zoning conflicts:

Type 1. Configuration mismatch: the enabled zone configurations are different –

Fabric A: cfgcreate "cfg4", "Red_Zone"



Fabric B: cfgcreate "cfg4", "Red_Zone; Blue_Zone"

sw4100:admin> cfgshow

sw4900:admin> cfgshow

Defined configuration:

Defined configuration:





Effective configuration:

Effective configuration:

cfg: cfg4

cfg: cfg4

zone: Red_Zone; 1,4; 1,5

zone: Red_Zone; 1,4; 1,5 zone: Blue_Zone; 2,8; 2,11

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

18

18

Identify Zoning Conflicts (cont.) Type 2. Type mismatch: The name of a zone object (alias, zone, cfg.) in one fabric is used for a different zone object in the other fabric

March 2008



Fabric A: alicreate “Device1”, ”1,1”



Fabric B: zonecreate “Device1”, ”1,1; 2,3”

sw4100:admin> cfgshow

sw4900:admin> cfgshow

Defined configuration:

Defined configuration:





alias: Device1 1,1

zone: Device1 1,1; 2,3





Effective configuration:

Effective configuration:

No effective configuration

No effective configuration

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

19

19

Identify Zoning Conflicts (cont.) Type 3. Content mismatch: The definition of a zone object in one fabric is different from a zone object with the same name in the other fabric (including the order of the zone members)

March 2008



Fabric A: zonecreate “Green_Zone”, ”1,1; 2,3”



Fabric B: zonecreate “Green_Zone”, ”2,3; 1,1”

sw4100:admin> cfgshow

sw4900:admin> cfgshow

Defined configuration:

Defined configuration:





zone: Green_Zone 1,1; 2,3

zone: Green_Zone 2,3; 1,1





Effective configuration:

Effective configuration:

No effective configuration

No effective configuration

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

20

20

Identify Zoning Conflicts •

Begin by running the switchshow and errshow commands –

Segmentations caused by zoning conflicts are noted as such

sw4100:admin> errshow -r Fabric OS: v5.1.0c 2006/08/15-11:37:54, [FABR-1001], 202,, WARNING, sw4100, port 1, Zone conflict



March 2008

To identify zoning conflict cause, perform the following actions on both fabrics: –

Display the current zone configuration in both fabrics (cfgshow)



Review the zone configurations in both fabrics for configuration, type, and content mismatches



Verify that the Advanced Zoning license is installed (licenseshow)

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

21

21

Identify Zoning Conflicts (cont.) •

Use Fabric Manager 5.2 Fabric Merge to check and analyze and Offline zoning management tool to correct –



Copy the existing zoning configuration from an installed switch, and push it to the new switch.

defzone - check this setting before you connect sw4100:admin> defzone --show Default Zone Access Mode committed - No Access transaction - No Transaction

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

22

22

Resolve Zoning Conflicts •

Use Web Tools or zone editing commands to resolve the mismatches (ali*, cfg*, zone*, defzone*)



To prevent zone conflicts clear the zoning database on the new/replacement switch, cfgdisable, cfgclear, cfgsave –



March 2008

Set defzone parameters to match existing fabric

Use Fabric Manager 5.2+ offline zoning capabilities

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

23

23

Incompatible Switch Parameters •

Incompatible switch parameters are reported as incompatibility



To verify the flow control settings without disrupting the fabric, run the configshow command in both fabrics and look at the fabric.ops parameters: –



March 2008

R_A_TOV – fabric.ops.R_A_TOV



E_D_TOV – fabric.ops.E_D_TOV



Data field size – fabric.ops.dataFieldSize



Disable device probing – fabric.ops.mode.fcpprobedisable



Suppress class F traffic – fabric.ops.mode.noClassF



Per-frame route priority – fabric.ops.UseCsCtl



BB credit – fabric.ops.BBcredit



Interop mode – switch.interopMode



PID format – fabric.ops.mode.pidFormat



Long distance – fabric.ops.mode.longDistance

You can also review these values by uploading the switch configuration file with the configupload command or Fabric Manager baseline SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

24

24

Incompatible Switch Parameters (cont.) •

To change these values at the command line (disruptively):



First, disable the switch (switchdisable)



Next, use the Fabric parameters menu in the configure command sw4100:admin> switchdisable; configure Configure... Fabric parameters (yes, y, no, n): [no] yes Domain:(1..239) [1] R_A_TOV: (4000..120000) [10000] E_D_TOV: (1000..5000) [2000] WAN_TOV: (0..30000) [0] MAX_HOPS: (7..19) [7] Data field size: (256..2112) [2112] Sequence Level Switching: (0..1) [0] Disable Device Probing: (0..1) [0] Switch PID Format: (1..2) [2] 1 Per-frame Route Priority: (0..1) [0] BB credit: (1..16) [16]

• March 2008

Finally, re-enable the switch (switchenable) SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

25

25

Incompatible Port Parameters •



Port-level parameters will cause a segmentation if not set to the same values: –

Basic connections: Port speed, type, licensed, and enabled



Long-distance connections: Long distance mode, VC Link Init, ISL R_RDY mode, and FCIP tunnel configurations

Verify the current settings by running the portcfgshow command rsl1_st10_b41_1:admin> portcfgshow 8

March 2008

Area Number:

8

Speed Level:

AUTO

Trunk Port

ON

Long Distance

LS

VC Link Init

ON

Desired Distance

40 Km

Locked L_Port

OFF

Locked G_Port

OFF

Disabled E_Port

OFF

ISL R_RDY Mode

OFF

RSCN Suppressed

OFF

Persistent Disable

OFF

NPIV capability

ON

Mirror Port

OFF

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

26

26

Incompatible Port Parameters (cont.) •





Fabric OS v5.2 Extended Fabrics long-distance modes were revised: –

Modes L0, LE, LD, and LS are supported and can be configured on any FC port



Modes L0.5, L1 and L2 are supported, but can not be configured

When upgrading from Fabric OS v5.1 to v5.2, what happens to ports set to mode L0.5, L1, or L2? –

The long-distance mode is still displayed in command line output (switchshow, etc.), but modes L0.5, L1, and L2 cannot be configured



To change the distance on these ports, use mode LD or LS

When connecting a Fabric OS v5.2 switch to a pre-Fabric OS v5.2 switch both ports on the link must have the same mode –

March 2008

Result: Use mode LS or LD

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

27

27

Incompatible Switch Parameters (cont.) •



March 2008

Change these settings with the following commands: –

Port speed: portcfgspeed



Reset to defaults: portcfgdefault



Port type (L_Port only): portcfglport



Port type (E_Port or F_Port only): portcfggport



Port type (E_Port disabled): portcfgeport



Port disable/enabled: portdisable, portenable



Port persistently disabled/enabled: portcfgpersistentdisable, portcfgpersistentenable



Long-distance mode, VC link initialization: portcfglongdistance



ISL R_RDY mode: portcfgislmode

Verify settings are the same by invoking portcfgshow on both switches and comparing output SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

28

28

Domain ID Conflicts

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

29

29

Domain ID Conflicts (cont.) •

Duplicate domain IDs are reported as Domain Overlap or Overlap. To resolve domain ID conflicts, follow these steps: – In each fabric, display the assigned domain IDs with the fabricshow or switchshow command – Review the command output, and determine those switches whose domain ID must be changed – Disable the switch (switchdisable), run the configure command to change the domain ID manually, then enable the switch (switchenable) – The switch will now join the fabric with the unique domain ID you assigned



March 2008

Option: set Insistent domain ID (required for FICON) SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

30

30

End Device Troubleshooting

31

Run supportsave Before and After •



March 2008

Run supportsave as soon as you experience a problem in your SAN –

Critical data will be captured if supportsave is run right away



Run supportsave prior to all problem determination steps



If unable to resolve during problem then run supportsave again

If you have to escalate problem send escalation team both supportsave files

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

32

32

End Device Troubleshooting End device troubleshooting requires the following: •

Is there light from the host or device? A powered off or failed device may not provide light. Without light there will never be a login.



Does the switch port speed configuration match the attached device speed configuration? Devices and switch ports typically autonegotiate. Verify that the switch port is not locked to a speed the device cannot handle.



Are the transmission characters synchronized with the switch port?



How far has the login process progressed? Did the device log in properly as a loop and/or fabric device?



Are the FOS v5.2 ACLs, specifically Device Connection Control (DCC) policies, preventing device from receiving a response to a login?

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

33

33

End Device Troubleshooting (cont.) •

With the maturation of Fibre Channel, most devices login as point-topoint via a Fabric Login (FLOGI). Has this occurred? –



If the end device logs in as loop or Fabric, it will be assigned a 24-bit address –

March 2008

Even if the device logs in as loop, it should still proceed to the FLOGI stage to get a Public Loop Address (24-bit address)

Until then, it has no source ID (SID) with which to initiate communication in the fabric

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

34

34

End-to-End Device Connectivity Use LLFD to Divide and Conquer

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

35

35

End-to-End Device Connectivity (cont.) Link, Login, Fabric, Devices

Link – Physical and logical connection of device to switch •

Transmission of light/signal



Negotiation of speed



Synchronization of characters and words –

Loop/Fabric initialization primitives

Login – Device to switch connectivity •

FLOGI to Fabric Port (FFFFFE)



Security Policy Check– Device Connection Control POLICY (DCC_POLICY) Access Control List (ACL); –

Switch responses: • •



March 2008

Accept: Assign fabric unique 24-bit address No response: Do not assign fabric address

Port Login (PLOGI) to Name Server (FFFFFC)

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

36

36

End-to-End Device Connectivity (cont.) Link, Login, Fabric, Devices Fabric •

Name Server Registration (FFFFFC) –

Device registers to local Name Server



Name Server is distributed within the fabric



If user-defined Virtual Fabric Admin Domains (ADs) are enabled, the Name Server will only show devices within the current AD



AD255 is the Physical Fabric view



AD0-AD254 will have a filtered view of the Name Server



Device attribute data may be registered: –



March 2008

Device Model and Vendor



Firmware and Driver revisions



Host name

SCR and RSCN to Fabric Controller (FFFFFD) –

Initiators register using State Change Registration (SCR)



Initiators receive notifications by Name Server of Registered State Change Notifications (RSCNs) SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

37

37

End-to-End Device Connectivity (cont.) Link, Login, Fabric, Devices Devices •

• •

Initiator queries Name Server for available devices –

Response contains devices within the effective zone configuration



FC devices are Type 8 (FCP)



Devices must successfully be logged into the fabric to exist within the Name Server



Initiators are zoned with targets

Initiator PLOGI to each target device, based upon Name Server query results Process Login (PRLI) from initiator to target(s) –



March 2008

Provides the end-to-end connectivity for device communication

Issue Report LUNs and Inquiry to each available device

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

38

38

Troubleshooting End-to-End Device Connectivity Start at the switch •

The switch contains a wealth of information concerning the condition of the fabric: –

Devices that are logged into the fabric



Devices registered within the Name Server



Which devices are within the same zone

Don’t forget about LUN Masking and Persistent Binding •



March 2008

Storage array may implement LUN Masking –

Initiator WWN (Port or Node) presented to array properly?



Correct LUNs made available to initiator by array?

HBAs may use Persistent Binding to specify LUN WWN or 24-bit PID to OS device mapping –

Target LUN WWN (Port or Node) or PID specified correctly in host file(s)



May require entry for new or replaced target LUNs

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

39

39

Troubleshooting End-to-End Device Connectivity (cont.) •

March 2008

If previous steps have been verified, there should be end-to-end device connectivity and communication •

If there is no communication between end devices, use CLI commands to determine where the problem exists. Verify connectivity through the SAN first.



If everything looks correct from switch CLI commands, use storage and host specific message logs and commands to isolate problems to the end point (initiator or target)

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

40

40

Troubleshooting Starts with switchshow •

The first command to enter when you start troubleshooting is switchshow. That shows whether: –



Switch is online



SFP is installed in each port



Port licensing – e.g. Ports-On-Demand (POD)



End devices are online

For remote devices, there are several commands to choose from, but start with nscamshow –

Tells if remote devices are seen within the fabric. • •



Next get a view of the fabric configuration with cfgshow



…or just get a supportsave –

March 2008

Name Server (ns*) commands are filtered by ADs in FOS v5.2+ If ADs are implemented, select AD255 (Physical Fabric View): rsl1_st15_b20_1:admin> ad --select 255

Super command script file. It gets all these commands and more!

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

41

41

Light/Signal •

March 2008

Fibre Channel Layer 0 connectivity – –

The actual light transmitted and received over FC cabling Use switchshow command to verify light/signal is being transmitted from a device. Use portflagsshow to see if LED is seen.



Additionally use sfpshow to verify SFP is not faulty

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

42

42

Light/Signal (cont.) Successful light (still no speed/synchronization) output examples •

March 2008

Use output of switchshow, portshow, and portflagsshow to verify light is being received:

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

43

43

Link – Speed Negotiation •



Speed Negotiation –

Device and switch use special transmission characters to agree upon a transfer speed of 4 Gbit/sec, 2 Gbit/sec, or 1 Gbit/sec



Speed negotiation starts with the highest possible speed and negotiates down until a speed is agreed upon or the lowest possible speed is attempted without success

CLI output information associated with the port when speed negotiation is successful: – – –

March 2008

switchshow: port speed will display the speed1 and State will display Online portshow: port speed will display configured or negotiated speed portflagsshow: Physical command column output field will display No_Sync or In_Sync

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

44

44

Link – Speed Negotiation (cont.) Unsuccessful Speed Negotiation • switchshow 1 1 id 2G No_Sync



portshow 1 | grep portSpeed portSpeed: 2Gbps



portflagsshow 1 Offline No_Sync PRESENT

Ensure port is set to default values: • portcfgdefault 1 Or manually set port to auto negotiate speed: • Use portcfgspeed 1 0

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

45

45

Physical Connectivity • • • •

Physical connectivity between a device and a switch port includes light/signal, speed, and link negotiation processes After speed negotiation the connecting points have to synchronize Devices can get into a condition defined as marginal when they go into and out of sync Commands that help identify this issue include – porterrshow – The errshow output may also have relevant output



March 2008

Fabric Watch can greatly augment the event reporting found in the error log (RASLog)

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

46

46

Physical Connectivity (cont.) porterrshow •

The porterrshow command is very helpful for getting a picture of all ports and their associated error and link related counters



Using this information, you can quickly isolate problems down to a specific port



A Marginal link is defined as a degraded physical connection; it is not optimally passing data



March 2008



The porterrshow, portstatsshow, and portshow output display counters that help monitor marginal ports



Symptoms include poor performance and occasional loss of connectivity

A delta of the counters can help you isolate a problem to a port and/or the connected HBA or Storage device –

Note that you can clear the port counters using portstatsclear on a per-port/port-group basis (granularity is dependent on FOS version)



The link counters cannot be cleared without a reboot

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

47

47

Physical Connectivity (cont.) Use the porterrshow command for initial investigation of marginal links

portstatsclear can be used to clear port errors on error statistics to left of the dotted line. The other counters get cleared on a reboot/fastboot. March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

48

48

Physical Connectivity (cont.) Granularity on ports with high error counters: •

porterrshow – Less granularity – Good for quickly identifying port(s) of interest



portstatsshow –

March 2008

Good for monitoring exact values of counters

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

49

49

Error Counters Certain port counters can point to physical link layer issues:

March 2008



enc_in: This counter increments when 8b/10b encoding errors are detected within a frame. enc_in errors are always detected on the ingress port.



crc_err: Indicates corruption within the frame. Always seen on ingress port but will be passed by the switch unaltered through the fabric (like a trail of bread crumbs).



enc_in and/or crc_err = Possible bad media (SFP, cable, patch panel)

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

50

50

Error Counters (cont.) •

enc_out: 8b/10b encoding errors NOT associated with frames (IDLE, R_RDY, and various other primitives). This counter increments during speed negotiation prior to login. Locking a port to a speed supported by the end device can be used to isolate issues. – Possible bad media (SFP, cable, patch panel) – Can cause a performance problem due to buffer recovery



disc_c3: Class 3 frame has been discarded because it is not routable to a destination address – Corrupted or not-online Destination ID (DID) – Timeout exceeded (Condor ASIC hold time exceeded) – Counter may increment when FC nodes and/or switches rapidly transition between online and offline; look at fabriclog –s output (described in the Logical Connectivity slide later)

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

51

51

Link Counters These are point-to-point errors; they do not propagate through the fabric •

Link failures - error conditions that cause a port to drop out of an active state –

Requires the reconnecting device to FLOGI back into fabric (No speed negotiation required, since the device does not lose synchronization)



Loss of sync - occur when bit and word synchronization on link is lost



Loss of signal – occur when light or an electrical signal is lost on a link –



March 2008

Require connected device to renegotiate speed and FLOGI back into fabric

If you experience device connectivity and/or performance issues and rising link counters look for –

bad cables/SFPs/patch-panel connections



repeating cycles of online/offline states in fabriclog -s output

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

52

52

Device Initialization into Fabric

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

53

53

Device Initialization - Port Configuration Device initialization could be affected by port configuration •

March 2008

portcfgshow – display port status

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

54

54

Port Configuration (cont.) •

switchshow – display login status; F/L/E or G: 1 1 id N1 Online G-Port



portcfglport – Lock port to L-Port to force Loop Initialization prior to FLOGI portcfglport



portcfggport – Lock to G-Port if HBA/storage has difficulties negotiating initial Loop Initialization portcfggport



portcfg mirrorport – A port configured as a mirror port will prevent HBA/storage login portcfg mirrorport --enable – Disable mirror port configured to connect a device portcfg mirrorport --disable

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

55

55

Login Services Three different levels of login: •

Fabric Login (FLOGI) is used by an N_Port or NL_Port (Nx_Ports) to establish service parameters with the switch – The following information is implicitly captured and put into the Name Server during this process: type; COS; PID; PortName (port WWN) ; and NodeName (node WWN)



N_Port Login (PLOGI) is used by one Nx_Port to establish service parameters with another N_Port or NL_Port



Process Login (PRLI) is used by an upper-level process in one port to establish image pairs and service parameters with the corresponding upper-level process in the other port – For example, it can be used to establish the environment between related SCSI processes on an origination Nx_Port and a responding Nx_Port

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

56

56

Fabric Login (FLOGI)



When devices 1st connect, their address is 000000 (unless they are loop devices, then their address will be 0000pp)



FLOGI is required before any frame can be sent thru the fabric



FLOGI is sent to well-known address FFFFFE (Fabric F_Port)

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

57

57

Commands to Check FLOGI Status •

switchshow – A successful login displays an F_Port (including its WWN) or L_Port



portshow – A successful login displays fabric viewpoint of device – portFlags - a bit map and English translation of the ports login process – portState - Online – portPhys - In_Sync, receiving light and synchronized – portId - 24-bit Fabric Address, port identifier (PID) of device – portScn - F_Port, from the fabrics point of view all end devices that successfully logged in are F_Ports – port WWN(s) of connected device(s) - an F_Port will have one WWN; an FL_Port can have multiple WWNs – Distance and Speed Configuration of the port



March 2008

portflagsshow – Lists the translation of all port login state flags; same as portshow portFlags output

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

58

58

portshow

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

59

59

portstatsshow – BB Credit

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

60

60

portcamshow •

Hardware enforced – SID/DID zone tables are kept in ASIC – portcamshow



Out of CAM Entries – Changes to Session-Based zoning – Resource issue - not an actual error condition



portzoneshow – undocumented/unsupported command – Displays type of zoning (Hard, Session based) for each port

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

61

61

Logical Connectivity fabriclog -s •

fabriclog –s supersedes the fabstateshow command – Use it to check for port Online/Offline transitions:

– Port 1 transitioned from Offline to Online multiple times – Check physical connectivity for bad cable, SFPs, patch-panel, etc.

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

62

62

Fabric – Name Server Successful port login and registration to Name Server

– A port login (PLOGI) to the Name Server can be confirmed by looking at the Name Server information – Verify using the nsshow command – Unsuccessful port login means no information within the Name Server

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

63

63

Fabric – Name Server (cont.) •

March 2008

Check for successful port login with -t option: device is an Initiator or Target

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

64

64

State Change Notification Services •

State Change Notification (SCN) - State Change Notifications (SCN) are used for internal state change notifications, not external – This is the switch logging that the port is online or is an Fx_port – This is not sent from the switch to the Nx_ports!



State Change Register (SCR) – Nx_Port request to receive notification when something in the fabric changes – FC Devices that choose to receive RSCNs must register for this service •

Devices send a State Change Registration (SCR) to FFFFFD



Registration indicates that the device wants to be notified of changes

– Devices register after PLOGI to Name Server



March 2008

Registered State Change Notification (RSCN) - issued by the Fabric Controller Service or an Nx_Port to devices that registered (issued an SCR requesting this notification) – only sent to devices within an affected zone

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

65

65

Fabric Controller Services •

The Fabric Controller (FFFFFD) Service alerts device that changes have occurred in the fabric by sending a Registered State Change Notification (RSCN) if: – Device registered to receive RSCN using an SCR – A new device has been added (within the same zone) – An existing device has been removed (within the same zone) – A zone has been changed – A switch name or IP address changed – The fabric reconfigured



Registration is optional – SCSI initiators normally register – SCSI targets do not register

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

66

66

Changes Within the Fabric •

“Properly” written device drivers will do the following in response to an RSCN: – Query the Name Server for changes related to devices they are (or were) currently logged into – Initiate a port login for any new devices the Name Server has notified them of within their Virtual Fabric zoning configuration



Sometimes it isn’t a device driver issue. Applications can fail if their I/O is not satisfied quickly. (“Quickly” is a relative term.) – If necessary, FOS gives the ability to suppress RSCN’s per port:

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

67

67

Device Identification Commands •

Use switchshow, nsshow, nscamshow, nsallshow, and nodefind to identify devices in the fabric



nsallshow lists all 24-bit PID addresses within the current fabric (Name Server view of current AD)



nodefind lists Name Server information for: – Specified Alias – Specified WWN – Specified PID address

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

68

68

Devices - End-to-End Connectivity •

End-to-end device connectivity communication could be blocked on the switch by: – Zoning – AD configuration – Commands to check include: fcping, cfgshow, and ad --show



End-to-end device connectivity flow – Nx_Port to Nx_Port communication – Initiator to target (similar to SCSI model) – PLOGI/PRLI from Nx_Port to Nx_Port



Name Server Query – Initiators learn about “devices of interest”, based upon FC4 layer type (5 or 8): where 8 = FCP/SCSI, 5 = IP over FC

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

69

69

End-to-End Device Connectivity (cont.) Use the fcping command to check for end device connectivity and zoning •

Response when device is not online: rsl1_st15_b41_1:admin> fcping 0x1400e8 0x0a0100 fcping: Error destination port invalid



Response when devices are online; but one does not respond to the fcping ELS ECHO frame: rsl1_st15_b20_1:admin> fcping 0x0a0000 0x1400e2 Source: 0xa0000 Destination: 0x1400e2 Zone Check: Not Zoned Pinging 0xa0000 with 12 bytes of data: received reply from 0xa0000: 12 bytes time:650 usec 5 frames sent, 5 frames received, 0 frames rejected, 0 frames timeout Round-trip min/avg/max = 567/618/674 usec Pinging 0x1400e2 with 12 bytes of data: Request timed out 5 frames sent, 0 frames received, 0 frames rejected, 5 frames timeout Round-trip min/avg/max = 0/0/0 usec

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

70

70

Device to Device Login •

Don’t forget, devices do not only log into the fabric. Initiators will initiate PLOGIs and PRLIs to other end devices after: – Each device is Online in the switch database – Each device has registered with the Name Server – Devices are zoned together and within the same Virtual Fabric Administrative Domain (AD)

• •

The mechanism for devices to login to each other through PLOGI is the same as used for device to switch login The switch acts as a “middle-man” – Passing PLOGI/PRLI requests and ACCEPT responses or – Discarding such requests if the devices are not zoned together or in the same AD

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

71

71

Port Configuration – End-to-End Check port configuration for end-to-end device connectivity •

Use nszonemember as a final step to verify that: – End devices have logged into Name Server, are Online, and are zoned together within the same AD rsl1_st15_b20_1:admin> nszonemember 0x0a0100 1 local zoned members: Type Pid COS PortName NodeName SCR N 0a0100; 2,3;10:00:00:00:c9:22:1f:23;20:00:00:00:c9:22:1f:23; 3 FC4s: FCP NodeSymb: [30] "Emulex LP8000 FV3.90A7 DV6.02h" Fabric Port Name: 20:01:00:05:1e:02:0c:77 Permanent Port Name: 10:00:00:00:c9:22:1f:23 Device type: Physical Initiator Port Index: 1 Share Area: No Device Shared in Other AD: No

…output continued on next slide

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

72

72

Port Configuration – End-to-End (cont.) Check port configuration for end-to-end device connectivity (nszonemember 0x0a0100 output continued…) 1 remote zoned members: Type Pid COS PortName NodeName NL 1400e8; 3;21:00:00:04:cf:92:6a:58;20:00:00:04:cf:92:6a:58; FC4s: FCP PortSymb: [28] "SEAGATE ST318452FC 0004" Fabric Port Name: 20:00:00:05:1e:02:aa:7b Permanent Port Name: 21:00:00:04:cf:92:6a:58 Device type: Physical Target Port Index: 0 Share Area: No Device Shared in Other AD: No



March 2008

Verifies end-to-end zoning within the fabric

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

73

73

When to use an Analyzer? •

When all devices are logged into the fabric, zoning is configured properly, and hosts do not see their targets



When there are I/O disruptions that cannot be isolated with RASLog (errdump) or porterrshow/portstatsshow



When a problem exists within the payload of a transfer



To monitor the health of a system for error statistics and performance problems (the switch also has relevant built-in diagnostic capabilities)



To diagnose protocol problems





A complete look at the FC header and payload



Capture end-to-end protocol information (including ULPs)

To troubleshoot extended Fabric communication –

March 2008

An FC analyzer can be installed between the switch and the gateway at each end •

Is the transmission the same as the reception?



Can bit – char – word sync be established?

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

74

74

Port Mirroring - Configuration •

Decide location of mirror port; on same ASIC as SID or DID port



Login to the physical fabric using an Admin role account



Follow these steps to use port mirroring to capture a FC analyzer trace: 1.

Configure the port as a mirror port by invoking the following command: portcfg mirrorport --enable •

2.

Verify the configuration, invoke portcfgshow and switchshow

Connect a FC Analyzer to the mirror port and verify that it comes online

3.

Configure port mirroring connection between the SID & DID thru the mirror port portmirror --add

4. 5.



The mirror port must be online



Verify mirror connection, invoke portmirror –-show

Start FC Analyzer capture, reproduce problem, stop capture and review output Remove the port mirror connection with the portmirror --delete command: portmirror --delete

6.

Remove the mirror port configuration (to allow other connections to this port): portcfg mirrorport --disable

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

75

75

Gathering Switch Support Data for Problem Determination and Escalation

76

Switch Support Data - Overview •

Up to this point, we have gathered details about a switch by running one CLI command at a time



For long-term support of a switch, we need to begin gathering switch support data – Larger, file-oriented data that provides a broader view of the switch – Configuration of parameters – State of FRUs and ports, both currently and in the past



There are several different types of switch support data that can be collected from a Brocade switch, router, or Director: – Switch error logs (RASLogs) – Audit logs – FFDC files – Panic dump and core files – Trace dump files

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

77

77

RASLog - Overview •

Starting in Fabric OS v4.4, the System Message Log began to be called the Reliability, Availability, and Serviceability Log (RASLog)



RASLog error messages are defined in one of two groups – External messages – CRITICAL, ERROR, WARNING, and INFO can be viewed by admin-level users – Internal messages - DEBUG and PANIC can not be viewed by adminlevel users



There is one RASLog stored in persistent memory – Up to 1024 external messages stored in a non-volatile circular buffer – In blade-based switches, each CP maintains a separate RASLog



March 2008

In Fabric OS v5.1+, certain security- and zoning-related commands cause an AUDIT flag to be added to error messages

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

78

78

RASLog - Standard Message Format •

Fabric OS v4.4+ error messages follow a standard format: –

Start Delimiter (customizable): Start



Date (including year) and Time: 2006/03/08-11:59:32



Message Module and Numeric Instance: ZONE-3006



Sequence Number: 9



Audit Flag: AUDIT or FFDC (added in Fabric OS v5.1)



Severity Level (one of four levels): INFO



Switch Name: NDA-ST01-B48



Error description: User: admin, Role: admin, Event: cfgdisable, Status: success, Info: Current zone configuration disabled.



End Delimiter (customizable): End

Start 2006/03/08-11:59:32, [ZONE-3006], 9, AUDIT, INFO, NDAST01-B48, User: admin, Role: admin, Event: cfgdisable, Status: success, Info: Current zone configuration disabled. End

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

79

79

RASLog - Management •

Use the following commands to view the RASLog associated with external messages: – Display all external messages in the error log with no line breaks – errdump (default display order: least-recent to most-recent) – Display all external messages in the error log with line breaks - errshow (default display order: least-recent to most-recent) – Use errdump/show -r to display error messages in reverse order: most-recent to least-recent – Clear all internal and external messages from the error log with Admin level errclear command



Forward RASLog and Console log entries to a syslogd daemon on a host computer (syslogdipadd) – Especially important on dual-CP systems as host computer logs maintain a single, sequentially ordered, merged file for both CPs

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

80

80

Audit Log - Overview •

The RASLog was designed to capture abnormal, error-related messages – not highfrequency AUDIT events



In Fabric OS v5.1 and earlier, error messages and AUDIT events are sent to the RASLog



In Fabric OS v5.2+, error messages go to the RASLog, and all AUDIT events go only to a new Audit Log

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

81

81

Audit Log – Overview (cont.) •





March 2008

SAN Troubleshooting Basics

The new Audit Log is designed for post event audits, and problem determination –

Captured per Virtual Fabric AD



Configurable (off by default)

For a given event it captures –

Who (user), when (timestamp), what (SAN component), and which AD



Event type



Other event-specific information (description)



Format consistent with DMTF standard

AUDIT messages are always sent to the console, and can be configured to go to syslog servers

® 2008 Brocade Communications Systems, Inc. All rights reserved.

82

82

Audit Log - Details •



Fabric OS v5.2+ continues to audit all Fabric OS v5.1 AUDIT messages –

Secure Fabric OS configuration



Security related: SSL, RADIUS, Zone, and password strengthening configuration

Fabric OS v5.2+ can also be configured to audit these tasks: –

configdownload (not configupload)



firmwaredownload start, complete, and error messages encountered during download



User initiated security events related to ACLs



Fabric events related to command execution in other ADs (ad --exec)



In an AD-aware fabric, Audit Log configuration is done from AD255



Commands involved in configuring the Audit Log include:

March 2008



auditcfg to enable auditing and define what gets audited (filters)



syslogdipadd to specify IP address of syslog server configured to receive audit messages

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

83

83

FFDC - Overview •





March 2008

To minimize requests for problem recreation from certain Brocadedefined events, Fabric OS captures First Failure Data Capture (FFDC) data –

Goal: Allow Brocade engineers to gain insight into problems that are transient, difficult-to-recreate, or difficult-to-solve



Triggered by error MSG_IDs that are selected by Brocade engineering



Messages are written to the console and the error log with an FFDC flag

Automatically collects “supportshow-like” information (based on CLI commands) as readable text when the selected event occurs –

A single FFDC event may create one or more FFDC files



Up to 4 MB for all FFDC files combined (if max size is reached, a RASLog message is generated, and periodic console messages are sent)

FFDC files are stored on the switch, and transferred by supportsave (automatically deletes files) or savecore (does not automatically delete files)

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

84

84

FFDC - Configuring •

Enable and disable the FFDC functionality with the supportffdc command – Enabled by default - disable only if directed to do so by next-level support switch:admin> supportffdc --enable --disable --show

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

85

85

FFDC - Capturing •

The supportsave command uploads the FFDC data via FTP, and deletes it from the switch –



File name indicates the triggering event, and date/time stamp (example: FSSM1005-2006-08-12-114707.ffdc)

The savecore command also uploads the FFDC data via FTP (same file name), but does not delete it from the switch switch:admin> savecore following 1 directories contains core files: [ ]0: /core_files/ffdc_data Welcome to core files management utility. Menu 1(or R): Remove all core files 2(or F): FTP all core files 3(or r): Remove marked files 4(or f): FTP marked files 5(or m): Mark Files for action 6(or u): Un Mark Files for action 9(or e): Exit Your choice:

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

86

86

Panic Dump and Core Files - Overview •

Fabric OS creates panic dump and core files when there are problems in the Fabric OS kernel – Generated when an important Fabric OS daemon no longer responds or terminates unexpectedly – Captures a snapshot of the current state of the switch at the time of the crash – no historical information retained – Panic dumps are text files, core file contents are encrypted



March 2008

In a dual-CP Director, each CP can create these files, so always check both CPs

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

87

87

Panic Dump and Core Files (cont.) •

To display panic dump files at the command line, enter the pdshow command switch:admin> pdshow Could not find any valid pd file!



To upload (FTP) or delete (remove) panic dump and core files via FTP, use the savecore command switch:admin> savecore -l /core_files/panic/core.873 /core_files/zoned/core.1234 /core_files/zoned/core.5678 /mnt/core_files/nsd/core.873 /mnt/core_files/panic/core.873 switch:admin> savecore -h 192.168.204.188 -u jsmith –d core_files_here -p password –f /core_files/zoned/,/mnt/core_files/nsd/ /core_files/zoned//core.1234: 1.12 kB 382.60 B/s /core_files/zoned//core.5678: 1.12 kB 381.95 B/s /mnt/core_files/nsd//core.873: 1.12 kB 382.53 B/s Files transferred successfully!

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

88

88

Trace Dump - Overview •

The trace functionality is a proactive troubleshooting tool – Included in Fabric OS v4.4+ to aid Fabric OS debugging – Always running, maintaining a historic record of the current and past state of the switch – can not be disabled – No impact on user data performance



The results from the trace operation are stored in a trace dump file – Triggered by a panic; timeout; CRITICAL-level event; or a manual trigger – Binary file, retained in persistent memory – Can be uploaded automatically or manually via FTP

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

89

89

Trace Dump - Implementation •

Initiate or remove a trace dump file, or display trace dump status with the tracedump command – tracedump –n: Create a trace dump manually – tracedump –r: Remove (delete) a trace dump from the switch



Use the traceftp command to manage the uploading (but not deleting) of trace dumps: – traceftp –n: Manually upload trace dumps via FTP – traceftp –e: Enable automatic FTP upload of trace dumps – traceftp –d: Disable automatic FTP upload of trace dumps – With traceftp –e, specify the FTP server to which trace dumps are uploaded with the supportftp command – must do this, or trace dump files will not be automatically uploaded



March 2008

Web Tools supports some of the traceftp command functionality

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

90

90

Capturing Switch Support Data - Overview •

There are several tools that you can use to capture switch support data: – supportshow – supportsave – Fabric Manager – SAN Health

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

91

91

Capturing Switch Support Data supportshow •

supportshow is a script that executes groups of pre-selected Fabric OS and LINUX commands, and displays them at the CLI command output



To simplify troubleshooting for the future, use the supportshow output to establish a switch baseline – Documents the switch configuration under good conditions – Future troubleshooting can start by comparing the current supportshow output with the baseline



supportshow takes ADs into consideration: – Command is relevant only in AD0 (no user-defined ADs) or AD255 (with user-defined ADs) – AD must include the switch on which the command is run – Example supportshow response in non-AD0/AD255 context: Operation not allowed in AD1-AD254 context

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

92

92

Capturing Switch Support Data supportsave •

To aid the capture of supportshow information, Fabric OS v4.4 introduced supportsave –

Uploads supportshow in a text file whose name indicates the switch name (Director), CP slot (S0, S5), time stamp (200605200014), and SUPPORTSHOW



Also uploads FFDC files, as well as other information switch:admin> supportsave –h 192.168.1.1 –u anonymous –d tmp This command will collect RASLOG, TRACE, and supportShow (active CP only) information for the local CP and then transfer them to a FTP server. The operation can take several minutes. OK to proceed? (yes, y, no, n): [no] y ... Saving support information for module SUPPORTSHOW... ...rtSave_files/Director-S5-200605200014-SUPPORTSHOW: 1.11 MB 346.39 kB/s



March 2008

supportsave needs to be run on both the Active and Standby CPs

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

93

93

Capturing Switch Support Data – SAN Health •

Another tool that automates the documentation of a SAN is Brocade SAN Health



SAN Health is a free utility that helps you create: – Comprehensive Documentation – Historical Performance Graphs – Detailed Topology Diagrams – Best Practice Recommendations



SAN Health can be run against: – Brocade systems running any version of Fabric OS or XPath OS – McData systems running EOS 4.x+

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

94

94

Gathering Switch Support Data Troubleshooting •



Before troubleshooting a Brocade switch, router, or Director, gather all the basic information that you can: –

Document the current state of the switch with supportsave: RASLogs, numerous command outputs (supportshow)



Identify user actions taken in the past: Audit logs (if available)

Validate the current state of the switch by reviewing supportshow: –

Verify switch access settings (e.g. ipaddrshow)



Check FRU status (e.g. fanshow)



Validate firmware revisions (e.g. firmwareshow)



Check port status, port errors (e.g. porterrshow)



Identify faults on the switch by checking the RASLog (errdump) for errorrelated messages



As needed, compare time stamps between the RASLog and the Audit Log to determine whether user actions were a problem source

March 2008

SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

95

95

Gathering Switch Support Data – Escalating to Next-Level Support •





March 2008

If you are escalating an issue to next-level support, gather all the basic and Brocade information from the switch by running supportsave: –

RASLogs



supportshow



FFDC files



Trace dumps



Core files and panic dumps



AP blade details

In addition, describe the problem in as much detail as possible: –

Affected devices/ports/switches



SAN topology drawing



Previous course of action (timeline, commands run)



Details on recent changes to the fabric (additions/removal/configs)

If available, also capture the Audit logs, so that past user actions can be identified SAN Troubleshooting Basics

® 2008 Brocade Communications Systems, Inc. All rights reserved.

96

96

Fin

97

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF