HP-UX Performance and Tuning (H4262S)

November 12, 2016 | Author: vinnie79 | Category: N/A

Short Description

Download HP-UX Performance and Tuning (H4262S)...

Description

StudentPerformance and Tuning HP-UX H4262S guide C.00

HP Training

Student guide

Copyright 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. This is an HP copyrighted work that may not be reproduced without the written permission of HP. You may not use these materials to deliver training to any person outside of your organization without the written permission of HP. UNIX® is a registered trademark of the Open Group. Printed in the USA HP-UX Performance and Tuning Student guide May 2004

Contents

Contents Overview............................................................................................................................................. 1 Module 1  Introduction 1–1. SLIDE: Welcome to HP-UX Performance and Tuning.................................................. 1-2 1–2. SLIDE: Course Outline ..................................................................................................... 1-3 1–3. SLIDE: System Performance ........................................................................................... 1-4 1–4. SLIDE: Areas of Performance Problems........................................................................ 1-6 1–5. SLIDE: Performance Bottlenecks ................................................................................... 1-8 1–6. SLIDE: Baseline .............................................................................................................. 1-10 1–7. SLIDE: Queuing Theory of Performance ..................................................................... 1-12 1–8. SLIDE: How Long Is the Line?....................................................................................... 1-13 1–9. SLIDE: Example of Queuing Theory ............................................................................ 1-14 1–10. SLIDE: Summary............................................................................................................. 1-16 1–11. LAB: Establishing a Baseline......................................................................................... 1-17 1–12. LAB: Verifying the Performance Queuing Theory ...................................................... 1-19 Module 2  Performance Tools 2–1. SLIDE: HP-UX Performance Tools...................................................................................... 2-2 2–2. SLIDE: HP-UX Performance Tools (Continued) ............................................................... 2-3 2–3. SLIDE: Sources of Tools....................................................................................................... 2-4 2–4. SLIDE: Types of Tools .......................................................................................................... 2-6 2–5. SLIDE: Criteria for Comparing the Tools ........................................................................... 2-8 2–6. SLIDE: Data Sources ........................................................................................................... 2-10 2–7. SLIDE: Performance Monitoring Tools (Standard UNIX).............................................. 2-11 2–8. TEXT PAGE: iostat ......................................................................................................... 2-12 2–9. TEXT PAGE: ps................................................................................................................... 2-14 2–10. TEXT PAGE: sar .............................................................................................................. 2-16 2–11. TEXT PAGE: time, timex .............................................................................................. 2-18 2–12. TEXT PAGE: top .............................................................................................................. 2-19 2–13. TEXT PAGE: uptime, w................................................................................................... 2-21 2–14. TEXT PAGE: vmstat ....................................................................................................... 2-22 2–15. SLIDE: Performance Monitoring Tools (HP Specific) .................................................. 2-25 2–16. TEXT: glance................................................................................................................... 2-26 2–17. TEXT PAGE: gpm .............................................................................................................. 2-28 2–18. TEXT PAGE: xload.......................................................................................................... 2-30 2–19. SLIDE: Data Collection Performance Tools (Standard UNIX) .................................... 2-31 2–20. TEXT PAGE: acct Programs .......................................................................................... 2-32 2–21. TEXT PAGE: sar .............................................................................................................. 2-34 2–22. SLIDE: Data Collection Performance Tools (HP-Specific) .......................................... 2-36 2–23. TEXT PAGE: MeasureWare/OVPA and DSI Software................................................... 2-37 2–24. TEXT PAGE: PerfView/OVPM ......................................................................................... 2-39 2–25. SLIDE: Network Performance Tools (Standard UNIX)................................................ 2-41 2–26. TEXT PAGE: netstat..................................................................................................... 2-42 2–27. TEXT PAGE: nfsstat..................................................................................................... 2-44 2–28. TEXT PAGE: ping ............................................................................................................ 2-46 2–29. SLIDE: Network Performance Tools (HP-Specific) ...................................................... 2-48 2–30. TEXT PAGE: lanadmin .................................................................................................. 2-49

http://education.hp.com

H4262S C.00 iii  2004 Hewlett-Packard Development Company, L.P.

Contents

2–31. 2–32. 2–33. 2–34. 2–35. 2–36. 2–37. 2–38. 2–39. 2–40. 2–41. 2–42. 2–43. 2–44. 2–45. 2–46. 2–47. 2–48. 2–49. 2–50. 2–51. 2–52. 2–53. 2–54. 2–55. 2–56. 2–57. 2–58. 2–59. 2–60. 2–61. 2–62. 2–63. 2–64. 2–65. 2–66. 2–67. 2–68.

TEXT PAGE: lanscan .....................................................................................................2-51 TEXT PAGE: nettune (HP-UX 10.x Only)....................................................................2-53 TEXT PAGE: ndd (HP-UX 11.x Only) ..............................................................................2-55 TEXT PAGE: NetMetrix (HP-UX 10.20 and 11.0 Only)..................................................2-57 SLIDE: Performance Administrative Tools (Standard UNIX) ......................................2-58 TEXT PAGE: ipcs, ipcrm...............................................................................................2-59 TEXT PAGE: nice, renice ............................................................................................2-61 SLIDE: Performance Administrative Tools (HP-Specific) ............................................2-63 Text Page: getprivgrp, setprivgrp.........................................................................2-64 Text Page: rtprio ............................................................................................................2-66 Text Page: rtsched..........................................................................................................2-67 Text Page: scsictl..........................................................................................................2-69 Text Page: serialize .....................................................................................................2-71 Text Page: fsadm...............................................................................................................2-72 Text Page: getext, setext............................................................................................2-74 Text Page: newfs, tunefs, vxtunefs .........................................................................2-75 Text Page: Process Resource Manager (PRM)..............................................................2-77 Text Page: Work Load Manager (WLM) .........................................................................2-78 Text Page: Web Quality of Service — WebQoS..............................................................2-79 SLIDE: System Configuration and Utilization Information (Standard UNIX) ............2-80 TEXT PAGE: bdf, df ........................................................................................................2-81 TEXT PAGE: mount ..........................................................................................................2-83 SLIDE: System Configuration and Utilization Information (HP-Specific) ..................2-84 TEXT PAGE: diskinfo...................................................................................................2-85 TEXT PAGE: dmesg ..........................................................................................................2-86 TEXT PAGE: ioscan........................................................................................................2-88 TEXT PAGE: vgdisplay, pvdisplay, lvdisplay .................................................2-90 TEXT PAGE: swapinfo...................................................................................................2-92 TEXT PAGE: sysdef........................................................................................................2-93 TEXT PAGE: kmtune, kcweb..........................................................................................2-95 SLIDE: Application Profiling and Monitoring Tools (Standard UNIX) .......................2-96 TEXT PAGE: prof, gprof...............................................................................................2-97 Text page: Application Response Measurement (ARM) Library Routines ................2-98 SLIDE: Application Profiling and Monitoring Tools (HP-Specific) ............................2-99 Text page: Transaction Tracker .....................................................................................2-100 Text page: caliper — HP Performance Analyzer..........................................................2-101 SLIDE: Summary ..............................................................................................................2-103 LAB: Performance Tools Lab..........................................................................................2-104

Module 3  GlancePlus 3-1. SLIDE: This Is GlancePlus................................................................................................3-2 3-2. SLIDE: GlancePlus Pak Overview...................................................................................3-4 3-3. SLIDE: gpm and glance .................................................................................................3-6 3-4. SLIDE: glance — The Character Mode Interface ......................................................3-8 3-5. SLIDE: Looking at a glance Screen ..............................................................................3-11 3-6. SLIDE: gpm — The Graphical User Interface ..............................................................3-13 3-7. SLIDE: Process Information ..........................................................................................3-15 3-8. SLIDE: Adviser Components .........................................................................................3-17 3-9. SLIDE: adviser Bottleneck Syntax Example............................................................3-18 3-10. SLIDE: The parm File .....................................................................................................3-19

H4262S C.00 iv 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Contents

3-11. 3-12. 3-13. 3-14. 3-15. 3-16. 3-17. 3-18.

SLIDE: GlancePlus Data Flow....................................................................................... 3-21 SLIDE: Key GlancePlus Usage Tips.............................................................................. 3-23 SLIDE: Global, Application, and Process Data ........................................................... 3-24 SLIDE: Can't Solve What's Not a Problem................................................................... 3-25 SLIDE: Metrics: "No Answers without Data"............................................................... 3-26 SLIDE: Summary............................................................................................................. 3-27 SLIDE: HP GlancePlus Guided Tour ............................................................................ 3-28 LAB: gpm and glance Walk-Through ............................................................................ 3-29

Module 4  Process Management 4–1. SLIDE: The HP-UX Operating System............................................................................ 4-2 4–2. SLIDE: Virtual Address Process Space (PA-RISC) ....................................................... 4-4 4–3. SLIDE: Virtual Address Process Space (IA-64) ............................................................. 4-6 4–4. SLIDE: Physical Process Components........................................................................... 4-7 4–5. SLIDE: The Life Cycle of a Process ................................................................................ 4-9 4–6. SLIDE: Process States .................................................................................................... 4-11 4–7. SLIDE: CPU Scheduler................................................................................................... 4-14 4–8. SLIDE: Context Switching ............................................................................................. 4-16 4–9. SLIDE: Priority Queues .................................................................................................. 4-17 4–10. SLIDE: Nice Values......................................................................................................... 4-19 4–11. SLIDE: Parent-Child Process Relationship.................................................................. 4-20 4–12. SLIDE: glance — Process List.................................................................................... 4-21 4–13. SLIDE: glance — Individual Process......................................................................... 4-23 4–14. SLIDE: glance — Process Memory Regions ............................................................. 4-24 4–15. SLIDE: glance — Process Wait States....................................................................... 4-25 4–16. LAB: Process Management ............................................................................................ 4-26 Module 5  CPU Management 5–1. SLIDE: Processor Module................................................................................................ 5-2 5–2. SLIDE: Symmetric Multiprocessing................................................................................ 5-4 5–3. SLIDE: Cell Module .......................................................................................................... 5-5 5–4. SLIDE: Multi-Cell Processing .......................................................................................... 5-6 5–5. SLIDE: CPU Processor..................................................................................................... 5-8 5–6. SLIDE: CPU Cache ......................................................................................................... 5-11 5–7. SLIDE: TLB Cache .......................................................................................................... 5-12 5–8. SLIDE: TLB, Cache, and Memory ................................................................................. 5-14 5–9. SLIDE: HP-UX — Performance Optimized Page Sizes............................................... 5-16 5–10. SLIDE: CPU — Metrics to Monitor Systemwide......................................................... 5-19 5–11. SLIDE: CPU — Metrics to Monitor per Process ......................................................... 5-21 5–12. SLIDE: Activities that Utilize the CPU ......................................................................... 5-23 5–13. SLIDE: glance — CPU Report .................................................................................... 5-25 5–14. SLIDE: glance — CPU by Processor ......................................................................... 5-26 5–15. SLIDE: glance — Individual Process......................................................................... 5-27 5–16. SLIDE: glance — Global System Calls ..................................................................... 5-28 5–17. SLIDE: glance — System Calls by Process............................................................... 5-29 5–18. SLIDE: sar Command ................................................................................................... 5-30 5–19. SLIDE: timex Command .............................................................................................. 5-32 5–20. SLIDE: Tuning a CPU-Bound System — Hardware Solutions .................................. 5-33 5–21. SLIDE: Tuning a CPU-Bound System — Software Solutions.................................... 5-35 5–22. SLIDE: CPU Utilization and MP Systems..................................................................... 5-36

http://education.hp.com

H4262S C.00 v  2004 Hewlett-Packard Development Company, L.P.

Contents

5–23. 5-24. 5–25.

SLIDE: Processor Affinity ..............................................................................................5-37 LAB: CPU Utilization, System Calls, and Context Switches ......................................5-38 LAB: Identifying CPU Bottlenecks ................................................................................5-40

Module 6  Memory Management 6–1. SLIDE: Memory Management ..........................................................................................6-2 6–2. SLIDE: Memory Management — Paging ........................................................................6-4 6–3. SLIDE: Paging and Process Deactivation.......................................................................6-5 6–4. SLIDE: The Buffer Cache .................................................................................................6-7 6–5. SLIDE: The syncer Daemon ..........................................................................................6-9 6–6. SLIDE: IPC Memory Allocation .....................................................................................6-10 6–7. SLIDE: Memory Metrics to Monitor — Systemwide ...................................................6-12 6–8. SLIDE: Memory Metrics to Monitor — per Process ...................................................6-14 6–9. SLIDE: Memory Monitoring vmstat Output...............................................................6-16 6–10. SLIDE: Memory Monitoring glance — Memory Report...........................................6-18 6–11. SLIDE: Memory Monitoring glance — Process List.................................................6-19 6–12. SLIDE: Memory Monitoring glance — Individual Process......................................6-20 6–13. SLIDE: Memory Monitoring glance — System Tables.............................................6-21 6–14. SLIDE: Tuning a Memory-Bound System — Hardware Solutions ............................6-23 6–15. SLIDE: Tuning a Memory-Bound System — Software Solutions ..............................6-24 6-16: SLIDE: PA-RISC Access Control ...................................................................................6-26 6–17. SLIDE: The serialize Command..............................................................................6-28 6–18. LAB: Memory Leaks ........................................................................................................6-30 Module 7  Swap Space Performance 7–1. SLIDE: Swap Space Management — Simple View ....................................................... 7-2 7–2. SLIDE: Swap Space — After a New Process Executes ............................................... 7-4 7–3. SLIDE: The swapinfo Command ................................................................................. 7-5 7–4. SLIDE: Swap Space Management — Realistic View.................................................... 7-7 7–5. SLIDE: Swap Space — After a New Process Executes ............................................... 7-8 7–6. SLIDE: Swap Space — When Memory Equals Data Swapped.................................. 7-10 7–7. SLIDE: Swap Space — When Swap Space Fills Up ................................................... 7-11 7–8. SLIDE: Pseudo Swap ..................................................................................................... 7-12 7–9. SLIDE: Total Swap Space Calculation — with Pseudo Swap................................... 7-14 7–10. SLIDE: Example Situation Using Pseudo Swap ......................................................... 7-16 7–11. SLIDE: Swap Priorities .................................................................................................. 7-17 7–12. SLIDE: Swap Chunks ..................................................................................................... 7-18 7–13. SLIDE: Swap Space Parameters ................................................................................... 7-19 7–14. SLIDE: Summary ............................................................................................................ 7-21 7–15. LAB: Monitoring Swap Space ....................................................................................... 7-22 Module 8  Disk Performance Issues 8–1. SLIDE: Disk Overview ......................................................................................................8-2 8–2. SLIDE: Disk I/O — Read Data Flow................................................................................8-4 8–3. SLIDE: Disk I/O — Write Data Flow (Synchronous) ....................................................8-6 8–4. SLIDE: Disk Metrics to Monitor — Systemwide ...........................................................8-8 8–5. SLIDE: Disk Metrics to Monitor — Per Process..........................................................8-10 8–6. SLIDE: Activities that Create a Large Amount of Disk I/O.........................................8-12 8–7. SLIDE: Disk I/O Monitoring sar –d Output...............................................................8-14 8–8. SLIDE: Disk I/O Monitoring sar –b Output...............................................................8-16 8–9. SLIDE: Disk I/O Monitoring glance — Disk Report......................................................8-18

H4262S C.00 vi 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Contents

8–10. 8–11. 8–12. 8–13. 8–14. 8–15. 8–16. 8–17. 8–18.

SLIDE: Disk I/O Monitoring glance — Disk Device I/O .......................................... 8-19 SLIDE: Disk I/O Monitoring glance — Logical Volume I/O....................................... 8-20 SLIDE: Disk I/O Monitoring glance — System Calls per Process.......................... 8-21 SLIDE: Tuning a Disk I/O-Bound System — Hardware Solutions ........................................................................................................ 8-22 SLIDE: Tuning a Disk I/O-Bound System — Perform Asynchronous Meta-data I/O.......................................................................... 8-24 SLIDE: Tuning a Disk I/O-Bound System — Load Balance across Disk Controllers ......................................................................... 8-26 SLIDE: Tuning a Disk I/O-Bound System — Load Balance across Disk Drives.................................................................................. 8-28 SLIDE: Tuning a Disk I/O-Bound System — Tune Buffer Cache .......................................................................................................... 8-30 LAB: Disk Performance Issues...................................................................................... 8-33

Module 9  HFS File System Performance 9–1. SLIDE: HFS File System Overview ................................................................................. 9-2 9–2. SLIDE: Inode Structure .................................................................................................... 9-5 9–3. SLIDE: Inode Data Block Pointers ................................................................................. 9-6 9–4. SLIDE: How Many Logical I/Os Does It Take to Access /etc/passwd? ................. 9-8 9–5. SLIDE: File System Blocks and Fragments ................................................................. 9-10 9–6. SLIDE: Creating a New File on a Full File System ..................................................... 9-13 9–7. SLIDE: HFS Metrics to Monitor — Systemwide ......................................................... 9-15 9–8. SLIDE: Activities that Create a Large Amount of File System I/O ............................ 9-17 9–9. SLIDE: HFS I/O Monitoring bdf Output ...................................................................... 9-18 9–10. SLIDE: HFS I/O Monitoring glance — File System I/O ........................................... 9-19 9–11. SLIDE: HFS I/O Monitoring glance — File Opens per Process.............................. 9-20 9–12. SLIDE: Tuning a HFS I/O-Bound System — Tune Configuration for Workload ..... 9-22 9–13. SLIDE: Tuning a HFS I/O-Bound System — Use Fast Links...................................... 9-25 9–14. LAB: HFS Performance Issues ...................................................................................... 9-27 Module 10  VxFS Performance Issues 10–1. SLIDE: Objectives ........................................................................................................... 10-2 10–2. SLIDE: JFS History and Version Review...................................................................... 10-5 10–3. SLIDE: JFS Extents......................................................................................................... 10-9 10–4. SLIDE: Extent Allocation Policies .............................................................................. 10-11 10–5. SLIDE: JFS Intent Log .................................................................................................. 10-13 10–6. SLIDE: Intent Log Data Flow....................................................................................... 10-16 10–7. SLIDE: Understand Your I/O Workload ..................................................................... 10-18 10–8. SLIDE: Performance Parameters ................................................................................ 10-20 10–9. SLIDE: Choosing a Block Size..................................................................................... 10-21 10–10. SLIDE: Choosing an Intent Log Size ........................................................................... 10-23 10–11. SLIDE: Intent Log Mount Options............................................................................... 10-25 10–12. SLIDE: Other JFS Mount Options ............................................................................... 10-27 10–13. SLIDE: JFS Mount Option: mincache=direct...................................................... 10-31 10-14. SLIDE: JFS Mount Option: mincache=tmpcache ................................................. 10-33 10–15. SLIDE: Kernel Tunables ............................................................................................... 10-35 10–16. SLIDE: Fragmentation.................................................................................................. 10-37 10–17. TEXT PAGE: Monitoring and Repairing File Fragmentation .................................. 10-40 10–18. SLIDE: Using setext .................................................................................................. 10-50 10–19. SLIDE: I/O Tunable Parameters .................................................................................. 10-52

http://education.hp.com

H4262S C.00 vii  2004 Hewlett-Packard Development Company, L.P.

Contents

10–20. 10–21. 10–22. 10–23.

SLIDE: vxtunefs Command for Tuning VxFS ........................................................10-54 SLIDE: /etc/vx/tunefstab Configuration ..........................................................10-56 SLIDE: Taking Snapshots and Performance ..............................................................10-58 LAB: JFS File System Tuning.......................................................................................10-60

Module 11  Network Performance 11–1. SLIDE: The OSI Model ....................................................................................................11-2 11–2. SLIDE: NFS Read/Write Data Flow ...............................................................................11-4 11–3. SLIDE: NFS on HP-UX with UDP ..................................................................................11-6 11–4. SLIDE: NFS on HP-UX with TCP...................................................................................11-7 11–5. SLIDE: biod on Client .....................................................................................................11-9 11–6. SLIDE: TELNET.............................................................................................................11-11 11–7. SLIDE: FTP.....................................................................................................................11-13 11–8. SLIDE: Metrics to Monitor — NFS..............................................................................11-15 11–9. SLIDE: Metrics to Monitor — Network ......................................................................11-18 11–10. SLIDE: Determining the NFS Workload .....................................................................11-20 11–11. SLIDE: NFS Monitoring — nfsstat Output ............................................................11-23 11–12. SLIDE: Network Monitoring — lanadmin Output ..................................................11-28 11–13. SLIDE: Network Monitoring — netstat –i Output ................................................11-31 11–14. SLIDE: glance — NFS Report ......................................................................................11-32 11–15. SLIDE: glance — NFS System Report.........................................................................11-33 11–16. SLIDE: glance — Network by Interface Report.........................................................11-34 11–17. SLIDE: Tuning NFS .......................................................................................................11-35 11–18. SLIDE: Tuning the Network .........................................................................................11-37 11–19. SLIDE: Tuning the Network (Continued)...................................................................11-39 11–20. LAB: Network Performance.........................................................................................11-41 Module 12  Tunable Kernel Parameters 12–1. SLIDE: Kernel Parameter Classes .................................................................................12-2 12–2. SLIDE: Tuning the Kernel...............................................................................................12-5 12–3. SLIDE: Kernel Parameter Categories............................................................................12-8 12–4. SLIDE: File System Kernel Parameters ........................................................................12-9 12–5. SLIDE: Message Queue Kernel Parameters ...............................................................12-11 12–6. SLIDE: Semaphore Kernel Parameters.......................................................................12-13 12–7. SLIDE: Shared Memory Kernel Parameters ...............................................................12-15 12–8. SLIDE: Process-Related Kernel Parameters ..............................................................12-17 12–9. SLIDE: Memory-Related Kernel Parameters..............................................................12-19 12–10. SLIDE: LVM-Related Kernel Parameters ....................................................................12-21 12–11. SLIDE: Networking-Related Kernel Parameters........................................................12-22 12–12. SLIDE: Miscellaneous Kernel Parameters..................................................................12-23 Module 13  Putting It All Together 13–1. SLIDE: Review of Bottleneck Characteristics .............................................................13-2 13–2. SLIDE: Performance Monitoring Flowchart ................................................................13-4 13–3. SLIDE: Review — Memory Bottlenecks .......................................................................13-6 13–4. SLIDE: Correcting Memory Bottlenecks ......................................................................13-7 13–5. SLIDE: Review — Disk Bottlenecks .............................................................................13-8 13–6. SLIDE: Correcting Disk Bottlenecks.............................................................................13-9 13–7. SLIDE: Review — CPU Bottlenecks ...........................................................................13-11 13–8. SLIDE: Correcting CPU Bottlenecks...........................................................................13-12 13–9. SLIDE: Final Review — Major Symptoms..................................................................13-13

H4262S C.00 viii 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Contents

Appendix A — Applying GlancePlus Data A–1. TEXT PAGE: Case Studies — Using GlancePlus ............................................................. A-2 Solutions

http://education.hp.com

H4262S C.00 ix  2004 Hewlett-Packard Development Company, L.P.

Contents

H4262S C.00 x 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Overview

Overview Course Description This course is intended to introduce students to the various aspects of monitoring and tuning their systems. Students will be taught how to monitor which tools to use, symptoms to look for, and what remedial actions to take. The course also covers HP GlancePlus/Gpm and HP PerfRx. The course is designed to: • • • •

Introduce the subject of performance and tuning. Describe how the system works. Identify what tools we can use to look at performance. Identify the symptoms we may encounter and what they indicate.

Course Goals • • •

To educate the students on HP-UX performance monitoring To enable them to identify bottlenecks and potential problems To learn the appropriate remedial actions to take

Student Performance Objectives Module 1 — Introduction

•

List characteristics of a system yielding good user response time.

•

List characteristics of a system yielding high data throughput.

•

List three generic areas most often analyzed for performance.

•

List the four most common bottlenecks on a system.

Module 2 — Performance Tools

•

Identify various performance tools available on HP-UX.

•

Categorize each tool as either real time or data collection.

•

List the major features of the performance tools.

•

Compare and contrast the differences between the tools

Module 3 — GlancePlus

•

Compare GlancePlus with other performance monitoring/management tools.

•

Start up the GlancePlus terminal interface (glance) and graphical user interface (gpm).

http://education.hp.com

H4262S C.00 1  2004 Hewlett-Packard Development Company, L.P.

Overview Module 4 — Process Management

•

Describe the components of a process.

•

Describe how a process executes, and identify its process states.

•

Describe the CPU scheduler.

•

Describe a context switch and the circumstances under which context switching occurs.

•

Describe in general, the HP-UX priority queues.

Module 5 — CPU Management

•

Describe the components of the processor module.

•

Describe how the TLB and CPU cache are used.

•

List four CPU related metrics.

•

Identify how to monitor CPU activity.

•

Discuss how best to use the performance tools to diagnose CPU problems.

•

Specify appropriate corrections for CPU bottlenecks.

Module 6 — Memory Management

•

Describe how the HP-UX operating system performs memory management.

•

Describe the main performance issues that involve memory management.

•

Describe the UNIX buffer cache.

•

Describe the sync process.

•

Identify the symptoms of a memory bottleneck.

•

Identify global and process memory metrics.

•

Use performance tools to diagnose memory problems.

•

Specify appropriate corrections for memory bottlenecks.

•

Describe the function of the serialize command.

H4262S C.00 2 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Overview Module 7 — Swap Space Performance

•

Describe the difference between swap usage and swap reservation.

•

Interpret the output of the swapinfo command.

•

Define and configure pseudo swap.

•

Define and configure swap space priorities.

•

Define and configure swchunk and maxswapchunks.

Module 8 — Disk Performance Issues

•

List three ways disk space can be used.

•

List disk device files.

•

Identify disk bottlenecks.

•

Identify kernel system parameters.

Module 9 — File System Performance

•

List three ways file systems are used.

•

List basic file system data structures.

•

Identify file system bottlenecks.

•

Identify kernel system parameters.

Module 10 — VxFS Performance

•

Understand JFS structure and version differences

•

Explain how to enhance JFS performance

•

Set block sizes to improve performance

•

Set Intent-Log size and rules to improve performance

•

Understand and manipulate synchronous and asynchronous IO

•

Identify JFS tuning parameters

•

Understand and control fragmentation issues

•

Evaluate the overhead of online backup snapshots

http://education.hp.com

H4262S C.00 3  2004 Hewlett-Packard Development Company, L.P.

Overview Module 11 — NFS Performance

•

List factors directly related to network performance.

•

Describe how to determine network workloads (server and client).

•

Evaluate UDP and TCP transport options.

•

Identify a network bottleneck.

•

List possible solutions for a network performance problem.

Module 12  Tunable Kernel Parameters

•

Identify which tunable parameters belong to which category

•

Identify tunable kernel parameters that could impact performance

•

Tune both static and dynamic tunable parameters

Module 13 — Putting It All Together

•

Identify and characterize some network performance problems.

•

List some useful tools for measuring network performance problems and state how they might be applied.

•

Identify bottlenecks on other common system devices not associated directly with the CPU, disk, or memory.

H4262S C.00 4 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Overview

Student Profile and Prerequisites The student should be well versed in UNIX and able to perform the usual duties of a system administrator. Students should have completed HP-UX System and Network Administration I and HP-UX System and Network Administration II prior to attending this course or equivalent experience on another manufacturer's equipment. NOTE:

The course Inside HP-UX (H5081S) is not a formal prerequisite to attending HP-UX Performance and Tuning, but it should be considered a co-requisite training course for the serious HP-UX Performance Specialist. (The suggested order is Inside HP-UX then HP-UX Performance Tuning, but as the two courses have a synergistic relationship, the order is not absolute).

Curriculum Path Fundamentals of UNIX (H51434S) | | HP-UX System and Network Administration I (H3064S) | | HP-UX System and Network Administration II (H3065S)

OR HP-UX Administration for the Experienced UNIX Administrator (H5875S) | | | |

Inside HP-UX (H50815S)

Recommended

HP-UX Performance and Tuning (H4262S)

http://education.hp.com

H4262S C.00 5  2004 Hewlett-Packard Development Company, L.P.

Overview

Agenda The following agenda is only a guideline. The instructor may vary it if desired. The course will run until the afternoon of the third day. The last hour or so can be used to demonstrate more fully the performance offerings, such as HP PRM and HP PerfView. Day 1 1 — Introduction 2 — Performance Tools Day 2 3 — GlancePlus 4 — Process Management 5 — CPU Management Day 3 6 — Memory Management 7 — Swap Space PerformanceManaging 8 — Disk Performance Issues Day 4 9 — File System Performance 10 ---- VxFS Performance 11 — NFS Performance Day 5 12 — Tunable Kernel Parameters 13 — Putting It All Together

H4262S C.00 6 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 1  Introduction Objectives Upon completion of this module, you will be able to do the following: •

List characteristics of a system yielding good user response time.

•

List characteristics of a system yielding high data throughput.

•

List three generic areas most often analyzed for performance.

•

List the four most common bottlenecks on a system.

http://education.hp.com

H4262S C.00 1-1  2004 Hewlett-Packard Development Company, L.P.

Module 1 Introduction

1–1. SLIDE: Welcome to HP-UX Performance and Tuning

Welcome to HP-UX Performance and Tuning

Student Notes Welcome to the HP-UX Performance and Tuning course. This course is designed to provide a high level understanding of common performance problems and common bottlenecks found on an HP-UX system. The course uses HP performance tools to view activity currently on the system. While many tools can be used to analyze the activity, this course primarily utilizes the glance tool, which is specifically tailored for HP-UX systems.

H4262S C.00 1-2 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 1 Introduction

1–2. SLIDE: Course Outline

Course Outline • Introduction to Performance • Performance Tools – Overview • GlancePlus • Process Management • CPU Management • Memory Management • Swap Space Performance Issues • Disk and File System Performance Issues • HFS Performance Issues • VxFS Performance Issues • Network Performance Issues • Tuning the Kernel • Putting it All Together – Performance Recap

Student Notes Topics covered in this course include: •

System Internals This module includes information related to how the system components (CPU, memory, file systems, and network) function and interact with each other. Similar to a mechanic not being able to tune a car's engine until he understands how it works, a system administrator cannot tune system resources properly until he has a good understanding of how the resources work.

•

Performance Tools There are many performance tools that are available with HP-UX. Some tools come as standard equipment; other tools are additional add-on products. Some tools provide runtime monitoring; other tools perform data collection. We will review all of the tools.

•

Specialty Areas These modules cover areas of special interest to customers in particular types of environments. Three specialty areas are covered at a high level. These are NFS and networking, databases, and application profiling.

http://education.hp.com

H4262S C.00 1-3  2004 Hewlett-Packard Development Company, L.P.

Module 1 Introduction

1–3. SLIDE: System Performance

System Performance

Response Time

System Throughput Management

Users

Computer System

Student Notes Different computer systems have different requirements. Some systems may need to provide quick response time; other systems may need to provide a high level of data throughput.

Response Time — User's Perspective Response time is the time between the instant the user presses the return key or the mouse button and the receipt of a response from the computer. Users often use response time as a criterion of system performance. A system that yields high response is typically not 100% utilized. Often there are free CPU cycles, along with low utilization of disk drives, and with no swapping or paging. Because the system resources are not being utilized constantly, often when a user executes a task the resources are available immediately, yielding quick response time to the user. Users want low utilization of resources in order to experience optimal response time.

H4262S C.00 1-4 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 1 Introduction

Throughput — IT or MIS Management Perspective Throughput is the number of transactions accomplished in a fixed-time period. Management is often interested in how many compilations or how many reports they can generate in a specific amount of time. Many systems use benchmarks (like SPECmarks or TPC), which measure, in general, how many operations or transactions a system can perform per minute. A system that yields high throughput is typically 100% utilized. There are no free CPU cycles; there are always jobs in the CPU run queue; the disk drives are constantly being utilized; and there is often pressure on memory. Because the system resources are constantly in use, the amount of work produced typically yields good system throughput. Management wants high utilization of resources to maximize system performance. Question

Is it possible to get both good response time and high system throughput?

http://education.hp.com

H4262S C.00 1-5  2004 Hewlett-Packard Development Company, L.P.

Module 1 Introduction

1–4. SLIDE: Areas of Performance Problems

Areas of Performance Problems

Application

Operating System

Hardware

Student Notes This slide shows a hierarchical view of a computer system. The base of a computer is its hardware. Built on top of the hardware is the operating system (i.e. the operating system is dependent on the hardware in order to run). The application programs are built on top of the operating system (OS). All three of these areas can have performance problems.

Hardware The hardware moves data within the computer system. If the hardware is slow, then, no matter how finely tuned the OS and applications are, the system will still be slow. Ultimately, the system is only as fast as the hardware can move the data. Items affecting the speed of the hardware include CPU clock speed, amount and speed of memory, type of disk controller (Fast/Wide SCSI or Single-Ended SCSI), and type of network card (FDDI or Ethernet).

H4262S C.00 1-6 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 1 Introduction

Operating System The operating system runs on top of the hardware. It controls how the hardware is utilized. The operating system decides which process runs on the CPU, how much memory to allocate for the buffer cache, whether I/O to the disks is performed synchronously or asynchronously, and so on. If the operating system is not configured properly, then the performance of the system will be poor. Items affecting how the operating system performs include process priorities and their nice values, the tunable OS parameters, the mount options used for file systems, and the configurations of network and swap devices

Applications The applications run on top of the operating system. The application programs include software, such as database management systems, electronic design applications programs, and accounting-based applications. The performance of the application program is dependent on the operating system and hardware, but it is also dependent on how the application is coded, and how the application itself is configured. Items affecting the performance of the application include how the application data is laid out on the disk, how many users are trying to use the application currently, and how efficiently the application uses the system's resources. Questions

In which of these three areas are most performance problems located?

http://education.hp.com

H4262S C.00 1-7  2004 Hewlett-Packard Development Company, L.P.

Module 1 Introduction

1–5. SLIDE: Performance Bottlenecks

Performance Bottlenecks

Network

CPU Run Queue

CPU

Disk I/O Queue

Disk

Processes

Memory

System Bottleneck Areas • CPU • Memory • Disk • Network

Student Notes Poor performance often results because a given resource cannot handle the demand being placed upon it. When the demand for a resource exceeds the availability of the resource, then a bottleneck exists for that resource. Common resource bottlenecks are: CPU

A CPU bottleneck occurs when the number of processes wanting to execute is constantly more than the CPU can handle. Basic symptoms of a CPU bottleneck are high CPU Utilization and multiple jobs in the CPU run queue, consistently.

Memory

A memory bottleneck occurs when the total number of processes on the system will not all fit into memory at one time (i.e. there are more processes than memory can hold). When this happens, pages of memory need to be copied out to the swap partition on disk to free space in memory. Basic symptoms of a memory bottleneck are high memory utilization and consistent I/O activity to the swap partition on disk.

H4262S C.00 1-8 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 1 Introduction

Disk

A disk bottleneck occurs when the amount of I/O to a specific disk is more than the disk can handle. Basic symptoms of a disk bottleneck include high utilization of a disk drive and multiple I/O requests consistently in the disk I/O queue.

Network

A network bottleneck occurs when the amount of time needed to perform network-based transactions is consistently greater than expected. Basic symptoms of a network bottleneck include network collisions, network request timeouts, and packet retransmissions.

http://education.hp.com

H4262S C.00 1-9  2004 Hewlett-Packard Development Company, L.P.

Module 1 Introduction

1–6. SLIDE: Baseline

Baseline

Response Time

Best Possible Response Time

Response Time with Five Users

Response Time with Ten Users

Response Time with Fifteen Users

Student Notes In order to quantify good versus poor performance, a customer needs to know what the best possible response time for a given workload can be. The procedure for calculating the best possible response time for a given workload is known as baselining. To calculate the baseline (i.e. the best possible response time) for a particular workload, the workload needs to be performed when no other activity is on the system. The intent is that when all resources are free, the workload will be able to execute as quickly as possible, thereby yielding the best possible response time. Once the baseline value is known, a relative measure is now available for determining how poorly the workload is performing. For example, assume a baseline value of 5 seconds for the workload shown on the slide. When five users are on the system, the response time for the workload increases to 7 seconds. The relative comparison shows response time taking 40% (or 2 seconds) more time to perform this workload when five users are on the system. We have just quantified the relative effect of having five users on the system relative to this particular workload.

H4262S C.00 1-10 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 1 Introduction

The slide illustrates the typical behavior for a given workload. As more users concurrently utilize the system, the response time for a given workload gets worse. NOTE:

In this class we will run baseline metrics using simplified "workload" simulation programs. Results will vary greatly with your applications.

http://education.hp.com

H4262S C.00 1-11  2004 Hewlett-Packard Development Company, L.P.

Module 1 Introduction

1–7. SLIDE: Queuing Theory of Performance

Queuing Theory of Performance

Response Time 4X 3X 2X Baseline

X

25

50

75

100

Percent Utilization

Student Notes The queuing theory of performance states that the average response time of a given resource is directly linked to the average utilization of that resource. The slide shows a baseline value of X seconds for a given resource. According to the queuing theory, the users will experience this response time when the resource has an average utilization of 0 to 25%. When the average utilization of the resource reaches 75%, the average response time will double. As the average utilization approaches 100%, the average response time quadruples. The bottom line is, as the average utilization of the resource increases, the average response time gets worse and worse. Why does the average response time become poor as the average utilization of a resource increases?

H4262S C.00 1-12 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 1 Introduction

1–8. SLIDE: How Long Is the Line?

How Long Is the Line?

System Resource

The Line Starts Here

Student Notes The reason why the average response time gets so poor when the average resource utilization increases, is because the length of the line waiting to get to the resource gets longer. As resource utilization increases, the number of jobs waiting on the resource also increases. When poor performance is experienced, it is most often due to the length of the queue becoming long. A long queue causes jobs to spend most of their time waiting in line for the resource (CPU, memory, network, or disk), as opposed to being serviced by the resource. The slide shows four people waiting in line to get to a resource (think of a line in a bank with one bank teller). If it takes 5 minutes to service one customer, then the fourth person in line will wait 15 minutes before reaching the resource. Adding another 5 minutes to service, the request brings the total response time to 20 minutes for the last person in line, as opposed to 5 minutes if the line had been empty. Of course there is also an overhead experienced because of “switching” from one customer to the next. This switching is minimal in this example because the customers are handled in a serial fashion.

http://education.hp.com

H4262S C.00 1-13  2004 Hewlett-Packard Development Company, L.P.

Module 1 Introduction

1–9. SLIDE: Example of Queuing Theory

Example of Queuing Theory

sar -d 5 5 15:31:55

device

%busy

avque

15:32:00

c0t6d0 c0t5d0 c0t6d0 c0t5d0 c0t6d0 c0t5d0 c0t6d0 c0t5d0 c0t6d0 c0t5d0

81 5 84 3 68 1 71 0 69 0

3.4 .5 3.5 .5 2.9 .5 2.7 .5 2.7 .5

15:32:05 15:32:10 15:32:15 15:32:20

r+w/s 31 1 34 2 31 0 30 1 29 1

blks/s 248 32 245 8 248 6 30 3 29 3

avwait 59.31 0.65 71.64 0.25 51.36 0.48 62.88 0.65 61.70 0.65

avserv 21.20 23.58 24.04 17.93 18.55 19.18 24.16 29.25 24.14 29.25

Student Notes The above slide provides an example of the queuing theory for the disk drives as reported with the sar tool. The four fields to focus on are: %busy

The percentage of utilization of each disk

avque

The average number of I/O requests in the queue for that disk

avwait The average amount of time a request spends waiting in that disk’s queue avserv The average amount of time to service that I/O request (not including the wait time) Analyzing the data shows a baseline around 20 milliseconds to service an I/O request (approximate average of avserv column). The first line item shows a disk that is 81% utilized. The total response time is the average wait plus the average service, or approximately 80 milliseconds. This is four times longer than the baseline of 20 milliseconds. In fact, each snapshot shows the busy disk waiting in the queue for an amount of time greater than the amount of time to service the I/O request. To

H4262S C.00 1-14 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 1 Introduction

see why the wait time is so high, look at the avque size. Notice the queue size is highest when the device is most busy. This is the basic concept of the performance queuing theory.

http://education.hp.com

H4262S C.00 1-15  2004 Hewlett-Packard Development Company, L.P.

Module 1 Introduction

1–10. SLIDE: Summary

Summary • Objective for the system: –

Provide fast response time to users, or

–

Maximize throughput of system

• Three performance problem areas: – Hardware –

Operating System

–

Application

• Performance bottlenecks: – CPU –

Disk

–

Memory

–

Network

• Need for baselines • Performance queuing theory

Student Notes To summarize this module, systems are tuned for response time or for throughput. This class focuses on tuning for best possible response time. Areas that affect response time are speed of the hardware, configuration of the operating system, and configuration of the application. This class focuses on the configuration of the operating system. Common bottlenecks with computer systems include CPU, memory, disk, and network. This class discusses all four bottlenecks. Baselines are an important measurement tool for quantifying performance. In the lab for this module, the student will establish CPU and disk I/O baselines. Finally, the queuing theory of performance states that the average response time increases as the average utilization of a resource increases. This is an important concept, which will be revisited throughout this course.

H4262S C.00 1-16 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 1 Introduction

1–11. LAB: Establishing a Baseline Directions The following lab exercise establishes baselines for three CPU-bound applications and one disk-bound application. The objective is to time how long these applications take when there is no activity on the system. These same applications will be executed later on in the course when other bottleneck activity is present. The impact of these bottlenecks on user response time will be measured through these applications. 1. Change directory to /home/h4262/baseline. # cd /home/h4262/baseline 2. Compile three C programs long, med and short by running the BUILD script # ./BUILD 3. Time the execution of the long program. Make sure there is no activity on the system. # timex ./long Record Execution Time

real: user: sys:

4. Time the execution of the med program. Make sure there is no activity on the system. # timex ./med Record Execution Time

real: user: sys:

5. Time the execution of the short program. Make sure there is no activity on the system. # timex ./short Record Execution Time

real: user: sys:

6. Time the execution of the diskread program. # timex ./diskread Record Execution Time

http://education.hp.com

real: user: sys:

H4262S C.00 1-17  2004 Hewlett-Packard Development Company, L.P.

Module 1 Introduction

7. In the case of the long, med, and short programs the real time is the sum of the usr and sys time (approximately). This is not the case with diskread. Explain why.

H4262S C.00 1-18 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 1 Introduction

1–12. LAB: Verifying the Performance Queuing Theory Directions The performance queuing theory states that as the number of jobs in a queue increases, so will the response time of the jobs waiting to use that resource. This lab uses the short program compiled from /home/h4262/baseline/prime_short.c. 1. In terminal window 1, monitor the CPU queue with the sar command. # sar -q 5 200 2. In a second terminal window, time how long it takes for the short program to execute. # timex ./short & How long did the program take to execute? _________________ How does this compare to the baseline measurement from earlier? _____ What is the CPU queue size? _______ 3. Time how long it takes for three short programs to execute. # timex ./short &

timex ./short &

timex ./short &

How long did the slowest program take to execute? ___________________ How did the CPU queue size change from step 2? ___________________ 4. Time how long it takes for five short programs to execute. # timex ./short & timex ./short & timex ./short & timex ./short & timex ./short &

\

How long did the slowest program take to execute? _____________________ How did the CPU queue size change from step 3? _____________________ 5. Is the relationship between elapsed execution (real) time and the number of running programs linear? 6. Comment about the overhead of switching from one process to another.

http://education.hp.com

H4262S C.00 1-19  2004 Hewlett-Packard Development Company, L.P.

Module 1 Introduction

H4262S C.00 1-20 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2  Performance Tools Objectives Upon completion of this module, you will be able to do the following: •

Identify various performance tools available on HP-UX.

•

Categorize each tool as either real time or data collection.

•

List the major features of the performance tools.

•

Compare and contrast the differences between the tools.

http://education.hp.com

H4262S C.00 2-1  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–1. SLIDE: HP-UX Performance Tools

HP-UX Performance Tools

Student Notes Many performance tools are available for many different purposes. In the HP-UX operating system, there are over 50 different performance-related tools. Some tools provide real-time performance information, such as, “How busy the CPU is right now?” Other tools collect data in the background and maintain a history of performance information. This module addresses all the tools and the different functions they perform.

H4262S C.00 2-2  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–2. SLIDE: HP-UX Performance Tools (Continued)

HP-UX Performance Tools Objective: • Identify the various performance tools available on HP-UX • Demonstrate their mechanics • Discuss their features • Compare and contrast the differences between the tools

Student Notes The objective of this module is to highlight all the performance tools available with HP-UX, to categorize them by function, and to describe how each tool is used. The module is intended to be a quick reference of performance tools, which the student can refer to when needing to select a tool for a specific task. NOTE:

This module does not discuss how to interpret the output of the tools. Interpretation of the metrics is provided in later modules.

http://education.hp.com

H4262S C.00 2-3  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–3. SLIDE: Sources of Tools

Sources of Tools • Standard tools –

Tools found on UNIX systems, including HP-UX

Tools frequently found on other UNIX systems • HP-UX-specific tools –

–

Tools found only on HP-UX

• Optional tools –

Tools licensed and sold separately (Generally available only on HP-UX)

Student Notes Three types of tools are presented in this module: •

Standard Tools Standard tools are those frequently found on many UNIX systems, including HP-UX. The advantage of the standard tools is that their results can be compared with those being collected on other UNIX platforms. This provides an "apples for apples" comparison, which is desirable when comparing systems. The output from these standard tools (and some of the options) may vary slightly among UNIX systems. In addition, differences between the various UNIX implementations can affect the reliability of the metrics being output by the tools. Therefore, be careful to check the results with other tools or seek help before basing important tuning decisions on the value of one metric.

•

HP-Specific Tools HP-specific tools are those which are found only on HP-UX operating systems. These tools are often tailored specifically to understand HP-UX implementations. These tools are generally not found on other UNIX implementations, as other

H4262S C.00 2-4  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

implementations are different from those of HP. Some of the HP-specific tools come with the base OS; others are purchased as optional tools. •

Optional Tools Optional tools are tools that are added to the operating system in addition to the standard tools. Some of the optional tools, such as the HP-PAK (Programmers Analysis Kit), may be included with add-on software, such as compilers for HP-UX. Other optional tools, like GlancePlus, PerfView, MeasureWare, NetMetrix, PRM (Process Resource Manager), and WLM (Work Load Manager), are purchased individually or in small bundles (GlancePlus Pak also includes a MeasureWare agent). Optional tools are typically licensed from HP. They offer many advantages over the standard tools including:

− − − − −

ease of use accuracy granularity low overhead additional metrics

http://education.hp.com

H4262S C.00 2-5  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–4. SLIDE: Types of Tools

Types of Tools

Data Collection Performance Run-Time Monitoring

Network Monitoring

System Configuration and Utilization

Application Profiling and Monitoring Performance Administration

Student Notes The tools covered in this section fall into six main categories: •

Run-Time Monitoring Tools These tools provide information as to the performance of the system now. The information is current and provides a real-time perspective as to the state of the system at the current moment.

•

Data Collection Performance Tools These tools collect performance data in the background, summarize or average the data into a summary record, and log the summary record to a file or files on disk. They do not typically provide real-time data.

•

Network Monitoring Tools These tools monitor performance, status, and packet errors on the network. They include both monitoring and configuration tools related to network management.

•

Performance Administrative Tools A system administrator can use these tools to manage the performance of his system.

H4262S C.00 2-6  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

They typically do not report any data, but allow the current configuration of the system (and its components) to be changed to help improve performance •

System Configuration and Utilization Information Tools These tools report current system configurations (such as LVM and file systems). They also report utilization of resource statistics, like disk and file system space and number of processes.

•

Application Profiling and Monitoring Tools These tools provide in-depth analysis about the behavior of a program. These tools monitor and trace the execution of a process, and report the resources and calls made during its execution.

http://education.hp.com

H4262S C.00 2-7  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–5. SLIDE: Criteria for Comparing the Tools

Criteria for Comparing the Tools • Source of data • Scope • Additional cost versus no cost • Intrusiveness • Accuracy • Ease of use • Portability • Metrics available • Data collection and storage • Permissions required

Student Notes Each tool has strengths and weaknesses, advantages and disadvantages, and unique features. Some items to consider when selecting a tool are: Source of Data

The collected data can come from a variety of sources, including the kernel, an application, or a specific daemon (like the midaemon).

Scope

The scope determines the level of detail provided with the tool. Most of the standard tools do not show process-level metrics. For example, they display global disk I/O rates, but do not show which process is generating the I/O or the disk on which the I/O is concentrated.

Cost

The cost sometimes determines if the tool is an option. Many of the HP-specific tools have additional costs associated with them. (Many of these tools have evaluation copies available for a trial period.)

Intrusiveness

The intrusiveness relates to the overhead associated with running the tool. Some tools also have significant overhead. A large user community using top, for example, may be responsible for generating large amounts of "monitoring" overhead on the system. Another example is the ps

H4262S C.00 2-8  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

command. It has little impact on most systems due to the low frequency at which it is executed. However, the ps command places fairly high overhead on the system during its execution. Accuracy

The accuracy of the tool relates to the reliability of the data being reported. Many standard UNIX tools, like vmstat and sar, have been ported from other UNIX systems. The registers that they monitor may not always correspond to the registers that the kernel updates.

Others

There are other factors that can have significant impact on the tool you decide to use. These factors include familiarity, metrics available, permissions required, and portability.

As the tools are presented in the upcoming pages, many of these items will be addressed.

http://education.hp.com

H4262S C.00 2-9  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–6. SLIDE: Data Sources

Data Sources

Kernel Instrumentation Trace Buffers

Kernel Memory /dev/kmem or pstat()

midaemon

iostat

sar ps

vmstat scopeux

Socket

pv

Shared Memory Segment

Measurement Interface Library

logfiles glance utility

extract

Student Notes The standard tools read information from the UNIX counters and registers maintained in kernel memory (accessible via the /dev/kmem device file and the pstat() system call). These counters and registers are updated 10 times a second as a standard part of most UNIX system implementations. The data in the counters and registers are generally adequate for most performance jobs, but do not provide enough detail when in-depth tuning is needed. The optional tools for HP-UX use an additional source called kernel instrumentation (KI). The KI interface provides additional information beyond the UNIX kernel counters and registers. The KI interface gathers performance information on a system call basis, with every system call generated by every process being traced. The KI interface uses a proprietary measurement interface library to derive the additional metrics. These tools are frequently revised and updated to provide the highest levels of accuracy with the lowest possible overhead. The optional tools, such as Glance and MeasureWare, are KI-based tools when running on HP-UX systems, although they are available for other vendor systems as well. Additional information about KI-based tools (also known as resource and performance management (RPM) tools) can be obtained from the RPM web site at: www.hp.com/go/rpm

H4262S C.00 2-10  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–7. SLIDE: Performance Monitoring Tools (Standard UNIX)

Performance Monitoring Tools (Standard UNIX) Global Metrics

Process Details

Alarming Capability

iostat

Yes

No

No

ps

No

Yes

No

sar

Yes

No

No

time

No

Some

No

timex

Some

Some

No

Yes

Some

No

Some

Some

No

Yes

No

No

top uptime,w vmstat

Student Notes The slide shows run-time performance monitoring tools included with HP-UX. These tools provide current information about the performance of the system. These tools are standard UNIX performance tools, which are found on most other UNIX implementations. The Global Metrics column indicates whether the tool will show aggregate resource utilization without differentiating between specific resources. The Process Detail column indicates whether the tool will show resources being used by a single PID. The Alarming Capability column indicates whether the tool is capable of sending an alarm when one of the metrics exceeds a user-defined threshold.

http://education.hp.com

H4262S C.00 2-11  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–8. TEXT PAGE: iostat The iostat command reports I/O statistics for each active disk on the system. Tool Source:

Standard UNIX (BSD 4.x)

Documentation

man page

Interval

>= 1 second

Data Source:

Kernel registers/counters

Type of Data:

Global

Metrics:

Physical Disk I/O

Logging:

Standard output device

Overhead:

Varies, depending on the output interval

Unique Features:

Terminal I/O

Full Pathname:

/usr/bin/iostat

Pros and Cons:

+ statistics by physical disk drive - limited statistics - poorly documented and cryptic headings

Syntax

iostat [-t] [interval [count]] -t interval count

Report terminal statistics as well as disk statistics Display successive lines summaries at this frequency Repeat the summaries this number of times

Key Metrics The iostat metrics include: bps sps msps

Blocks (kilobytes) transferred per second Number of seeks per second Average milliseconds per seek

With the advent of new disk technologies, such as data striping, where a single data transfer is spread across several disks, the average milliseconds per seek becomes impossible to compute accurately. At best it is only an approximation, varying greatly, based on several dynamic system conditions. For this reason and to maintain backward compatibility, the milliseconds per seek (msps) field is set to the value 1.0. Examples

# iostat 5 2 device bps c0t6d0 0 c0t6d0 1100

sps 0.0 34.6

msps 1.0 1.0

H4262S C.00 2-12  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

# iostat -t 5 1 tty tin tout 0 0 device c0t6d0

bps 0

http://education.hp.com

sps 0.0

us 2

cpu ni 0

sy 1

id 98

msps 1.0

H4262S C.00 2-13  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–9. TEXT PAGE: ps The ps command displays information about selected processes running on the system. The command has many options for reducing the amount of output. Tool Source:

Standard UNIX (BSD 4.x)

Documentation:

man page

Interval:

on demand

Data Source:

in core process table

Type of Data:

per process

Metrics:

state, priority, nice values, PIDs, times, ...

Logging:

Standard output device

Overhead

Varies, depending on the number of processes

Unique Features:

Wait channel and Run queue of processes.

Full Pathname:

/usr/bin/ps

Pros and Cons:

+ familiarity + options for altering output - minimal information - no averaging or summarization (i.e. no global metrics)

Syntax ps [-aAcdefHjlP] [-C cmdlist] [-g grplist] [-G gidlist] [-n namelist] [-o format] [-R prmgrplist [-s sidlist] [-t termlist] [-u uidlist] [-U uidlist]

Key Metrics The ps metrics include: ADDR

The memory address of the process, if resident; otherwise, the disk address.

C

Recent processor utilization, used for CPU scheduling (0-255).

F

Flags associated with the process (octal, additive): 0 1 2 4

Process is on the swap device Process is in core memory Process is a system process Process is locked in memory

(and many more) NI

The nice value for the process; used in priority computation.

PPID

The process ID number of the parent process.

PID

The process ID number of this process.

H4262S C.00 2-14  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

PRI

The priority of the process.

S

The state of the process I S R T Z

Process is being created (very rarely seen) Process is Sleeping Process is currently Runnable Process is Stopped (rare) Process is terminated (aka Zombie process)

STIME

Starting time of process.

SZ

The size in 4-KB memory pages.

TIME

The cumulative execution time of the process.

TTY

The controlling terminal for the process.

WCHAN

The address of a structure representing the event or resource for which the process is waiting or sleeping.

Example # ps -fu daemon UID PID PPID daemon 1171 1170 daemon 1565 1171 # ps -lu daemon F S UID 1 S 1 1 S 1

PID 1171 1565

http://education.hp.com

C STIME TTY 0 13:03:42 ? 0 17:47:47 ?

PPID 1170 1171

TIME COMMAND 3:10 /usr/bin/X11/X :0 0:00 pexd /tmp/to_pexd_1171.2 /dev/ttyp2

C PRI NI 1 154 20 0 154 20

ADDR dbea00 10e6900

SZ 697 115

WCHAN TTY 3ace9c ? 3ace9c ?

TIME COMD 3:10 X 0:00 pexd

H4262S C.00 2-15  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–10. TEXT PAGE: sar The sar command collects and reports on many different system activities (and system areas), including CPU, buffer cache, disk, and others. Related commands include sadc, sa1, and sa2. These commands are related to the data collection functionality of sar and will be addressed with the data collection commands. Tool Source:

Standard UNIX (System V)

Documentation:

man page and kernel source

Interval:

>= 1 second

Data Source:

/dev/kmem registers/counters

Type of Data:

Global

Metrics:

CPU, Disk, and Kernel resources

Logging

Standard output device, or file on disk

Overhead:

Varies, depending on the output interval

Unique Features:

Disk I/O wait time, kernel table overflows, buffer cache hit ratios

Full Pathname:

/usr/sbin/sar

Pros and Cons:

+ familiarity + performs both real time and data collection functions - no per process information - no paging information, only designed for swapping (no longer done on HP-UX)

Syntax

sar [-ubdycwaqvmpAMSP] [-o file] t [n] Metric-related options: -u CPU Utilization -q Run queue and swap queue lengths and utilization -b Buffer cache stats -d Disk utilization -y TTY utilization -c System call rates -w Swap activity -v Kernel table utilization -m Semaphore and message queue utilization -a File access system routine utilization -A Everything! -M Per processor breakdown (used with –u and/or –q) -P/p Per processor set breakdown (used with –MU and/or –Mq)

H4262S C.00 2-16  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

Key Metrics The sar command has many metrics. Included below are some sample metrics based on the disk and CPU reports: CPU Report (-u)

The CPU report displays utilization of CPU and the percentage of time spent within the different modes. %usr %sys %wio %idle

Percentage of time system spent in user mode Percentage of time system spent in system mode Percentage of time processes were waiting for (disk) I/O Percentage of time system was idle

Disk Report (-d)

The disk report displays activity on each block device (i.e. disk drive). Device %busy avque r+w/s blks/s avwait avserv

Logical name of the device (device file name) Percentage of time the device was busy servicing a request Average number of I/O requests pending for the device Number of I/O requests per second (includes reads and writes) Number of 512-byte blocks transferred (to and from) per second The average amount of time the I/O requests wait in the queue before being serviced The average amount of time spent servicing an I/O request (includes seek, rotational latency, and data transfer times)

Examples # sar -u 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:32:24 08:32:29 08:32:34 08:32:39 08:32:44 Average

10/14/97

%usr 64 61 61 61

%sys 36 39 39 39

%wio 0 0 0 0

%idle 0 0 0 0

61

39

0

0

# sar -d 5 4 HP-UX r3w14 B.10.20 C 9000/712

10/14/97

08:32:24 08:32:29 08:32:34 08:32:39 08:32:44

device c0t6d0 c0t6d0 c0t6d0 c0t6d0

%busy 19.36 26.40 21.00 21.00

avque 0.55 0.58 0.54 0.54

r+w/s 20 27 23 23

blks/s 1341 1687 1528 1528

avwait 6.37 7.10 5.48 5.48

avserv 14.27 15.00 14.09 14.09

Average

c0t6d0

22.44

0.56

23

1552

6.34

14.45

http://education.hp.com

H4262S C.00 2-17  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–11. TEXT PAGE: time, timex Description The time and timex commands report the elapsed (wall clock) time, the time spent in system mode, and the time spent in user mode, for a specific invocation of a program. The timex command is an enhanced version of time, and can report additional statistics related to resources used during the execution of the command. Tool Source:

Standard UNIX (System V)

Documentation:

man page and kernel source

Interval:

Process completion

Data Source:

Kernel registers/counters

Type of Data:

Process

Metrics:

CPU (user, system, elapsed)

Logging:

Standard output device

Overhead:

Minimal

Unique Feature:

Timing how long a process executes

Full Pathname:

/usr/bin/timex

Pros and Cons:

+ minimal overhead - cannot be used on already running processes

Syntax time command timex [-o] [-p[fhkmrt]] [-s] command -o -s

List amount of I/O performed by command (requires pacct file to be present) List activity (SAR data) present during execution of command (requires sar file to be present)

Example timex find / 2>&1 >/dev/null | tee -a perf.data real user sys

39.49 1.47 11.24

H4262S C.00 2-18  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–12. TEXT PAGE: top Description The top command displays a real-time list of the CPU consumers (processes) on the system, sorted, with the greatest users at the top of the list. Tool Source:

Standard UNIX (BSD 4.x)

Documentation:

man page

Interval:

>= 1 second

Data Source:

Kernel registers/counters

Type of Data:

Global, Process

Metrics:

CPU, Memory

Logging:

Standard output device

Overhead:

Varies, depending on presentation interval

Unique Feature:

Real-time list of top CPU consumers

Full Pathname:

/usr/bin/top

Pros and Cons:

+ quick look at global and process CPU data - limited statistics - uses curses for terminal output

Syntax

top [-s time] [-d count] [-n number] [-q] -s time -d count -n number -q

Set Set Set Run

the delay between screen updates the number of screen updates to "count", then exit the number of processes to be displayed quick. The top command with a nice value of zero.

Key Metrics The top metrics include: SIZE RES %WCPU %CPU

Total size of the process in KB. This includes text, data, and stack. Resident size of the process in KB. This includes text, data, and stack. Average (weighted) CPU usage since top started. Current CPU usage over the current interval.

Example * Start top with a 10 second update interval # top -s 10 * Start top and display only 5 screen updates then exit # top -d 5 * Start top and display only top 15 processes # top -n 15

http://education.hp.com

H4262S C.00 2-19  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

* Start top and let it run continuously # top System: r3w14 Fri Oct 17 10:24:23 1997 Load averages: 0.55, 0.37, 0.25 115 processes: 113 sleeping, 2 running Cpu states:LOAD USER NICE SYS IDLE 0.55 9.9% 0.0% 2.0% 88.1%

BLOCK 0.0%

SWAIT 0.0%

INTR 0.0%

SSYS 0.0%

Memory: 24204K (15084K) real, 46308K (33432K) virtual, 2264K free Page# 1/9 TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND ? 680 root 154 20 1328K 468K sleep 33:23 12.36 12.34 snmpdm ? 728 root 154 20 340K 136K sleep 18:20 5.82 5.81 mib2agt ? 1141 root 154 20 12784K 3708K sleep 84:06 4.47 4.47 netmon ? 1071 root 80 20 1264K 568K run 0:19 3.00 2.99 pmd ? 3892 root 179 20 308K 296K run 0:00 2.59 0.34 top * To go to the next/previous page, type "j" and "k" respectively * To go to the first page, type "t"

NOTE:

The two values preceding real and virtual memory are the memory allocated for all processes, and in parentheses, memory allocated for processes that are currently runnable or that have executed within the last 20 seconds.

NOTE:

swait and block are relevant for SMP systems and will be 0.0% on single processor systems. swait is the time a processor spends “spinning” while waiting for a spinlock. block is the time a processor spends “blocked” while waiting for a kernel-level semaphore.

H4262S C.00 2-20  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–13. TEXT PAGE: uptime, w The uptime command shows how long a system has been up, and who is logged in and what they are doing. The w command is linked to uptime and prints the same output as uptime -w, displaying a summary of the current activity on the system. Tool Source:

Standard UNIX (BSD 4.x)

Documentation:

man page

Interval:

on demand

Data Source:

Kernel registers/counters and /etc/utmp

Type of Data:

Global

Metrics:

Load averages, number of logged on users

Logging:

Standard output device

Overhead:

Varies, depending on number of users logged in

Unique Feature:

Easiest way to see time since last reboot, load averages

Full Pathname:

/usr/bin/uptime

Pros and Cons:

+ quick look at load average, how long systems been up - limited statistics

Syntax uptime [-hlsuw] [user] w [-hlsuw] [user] -h -l -s -u -w

Suppress the first line and the header line Print long listing Print short listing Print only the utilization lines; do not show user information Print what each user is doing; same as w command.

Example # uptime 11:23am # uptime 11:23am User root root root root root

up 3 days, 22:22,

7 users,

load average: 0.62, 0.37, 0.30

-w up 3 days, 22:22, 7 users, load average: 0.57, 0.37, 0.30 tty login@ idle JCPU PCPU what console 9:26am 94:20 /usr/sbin/getty console pts/0 9:26am 5 /sbin/sh pts/3 9:26am 1:57 /sbin/sh pts/4 10:16am 2 2 vi tools_notes pts/5 9:43am script

http://education.hp.com

H4262S C.00 2-21  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–14. TEXT PAGE: vmstat The vmstat command reports virtual memory statistics about processes, virtual memory, and CPU activity. Tool Source:

Standard UNIX (BSD 4.x)

Documentation

man page, include files

Interval:

>= 1 second

Data Source:

Kernel registers/counters

Type of Data:

Global

Metrics:

CPU, Memory

Logging:

Standard output device

Overhead:

Varies, depending on presentation interval

Unique Feature:

Cumulative VM statistics since last reboot

Full Pathname:

/usr/bin/vmstat

Pros and Cons:

+ minimal overhead - poorly documented - cryptic headings - lines wrap on 80-column character display - statistics can bleed together

Syntax vmstat [-dnS] [interval [count]] vmstat -f | -s | -z -d -n -S -f -s -z

Include disk I/O information Print in a format more easily viewed on a 80-column display Include swapping information Print number of processes forked since boot, number of pages used by all forked processes, and the average pages/forked process Print virtual memory summary information Zero the summary registers.

Key Metrics The vmstat metrics include: Process metrics

r b w

In run queue Blocked for resource (I/O, paging, and so on) Runnable or short sleeper (< 20 sec.) but swapped

H4262S C.00 2-22  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools VM metrics

avm free re at pi po fr sr

Active virtual pages Number of pages on the free list Page reclaims Address translation faults Pages paged in Pages paged out Pages freed by vhand, per second Pages surveyed (dereferenced) by vhand, per second

Fault metrics

in sy cs

Device interrupts per second System calls per second CPU context switch rate (switches/second)

CPU metrics

us sy id

User mode utilization System mode utilization Idle time

Examples # vmstat -n 5 2 VM memory avm free re 7589 728 0 CPU cpu procs us sy id r b 2 1 97 0 74 7670 692 0 47 11 42 0 75

# vmstat -nS 5 2 VM memory avm free si 7984 584 0 CPU cpu procs us sy id r b 2 1 97 0 75 7972 549 0 1 1 98 0 76

at 0

w 0 0 0

so 0

w 0 0 0

pi 0

page po 0

fr 0

de 0

sr 0

in 140

faults sy 490

0

0

0

0

0

235

4959

pi 0

page po 0

fr 0

de 0

sr 0

in 140

faults sy 490

cs 30

0

0

0

0

0

203

462

53

# vmstat -f 3949 forks, 497929 pages, average=

http://education.hp.com

cs 30

170

126.09

H4262S C.00 2-23  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools # vmstat -s 0 swap ins 0 swap outs 0 pages swapped in 0 pages swapped out 1116471 total address trans. faults taken 346175 page ins 7976 page outs 200675 pages paged in 16824 pages paged out 213104 reclaims from free list 216129 total page reclaims 110 intransit blocking page faults 587961 zero fill pages created 303212 zero fill page faults 248573 executable fill pages created 67077 executable fill page faults 0 swap text pages found in free list 80233 inode text pages found in free list 166 revolutions of the clock hand 106769 pages scanned for page out 13236 pages freed by the clock daemon 75633551 cpu context switches 1612387244 device interrupts 1137948 traps 247228805 system calls

H4262S C.00 2-24  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–15. SLIDE: Performance Monitoring Tools (HP Specific)

Performance Monitoring Tools (HP Specific) Global Metrics

Process Details

Alarming Capability

glance

Yes

Yes

Yes

gpm

Yes

Yes

Yes

xload

Yes

No

No

Student Notes This slide shows the HP-specific, run-time performance monitoring tools included with HP-UX. Currently, glance and gpm are available for HP-UX. Both glance and gpm are optional, and can be purchased separately. If you are running 11i (any version), both glance and gpm are included with the Enterprise and Mission Critical Operating Environments. The glance and gpm tools provide real-time monitoring capabilities specific to the HP-UX operating system. Both tools provide access to performance data not available with standard UNIX tools, and both tools use the midaemon (i.e. KI interface) to collect performance data, yielding much more accurate performance results. xload is an X-windows application, which graphically shows the recent length of the CPU’s run queue. It consists of a window that displays vertical lines which represent the average number of processes in the run queue over the previous intervals. The default interval size is 8 seconds.

http://education.hp.com

H4262S C.00 2-25  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–16. TEXT: glance The glance tool is available for HP-UX. This is the recommended (and preferred) performance monitoring tool for HP-UX systems (character-based display). This tool shows information that cannot be seen with any of the standard UNIX monitoring tools. The accuracy of the data is considered more reliable, as the source is the midaemon, as opposed to the kernel counters and registers. NOTE:

Free evaluation copies of glance and gpm can be obtained for trial periods. The phone number to obtain an evaluation copy is (800)237-3990.

Tool Source:

HP

Documentation:

man page and on-line help

Interval:

>= 2 second

Data Source:

midaemon

Type of Data:

Global, Process, and Application

Metrics:

CPU, Memory, Disk, Network, and Kernel resources

Logging:

Standard output device, screen shots to a file

Overhead:

Varies, depending on presentation interval and number of processes

Unique Feature:

Per process (and global) system call rates Extensive on-line help for the metrics Sort by CPU usage, memory usage, or disk I/O usage Files opened per process

Full Pathname:

/opt/perf/bin/glance

Pros and Cons:

+ extensive per-process information + extensive global information + more accurate than standard UNIX tools - uses the “curses” display library - relatively slow startup - not bundled with the OS (prior to 11i)

Syntax

glance [-j interval] [-p [dest]] [-f dest] [-maxpages numpages] [-command] [-nice nicevalue] [-nosort] [-lock] [-adviser_off] [-adviser_only] [-bootup] [-iterations count] [-syntax filename] [-all_trans] [-all_instances] [-disks ] [-kernel ] [-nfs ] [-pids ] [-no_fkeys]

H4262S C.00 2-26  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

Key Metrics The glance tool includes reports for the following areas: Hot Key

GLANCE PLUS REPORT

FUNCTION

a c d g h

CPU by Processor CPU Report Disk Report Process List

All CPUs Performance Stats CPU Utilization Stats Disk I/O Stats Global Process Stats Help

i l m n s

I/O by Filesystem Network by LAN Memory Report NFS Report Process selection

I/O by Filesystem Lan Stats Memory Stats NFS Stats Single process information

t u v w z

System Table Report Disk Report I/O by Logical Volume Swap Detail

OS Table Utilization Disk Queue Length Logical Volume Mgr Stats Swap Stats Zero all Stats

A B D F G

Application List Global Waits DCE Activity Process Open Files Process Threads

H I J K L

Alarm History Thread Resource Thread Wait DCE Process List Process System Calls

M N P R T

Process Memory Regions NFS Global Activity PRM Group List Process Resources Transaction Tracker

W Y Z ?

Process Wait States Global System Calls Global Threads Help with options Update screen with new data

See Module 3 for a more complete discussion of glance and gpm.

http://education.hp.com

H4262S C.00 2-27  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–17. TEXT PAGE: gpm The gpm tool is a graphical version of glance. All the benefits of using glance apply to gpm (GlancePlus Monitor). Free evaluation copies of Glance and gpm can be obtained for a 90-day trial period. The phone number to obtain the evaluation copy is (800)237-3990.

NOTE:

Tool Source:

HP

Documentation:

man page and on-line help

Interval:

>= 1 second

Data Source:

midaemon

Type of Data:

Global, Process, Application

Metrics:

CPU, Memory, Disk, Network, kernel resources

Logging:

Standard output device and screen shots to a file

Overhead:

Varies, depending on presentation interval and number of processes

Unique Feature:

Alarming capabilities Performance advisor

Full Pathname:

/opt/perf/bin/gpm

Pros and Cons:

+ extensive per-process information + extensive global information + more accurate than standard UNIX tools - no selection for printing graphs - not bundled with the OS, prior to 11i

Syntax gpm

[-nosave] [-rpt [rptname]] [-sharedclr] [-nice nicevalue] [-lock] [-disks ] [-kernel ] [-lfs ] [-nfs ] [-pids ] [Xoptions]

Glance and GPM Advantages Both Glance and GPM: •

Use the same metrics

•

Use the midaemon and kernel registers/counters as data sources

•

Have adjustable presentation intervals

•

Have the ability to renice processes

•

Provide alarming capability (via /var/opt/perf/advisor.syntax)

•

Provide per-CPU metrics

H4262S C.00 2-28  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

•

Can be configured to monitor application performance (that is, groups of processes)

Glance Advantages Advantages of using Glance include: •

It is independent of X-Windows.

•

It uses less overhead.

GPM Advantages Advantages of using gpm include: •

It has customizable advisor syntax, which generates color-coded alarms.

•

Has the ability to kill processes

•

Reports are customizable.

•

More comprehensive online documentation is available.

See Module 3 for a more complete discussion of glance and gpm.

http://education.hp.com

H4262S C.00 2-29  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–18. TEXT PAGE: xload xload is a graphical tool that will display the average length of the run queue over recent 10 second intervals. Since it is displayed in its own window on a graphics terminal, the window can be resized to accommodate good detail and many intervals at once. Tool Source:

HP

Documentation:

man page

Interval:

10 seconds (default)

Data Source:

Kernel registers

Type of Data:

Global

Metrics:

Run queue length

Logging:

none

Overhead:

Very little

Unique Feature:

Visual representation of run queue lengths

Full Pathname:

/usr/contrib./bin/X11/xload

Pros and Cons:

+ visual representation of run queue lengths in real time + expandable window for greater time and detail + self-scaling - no scale labels - no per-processor information

Syntax xload

[-toolkitoption … ] [-scale integer] [-update seconds] [-hl|-highlight color] [-jumpscroll pixels] [-label string] [-nolabel] [-lights]

Example xload

-update 30

H4262S C.00 2-30  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–19. SLIDE: Data Collection Performance Tools (Standard UNIX)

Data Collection Performance Tools (Standard UNIX) acctcom sar

Global Metrics

Process Details

Alarming Capability

Some

Some

No

Yes

No

No

Student Notes This slide shows the standard UNIX data collection tools included with HP-UX. Data collection tools gather performance data and other system-activity information, and store this data to a file on the system. By default, not too many standard UNIX tools perform data collection. The two most common tools are the acct (system accounting) suite of tools and sar (via the sadc and sa1 programs), the system activity reporter.

http://education.hp.com

H4262S C.00 2-31  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–20. TEXT PAGE: acct Programs The system accounting programs are primarily a financial tool and are designed to charge for time and resources used on the system. Information such as connect time, pages printed, disk space used for file storage, and commands executed (and the resources used by those commands) is collected and stored by the acct commands. Generally not considered a performance tool, the accounting commands can provide useful data for certain situations.

Description Tool Source:

Standard UNIX (System V)

Documentation:

man pages

Interval:

on demand

Data Source:

Kernel registers and other kernel routines

Type of Data:

System resources used, on a per user basis

Metrics:

Connect time, Disk space used, others

Logging:

Binary file /var/adm/acct/pacct

Overhead:

Medium to large (up to 33%), depending on number of users and amount of activity

Unique Feature:

Shows the amount of system resources being consumed by each user on the system. Logs every command executed by every user on the system.

Full Pathname:

/usr/sbin/acct/[acct_command]

Pros and Cons:

+ provides information to charge users for system use + extensive system utilization information kept - extremely large overhead, especially on an active system. - poor documentation

Syntax

/usr/sbin/acct/acctdisk /usr/sbin/acct/acctdusg [-u file] [-p file] /usr/sbin/acct/accton [file] /usr/sbin/acct/acctwtmp reason /usr/sbin/acct/closewtmp /usr/sbin/acct/utmp2wtmp and many more …

H4262S C.00 2-32  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

System Accounting Notes •

System Accounting can be started: Manually Run the /usr/sbin/acct/startup command. Automatically at Boot Time Edit the /etc/rc.config.d/acct file and set the START_ACCT parameter equal to one (for example, START_ACCT=1).

•

Only terminated processes are reported.

•

Accounting reports include:

− − − − − −

CPU time accounting Disk accounting Memory accounting Connect time accounting User command history Several more

http://education.hp.com

H4262S C.00 2-33  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–21. TEXT PAGE: sar The sar tool comes with additional programs, which assist in performance data collection and storage. The performance data is kept for one month before being overwritten with new data. Since collected data is overwritten each month, monitoring the files’ sizes is unnecessary. The sadc program is a data collector which runs in the background, usually started by sar or sa1. The sa1 program is a convenient shell script for collecting and storing “sar” data to a log file under /var/adm/sa. This script is typically run from root's cron file and collects (by default) three system snapshots per hour. The sa2 program is also a convenient shell script for converting collected sar data (binary format) into readable ASCII report files. The report files are typically stored in /var/adm/sa. The sa2 script is also normally run from root's cron file. Tool Source:

Standard UNIX (System V)

Documentation:

man page

Interval:

>= 1 second

Data Source:

Kernel registers

Type of Data:

Global

Metrics:

CPU, Disk, Kernel resources

Logging:

Binary file under /var/adm/sa

Overhead:

Varies, depending on snapshot interval

Unique Feature:

Only standard UNIX performance data collector

Full Pathname:

/usr/sbin/sar

Pros and Cons:

+ familiarity + relatively low overhead - no per process information - accuracy not as good as MeasureWare/OVPA

Syntax

sar [-ubdycwaqvmAMS] [-o file] t [n] sar [-ubdycwaqvmAMS] [-s time] [-e time] [-i sec] [-f file]

Some data collection related options: -s -e -i -o -f

The start time of the desired data The end time of the desired data The size of the reporting interval in seconds The file to write the data to The file to read the data from

H4262S C.00 2-34  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

Configure Data Collection through cron Jobs To set up sar data collection, add the following to root's cron file: 0 0 0 5 5 5

* 8-17 18-7 18 18 18

* * * * * *

* * * * * *

0,6 1-5 1-5 1-5 1-5 1-5

/usr/lbin/sa/sa1 /usr/lbin/sa/sa1 /usr/lbin/sa/sa1 /usr/lbin/sa/sa2 /usr/lbin/sa/sa2 /usr/lbin/sa/sa2

1200 3 -s 8:00 -e 18:01 -i 3600 -u -s 8:00 -e 18:01 -i 3600 -b -s 8:00 -e 18:01 -i 3600 -q

Create the /var/adm/sa directory: mkdir /var/adm/sa Some systems recommend adding the above entries to adm's cron file instead of root's. On these systems, be sure to give write access to all users on the /var/adm/sa directory. chmod a+w /var/adm/sa

http://education.hp.com

H4262S C.00 2-35  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–22. SLIDE: Data Collection Performance Tools (HP-Specific)

Data Collection Performance Tools (HP-Specific) Global Metrics

Process Details

Alarming Capability

MeasureWare/OVPA

Yes

Yes

Yes

PerfView/OVPM

Yes

Yes

Yes

User Definable

User Definable

User Definable

Data Source Integration

Student Notes This slide shows the HP-specific data collection performance tools, which can be added to an HP-UX system. The MeasureWare/OVPA (OpenView Performance Agent) and PerfView/OVPM (OpenView Performance Manager) tools are available for HP-UX systems. These tools are optional products (separately purchasable). These tools significantly enhance a customer's ability to track performance trends and review historical performance data about a system. The standard UNIX tools collect little to no perprocess information, and have no alarming capabilities. With the MeasureWare/OVPA and PerfView/OVPM tools, global and per-process information is collected. In addition, alarms can be set to notify a user when a collected metric exceeds a defined threshold. Recently, PerfView was renamed OpenView Performance Manager and MeasureWare was renamed OpenView Performance Agent. There were no other significant changes made to the products.

H4262S C.00 2-36  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–23. TEXT PAGE: MeasureWare/OVPA and DSI Software MeasureWare/OVPA is the recommended and preferred tool for collecting performance data on an HP-UX system. MeasureWare/OVPA collects all the global and process statistics, consolidates the data into a 5-minute summary, and writes the record to a circular log file. Processes can be grouped into applications, and various thresholds are available for determining which processes are included in the summary. OVPA version 3.x is identical to Measureware. OVPA version 4.x serves the same purpose, but has a new user interface. Included with MeasureWare/OVPA is a product/tool called Data Source Integration (DSI). DSI allows custom, application-specific metrics to be defined and collected via the MeasureWare/OVPA product. This custom information can include database statistics, networking statistics collected with NetMetrix, or MIB information from a networking device (router or gateway) collected with SNMP. Tool Source:

HP

Documentation:

man pages, manual, on-line help

Interval:

1 minute and 5 minute summaries

Data Source:

midaemon

Type of Data:

Global, Process, Application

Metrics:

CPU, Memory, Disk, Network, Other

Logging:

Circular binary files under /var/opt/perf/datafiles

Overhead:

Number of processes and number of application definitions

Unique Feature:

Parameter file to define the extent of data collection. Circular, compact log file format

Full Pathname:

/opt/perf/bin/mwa

Pros and Cons:

+ extensive global information + extensive per-process information + customize data collection with DSI - requires another tool (PerfView/OVPM) for graphical analysis - not included with the base OS

Syntax

mwa [action] [subsystem] [parms] in which action is start

Start all or part of MeasureWare/OpenView Performance Agent. (default)

stop

Stop all or part of MeasureWare/OpenView Performance Agent.

restart

Reinitialize all or part of MeasureWare/OpenView Performance Agent. This option causes some processes to be stopped and restarted.

http://education.hp.com

H4262S C.00 2-37  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

status

List the status of all or part of MeasureWare/OpenView Performance Agent processes.

MeasureWare/OVPA and Data Source Integration Notes •

The MeasureWare/OVPA agent for HP-UX is part of the RPM (Resource and Performance Management) set of performance tools. To find the complete list of available RPM products, visit the RPM Web site at: www.hp.com/go/rpm

•

MeasureWare/OVPA is designed for use with the PerfView/OVPM Analyzer tool and features extensive alarming syntax.

•

The utility and extract programs for MeasureWare/OVPA provide many features for the analysis and management of the MeasureWare/OVPA log files.

•

The MeasureWare/OVPA agent is fully integrated with the OpenView product line and is capable of sending alarm messages to the PerfView/OVPM Monitor, Network Node Manager, and IT Operations.

•

The MeasureWare/OVPA agent is available for a large number of UNIX platforms including: AIX, Solaris, NCR System VR4, Microsoft Windows NT, and more.

•

Data Source Integration (DSI) is one of the most powerful features of MeasureWare/OVPA. DSI provides the ability to log data from any data source – as long as it writes its output to stdout.

•

HP sells additional agents, which make use of this data source integration to allow for the monitoring of databases, network operating systems (for example, Windows NT and NetWare), and the Network Response Monitoring metrics (a facility of NetMetrix).

•

Data can be imported from such operating environments as SAP/R3 and Baan.

See the course B5136S – “Performance Management with HP OpenView” for a more complete discussion of MeasureWare/OVPA.

H4262S C.00 2-38  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–24. TEXT PAGE: PerfView/OVPM The PerfView/OVPM tool allows collected MeasureWare/OVPA information to be viewed in a feature-rich, GUI interface. Graphs, charts, alarms, and other details are easily viewed with the PerfView/OVPM tool. Similarly to the MeasureWare product, OVPM version 3 is identical to PerfView, whereas OVPM version 4 has the same functionality but has a new user interface. Tool Source:

HP

Documentation:

Man pages, manual, online help

Interval:

On demand

Data Source:

MeasureWare/OVPA log files

Type of Data:

Global, Process, and Application

Metrics:

CPU, Memory, Disk, Network, others

Logging:

To central monitoring workstation

Overhead:

number of systems being analyzed, number of systems sending alarms

Unique Feature:

Many predefined graph templates. Access to any system currently running the MeasureWare/OVPA agent.

Full Pathname:

/opt/perf/bin/pv

Pros and Cons:

+ Centralized and automated performance monitoring + Can view data from DSI sources + Graphs can be saved in a worksheet format - Does not come standard with the OS

Syntax

pv [options]

PerfView/OVPM Notes There are three components that make up the PerfView/OVPM product: PerfView/OVPM Analyzer

•

The PerfView/OVPM Analyzer allows for the performance administrator to easily access data from any MeasureWare/OVPA Agent.

•

By default, the last 8 days of data are pulled in to be analyzed, but any amount of data that has been collected can be retrieved.

•

The PerfView/OVPM Analyzer allows you to compare multiple systems against a specific metric as well for load balancing.

•

The graphs produced by the PerfView/OVPM Analyzer can be stored, or printed out to any Postscript or PCL printer.

•

As with all of the RPM products, the PerfView/OVPM Analyzer is fully integrated with Network Node Manager and IT Operations.

http://education.hp.com

H4262S C.00 2-39  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools PerfView/OVPM Monitor

•

The PerfView/OVPM Monitor receives alarms sent by MeasureWare/OVPA agents.

•

It allows you to filter alarms by severity and type.

•

The PerfView/OVPM Monitor is an optional module and may not be required if you are also running Network Node Manager or IT Operations.

PerfView/OVPM Planner

•

The PerfView/OVPM Planner allows you to use collected MeasureWare/OVPA data to see performance trends.

•

The more data provided to the PerfView/OVPM Planner and the less time you project it, the more accurate the reports will be.

•

The PerfView/OVPM Planner is not a true capacity-planning tool in that it does not provide modeling or simulation capability.

See the course B5136S – “Performance Management with HP OpenView” for a more complete discussion of PerfView/OVPM.

H4262S C.00 2-40  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–25. SLIDE: Network Performance Tools (Standard UNIX)

Network Performance Tools (Standard UNIX) Resource

Super User Access Required

netstat

Various LAN Statistics

No

nfsstat

Network File Sharing Statistics

No

Test Network Connectivity and Packet Round-Trip Response Time

No

ping

Student Notes This slide shows the standard UNIX networking performance tools included with HP-UX. Networking performance tools monitor performance and errors on the network. The standard UNIX networking tools primarily allow for monitoring of performance. The HP-specific tools will introduce the ability to tune some networking parameters to better meet the needs of a system's networking environment. NOTE:

Super user (or root) access is not needed to monitor networking status by default.

http://education.hp.com

H4262S C.00 2-41  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–26. TEXT PAGE: netstat The netstat command displays general networking statistics. Information displayed includes: • active sockets per protocol • network data structures (like route tables) • LAN card configuration and traffic Tool Source:

Standard UNIX (BSD 4.x)

Documentation:

man pages and manual

Interval:

on demand

Data Source:

Kernel registers and LAN card

Type of Data:

Global

Metrics:

Network, LAN I/O, Sockets

Logging:

Standard output device

Overhead:

Varies, depending on network activity

Unique Features:

Shows established and listening sockets. Shows traffic going through LAN interface card. Shows amount of memory allocated to networking

Full Pathname:

/usr/bin/netstat

Pros and Cons:

+ provides lots of information on networking configuration - provides lots of metrics; not all metrics are documented well

Syntax netstat [-aAn] [-f address-family] [system [core]] netstat [-f address-family] [-p protocol] [system [core]] netstat [-gin] [-I interface] [interval] [system [core]]

Examples Display network connections # netstat -n Active Internet connections Proto Recv-Q Send-Q Local Address Foreign Address tcp 0 0 156.153.192.171.1128 156.153.192.171.1129 tcp 0 0 156.153.192.171.1129 156.153.192.171.1128 tcp 0 0 156.153.192.171.947 156.153.192.171.1105 Active UNIX domain sockets Address Type Recv-Q Send-Q Inode Conn Refs Nextref c6f300 dgram 0 0 844afc 0 0 0 c87e00 dgram 0 0 844c4c 0 0 0 de4f00 stream 0 0 0 f75240 0 0 f71200 stream 0 0 0 f75280 0 0

(state) ESTABLISHED ESTABLISHED ESTABLISHED Addr /var/tmp/psb_front_socket /var/tmp/psb_back_socket /var/spool/sockets/X11/0

Display network interface information: # netstat -in Name Mtu Network ni0* 0 none ni1* 0 none

Address none none

Ipkts 0 0

H4262S C.00 2-42  2004 Hewlett-Packard Development Company, L.P.

Ierrs 0 0

Opkts 0 0

Oerrs 0 0

Coll 0 0

http://education.hp.com

Module 2 Performance Tools lo0 lan0

4608 1500

127 156.153.192.0

127.0.0.1 156.153.192.171

6745 156

0 0

6745 0

0 0

0 0

Display network interface traffic: # netstat -I lan0 5 (lan0)-> input packets 188 2

output packets 172 1

(Total)-> input packets 6973 2

output packets 6785 1

. . .

Display protocol status: # netstat -s tcp: 2244 packets sent 1191 data packets (217208 bytes) 4 data packets (5840 bytes) retransmitted 692 ack-only packets (276 delayed) 318 control packets 2277 packets received 1288 acks (for 195140 bytes) 144 duplicate acks 1360 packets (236775 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 83 out-of-order packets (0 bytes) 0 discarded for bad header offset fields 0 discarded because packet too short 134 connection requests 120 connection accepts 243 connections established (including accepts) udp: 0 bad checksums 164 socket overflows 0 data discards ip: 460730 total packets received 0 bad header checksums 0 with ip version unsupported 2253 fragments received 2670 packets not forwardable 0 redirects sent icmp: 1989 calls to generate an ICMP error message Output histogram: echo reply: 727 destination unreachable: 1989 727 responses sent arp: 0 Bad packet lengths 0 Bad headers probe: 0 Packets with missing sequence number 0 Memory allocations failed igmp: 0 messages received with bad checksum 10939700 membership queries received 10969833 membership queries received with incorrect field(s) 0 membership reports received

http://education.hp.com

H4262S C.00 2-43  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–27. TEXT PAGE: nfsstat The nfsstat command displays network file system (NFS) statistics. Categories of NFS information include: •

server statistics

•

client statistics

•

RPC statistics

•

performance detail statistics

Tool Source:

Sun Microsystems

Documentation:

man pages

Interval:

on demand

Data Source:

Kernel registers

Type of Data:

Global

Metrics:

NFS, RPC

Logging:

Standard output device

Overhead:

Varies, depending on NFS activity

Unique Feature:

Shows RPC calls, retransmissions, and timeouts.

Full Pathname:

/usr/bin/nfsstat

Pros and Cons:

+ reports both client and server activity - limited documentation

Syntax

nfsstat [ -cmnrsz ]

Examples To reset all nfsstat counters to zero: # nfsstat -z To display server/client RPC and NFS statistics: # nfsstat

(this defaults to nfsstat -cnrs)

Server rpc: Connection oriented: calls badcalls nullrecv 0 0 0

badlen 0

xdrcall 0

dupchecks 0

dupreqs 0

Connectionless oriented: calls badcalls nullrecv 0 0 0

badlen 0

xdrcall 0

dupchecks 0

dupreqs 0

H4262S C.00 2-44  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

Server nfs: calls badcalls 0 0 Version 2: (0 calls) null getattr setattr 0.0% 0.0% 0.0% wrcache write create 0.0% 0.0% 0.0% mkdir rmdir readdir 0.0% 0.0% 0.0% Version 3: (0 calls) null getattr setattr 0.0% 0.0% 0.0% write create mkdir 0.0% 0.0% 0.0% rename link readdir 0.0% 0.0% 0.0% commit 0.0%

root 0.0% remove 0.0% statfs 0.0%

lookup 0.0% rename 0.0%

readlink 0.0% link 0.0%

read 0.0% symlink 0.0%

lookup 0.0% symlink 0.0% readdir+ 0.0%

access 0.0% mknod 0.0% fsstat 0.0%

readlink 0 0% remove 0 0% fsinfo 0.0%

read 0.0% rmdir 0.0% pathconf 0.0%

Client rpc: Connection oriented: calls badcalls badxids 20 0 0 badverfs timers cantconn 0 17 0

timeouts 0 nomem 0

newcreds 0 interrupts 0

Connectionless oriented: calls badcalls retrans 20 0 0 badverfs timers toobig 0 17 0

badxids 0 nomem 0

timeouts 0 cantsend 0

waits 0 bufulocks 0

newcreds 0

root 0.0% remove 0.0% statfs 1.5%

lookup 0.0% rename 0.0%

readlink 0.0% link 0.0%

read 0.0% symlink 0.0%

lookup 0.0% symlink 0.0% readdir+ 0.0%

access 0.0% mknod 0.0% fsstat 0.0%

readlink 0.0% remove 0.0% fsinfo 0.0%

read 0.0% rmdir 0.0% pathconf 0.0%

Client nfs: calls badcalls clgets 20 0 20 Version 2: (20 calls) null getattr setattr 0.0% 18.90% 0.0% wrcache write create 0.0% 0.0% 0.0% mkdir rmdir readdir 0.0% 0.0% 1.5% Version 3: (0 calls) null getattr setattr 0.0% 0.0% 0.0% write create mkdir 0.0% 0.0% 0.0% rename link readdir 0.0% 0.0% 0.0% commit 0.0%

http://education.hp.com

cltoomany 0

H4262S C.00 2-45  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–28. TEXT PAGE: ping The ping command sends an ICMP echo packet to a host, and times how long it takes for the echo packet to return. This command is often used to test connectivity to another system. Specific details of the implementation include: •

An ICMP echo packet is sent once a second.

•

Upon receipt of the echo packet, the round-trip time is displayed.

•

The ability to display (via the -o option) the IP route taken.

Tool Source:

Public Domain

Documentation:

man pages

Interval:

on demand

Data Source:

NIC and ICMP packets

Type of Data:

Network

Metrics:

Packet transmission

Logging:

Standard output device

Overhead:

minimal; one packet transmission per second

Unique Feature:

Shows round-trip times between systems Shows route taken to and from the second system.

Full Pathname:

/usr/sbin/ping

Pros and Cons:

+ familiarity + understood by all UNIX-based (and TCP/IP-based) systems - limited functionality

Syntax

ping [-oprv] [-i address] [-t ttl] host [-n count]

Examples Send two ICMP echo packets to host star1: # ping star1 -n 2 PING star1: 64 byte packets 64 bytes from 156.153.193.1: icmp_seq=0. time=1. ms 64 bytes from 156.153.193.1: icmp_seq=1. time=0. ms ----star1 PING Statistics---2 packets transmitted, 2 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/0/1

H4262S C.00 2-46  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

Send one ICMP packet and display the IP path taken: # ping -o 156.152.16.10 -n 1 PING 156.152.16.10: 64 byte packets 64 bytes from 156.152.16.10: icmp_seq=0. time=337. ms ----156.152.16.10 PING Statistics---1 packets transmitted, 1 packets received, 0% packet loss round-trip (ms) min/avg/max = 337/337/337 1 packets sent via: 15.63.200.2 - [ name lookup failed ] 15.68.88.4 - [ name lookup failed ] 156.152.16.1 - [ name lookup failed ] 156.152.16.10 - [ name lookup failed ] 15.68.88.43 15.63.200.1

http://education.hp.com

- [ name lookup failed ] - [ name lookup failed ]

H4262S C.00 2-47  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–29. SLIDE: Network Performance Tools (HP-Specific)

Network Performance Tools (HP-Specific) Resource

Super User Access Required

Layer 2 Networking Statistics and NIC Reset

Yes

LAN Hardware and Software Status

No

nettune (10.x)

Change Kernel Networking Parameters

Yes

ndd (11.x)

Change Kernel Networking Parameters

Yes

Collects network performance data using RMON LAN probes

Yes

lanadmin lanscan

NetMetrix

Student Notes This slide shows the HP-specific networking performance tools included with HP-UX. The first three tools listed (lanadmin, lanscan, and ndd/nettune) come standard with the base OS. The NetMetrix product is an additional product. The HP-specific networking tools display additional networking information and allow tuning of various networking parameters.

H4262S C.00 2-48  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–30. TEXT PAGE: lanadmin The lanadmin command tests, displays statistics for, and allows modifications to LAN cards on the HP-UX system. Specific capabilities include: •

Resetting the LAN card and executing the LAN card self-tests

•

Displaying and clearing LAN card statistics

•

Changing the LAN card speed, the MTU size, and the link level address

Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

Kernel registers and Network Interface Card

Type of Data:

Network

Metrics:

Packet transmission status and errors

Logging:

Standard output device

Overhead:

minimal

Unique Feature:

Allows LAN interface card to be reset.

Full Pathname:

/usr/sbin/lanadmin

Pros and Cons:

+ provides extensive transmission statistics + allows for tuning of parameters normally requiring source code to change. - many statistics have little to no documentation

Syntax /usr/sbin/lanadmin [-e] [-t] /usr/sbin/lanadmin [-a] [-A station_addr] [-m] [-M mtu_size] [-R] [-s] [-S speed] NetMgmtID -e

Echo the input commands on the output device.

-t

Suppress the display of the command menu before each command prompt.

Example # lanadmin Test Selection mode. lan menu quit verbose

= = = =

LAN Interface Administration Display this menu Terminate the Administration Display command menu

Enter command: lan

http://education.hp.com

H4262S C.00 2-49  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

LAN Interface test mode. LAN Interface Net Mgmt ID = 4 clear display end menu ppa quit nmid reset specific

= = = = = = = = =

Clear statistics registers Display LAN Interface status and statistics registers End LAN Interface Administration, return to Test Selection Display the menu PPA Number of the LAN Interface Terminate the Administration, return to shell Network Management ID of the LAN Interface Reset LAN Interface to execute its selftest Go to Driver specific menu

Enter command: display Network Management ID Description Type (value) MTU Size Speed Station Address Administration Status (value) Operation Status (value) Last Change Inbound Octets Inbound Unicast Packets Inbound Non-Unicast Packets Inbound Discards Inbound Errors Inbound Unknown Protocols Outbound Octets Outbound Unicast Packets Outbound Non-Unicast Packets Outbound Discards Outbound Errors Outbound Queue Length Specific

= = = = = = = = = = = = = = = = = = = = = =

4 lan0 Hewlett-Packard LAN Interface Hw Rev 0 ethernet-csmacd(6) 1500 10000000 0x8000935c9bd up(1) up(1) 14465 3606105787 2767086 88379016 0 464396 7114206 458391388 2842387 2874 0 0 0 655367

= = = = = = = = = = = =

4 0 0 21353 42774 281589 0 0 0 0 0 0

Ethernet-like Statistics Group Index Alignment Errors FCS Errors Single Collision Frames Multiple Collision Frames Deferred Transmissions Late Collisions Excessive Collisions Internal MAC Transmit Errors Carrier Sense Errors Frames Too Long Internal MAC Receive Errors

H4262S C.00 2-50  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–31. TEXT PAGE: lanscan The lanscan command displays the LAN card configuration and status. Items displayed include: •

Hardware address of LAN card slot

•

Link level address of card

•

Hardware status and interface status

•

Other status and configuration information

Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

Network Interface Card

Type of Data:

Network

Metrics:

Interface status, Link Level Address

Logging:

Standard output device

Overhead:

minimal

Unique Feature:

Shows Link Level Address of system.

Full Pathname:

/usr/sbin/lanscan

Pros and Cons:

+ provides additional status information about network interface cards - no performance information

Syntax lanscan [-ainv] [system [core]] -a

Display station addresses only.

No headings.

-i

Display interface names only.

-n

Display Network Management IDs only.

-v

Verbose output. Two lines per interface. Includes displaying of extended station address and supported encapsulation methods.

No headings. No headings.

Examples Output from a 10.x system: # lanscan Hardware Station Crd Hardware Net-Interface Path Address In# State NameUnit State 2/0/2 0x080009D2C2DE 0 UP lan0 UP

NM ID 4

MAC Type ETHER

HP DLPI Mjr Support Num Yes 52

Output from an 11.x system:

http://education.hp.com

H4262S C.00 2-51  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

# lanscan Hardware Station Crd Hdw Net-Interface Path Address In# State NamePPA 2/0/2 0x08000978BDB0 0 UP lan0 snap0

H4262S C.00 2-52  2004 Hewlett-Packard Development Company, L.P.

NM ID 1

MAC Type ETHER

HP-DLPI DLPI Support Mjr# Yes 119

http://education.hp.com

Module 2 Performance Tools

2–32. TEXT PAGE: nettune (HP-UX 10.x Only) The nettune command allows modifications to be made to network parameters, which in previous releases were not modifiable. This command was not included with any HP-UX 11.x release. Parameters that can be modified with nettune include: • • •

arp configuration socket buffer sizes enable or disable IP forwarding Use caution when making modifications with the tool. It is possible to hurt network performance severely or disable the LAN card when using this tool.

CAUTION: Tool Source:

HP

Documentation:

man pages, nettune help options (-?, -l, -h)

Interval:

on demand

Data Source:

Kernel registers and NIC

Type of Data:

Global

Metrics:

LAN tunable parameters

Logging:

Standard output device

Overhead

minimal

Unique Feature:

Change values of network parameters, which cannot otherwise be changed Change TCP send and receive buffer sizes without need for source code

Full Pathname:

/usr/contrib/bin/nettune

Pros and Cons:

+ provides ability to modify networking behavior without needing source code + provides access to tunable parameters normally not available - can have a negative impact on performance if used the wrong way - minimal documentation

Syntax

nettune nettune nettune nettune -h

[-w] object [parm...] -h [-w] [object] -l [-w] [-b size] [object [parm...]] -s [-w] object [parm...] value... (help) Print all information related to the object. This information provides helpful hints about changing the value of an object.

-l

(list) Print information regarding changing the value of object.

http://education.hp.com

H4262S C.00 2-53  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

-s

(set)

-w

Set object to value. An object may require more than one value. Display warning messages (for example, 'value truncated'). These are normally discarded when the command is successful.

Examples To get help information on all defined objects: nettune -h arp_killcomplete: The number of seconds that an arp entry can be in the completed state between references. When a completed arp entry is unreferenced for this period of time, it is removed from the arp cache. . . . To get help information on all TCP-related objects: nettune -h tcp tcp_receive: The default socket buffer size in bytes for inbound data. tcp_send: The default socket buffer size in bytes for outbound data. . . . To set the value of the ip_forwarding object to 1: nettune -s ip_forwarding 1 To get the value of the tcp_send object (socket send buffer size): nettune tcp_send

H4262S C.00 2-54  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–33. TEXT PAGE: ndd (HP-UX 11.x Only) The ndd command allows the examination and modification of several tunable parameters that affect networking operation and behavior. It accepts arguments on the command line or may be run interactively. The -h option displays all the supported and unsupported tunable parameter that ndd provides. ndd was ported to HP-UX and contains references to some parameters that have not been implemented on the HP-UX O/S at this time. Reference the man page when in doubt. (Just because you can display a symbol's value and set it doesn't necessarily mean that the HP-UX kernel references the symbol!)

CAUTION:

The ndd utility command accesses kernel parameters through the use of "pseudo device files". These pseudo device files are referred to as a network device on the ndd command line and selected from the following list: For ARP cache-related values For IP routing and forwarding parameters Default IP time-to-live header value Transport Connect Protocol (connection based) parameters User Datagram Protocol (connectionless) parameters

/dev/arp /dev/ip /dev/rawip /dev/tcp /dev/udp Tool Source:

HP

Documentation:

man pages, ndd -h (for help options)

Interval:

on demand

Data Source:

network device pseudo device files (reference above)

Type of Data:

Global

Metrics:

LAN tunable parameters

Logging:

Standard output device

Overhead

minimal

Unique Feature:

Change values of network parameters, which cannot otherwise be changed

Full Pathname:

/usr/bin/ndd

Pros and Cons:

+ provides ability to modify networking behavior without needing source code + provides access to tunable parameters normally not available - can have a negative impact on performance if used the wrong way - minimal documentation

Syntax ndd -get

network device

http://education.hp.com

parameter

H4262S C.00 2-55  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

ndd -set

network device

ndd -h

sup[ported]

ndd -h

unsup[ported]

ndd -h

[parameter]

parameter

ndd -c At boot:

The file /etc/rc.config.d/nddconf contains tunable parameters that will be set automatically each time the system boots.

Examples To list the contents of the "arp cache": ndd -get

/dev/arp

arp_cache_report

To get help information on all supported tunable parameters: ndd -h

supported

To get a detail description of the tunable parameter, ip_forwarding: ndd -h

ip_forwarding

To get the current value of the tunable parameter, ip_forwarding: ndd -get

/dev/ip

ip_forwarding

To set the value of the default TTL parameter for UDP to 128: ndd -set

/dev/udp

udp_def_ttl

128

To re-read the configuration file, /etc/rc.config.d/nddconf without rebooting the system: ndd -c

H4262S C.00 2-56  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–34. TEXT PAGE: NetMetrix (HP-UX 10.20 and 11.0 Only) The NetMetrix product makes use of LAN probes to collect network traffic information. The LAN probes attach to the physical network and collect detailed information regarding the packets that pass through the probe. Tools available with NetMetrix include: • • • •

packet decoders network alarming capabilities reports including top packet generating systems data collection for trending

Tool Source:

HP

Documentation:

man pages, NRF (Network Response Facility) manual

Interval:

on demand

Data Source:

LAN probes

Type of Data:

LAN traffic

Metrics:

number of packets through cross-section of network

Logging:

NetMetrix binary file

Overhead:

Varies, depending on the number of LAN probes

Unique Feature:

Provides statistics regarding traffic on the entire network

Pros and Cons:

+ Statistics regarding total packet traffic - Additional cost - Requires LAN probes

NetMetrix Notes •

NetMetrix makes use of highly sophisticated devices (LAN probes) capable of collecting large amounts of detailed network information.

•

NetMetrix is a truly distributed network management product that makes use of "midlevel managers" for data storage and alarming.

• •

There are a number of modules available with NetMetrix. NetMetrix's Internet Response Manager (IRM) and Internet Response Agent (IRA) fully integrate with HP OpenView products to provide a complete system and network management solution.

http://education.hp.com

H4262S C.00 2-57  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–35. SLIDE: Performance Administrative Tools (Standard UNIX)

Performance Administrative Tools (Standard UNIX) ipcs ipcrm nice renice

Resource

Super User Access Required

List Semaphores, Message Queues, and Shared Memory Segments

No

Destroy Semaphores, Message Queues, and Shared Memory Segments

Yes

Setting Process Priorities

Yes

Modifying Process Priorities

Yes

Student Notes This slide shows the standard UNIX administrative performance tools included with HP-UX. These tools are used to tune or modify system resources to better improve the performance of a system. These tools are typically used to change or tune a system's component, as opposed to viewing or displaying characteristics about the component. Only the root user is allowed to use these commands, as making these modifications affects the performance for all users on the system. NOTE:

The ipcs program is really a performance-monitoring command; however, because it is usually run in conjunction with ipcrm, it is covered here to emphasize the relationship between the two commands.

H4262S C.00 2-58  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–36. TEXT PAGE: ipcs, ipcrm The ipcs command displays information about active interprocess communication facilities. With no options, ipcs displays information in short format about message queues, shared memory segments, and semaphore sets that are currently active in the system. The ipcrm command removes one or more specified message-queue, semaphore-set, or shared-memory identifiers. Tool Source:

Standard UNIX (System V)

Documentation:

man pages

Interval:

on demand

Data Source:

Kernel registers

Type of Data:

Global, limited process

Metrics:

semaphore sets, message queues, shared memory

Logging:

Standard output device

Overhead:

varies, depending on the IPC resource in use

Unique Feature:

Shows the size, owner, and last user of message queues and shared memory segments.

Full Pathname:

/usr/bin/ipcs and /usr/bin/ipcrm

Pros and Cons:

+ shows orphan IPC entries + shows size of message queues and shared memory segments - process information limited to owner and last user

Syntax

ipcrm [-m shmid] [-q msqid] [-s semid]

ipcs [-mqs] [-abcopt] [-C -m Display information -q Display information -s Display information -b -c -o -p -t

Display Display Display Display Display

corefile] [-N namelist] about active shared memory segments. about active message queues. about active semaphore sets.

largest-allowable-size information creator's login name and group name information on outstanding usage process number information time information

http://education.hp.com

H4262S C.00 2-59  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

Examples # ipcs -s IPC status from /dev/kmem as of Fri Oct 17 12:56:36 1997 T ID KEY MODE OWNER GROUP Semaphores: s 0 0x2f180002 --ra-ra-raroot sys s 3 0x412000a9 --ra-ra-raroot root s 4 0x00446f6e --ra-r--r-root root s 6 0x01090522 --ra-r--r-root root s 7 0x013d8483 --ra-r--r-root root s 200 0x4c1c2f79 --ra-r--r-daemon daemon

# ipcrm -s 7

# ipcs -s IPC status from /dev/kmem as of Fri Oct 17 12:57:42 1997 T ID KEY MODE OWNER GROUP Semaphores: s 0 0x2f180002 --ra-ra-raroot sys s 3 0x412000a9 --ra-ra-raroot root s 4 0x00446f6e --ra-r--r-root root s 6 0x01090522 --ra-r--r-root root s 200 0x4c1c2f79 --ra-r--r-daemon daemon

H4262S C.00 2-60  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–37. TEXT PAGE: nice, renice The nice command executes a command at a nondefault CPU scheduling priority. (The name is derived from being "nice" to other system users by running large programs at a weaker priority.) The renice command alters the nice value of a existing process. Tool Source:

Standard UNIX (System V)

Documentation:

man pages

Interval:

on demand

Data Source:

process table

Type of Data:

processes

Metrics:

priority

Logging:

standard output device

Overhead:

minimal

Unique Feature: Full Pathname:

/usr/bin/nice and /usr/bin/renice

Pros and Cons:

+ allows less important processes to run in the background + allows more important processes to run in the foreground - not an intuitive interface or syntax

Syntax

nice [-n newoffset_from_default_20] command [command_args] renice [-n newoffset_from_current_value] [-g|-p|-u] id ... An unsigned newoffset increases the system nice value for the command or process, causing it to run at a weaker priority. A negative value requires superuser privileges, and assigns a lower system nice value (strongerer priority) to the process.

Examples # ps -l F S 1 S 1 R

UID 0 0

PID 6044 8286

PPID 6042 6044

C PRI NI 1 158 20 6 179 20

ADDR ff6680 1003d80

SZ 85 22

UID 0 0 0

PID 6044 8290 8293

PPID C PRI NI 6042 11 158 20 8287 0 158 30 8290 4 199 30

ADDR ff6680 ff1680 feae80

SZ 85 85 22

WCHAN TTY 87cec0 ttyp2 - ttyp2

TIME COMD 0:00 sh 0:00 ps

# nice sh # ps -l F S 1 S 1 S 1 R

WCHAN 87cec0 100d3e0 -

TTY ttyp2 ttyp2 ttyp2

TIME 0:00 0:00 0:00

COMD sh sh ps

# exit

http://education.hp.com

H4262S C.00 2-61  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools # nice -10 sh # ps -l F S 1 S 1 R 1 S

UID 0 0 0

PID 6044 8297 8294

PPID C 6042 0 8294 7 6044 10

PRI 158 199 158

NI 20 30 30

ADDR ff6680 ff1280 fea380

SZ 85 22 121

WCHAN 87cec0 87e0c0

TTY ttyp2 ttyp2 ttyp2

TIME 0:00 0:00 0:00

COMD sh ps sh

# nice -5 ps -l F S UID 1 S 0 1 R 0 1 S 0

PID 6044 8304 8294

PPID C PRI 6042 0 158 8294 10 210 6044 10 158

NI 20 35 30

ADDR ff6680 1003e80 fea380

SZ 85 22 121

WCHAN 87cec0 87e0c0

TTY ttyp2 ttyp2 ttyp2

TIME 0:00 0:00 0:00

COMD sh ps sh

PID 6044 8305 8294 8308

PPID C PRI NI 6042 0 158 20 8294 19 158 39 6044 6 158 30 8305 4 220 39

ADDR ff6680 fb3300 fea380 feae80

SZ 85 121 121 22

WCHAN 87cec0 87d6c0 87e0c0 -

TTY ttyp2 ttyp2 ttyp2 ttyp2

TIME 0:00 0:00 0:00 0:00

COMD sh sh sh ps

PID 6044 8306 8309 8312

PPID 6042 8294 8306 8309

ADDR ff6680 f86200 fea380 1003980

SZ 85 121 121 22

WCHAN 87cec0 87dc40 87e0c0 -

TTY ttyp2 ttyp2 ttyp2 ttyp2

TIME 0:00 0:00 0:00 0:00

COMD sh sh sh ps

# nice -n 30 sh # ps -l F S 1 S 1 S 1 S 1 R

UID 0 0 0 0

# exit # nice -n -30 sh # ps -l F S 1 S 1 S 1 S 1 R

UID 0 0 0 0

C 0 1 7 6

PRI NI 158 20 158 30 158 0 139 0

H4262S C.00 2-62  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–38. SLIDE: Performance Administrative Tools (HP-Specific)

Performance Administrative Tools (HP-Specific) Resource

Super User Access Required

getprivgrp

List system privileged groups

No

setprivgrp

Allocate special system privileges

Yes

rtprio

Set real time process priority (HP)

Privileged Access

rtsched

Set POSIX real time process priority

Privileged Access

scsictl

Set parameters on SCSI devices

Yes

serialize

Mark a program to run serially

Privileged Access

fsadm

Online JFS management tool

Yes

getext

Display JFS extent attributes

No

setext

Sets/changes JFS extent attributes

Yes

newfs

Create a file system

Yes

Change a file system’s attributes

Yes

PRM/WLM

Process Resource Mgr/Work Load Mgr

Yes

WebQoS

Web Quality of Service

Yes

tunefs/vxtunefs

Student Notes This slide shows the HP-specific administrative performance tools available on HP-UX systems. Many of the tools shown on the slide come standard with the base OS. The only tools that are add-on products are PRM, WLM, WebQoS, and Advanced JFS (getext, setext, and fsadm). These HP-specific tools were developed to allow modifications and performance enhancements to the functionality unique to the HP-UX operating system.

http://education.hp.com

H4262S C.00 2-63  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–39. TEXT PAGE: getprivgrp, setprivgrp The getprivgrp command lists the access privileges of privileged groups. The setprivgrp command sets the access privileges of privileged groups. If a group_name is supplied, access privileges are listed for that group only. The superuser is a member of all groups. Access privileges include RTPRIO, RTSCHED, MLOCK, CHOWN, LOCKRDONLY, SETRUGID, MPCTL, SPUCTL, and SERIALIZE. Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

/etc/group and kernel data structures

Type of Data:

users and groups

Metrics:

privilege access

Logging:

Standard output device

Overhead:

minimal

Unique Feature:

Gives non-root users access to privileges normally requiring root access.

Full Pathname

/usr/bin/getprivgrp and /usr/sbin/setprivgrp

Pros and Cons:

+ ability to assign additional privileges to groups - requires additional system management - cannot give privilege to a single user; must assign privileges to groups

Syntax

getprivgrp [-g|group_name] setprivgrp [-g|groupname] [privileges] -g

Specify global privileges that apply to all groups.

Examples # getprivgrp global privileges: CHOWN # setprivgrp class CHOWN SERIALIZE RTPRIO # getprivgrp global privileges: CHOWN class: RTPRIO CHOWN SERIALIZE

H4262S C.00 2-64  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

Notes •

Group privileges which can be modified are: RTPRIO

Can use rtprio() call to set real-time priorities.

RTSCHED

Can use sched_setparam() call and sched_setscheduler() call to set POSIX.4 real-time priorities.

MLOCK

Can use plock() to lock process text and data into memory, and the shmctl() SHM_LOCK function to lock shared memory segments

CHOWN

Can use chown() to change file ownership.

LOCKRDONLY

Can use lockf() to set locks on files that are open for reading only.

SETRUGID

Can use setuid() and setgid() to change, respectively, the real user ID and real group ID of a process.

SERIALIZE

Can use serialize() to force the target process to run serially with other processes that are also marked by this system call.

MPCTL

Can use mpctl() to lock a process or a thread to a specific processor on SMP systems. If processor sets are available, can be used to lock a process or a thread to a specific processor set.

SPUCTL

Can use spuctl() to enable and disable specific processors on SMP systems. (V-class, T-class, N-class, L-class, and Superdome only)

http://education.hp.com

H4262S C.00 2-65  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–40. TEXT PAGE: rtprio The rtprio command executes a specified command with a real-time priority, or changes the real-time priority of a currently executing process with a specific PID. Real-time priorities range from zero (strongest) to 127 (weakest). Real-time processes are not subject to priority degradation and are considered of greater importance than all non-real-time processes. CAUTION:

Special care should be taken when using this command. It is possible to lock out other processes (including system processes) when using this command.

Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

process table

Type of Data:

process

Metrics:

process priority

Logging:

none

Overhead:

varies, depending on the activity of the process

Unique Feature:

assign real time priority to a process

Full Pathname:

/usr/bin/rtprio

Pros and Cons:

+ Can significantly improve the performance of a program - Can severely impact the performance of the system (if used incorrectly)

Syntax rtprio priority command [arguments] rtprio priority -pid rtprio -t command [arguments] rtprio -t -pid -t

execute command with a timeshare (non-real-time) priority, or change the currently executing process pid from a possibly real-time priority to a timeshare priority.

Examples Execute file a.out at a real-time priority of 100: rtprio 100 a.out Set the currently running process PID 24217 to a real-time priority of 40: rtprio 40 24217

H4262S C.00 2-66  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–41. TEXT PAGE: rtsched The rtsched executes commands with POSIX or HP-UX real-time priority, or changes the real-time priority of currently executing process PID. All POSIX real-time priority processes are of greater scheduling importance than processes with HP-UX real-time or HP-UX timeshare priority. Neither POSIX nor HP-UX real-time processes are subject to degradation. POSIX real-time processes can be scheduled with one of three different POSIX scheduling policies specified: SCHED_FIFO, SCHED_RR, or SCHED_RR2. The number of POSIX real-time priority queues is tunable between the values of 32 and 512, and show up as a negative number between -1 and -512 when viewed with the ps -ef or ps –el commands. CAUTION:

Special care should be taken when using this command. It is possible to lock out other processes (including system processes) when using this command.

Tool Source:

HP

Documentation:

man pages (also see rtsched(2) )

Interval:

on demand

Data Source:

process table

Type of Data:

process

Metrics:

process priority

Logging:

none

Overhead:

varies, depending on the activity of the process

Unique Feature:

assign real time priority to a process

Full Pathname:

/usr/bin/rtsched

Pros and Cons:

+ Can significantly improve the performance of a program - Can severely impact the performance of the system (if used incorrectly)

Syntax rtsched -s scheduler -p priority

command

rtsched [ -s scheduler ] -p priority -s

-p

[arguments] pid

Specifies which scheduler to use, SCHED_FIFO (POSIX real-time), SCHED_RR (POSIX real-time), SCHED_RR2 (POSIX real-time), SCHED_RTPRIO (HP-UX real-time), or SCHED_HPUX (HP-UX timeshare)

http://education.hp.com

H4262S C.00 2-67  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

Examples Execute file a.out at a POSIX real-time priority of 4: rtsched

-s SCHED_FIFO -p 4

a.out

Set the currently running process pid 24217 to a real-time priority of 20: rtsched

-s SCHED_RR -p 20

H4262S C.00 2-68  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–42. TEXT PAGE: scsictl The scsictl command provides a mechanism for controlling a SCSI device. It can be used to query mode parameters, set configurable mode parameters, and perform SCSI commands. The operations are performed in the same order as they appear on the command line. Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

SCSI disks

Type of Data:

disks

Metrics:

immediate reporting, I/O queue

Logging:

standard output device

Overhead:

minimal

Unique Feature:

Provides control over the behavior of an individual SCSI disk

Full Pathname:

/usr/sbin/scsictl

Pros and Cons

+ can improve performance by modifying the drive behavior - not all SCSI devices support the command - could misconfigure a disk, causing data to be lost in the event of a system crash

Syntax scsictl [-akq] [-c command]... [-m mode[=value]]... device -a

Display the status of all mode parameters available.

-m mode

Display the status of the specified mode parameter. ir

For devices that support immediate reporting, this displays the immediate reporting status.

queue_depth For devices that support a queue depth greater than the system default, this mode controls how many I/Os the driver will attempt to queue to the device at any one time. -m mode=value

http://education.hp.com

Set the mode parameter mode to value. The available mode parameters and values are listed above.

H4262S C.00 2-69  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

Examples To display a list of all of the mode parameters, turn immediate_report on, and redisplay the value of immediate_report. scsictl -a -m ir=1 -m ir /dev/rdsk/c0t6d0 will produce the following output: immediate_report = 0; queue_depth = 8; immediate_report = 1

H4262S C.00 2-70  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–43. TEXT PAGE: serialize The serialize command is used to force the target process to run serially with other processes also marked by this command. Once a process has been marked by serialize, the process stays marked until process completion, unless serialize is reissued. Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

process table

Type of Data:

process

Metrics:

priority

Logging:

standard output device

Overhead:

minimal

Unique Feature:

decreases CPU and memory contention problems using standard functionality.

Full Pathname:

/usr/bin/serialize

Pros and Cons:

+ allows system to behave more efficiently when CPU and memory resources are scarce. - minimal documentation - only helps when CPU and memory resources are scarce

Syntax serialize command [command_args] serialize [-t] [-p pid] -t

Indicates the process specified by pid should be returned to timeshare scheduling.

Examples Use serialize to force a database application to run serially with other processes marked for serialization. Type: serialize database_app Force a currently running process with a PID value of 215 to run serially with other processes marked for serialization. Type: serialize -p 215 Return a process previously marked for serialization to normal timeshare scheduling. The PID of the target process for this example is 174. Type: serialize -t -p 174

http://education.hp.com

H4262S C.00 2-71  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–44. TEXT PAGE: fsadm The fsadm command is designed to perform selected administration tasks on HFS (10.20 or later) and JFS file systems. These tasks may differ between file system types. For HFS file systems, fsadm allows conversions between large and nolarge files. For VxFS file systems, fsadm allows file system resizing, extent (and directory) reorganization, and large/nolarge file conversions. Tool Source:

Veritas and HP

Documentation:

man pages

Interval:

on demand

Data Source:

File System superblock and header structures

Type of Data:

file system header and data

Metrics:

fragmentation

Logging:

standard output device

Overhead:

Medium to large (up to 33%), depending on number of user and amount of activity

Unique Features:

Can defragment a file system, improving performance. (JFS) Can increase the size of a file system while it's mounted. (JFS)

Full Pathname:

/usr/sbin/fsadm

Pros and Cons:

+ provides greater manageability of file systems - many features (including defragmentation) are only available for JFS - requires purchasing the AdvanceJFS or OnlineJFS product.

Syntax /usr/sbin/fsadm [-F vxfs|hfs] [-V] [-o largefiles|nolargefiles] mount_point|special /usr/sbin/fsadm [-F vxfs] [-V] [-b newsize] [-r rawdev] mount_point /usr/sbin/fsadm [-F vxfs] [-V] [-d] [-D] [-s] [-v] [-a days] [-t time] [-p passes] [-r rawdev] mount_point

Examples HFS Example Convert a nolargefiles HFS file system to a largefiles HFS file system: fsadm -F hfs -o largefiles /dev/vg02/lvol1 Display relevant HFS file system statistics: fsadm -F hfs /dev/vg02/lvol1

H4262S C.00 2-72  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

JFS Example Increase the size of the var file system to 100 MB while it is mounted and online: lvextend -L 100 /dev/vg00/lvol7 fsadm -F vxfs -b 102400 /var Display fragmentation statistics for the /home file system: fsadm -D -E /home

http://education.hp.com

H4262S C.00 2-73  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–45. TEXT PAGE: getext, setext The getext command displays extent attribute information of associated files on a JFS file system. The setext command allows attributes related to JFS file systems and files within the JFS file system to be modified and tuned. Tool Source:

Veritas

Documentation:

man pages

Interval:

on demand

Data Source:

JFS file system

Type of Data:

File system metadata structures

Metrics:

File system space allocation

Logging:

standard output device

Overhead:

minimal

Unique Feature:

Allows attributes of JFS files to be set

Full Pathname:

/usr/sbin/getext and /usr/sbin/setext

Pros and Cons:

+ can improve file system performance by modifying file attributes - require purchase of the AdvancedJFS or OnlineJFS product

Syntax

/usr/sbin/getext [-V] [-f] [-s] file... /usr/sbin/setext [-V] [-e extent_size] [-r reservation] [-f flag] file

Example Display file attributes for the file, file1: getext file1 file1:

Bsize

1024

Reserve

36

Extent Size

3

align noextend

The above output indicates a file with 36 blocks of reservation, a fixed extent size of 3 blocks, all extents aligned to 3-block boundaries, and the file cannot be extended once the current reservation is exhausted.

H4262S C.00 2-74  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–46. TEXT PAGE: newfs, tunefs, vxtunefs The newfs command is a "friendly" front-end to the mkfs command. The newfs command calculates the appropriate parameters and then builds the file system by invoking the mkfs command. The tunefs command displays detailed configuration information for an HFS file system and allows some of the file system parameters to be modified. Tool Source:

BSD 4.x, modified by HP, Veritas

Documentation:

man pages

Interval:

not applicable, on demand

Data Source:

file system header and superblock

Type of Data:

file system metadata structures

Metrics:

Block size, Fragment size, Mininum space

Logging:

standard output

Overhead:

minimal

Unique Feature:

Allows file system parameters to be displayed and set.

Full Pathname:

/usr/sbin/newfs, /usr/sbin/tunefs, /usr/sbin/vxtunefs

Pros and; Cons:

+ File system parameters can be viewed and tuned for optimal performance - To tune many parameters, a re-initialization of the file system is required

Syntax

/usr/sbin/newfs [-F FStype] [-o specific_options] [-V] special /usr/sbin/tunefs [-A] [-v] [-a maxcontig] [-d rotdelay] [-e maxbpg] [-m minfree] special-device /usr/sbin/vxtunefs Notes

The initial file system parameters are set when the file system is first created with newfs. A small set of these parameters can be changed after the file system is created with tunefs. vxtunefs changes the attributes of the JFS file system when the file system is mounted. NOTE:

The tunefs command works only for HFS file systems. The JFS file systems use other commands (getext, setext, vxtunefs).

http://education.hp.com

H4262S C.00 2-75  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

Examples Create a file system on vg01 called lvol1. newfs -F hfs -b 16384 -f 2048 /dev/vg01/rlvol1 mkfs (hfs): Warning - 2 sector(s) in the last cylinder are not allocated. mkfs (hfs): /dev/vg01/rlvol1 - 20480 sectors in 133 cylinders of 7 tracks, 22 sectors 21.0Mb in 9 cyl groups (16 c/g, 2.52Mb/g, 384 i/g) Super block backups (for fsck -b) at: 16, 2512, 5008, 7504, 10000, 12496, 14992, 17488, 19728

View the file system's configuration parameters: tunefs -v /dev/vg01/rlvol91 super block last mounted on: magic 95014 clean FS_CLEAN time Fri Nov sblkno 8 cblkno 16 iblkno 24 dblkno sbsize 2048 cgsize 2048 cgoffset 16 cgmask ncg 9 size 10240 blocks 9858 bsize 16384 bshift 14 bmask 0xffffc000 fsize 2048 fshift 11 fmask 0xfffff800 frag 8 fragshift 3 fsbtodb 1 minfree 10% maxbpg 38 maxcontig 1 rotdelay 0ms rps 60 csaddr 48 cssize 28672 csshift 10 csmask ntrak 7 nsect 22 spc 154 ncyl cpg 16 bpg 154 fpg 1232 ipg nindir 4096 inopb 128 nspf 2 nbfree 1230 ndir 2 nifree 3452 nffree cgrotor 0 fmod 0 ronly 0 fname fpack

28 07:02:58 1997 48 0xfffffff8

0xfffffc00 133 384 9

cylinders in last group 5 blocks in last group 48

For VxFS file systems use: # fsdb -F vxfs /dev/vg/NN/rlvolN > 8192 B > p S

H4262S C.00 2-76  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–47. TEXT PAGE: Process Resource Manager (PRM) Process Resource Manager (PRM) allows the administrator to guarantee that important processes will receive the amount of memory, disk, and CPU time required to meet your performance objectives. PRM works in conjunction with the standard HP-UX scheduler to improve response times for critical applications. PRM provides state-of-the-art resource allocation that has long been missing in the UNIX environment. Tool Source:

HP

Documentation:

PRM man pages (prmconfig)

Interval:

on demand

Data Source:

kernel registers and counters

Type of Data:

process groups as defined by the PRM configuration file.

Metrics:

CPU time, memory, and disk I/O bandwidth allocated to groups of processes

Logging:

standard output, glance, gpm, perfview/OVPM, measureware/OVPA

Overhead:

PRM only applies to time-shared processes. Real-time processes are not affected.

Unique Features:

allows the system administrator to control which groups of processes receive a certain percentage of the CPU's time, memory paging, and/or disk I/O request preference. CPU (per PRM group) entitlement and capping DISK (per PRM group per VG) entitlement Memory (per PRM group) entitlement, capping and selection method Application (per PRM group)

Full Pathname: Pros and Cons:

/usr/sbin/prmconfig + Greater control of resource distributions - Optional product. Does not come standard with the OS. If you are running 11i in the Enterprise or Mission Critical Operating Environments, PRM is included.

See the course U5447S – “HP-UX Resource Management with PRM & WLM” for a more complete discussion of PRM.

http://education.hp.com

H4262S C.00 2-77  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–48. TEXT PAGE: Work Load Manager (WLM) The Work Load Manager sits on top of PRM and tunes it as necessary to meet the desired performance goals. The goals are defined in a configuration file in the form of Service Level Objectives (SLOs). The administrator defines these goals in the file and then lets WLM “tweek” PRM until either the goals are reached or they are attained as closely as possible. Tool Source:

HP

Documentation:

WLM man pages (wlmd)

Interval:

on demand

Data Source:

kernel registers and counters

Type of Data:

process groups as defined by the WLM configuration file

Metrics:

As defined in the WLM configuration file

Logging:

Data can be sent to an EMS (Event Monitoring System)

Overhead:

Data collection of defined metrics and adjusting of PRM configuration

Unique Features:

allows the system administrator to define what Service Level Objectives are desired on the system and lets WLM to “tune” the system (via PRM) to obtain performance as close to those objectives as possible. CPU (per WLM group) entitlement DISK (per WLM group per VG) entitlement Memory (per WLM group) entitlement Application (per WLM group)

Full Pathname: Pros and Cons:

/opt/wlm/bin/wlmd + Greater control of CPU distribution - Optional product. Does not come standard with the OS.

See the course U5447S – “HP-UX Resource Management with PRM & WLM” for a more complete discussion of WLM.

H4262S C.00 2-78  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–49. TEXT PAGE: Web Quality of Service — WebQoS WebQoS manager is an example of the growing number of system performance management/enhancement products focused on specific server applications and environments. The modern paradigm for application server management requires looking past simple performance metrics and forces us to start think a little out of the box. Do all requests received by a Web server warrant the same level of service? WebQoS allows the administrator to make decisions on the service level, based on several different criteria: • admission control • user differentiation • activity differentiation • application differentiation. A discussion of the specifics of this product is beyond the scope of this class. Tool Source:

HP

Typical metrics:

Number of concurrent users and response times.

Purpose:

Maximize successful customer interactions and peak throughput.

Pros and Cons:

+ Greater control of Web server resources tuned to specific client requests. - Optional product. Does not come standard with the OS.

http://education.hp.com

H4262S C.00 2-79  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–50. SLIDE: System Configuration and Utilization Information (Standard UNIX)

System Configuration and Utilization Information (Standard UNIX) Resource

Portability

bdf

Local and remote mounted file system space

Some

df

Mounted file system space

Yes

Local and remote file system mounts

Yes

mount

Student Notes This slide shows the standard UNIX tools for displaying system configuration and utilization information on an HP-UX system. System configuration and utilization tools are those which display configurations of LVM disks, file systems, and kernel resources.

H4262S C.00 2-80  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–51. TEXT PAGE: bdf, df The bdf command displays the amount of free disk space available. If no file system is specified, the free space on all of the normally mounted file systems is printed. Free inode information can be displayed by using the –i option. The df command displays the number of free 512-byte blocks and free inodes available for file systems by examining the counts kept in the superblock or superblocks. Blocks can be displayed in 1KB sizes by using the –k option. Tool Source:

df, Standard UNIX (System V) bdf, Standard UNIX (Berkeley 4.x)

Documentation:

man pages

Interval:

on demand

Data Source:

File system superblocks

Type of Data:

Disk space resources

Metrics:

Disk space utilization

Logging:

Standard output

Overhead:

Minimal

Unique Feature:

Shows how much disk space is being utilized.

Full Pathname:

/usr/bin/bdf, /usr/bin/df

Pros and Cons:

+ Easy to use - minimal tuning statistics

Syntax

/usr/bin/bdf [-b] [-i] [-l] [-t type | [filesystem|file] ... ] /usr/bin/df [-befgiklnv] [-t|-P] [-o specific_options] [-V] [special|directory]...

Examples — bdf Command # bdf /usr Filesystem /dev/vg00/lvol7

kbytes 307200

used 279059

# bdf -i / Filesystem /dev/vg00/lvol3

kbytes 40960

used 25093

avail %used 14869 63%

kbytes 53248 53248

used 3586 0

avail %used 46546 7% 40546 0%

# bdf -ib /home Filesystem dev/vg00/lvol4 Swapping g

avail %used Mounted on -9635 103% /usr

iused 3284

ifree %iuse Mounted on 3960 45% /

iused ifree %iuse Mounted on 513 12407 4% /home /home/pagin

# ll /home/paging total 0

http://education.hp.com

H4262S C.00 2-81  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

Examples — df Command # df /home /opt /tmp /usr /var /stand

(/dev/vg00/lvol4 (/dev/vg00/lvol5 (/dev/vg00/lvol6 (/dev/vg00/lvol7 (/dev/vg00/lvol8 (/dev/vg00/lvol1

): ): ): ): ): ):

H4262S C.00 2-82  2004 Hewlett-Packard Development Company, L.P.

93062 177124 90010 52732 100122 23596

blocks blocks blocks blocks blocks blocks

12403 23598 11982 7011 13320 5358

i-nodes i-nodes i-nodes i-nodes i-nodes i-nodes

http://education.hp.com

Module 2 Performance Tools

2–52. TEXT PAGE: mount The mount command is used to mount file systems on the system. Other users can use mount to list mounted file systems. If mount is invoked without any arguments, it lists all of the mounted file systems from the file system mount table, /etc/mnttab. Tool Source:

standard UNIX (System V)

Documentation:

man pages

Interval:

on demand

Data Source:

kernel mount table and /etc/mnttab file

Type of Data:

file system

Metrics:

file system type and mount options

Logging:

the file /etc/mnttab and standard output

Overhead:

minimal

Unique Feature:

used to mount HFS, JFS, and NFS file systems.

Full Pathname:

/sbin/mount

Pros and Cons:

+ displays valuable data regarding how file systems are mounted - different options depending on the type of file system being mounted

Syntax

/usr/sbin/mount [-l] [-p|-v]

Examples # mount -p /dev/root /dev/vg00/lvol1 /dev/vg00/lvol6 /dev/vg00/lvol5 /dev/vg00/lvol4 /dev/dsk/c0t4d0 /dev/vg00/lvol7

/ /stand /usr /tmp /opt /disk /var

vxfs hfs vxfs vxfs vxfs hfs vxfs

log defaults delaylog delaylog delaylog defaults delaylog

0 0 0 0 0 0 0

0 0 0 0 0 0 0

# mount -v /dev/root on / type vxfs log on Thu Sep 11 12:15:08 1997 /dev/vg00/lvol1 on /stand type hfs defaults on Thu Sep 11 12:15:11 1997 /dev/vg00/lvol6 on /usr type vxfs delaylog on Thu Sep 11 12:17:06 1997 /dev/vg00/lvol5 on /tmp type vxfs delaylog on Thu Sep 11 12:17:07 1997 /dev/vg00/lvol4 on /opt type vxfs delaylog on Thu Sep 11 12:17:07 1997 /dev/dsk/c0t4d0 on /disk type hfs defaults on Thu Sep 11 12:17:08 1997 /dev/vg00/lvol7 on /var type vxfs delaylog on Thu Sep 11 12:17:23 1997 #

http://education.hp.com

H4262S C.00 2-83  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–53. SLIDE: System Configuration and Utilization Information (HP-Specific)

System Configuration and Utilization Information (HP-Specific) Resource

Portability

Size and model of local disk drives

No

dmesg

I/O tree and memory details

Some

ioscan

I/O tree and addressing

No

Local volume group contents/attributes

No

diskinfo

vgdisplay pvdisplay

Local physical volume contents/attributes

No

lvdisplay

Local logical volume contents/attributes

No

swapinfo

Swap space utilization

No

sysdef

Sizes and values of kernel tables and parms

Some

kmtune

Query, set, or reset system parameters

Some

kcweb

Query, set, or reset system configuration

Some

Student Notes This slide shows the HP-specific commands for displaying system configuration and utilization information. All the commands on the slide come standard with the base OS; none are add-on products. These commands display the configuration and utilization of HP-specific subsystems. Many of these commands have corresponding commands on other UNIX systems that perform similar functions.

H4262S C.00 2-84  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–54. TEXT PAGE: diskinfo The diskinfo command determines whether the character special file named by character_devicefile is associated with a SCSI, CS/80, or Subset/80 disk drive. If so, diskinfo summarizes the disk's characteristics. Both the size of disk and bytes per sector represent formatted media. Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

controller on disk

Type of Data:

disk specific

Metrics:

disk capacity, sector size

Logging:

standard output

Overhead:

minimal

Unique Feature:

shows model number and manufacturer of disk

Full Pathname:

/usr/sbin/diskinfo

Pros and Cons:

+ can determine size and manufacturer of disk without having to open system - minimal tuning information

Syntax

/usr/sbin/diskinfo [-b|-v] character_devicefile The diskinfo command displays information about the following characteristics of disk drives: • • • • •

vendor name, manufacturer of the drive (SCSI only) product identification number or ASCII name type, CS/80 or SCSI classification for the device size of disk specified in bytes sector size, specified as bytes per sector

Example # diskinfo /dev/rdsk/c0t6d0 SCSI describe of /dev/rdsk/c0t6d0: vendor: QUANTUM product id: PD425S type: direct access size: 416575 Kbytes bytes per sector: 512

http://education.hp.com

H4262S C.00 2-85  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–55. TEXT PAGE: dmesg The dmesg command looks in a system buffer for recently printed diagnostic messages and prints them on the standard output. The messages are those printed by the system when unusual events occur (such as when system tables overflow or the file systems get full). Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

kernel diagnostic buffer

Type of Data:

system diagnostic messages

Metrics:

kernel startup information

Logging:

standard output device

Overhead:

minimal

Unique Feature:

displays kernel diagnostic messages

Full Pathname:

/sbin/dames

Pros and Cons:

+ Allows kernel diagnostic messages to be recalled - Diagnostic messages can be lost since kernel buffer is a fixed size

Syntax

/usr/sbin/dmesg [-] If the - argument is specified, dmesg computes (incrementally) the new messages since the last time it was run and places these on the standard output. This is typically used with cron (see cron(1)) to produce the error log /var/adm/messages by running the command: /usr/sbin/dmesg

-

>> /var/adm/messages

every 10 minutes.

Example # dmesg Oct 17 12:39 vuseg=1815000 inet_clts:ok inet_cots:ok 1 graph3 2 bus_adapter 2/0/1 c720 2/0/1.0 tgt 2/0/1.0.0 stape 2/0/1.2 tgt 2/0/1.2.0 sdisk 2/0/1.3 tgt 2/0/1.3.0 stape 2/0/1.4 tgt 2/0/1.4.0 sdisk 2/0/1.7 tgt 2/0/1.7.0 sctl 2/0/2 lan2 2/0/3 hil

H4262S C.00 2-86  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools 2/0/4 asio0 2/0/5 asio0 2/0/6 CentIf 2/0/7 c720 2/0/7.5 tgt 2/0/7.5.0 sdisk 2/0/7.6 tgt 2/0/7.6.0 sdisk 2/0/7.7 tgt 2/0/7.7.0 sctl 2/0/8 audio 4 eisa 4/0/4 lan2 8 processor 9 memory System Console is on the ITE Networking memory for fragment reassembly is restricted to 36265984 bytes Logical volume 64, 0x3 configured as ROOT Logical volume 64, 0x2 configured as SWAP Logical volume 64, 0x2 configured as DUMP Swap device table: (start & size given in 512-byte blocks) entry 0 - major is 64, minor is 0x2; start = 0, size = 819200 Dump device table: (start & size given in 1-Kbyte blocks) entry 0 - major is 31, minor is 0x26000; start = 68447, size = 393217 Starting the STREAMS daemons. B2352B HP-UX (B.10.20) #1: Sun Jun 9 08:03:38 PDT 1996 Memory Information: physical page size = 4096 bytes, logical page size = 4096 bytes Physical: 393216 Kbytes, lockable: 302512 Kbytes, available: 349504 Kbytes Using 1932 buffers containing 15360 Kbytes of memory. SCSI: Request Timeout -- lbolt: 7543017, dev: cd000001 lbp->state: 0 lbp->offset: ffffffff lbp->uPhysScript: 2a24000 From most recent interrupt: ISTAT: 06, SIST0: 04, SIST1: 00, DSTAT: 80, DSPS: 00000006 lsp: 1febc00 bp->b_dev: cd000001 scb->io_id: 57b13 scb->cdb: 08 00 00 08 00 00 lbolt_at_timeout: 7544517, lbolt_at_start: 7543017 lsp->state: 30d lsp->uPhysScript: 196e000 lsp->upScript: 196a000 lsp->upActivePtr: 196a000 lsp->uActiveAdjust: 0 lsp->upSavedPtr: 196a000 lsp->uSavedAdjust: 0 lsp->upPeakPtr: 196a000 lsp->uPeakAdjust: 0 lbp->owner: 1febc00 scratch_lsp: 0 Pre-DSP script dump [1b20020]: 78051800 00000000 78030000 00000000 0e000002 02a24700 80000000 00000000 Script dump [1b20040]: 9f0b0000 00000006 98080000 00000005 98080000 00000001 58000008 00000000

http://education.hp.com

H4262S C.00 2-87  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–56. TEXT PAGE: ioscan The ioscan command scans system hardware, usable I/O system devices, or kernel I/O system data structures as appropriate, and lists the results. For each hardware module on the system, ioscan displays the hardware path to the hardware module, the class of the hardware module, and a brief description. By default, ioscan scans the system and lists all reportable hardware found. The types of hardware reported include processors, memory, interface cards and I/O devices. Scanning the hardware may cause drivers to be unbound and others bound in their place in order to match actual system hardware. Entities that cannot be scanned are not listed. On very large systems, ioscan will operate much faster with the –k option. This will force ioscan to read kernel structures built at boot time, rather than sending fresh inquiries to each hardware module. Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

SCSI devices

Type of Data:

status and Hardware address

Metrics:

hardware status

Logging:

standard output

Overhead:

minimal

Unique Feature:

polls SCSI bus to retrieve status of SCSI devices

Full Pathname:

/usr/sbin/ioscan

Pros and Cons:

+ Displays hardware addresses and corresponding device filenames. - Minimal performance data

Syntax /usr/sbin/ioscan [-k|-u] [-d driver|-C class] [-I instance] [-H hw_path] \ [-f[-n]|-F[-n]] [devfile]

Examples # ioscan -f Class I H/W Path Driver S/W State H/W Type Description =========================================================================== bc 0 root CLAIMED BUS_NEXUS graphics 0 0 graph3 CLAIMED INTERFACE Graphics ba 0 2 bus_adapter CLAIMED BUS_NEXUS Core I/O Adapter ext_bus 0 2/0/1 c720 CLAIMED INTERFACE Built-in SCSI target 0 2/0/1.0 tgt CLAIMED DEVICE disk 0 2/0/1.0.0 sflop CLAIMED DEVICE TEAC FC-1 HF 07 target 1 2/0/1.1 tgt CLAIMED DEVICE tape 0 2/0/1.1.0 stape CLAIMED DEVICE HP HP35470A target 2 2/0/1.2 tgt CLAIMED DEVICE disk 1 2/0/1.2.0 sdisk CLAIMED DEVICE TOSHIBA CD-ROM XM-3301TA target 5 2/0/1.5 tgt CLAIMED DEVICE disk 4 2/0/1.5.0 sdisk CLAIMED DEVICE QUANTUM FIREBALL1050S target 6 2/0/1.6 tgt CLAIMED DEVICE

H4262S C.00 2-88  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools disk target ctl lan hil tty tty ext_bus audio processor memory

5 7 0 0 0 0 1 1 0 0 0

2/0/1.6.0 2/0/1.7 2/0/1.7.0 2/0/2 2/0/3 2/0/4 2/0/5 2/0/6 2/0/8 8 9

sdisk tgt sctl lan2 hil asio0 asio0 CentIf audio processor memory

CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED CLAIMED

DEVICE DEVICE DEVICE INTERFACE INTERFACE INTERFACE INTERFACE INTERFACE INTERFACE PROCESSOR MEMORY

QUANTUM PD425S Initiator Built-in LAN Built-in HIL Built-in RS-232C Built-in RS-232C Built-in Parallel Interface Built-in Audio Processor Memory

# ioscan -fC disk Class I H/W Path Driver S/W State H/W Type Description ========================================================================= disk 5 2/0/1.6.0 sdisk CLAIMED DEVICE QUANTUM PD425S

# ioscan -fnC disk Class I H/W Path Driver S/W State H/W Type Description ========================================================================= disk 5 2/0/1.6.0 sdisk CLAIMED DEVICE QUANTUM PD425S /dev/dsk/c0t6d0 /dev/rdsk/c0t6d0

http://education.hp.com

H4262S C.00 2-89  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–57. TEXT PAGE: vgdisplay, pvdisplay, lvdisplay The vgdisplay command displays information about volume groups. If a specific vg_name is specified, information for just that volume group is displayed. The pvdisplay command displays information about specific physical volumes (or disks) within an LVM volume group. The lvdisplay command displays information about specific logical volumes within an LVM volume group. Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

LVM header structures and /etc/lvmtab

Type of Data:

LVM configuration

Metrics:

mirroring, stripping, other I/O policies

Logging:

standard output device

Overhead:

minimal

Unique Feature:

shows LVM configuration information

Full Pathname:

/usr/sbin/vgdisplay, /usr/sbin/pvdisplay, /usr/sbin/lvdisplay

Pros and Cons:

+ Only commands for viewing LVM configurations - Minimal tuning capabilities

Syntax

/sbin/vgdisplay [-v] [vg_name ...] /sbin/lvdisplay [-k] [-v] lv_path ... /sbin/pvdisplay [-v] [-b BlockList] pv_path ... Examples

# vgdisplay --- Volume groups --VG Name VG Write Access VG Status Max LV Cur LV Max PV Cur PV Max PE per PV VGDA PE Size (Mbytes) Total PE Alloc PE

/dev/vg00 read/write available 255 9 16 2 1016 4 4 726 279

H4262S C.00 2-90  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

Free PE Total PVG

447 0

# pvdisplay /dev/dsk/c0t5d0 --- Physical volumes --PV Name VG Name PV Status Allocatable VGDA Cur LV PE Size (Mbytes) Total PE Free PE Allocated PE Stale PE IO Timeout

/dev/dsk/c0t5d0 /dev/vg00 available yes 2 7 4 249 0 249 0 default

# lvdisplay /dev/vg00/lvol1 --- Logical volumes --LV Name VG Name LV Permission LV Status Mirror copies Consistency Recovery Schedule LV Size (Mbytes) Current LE Allocated PE Stripes Stripe Size (Kbytes) Bad block Allocation

/dev/vg00/lvol1 /dev/vg00 read/write available/syncd 0 MWC parallel 48 12 12 0 0 off strict/contiguous

http://education.hp.com

H4262S C.00 2-91  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–58. TEXT PAGE: swapinfo The swapinfo command prints information about device and file-system paging space. This information includes reserved space as well as used swap space. The term swap refers to an obsolete implementation of virtual memory; HP-UX actually implements virtual memory by way of paging rather than swapping. This command and others retain names derived from "swap" for historical reasons.

NOTE:

Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

kernel swap tables

Type of Data:

swap space

Metrics:

swap used, swap reserved, swap space configurations

Logging:

standard output device

Overhead:

minimal

Unique Feature:

Command can total all configured swap space into a one-line summary. Displays pseudoswap information (if configured).

Full Pathname:

/usr/sbin/swapinfo

Pros and Cons

+ provides valuable swap space configuration information - minimal documentation on psuedoswap

Syntax

/usr/sbin/swapinfo [-mtadfnrMqw]

Examples # swapinfo -t

TYPE dev reserve memory total

Kb AVAIL 159744 42112 201856

Kb USED 19868 51220 15300 86388

Kb FREE 139876 -51220 26812 115468

PCT USED 12% 36% 43%

START/ Kb LIMIT RESERVE 0 -

-

H4262S C.00 2-92  2004 Hewlett-Packard Development Company, L.P.

0

PRI 1

NAME /dev/vg00/lvol2

-

http://education.hp.com

Module 2 Performance Tools

2–59. TEXT PAGE: sysdef The sysdef command analyzes the currently running system and reports on its tunable configuration parameters. Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

/stand/vmunix and the currently running kernel

Type of Data:

Tunable kernel parameters

Metrics:

Current configuration of kernel parameters

Logging:

Standard output device

Overhead:

Minimal

Unique Feature:

Shows current value and possible range of values

Full Pathname:

/usr/sbin/sysdef

Pros and Cons:

+ Shows current setting of kernel parameters - reboot required to change most parameters

Syntax

/usr/sbin/sysdef [kernel [master]]

Example # /usr/sbin/sysdef NAME acctresume acctsuspend allocate_fs_swapmap bufpages create_fastlinks dbc_max_pct dbc_min_pct default_disk_ir dskless_node eisa_io_estimate eqmemsize file_pad fs_async hpux_aes_override maxdsiz maxfiles maxfiles_lim maxssiz maxswapchunks maxtsiz maxuprc maxvgs msgmap nbuf ncallout

http://education.hp.com

VALUE 4 2 0 2841 0 50 5 1 0 768 15 10 0 0 16384 60 1024 2048 256 16384 75 10 2555904 4788 292

BOOT -

MIN-MAX -100-100 -100-100 00-1 00-1 256-655360 30-2048 30-2048 256-655360 1-16384 256-655360 3306-

UNITS

Pages

Pages

Pages Pages

FLAGS -

H4262S C.00 2-93  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools ncdnode ndilbuffers netisr_priority netmemmax nfile nflocks ninode no_lvm_disks nproc npty nstrpty nswapdev nswapfs public_shlibs remote_nfs_swap rtsched_numpri sema semmap shmem shmmni streampipes swapmem_on swchunk timeslice unlockable_mem Name Value Boot Min Max Units Flags

-

150 30 -1 5378048 800 200 476 0 276 60 60 10 10 1 0 32 0 4128768 0 200 0 1 2048 10 801

-

1-1-127 142141011-25 1-25 0-1 40-1 3-1024 02048-16384 kBytes -1-2147483648 Ticks 0Pages

-

The name of the parameter The current value of the parameter The value of the parameter at boot time The minimum allowed value of the parameter The maximum allowed value of the parameter The units by which the parameter is measured Further describe the parameter M Parameter may be modified without rebooting

A comparable command, introduced at HP-UX 11.00, is kmtune(1m).

H4262S C.00 2-94  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–60. TEXT PAGE: kmtune, kcweb The kmtune command is used to query, set, or reset system parameters. kmtune displays the value of all system parameters when used without any options or with the -S or -l option. kmtune reads the master files and the system description files of the kernel and kernel modules. On 11i v2, kmtune is front-ended and will eventually be replaced entirely by kctune. kctune is part of a new, larger utility called kcweb. Tool Source:

HP

Documentation:

man pages

Interval:

on demand

Data Source:

/stand/vmunix and the currently running kernel

Type of Data:

Tunable kernel parameters

Metrics:

Current configuration of kernel parameters

Logging:

Standard output device

Overhead:

Minimal

Unique Feature:

Works with dynamic and static kernel modules

Full Pathname:

/usr/sbin/kmtune

Syntax

/usr/sbin/kmtune [-l] [[-q name] . . ] [-S system file] /usr/sbin/kmtune [[-s {+|=}value] . . ] [[-r name] . . ] [-S system file]

Examples # /usr/sbin/kmtune

Parameter Value =================================================================== NSTRBLKSCHED 2 NSTREVENT 50 NSTRPUSH 16 NSTRSCHED 0 . . . # /usr/sbin/kmtune –l -q maxdsiz Parameter: Value: Default: Minimum: Module: Version: Dynamic:

maxdsiz 0x04000000 0x04000000 (11i only) (11i only)

http://education.hp.com

H4262S C.00 2-95  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–61. SLIDE: Application Profiling and Monitoring Tools (Standard UNIX)

Application Profiling and Monitoring Tools (Standard UNIX) prof gprof arm

Resource

Super User Access Required

Application Profiler

No

Enhanced Application Profiler

No

Define and measure response time of transactions for an application

No

Student Notes This slide shows the standard UNIX application profiling performance tools included with HP-UX. Application profiling tools provide in-depth details regarding the execution of a program, including the number of times each subroutine is called and the amount of time spent in each subroutine.

H4262S C.00 2-96  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–62. TEXT PAGE: prof, gprof The prof and gprof tools are used to ascertain the library routines being called during the execution of a program. The prof utility profiles the execution of an application by displaying the names of the routines being called, the number of times the different routines were called, and how much time was spent in each routine. The gprof utility is an enhanced version of prof. It shows all the information available with prof, plus it displays a call graph tree, which details the call hierarchy of the routines. The call graph tree allows the parent routines, which called the children routines to be viewed. Tool Source:

standard UNIX (System V)

Documentation:

man pages

Interval:

on demand

Data Source:

kernel routines called by the application

Type of Data:

function call flow

Metrics:

time spent in each function, number of times function was called

Logging:

binary file mon.out

Overhead:

significant delays in the execution of the application

Unique Feature:

shows the flow of the function calls

Full Pathname:

/usr/bin/prof

Pros and Cons:

+ shows where an application is spending its time - requires access to source code - requires application to be recompiled

Syntax

prof [-tcan] [-ox] [-g] [-z] [-h] [-s] [-m mdata] [prog] gprof [options] [a.out [gmon.out...]]

Examples cc -p prog.c -o program ./program prof program

cc -G prog.c -o program ./program gprof program

http://education.hp.com

H4262S C.00 2-97  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–63. TEXT PAGE: Application Response Measurement (ARM) Library Routines Description The ARM library routines allow you to define and measure response time of transactions in any application that uses a programming language that can call a 'C' function. The ARM library is named "libarm" and is provided in two versions, an archive version and a shared library version. It is strongly recommended that you use the shared (sometimes referred to as dynamic) library version. In-depth discussion of this product is beyond the scope of this class. NOTE:

arm is a cross-platform tool and functionally replaces the ttd discussed in the next section. glance and gpm work equally well with either arm or ttd.

Documentation:

man 3 arm

Interval:

configurable

Platforms supported: HP-UX, IBM AIX, Sun Solaris, NCR Pros and Cons:

+ Integrates with PerfView/MWA and other distributed management/monitoring tools. - requires source code modification

Syntax:

The six function calls used by arm are:

arm_init

Return a unique ID based on application and user.

arm_getid

Return a unique ID based on a transaction name.

arm_start

Mark the beginning of a specific transaction.

arm_update

Provide information or show progress of a specific transaction.

arm_stop

Mark the end of a specific transaction.

arm_end

Mark the end of an application.

H4262S C.00 2-98  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–64. SLIDE: Application Profiling and Monitoring Tools (HP-Specific)

Application Profiling and Monitoring Tools (HP-Specific) Description

Super User Access Required

ttd

Tracks how much time is spent between specific lines of code in a program

No

caliper

caliper is a runtime performance analyzer for programs compiled with C, C++ and Fortran 90 compilers on Itanium systems.

No

Student Notes This slide shows some HP-specific application profiling tools included with HP-UX. Currently, the Transaction Tracker (ttd), and caliper are available for monitoring application behavior and performance. In 10.20, there was a tool called puma which came with all standard programming language compilers (like C, Pascal, and Fortran). The puma tool allowed profiling data to be collected without having to modify the application source code, or recompiling the application (in many cases). puma has been excluded from the more recent releases of HPUX. The Transaction Tracker allows a programmer to time how long a program is spending within a certain area of code. The Transaction Tracker requires the source code be modified to include the starting point and the stopping point. The Transaction Tracker is included as part of the MeasureWare/OVPA product. Transaction Tracker is HPUX specific. arm (discussed earlier) is the generic version of the Transaction Tracker. caliper is thread-aware, MP-aware, and features an easy command-line interface.

http://education.hp.com

H4262S C.00 2-99  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–65. TEXT PAGE: Transaction Tracker Description The Transaction Tracker is a set of function calls that allow a programmer to time the execution of a particular body of code (referred to a “transaction”). The function calls are inserted into the source code to mark where a particular transaction begins and ends. Glance and gpm can then be used to monitor how many times the transaction is called, and how long it takes for the transaction to complete. Tool Source:

HP

Documentation:

MeasureWare Users manual

Interval:

Every time Transaction Tracker function call is invoked within program

Data Source:

The ttd process

Type of Data:

Application execution times

Metrics:

Times to one hundredth of a second

Logging:

Binary file /var/opt/perf/datafiles/logtrans

Overhead:

Medium to large, depending on number of transactions being timed.

Unique Feature

Shows amount of time spent in a particular body of code

Full Pathname:

Function calls defined in /opt/perf/include/tt.h

Pros and Cons:

+ Integrated with glance and gpm; makes it easy to monitor how long transactions take. - Cannot be used within shell programs; C programs only (or programs which can call C routines).

Syntax

The four function calls used by Transaction Tracker are: tt_getid

Names the transaction and returns a unique identifier.

tt_start

Signals the start of a unique transaction.

tt_end

Signals the end of the transaction.

tt_abort

Ends the transaction without recording times for the transaction.

H4262S C.00 2-100  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–66. TEXT PAGE: caliper — HP Performance Analyzer Description HP Caliper is a general-purpose performance analysis tool for applications on Itanium®based HP-UX systems. HP Caliper allows you to understand the performance of your application and to identify ways to improve its run-time performance. HP Caliper works with any Itanium-based binary and does not require your applications to have any special preparation to enable performance measurement. The two primary ways to use HP Caliper are: •

As a performance analysis tool.

•

As a profile based optimization (PBO) tool invoked by HP compilers.

The latest version of HP Caliper is available on the HP Caliper home page. You can find it at the http://www.hp.com/go/hpcaliper/ site.

Overview HP Caliper helps you dynamically measure and improve the performance of your native Itanium-based applications in three ways: •

Commands to measure the overall performance of your program.

•

Commands to drill down to identify performance parameters of specific functions in your program.

•

A simple way to optimize the performance of your program based on its specific execution profile.

HP Caliper does not require special compilation of the program being analyzed and does not require any special link options or libraries. HP Caliper selectively measures the processes, threads, and load modules of your application. An application's load modules are the main executable and all shared libraries it uses. HP Caliper uses a combination of dynamic instrumentation of code and the performance monitoring unit (PMU) in the Itanium processor. HP Caliper uses the least-intrusive method available to gather performance data.

Supported Target Programs HP Caliper includes support for: Programs compiled for Itanium- and Itanium 2-based systems. HP Caliper does not measure programs compiled for PA-RISC processors. Code generated by native and cross HP aC++, C++ and Fortran compilers, including inlined functions and C++ exceptions. Programs compiled with optimization or debug information, or both. This includes support for both the +objdebug and +noobjdebug options.

http://education.hp.com

H4262S C.00 2-101  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

•

Both ILP32 (+DD32) and LP64 (+DD64) programs, both 32-bit and 64-bit ELF formats.

•

Archive-, minshared- or shared-bound executables.

•

Both single- and multi-threaded applications, including MxN threads.

•

Applications that fork() or vfork() or exec() themselves or other executables.

•

Shell scripts and the programs they spawn.

Features HP Caliper is simple to run because it uses a single command for all measurements. You specify the type of measurement and the target program as command-line arguments. For example, to measure the total number of CPU cycles used by a program named myprog, just type: caliper total_cpu myprog HP Caliper features include: •

Multiple performance measurements, each of which can be customized through configuration files.

•

All reports are available in text format, comma-delimited (CSV) format, and most reports are also available in HTML format for easier browsing.

•

Performance data can be correlated to your source program by line number.

•

Easy inclusion and exclusion of specific load modules, such as libc, when measuring performance.

•

Both per-thread and aggregated thread reports for most measurements.

•

Performance data reported by function, sorted to show hot spots.

•

Support for multi-process selection capabilities.

•

The ability to save performance data in files that you can use to aggregate data across multiple runs to generate reports without having to re-run HP Caliper.

•

The ability to attach and detach to running processes for certain measurements.

•

The ability to restrict PMU measurements to specific regions of your programs.

•

Limited support for dynamically generated code.

H4262S C.00 2-102  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

2–67. SLIDE: Summary

Summary • Different categories of performance tools • Standard UNIX tools versus HP-specific tools • Separately purchasable tools • Kernel register-based tools versus midaemon-based Tools

Student Notes To summarize this module, there are many performance tools for many different purposes. The objective of this module was to highlight all the performance tools available with HP-UX, to categorize them by function, and to describe how each tool worked. In general, you should become most familiar with these tools: sar vmstat top glance/gpm (if available) These tools will tend to be your most commonly used tools. Other tools will tend to be useful in more specialized situations. Remember, never try to rely on just one tool to do everything. No tool will tell you everything. And every tool will mislead you somewhere down the line. No tool is perfect. That’s why you need to be familiar with multiple tools.

http://education.hp.com

H4262S C.00 2-103  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

2–68. LAB: Performance Tools Lab

Lab

Before we continue with a more focused discussion of glance and gpm, lets spend some time exploring the generic UNIX and HP-UXspecific tools discussed so far. As you answer the following questions, try to categorize each tool as to its type and scope.

Student Notes The goal of this lab is to gain familiarity with performance tools. A secondary goal is to get familiar with the metrics reported by the tools, although they will be explored in depth during the next days.

Directions Set up: Change directories to: # cd /home/h4262/tools Execute the setup script: # ./RUN Use glance (or gpm if you have a bit-mapped display), sar, top, vmstat, and any other available tools to answer the following questions. List as many as possible, and include the appropriate OPTION or SCREEN, which will give the requested information. Specific numbers are not the important goal of this lab. The goal is to gain familiarity with a variety of performance tools. Always investigate what the basic UNIX tools can tell you before running glance or gpm. You may want to run through this lab with the solution from the back of this book for more guidance and discussion.

H4262S C.00 2-104  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

1. How many processes are running on the system? Which tools can you use to determine this?

2. Are there any real-time priority processes running? If so, list the name and priority. What tools can you use to determine this?

3. Are there any nice'd processes on the system? If so, list the name and priority for each. What tools can you use to determine this?

4. Are there any zombie processes on the system? If so, how many are there? What tools can you use to determine this?

5. What is the length of the run queue? What are the load averages? What tools can you use to determine this?

http://education.hp.com

H4262S C.00 2-105  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

6. How many system processes are running? What tools can you use to determine this? NOTE

A system process is defined as a process whose data space is the kernel's data space (such as swapper, vhand, statdaemon, unhashdaemon, and supsched). ps reports their size as zero.

There are three ways this can be determined. If you get stuck on this question, move on. Don't spend more than a few minutes trying to answer this question.

7. What percentage of time is the CPU spending in different states? What tools can you use to determine this?

8. What is the size of memory? What is the size of free memory? What tools can you use to determine this?

9. What is the size of the swap area(s)? What is the percentage of swap utilization? What tools can you use to determine this?

H4262S C.00 2-106  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

10. What is the size of the kernel’s incore inode table? How much of the inode table is utilized? What tools can you use to determine this?

11. Are there any CPU-bound processes running (processes using a “lot” of CPU)? If so, what is the name of the process? What steps did you take to determine this?

12. Are there any processes running which are using a “lot” of memory? (A "lot" is relative, i.e. a large RSS size compared to other processes.) If so, what is the name of the process? What steps did you take to determine this? Is memory utilization changing?

13. Are there any processes running which are doing any disk I/O? If so, what is the name of the process? What steps did you take to determine this? What are the I/O rates of the disk bound processes? What files are open by this (these) process(es)? NOTE:

No processes are really doing a lot of physical disk I/O. However, lab_proc3 is doing a LOT of logical I/O.

http://education.hp.com

H4262S C.00 2-107  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

14. What is the current rate of semaphore or message queue usage? What tools can you use to determine this?

15. Is there any paging or swapping occurring? What tools can you use to determine this?

16. What is the system call rate? What tools can you use to determine this?

17. What is the buffer cache hit ratio? What tools can you use to determine this?

18. What is the tty I/O rate? What tools can you use to determine this?

19. Are there any traps (interrupts) occurring? What tools can you use to determine this?

H4262S C.00 2-108  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 2 Performance Tools

20. What information can you collect about network traffic? What tools can you use to determine this?

21. What information can be gathered on CPUs in an SMP environment? What tools can you use to determine this?

22. What information can be gathered on Logical Volumes? What tools can you use to determine this?

23. What information can be gathered on Disk I/O? What tools can you use to determine this?

24. Shut down the simulation by entering: # ./KILLIT

http://education.hp.com

H4262S C.00 2-109  2004 Hewlett-Packard Development Company, L.P.

Module 2 Performance Tools

H4262S C.00 2-110  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3  GlancePlus Objectives Upon completion of this module, you will be able to do the following: •

Compare GlancePlus with other performance monitoring/management tools.

•

Start up the GlancePlus terminal interface (glance) and graphical user interface (gpm).

http://education.hp.com

H4262S C.00 3-1  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

3-1. SLIDE: This Is GlancePlus

This is GlancePlus Features • Motif-based interface that offers exceptional ease-of-learning and ease-of-use • State-of-the-art, award-winning on-line Help system. • Rules-based diagnostics that use customizable system performance rules to identify system performance problems and bottlenecks. • Alarms that are triggered when customizable system performance thresholds are exceeded. • Tailor information gathering and display to suit your needs. • Integrated into OpenView environments.

Capabilities • Get detailed views of CPU, disk, and memory resource activity • View disk I/O rates and queue lengths by disk device to determine if your disk loads are well balanced • Monitor virtual memory I/O and paging • Measure NFS activity • And much more ...

Student Notes GlancePlus is a performance monitoring diagnostic tool. GlancePlus software visually gives you the useful, accurate information you need to pinpoint potential or existing problems involving your system’s CPU, memory, disk, or network utilization. To help you monitor and interpret your system’s performance data, GlancePlus software includes a rules-based adviser. Whenever threshold levels for measurements such as CPU utilization or disk I/O rates are exceeded the adviser notifies you with on-screen alarms. The adviser also applies rules to key performance measurements and symptoms and then gives you information to help you uncover bottlenecks or other performance problems. NOTE:

GlancePlus is integrated into OpenView Windows at the menu bar level.

H4262S C.00 3-2  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

GlancePlus offers a viewpoint into many of the critical resources that need to be measured in the open system environment.

Benefits •

Save time and effort managing your system resources

•

Better understand your computing environment

•

Satisfy your end users’ system performance needs quickly

•

Leverage from a standard interface across vendor platforms

The features in the product yield a performance monitoring diagnostic solution that offers many benefits to the user. GlancePlus offers a tool that will make your analysis activities easier and quicker to perform. This will save you time. The display of various types of information will also allow you to get a better understanding of your own environment. The same GUI on the Motif version is used on all the supported platforms, which provides a leverage point for a standard user interface across several UNIX platforms. Many times, just by cursory use of the product, people will discover certain things about their systems. You do not have to have a performance problem to use GlancePlus. This simple cursory use of the product has let many people gain a better understanding of their systems. This helps out when a problem does exist. Knowing what is normal can help identify what has become abnormal in your environment.

http://education.hp.com

H4262S C.00 3-3  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

3-2. SLIDE: GlancePlus Pak Overview

managed node

central management system

GlancePlus Pak Overview

PerfView

Forecasting and capacity planning

PerfView Planner

Central alarm monitoring and event management

PerfView Monitor PerfView Analyzer

GlancePlus

MeasureWare

NETWORKS SYSTEMS INTERNET

APPS DATABASES

Performance analysis and correlation

Performance data collection and alarming Online performance monitoring and diagnostic

Student Notes The view here is from the heights. For our purposes, we will focus our discussion on the capabilities of glance and gpm and the information and reports they can produce from a running HP-UX system. Also understand that GlancePlus may be used in conjunction with MeasureWare/OVPA to enhance and extend its capabilities. Many of you may have purchased glance in the GlancePlus Pak, which includes a license to run glance, gpm and to configure and run the MeasureWare/OVPA Agent (mwa) on your system. The GlancePlus and MeasureWare/OVPA Agent products can be purchased separately or combined in the GlancePlus Pak. The Pak also includes (as of C.03.58.00 June 2002 application release) some event monitoring and graphical configuration components.

H4262S C.00 3-4  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

The components share a common measurement infrastructure, thus metrics, as well as applications have similar alarming mechanisms.

GlancePlus Pak GlancePlus Interfaces include: /opt/perf/bin/gpm /opt/perf/bin/glance

MeasureWare/OVPA

PerfView/OVPM

Interfaces include: /opt/perf/bin/extract /opt/perf/bin/utility

Interfaces include: /opt/perf/bin/pv

Complete information on the configuration and use of MWA/OVPA and PerfView/OVPM are fully covered in the Hewlett-Packard Education Services' course: PerfView MeasureWare (catalog number B5136).

http://education.hp.com

H4262S C.00 3-5  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

3-3. SLIDE: gpm and glance

gpm and glance

Student Notes GlancePlus provides dual user interfaces: The gpm GUI See history of activity of the system with multiple window capability

The glance Character Mode Monitor performance remotely over slow datacom line

Monitor your system while doing other work

When no high resolution monitor is available

Use alarms, symptoms and color to assist with monitoring

Creates less load on the system being monitored

H4262S C.00 3-6  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

Notes on starting the user interfaces: gpm and glance Starting the GUI # gpm [options]

Starting the character based interface: # glance [options]

-nosave

Do not save the current configuration at the next exit

-j interval

Preset the number of seconds between screen refreshes

-rpt

Specify one or more additional report windows

-p dest

Specify the continuous print option destination.

-sharedclr

Share color scheme with other applications

-lock

Allows glance to lock itself into memory

-nice

Set the gpm nice value

-nice

Set the glance nice value

Xoptions

Use X-Toolkit options such as -display

http://education.hp.com

H4262S C.00 3-7  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

3-4. SLIDE: glance — The Character Mode Interface

glance — The Character Mode Interface

Student Notes With glance you can run on almost any terminal or workstation, over a serial interface and relatively slow data communication links, and with lower resource requirements. The default Process List screen is shown in the above screen capture, and provides general data on system resources and active processes. In addition, the user may “drill down” to more specific levels of detail in areas of CPU, memory, disk I/O, network, NFS system calls, swap, and system table screens. Specific details on a per-process level are also available through the individual process screens. For your convenience, the next two pages contain a hot key quick reference guide for the glance character mode interface.

H4262S C.00 3-8  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

Glance Hot Key Quick Reference Top Level Screen Hot Keys

Hot Key

Screen Displayed/Description

a c d g i l m n t u v w A B D G H I J K N P T Y Z ?

CPU By Processor CPU Report Disk Report Process List I/O By File System Network By Interface Memory Report NFS By System System Tables Report I/O By Disk I/O By Logical Volume Swap Space Application List Global Waits DCE Global Activity Process Threads Alarm History Thread Resources Thread Wait DCE Process List NFS Global Activity PRM Group List Transaction Tracker Global System Calls Global Threads Commands Menu

http://education.hp.com

H4262S C.00 3-9  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

Secondary Level Screen Hot Keys Hot Key

Screen Displayed/Description

S s F L M R W

Select a NFS system/Disk/Application/Trans/Thread Select a single process Process Open Files Process System Calls Process Memory Regions Process Resources Process Wait States

Miscellaneous Screen Hot Keys Hot Key

Screen Displayed/Description

b f h j o p e/q r y z > < !

Scroll page backward Scroll page forward Online HELP Adjust refresh interval Adjust process threshold Print toggle (start|stop auto-printing) Quit GlancePlus Refresh the current screen Renice a process Reset statistics to zero Display next logical screen Display previous logical screen Invoke a shell

H4262S C.00 3-10  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

3-5. SLIDE: Looking at a glance Screen

Looking at a glance Screen

Student Notes Above is an example of an easy and common performance problem — a runaway looping process. Why is the global CPU utilization < 100%, although the sum of the individual process CPU utilizations > 100 %? Hint: Is this a UP or MP system? Also note that / (slashes) are used in glance reports to separate current metric values from cumulative averages. NOTE:

For the record there were two CPUs on this system.

http://education.hp.com

H4262S C.00 3-11  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

On a three-way multiprocessor system with two processes in the same application looping, each process can use nearly 100% of each of 2 CPU’s. Over a 10-second interval, each uses nearly 10 seconds of CPU time, so the application used nearly 20 seconds of CPU time in 10 seconds of elapsed time. Process CPU utilization is 100% for each of the 2 looping processes, but global CPU utilization would be 66%. On HP-UX 11.0, processes can have multiple threads, each of which can consume CPU time independently of the others. On a four-way MP system, with one process that has three threads looping, the process as a total uses 300% of the CPU. The application and global CPU utilization would report the CPU utilization at 75%.

H4262S C.00 3-12  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

3-6. SLIDE: gpm — The Graphical User Interface

gpm — The Graphical User Interface

Student Notes gpm presents the same metrics as character-mode glance in graphical form. Significant global metrics, as well as bottleneck adviser symptom status and alarms are shown in the main window. The process list, as well as other reports, is available via menu selections. The process list is very customizable (and customizations are preserved) with filters, sorting, highlights, chosen metrics, and column rearrangement. The online User’s Guide is very useful. The ? button on every window is a shortcut into the on-item help, which is useful especially for metric definitions.

http://education.hp.com

H4262S C.00 3-13  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

This is another screen shot of the gpm interface.

Note the icon reflecting an adviser alarm.

H4262S C.00 3-14  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

3-7. SLIDE: Process Information

Process Information

Process Information z Detailed data on each active process z CPU data z Disk I/O data z Memory Use z Wait Reasons z Open Files

Process Features z Access via Main Reports selection Process List z Each Process has: –Process Resources –Open Files

Student Notes The Process Information screen in gpm presents the user with detailed information on each active process (including CPU utilization, disk I/O data, memory usage, wait state reasons, open() file information, and so on). This screen also allows the user to select a specific process and "drill down" to greater detail via the Reports selection menu.

Resource Diagnostic Monitoring GlancePlus provides an abundant set of performance metrics to help analyze the current system. Careful thought and consideration have been given to ensure that the proper metrics are displayed. The product with its Motif GUI offers a way to efficiently display performance information, without overloading the customer with screen after screen of detailed data.

http://education.hp.com

H4262S C.00 3-15  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

Customizable GUI GlancePlus uses the power of Motif and its industry-leading approach to display technology, to provide the user with a powerful graphical user interface that can be customized to fit your needs. Fonts, color, window size and more are configuration options. Additional configuration choices are available in "list" windows to allow easy manipulation of column tabular data for display and sort uses. The gpm Process List and GlancePlus - Main screen provide a pull-down menu to access the numerous, detailed Report screens. These reports allow a logical approach to the extensive amount of system resources and process specific data. ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼

Resource History Window CPU Info Memory Info Disk Info Network Info System Info Global Info Swap Space Wait States Transaction Tracking Application List PRM Group List Process List Thread List

H4262S C.00 3-16  2004 Hewlett-Packard Development Company, L.P.

Next Level contains additional graphs and tables

http://education.hp.com

Module 3 GlancePlus

3-8. SLIDE: Adviser Components

Adviser Components

z

z

z

Adviser Windows – Symptom History – Symptom Status/Snapshot – Alarm History – Adviser Syntax Button Label Colors – Alarm Button for Alarm Statements – Graph Buttons for Symptom Statements Icon Border Color (in OpenView) – Changes to Red or Yellow on Alarms

Student Notes GlancePlus supports performance alarms and a rules-based adviser to help automate the interpretation of performance data. The alarm rules can be customized by the user to reflect local system characteristics. Note: Both interfaces will report alarms, and the same syntax is used for alarms in glance and gpm. Alarms are configured through the /var/opt/perf/advisor.syntax file.

http://education.hp.com

H4262S C.00 3-17  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

3-9. SLIDE: adviser Bottleneck Syntax Example

adviser Bottleneck Syntax Example # # # # # # #

The following symptoms are used by the default Alarm Window Bottleneck alarms. They are re-evaluated every interval and the probabilities are summed. These summed probabilities are checked by the bottleneck alarms. The buttons on the gpm main window will turn yellow when a probability exceeds 50% for an interval, and red when a probability exceeds 90% for an interval. You may edit these rules to suit your environment:

symptom CPU_Bottleneck type=CPU rule GBL_CPU_TOTAL_UTIL > rule GBL_CPU_TOTAL_UTIL > rule GBL_CPU_TOTAL_UTIL > rule GBL_PRI_QUEUE >

75 85 90 3

prob prob prob prob

25 25 25 25

alarm CPU_Bottleneck > 50 for 2 minutes start if CPU_Bottleneck > 90 then red alert "CPU Bottleneck probability= ", else yellow alert "CPU Bottleneck probability= repeat every 10 minutes if CPU_Bottleneck > 90 then red alert "CPU Bottleneck probability= ", else yellow alert "CPU Bottleneck probability= end reset alert "End of CPU Bottleneck Alert"

CPU_Bottleneck, "%" ", CPU_Bottleneck, "%"

CPU_Bottleneck, "%" ", CPU_Bottleneck, "%"

Student Notes The bottleneck alarms are a little complex. The CPU bottleneck symptom definition and corresponding alarm is shown. Just because a resource is fully utilized doesn’t mean that it is a bottleneck. It is only a bottleneck if there is activity that is hindered waiting for that resource. Therefore, utilization alone is not a good bottleneck indicator. Both utilization and queue lengths are combined to define the symptom probability. Some of the key metrics for performance analysis are the ones we use in the default syntax to define bottleneck alarms.

H4262S C.00 3-18  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

3-10. SLIDE: The parm File

The parm File

application== application user== user file== file priority== priority group== group

!

Î

Î

application = and the associated parameters defines the logical groupings used to define each application on the machine. Examples: application=Real Time priority=0-127

parm file application definitions are used by both GlancePlus and MeasureWare. A .parm in a user's $HOME directory will override the system parm file.

application=Prog Dev Group 1 file=vi,xdb,abb,ld,lint user=bill,debbie application=Prog Dev Group 2 file=vi,xdb,abb,ld,lint user=ted,rebecc,test* application=Compilers file=cc,ccom,pc,pascomp

Student Notes By now you are starting to see the range and scope of the performance metric data that glance and gpm display. While this is invaluable when it comes to understanding the behavior of a single process, many times what we really need is to evaluate and baseline the performance of an entire application suite. This could be achieved by adding up the individual metrics of all processes within the application suite, but this could be a daunting task for all but the simplest of applications. Through the use of the configuration file /var/opt/perf/parm, glance and gpm can help to collect all the metrics from the individual processes within an application suite and present the information in a concise manner for your review. One challenge is in the definition of what constitutes an application. To address this issue, the parm file has several different methods for describing which processes belong to which application definition. Application member processes can be defined by their UID, the front-store file from which they were fork()'d , the priority at which they execute, their GID, or any combination of the above. This provides a very versatile framework for application profiling.

http://education.hp.com

H4262S C.00 3-19  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

NOTE:

glance and gpm share the same application definitions (via the parm configuration file) as mwa.

# /var/opt/perf/parm for host system “garat” id = garat # Parameters for what data classes scopeux will log: log global application process dev=disk,lvm transaction # Parameters to control maximum size of scopeux logfiles: size global=10, application=5, process=2, device=1, transaction=1.5 # Thresholds which determine what process data scopeux will log: threshold cpu = 1, disk = 1, nonew, nokilled # Web server: application = WWW user = www or file = httpd # Untrustworthy users: application = HighRisk user = fred,barney,root

The order in which applications are defined is very important. Once a process meets the definition of an application, its data will be contributed to that application's metrics. Care must be taken to assure that ambiguity is avoided in the definition of applications.

H4262S C.00 3-20  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

3-11. SLIDE: GlancePlus Data Flow

GlancePlus Data Flow Terminal display

Adviser output

Motif display

glance

gpm Adviser definitions

Shared Memory parm file

midaemon

(application definitions)

HP-UX kernel

KI

Student Notes Without going into a lot of detail, note that both interfaces share a common instrumentation source and common application definitions. Instrumentation comes partly from interfaces also accessed by standard UNIX utilities such as vmstat, and partly from special HP-UX KI trace-based instrumentation. There is no generally available API to these interfaces. They are written specifically for use by GlancePlus and MeasureWare/OVPA.

http://education.hp.com

H4262S C.00 3-21  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

Significant Directories /opt/perf

Product files from installation media

/opt/perf/bin

Executables

/opt/perf/ReleaseNotes Release Notes /opt/perf/examples

Supplementary configuration examples

/opt/perf/paperdocs

Electronic versions of documentation

/var/opt/perf

Product and configuration files created during and after installation

Always check ReleaseNotes for version-specific information. (New for C.02.30 and later releases: example configuration files) Config files come from /opt/perf/newconfig if they don’t already exist under /var/opt/perf. Compare new default parm file with that on your system if you are updating from a previous release. The directory /var/opt/perf contains the status and data files.

H4262S C.00 3-22  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

3-12. SLIDE: Key GlancePlus Usage Tips

Key GlancePlus Usage Tips •

Use it for “What’s going on right now.”

•

The gpm online help is very useful — especially on item help.

• •

Drill down from higher level reports to more detailed resource reports. Understand what the adviser is telling you.

• •

Sort, filter, and choose metrics in gpm; especially the Process List. In character-mode glance use: –

? screen to navigate

–

h for help o screen for setting thresholds and process list sorting

–

•

Edit the adviser alarms to be right for you.

•

Adjust update interval to control CPU overhead.

•

Process details including thread lists, wait states, memory regions, open files, and system call reports can be used to impress your programming staff ! 8^)

Student Notes

http://education.hp.com

H4262S C.00 3-23  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

3-13. SLIDE: Global, Application, and Process Data

Global, Application, and Process Data • Global metrics reflect system-wide activity (sum of all applications). • Process metrics reflect specific per-process (including thread) activity. • Application metrics sum activity for a set of processes. They keep track of activity for all processes, however short-lived, even if they are not reported individually. • Glance updates all metric values at the same time. MeasureWare summarizes Global, Application, and other class data over 5-minute intervals and summarizes Process data over 1-minute intervals. • Multiprocessor effects: Global and Application CPU percentages reflect normalization over the number of processors (percentage of availability for entire system). Process and thread-level CPU percentages are not normalized by the number of processors.

Student Notes It is important to understand the interrelationships among metric classes.

H4262S C.00 3-24  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

3-14. SLIDE: Can't Solve What's Not a Problem

Can’t Solve What’s Not a Problem! • A looping process by itself is not a problem. • Know what’s “normal” for your environment. • Keep historical performance data for reference. • Measure response times. • Use the tools to find out what is affecting performance. • Isolate bottlenecks and address them when there is a problem. • When tuning, make only one change at a time and then measure its effect. • Document everything you do! • Optimize your time resource: don’t fix what isn’t broken; sometimes more hardware is the cheapest answer; set yourself up to react quicker next time.

Student Notes One of the hardest skills is to determine what to measure and how to interpret its significance. After all, if the user’s response time is satisfactory, then oftentimes there is “no problem” even if an operation metric is higher than normal.

http://education.hp.com

H4262S C.00 3-25  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

3-15. SLIDE: Metrics: "No Answers without Data"

Metrics: “No Answers without Data” • Rate and utilization metrics are more useful than counts and times, because they are independent of the collection interval. • Cumulative metrics measure over the total duration of collection. • Most metrics are broken down into subsets by type. Work from the top down. • Blocked states reflect individual process or thread wait reasons. Global queue metrics are derived from process blocked states. • CPU is a “symmetric” resource. Scheduler will balance load on the multiprocessor, whereas disks and network interface activity depend on where data is located. • Memory utilization is not as important as paging activity and buffer cache sizing.

Student Notes CPU utilization and disk I/O rates compare well on different summarization intervals, whereas CPU time and I/O counts are always larger when the collection interval grows. Examples of breakdowns: Global disk I/O rate is a sum of the BYDSK_ metrics, each class in turn breaks down activity between reads and writes and file system versus raw and system access. For disk bottlenecks, it is often useful to correlate between DSK, FS, and LV classes. Memory utilization is frequently nearly 100% with dynamic buffer cache. If page outs occur or while in raw disk access environments, shrink the buffer cache to avoid paging. Programmers frequently don’t know they can view specific system-call metrics, as well as memory region and open file information on a per-process basis.

H4262S C.00 3-26  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

3-16. SLIDE: Summary

Summary • Don’t try to understand all the capabilities and extensions to the tools, just the ones of most use to you. • Start with developing an understanding of what is “normal” on your systems. • Refine and develop alarms customized for your environment. • Work from examples in documentation, gpm online help, config files, and example directories.

Student Notes Remember that performance tuning is an art, and the following two rules apply to most engagements: Rule #1:

When answering a question about computer system performance, the initial answer is always, “It depends.”

Rule #2:

Performance tuning always involves a trade-off.

Suggested reading: HP-UX Tuning and Performance by Robert F. Sauers and Peter S. Weygant, available through the Hewlett-Packard Professional Books, Prentice Hall Press (ISBN 0-13-102716-6)

http://education.hp.com

H4262S C.00 3-27  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

3-17. SLIDE: HP GlancePlus Guided Tour

HP GlancePlus Guided Tour

memo ry

process cpu

Topics • Main Window • CPU Bottlenecks • Memory Bottlenecks • Configuration Information • Alarm and Symptoms

Student Notes To take the guided tour of GlancePlus, run the gpm GUI and select Help on the menu bar. Next, select the Guided Tour option. This will introduce you to the product. It features captured “windows” of the actual product, with annotations to help point out the important features of certain screens or windows. Quick Tip: gpm provides an excellent online Help system. Click the right mouse button for the On-Item Help feature. For help in glance, press the h key.

H4262S C.00 3-28  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

3-18. LAB: gpm and glance Walk-Through Directions The following lab is intended to familiarize the student with gpm and glance. To achieve this result, the lab will “walk the student through” a number of windows and tasks in both the ASCII and X-Windows versions of gpm and glance.

The Graphical Version  GlancePlus 1. Log in. If you have not already done so, please log into the system with the user name and password provided by your instructor. 2. Start GlancePlus. From a terminal window, invoke GlancePlus by entering gpm. # gpm In a few seconds gpm will come up. The first thing will be a license notification informing you that you are starting a trial version of GlancePlus, along with ordering and technical support information. On the gpm Main screen, you will see four graphs for CPU, Memory, Disk, and Networking. By default, the graphs are in the resource history format. This means that for each interval (configurable) there will be a data point on the graph, up to the maximum number of intervals (also configurable). 3. Interval Customizations. Click on Configure in the menu bar, and select Measurement. Set the sample interval to 10 seconds and the number of graph points to 50. This will allow you to see up to 500 seconds of system history. Click on OK. NOTE:

This setting will be saved for you in your home directory in a file called $HOME/.gpmhp-system_name. This means that all GlancePlus users will have their customizations saved.

Start a program from another window: # cd /home/h4262/cpu/lab1; # ./RUN & 4. Main Window. Below each graph within the GlancePlus Main window, you will find a button. These buttons display the status color of adviser symptoms. This is a powerful feature of GlancePlus that we will investigate later. Clicking on one of these buttons displays details of that particular graph. To view the advisor symptoms from the main window, select: Adviser -> Edit Adviser Syntax This will display the definitions of the current symptoms being monitored by GlancePlus. Close the Edit Adviser Syntax window.

http://education.hp.com

H4262S C.00 3-29  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

View CPU details: Click the CPU button. To view a detailed report regarding the CPU, select: Reports -> CPU Report Select: Reports -> CPU by Processor This is a useful report, even on a single processor system. 5. On Line Help. One method for accessing online help within GlancePlus is to click on the question mark (?) button. The cursor changes to a ? . Click on the column heading, NNice CPU %. This opens a new window describing the NNice CPU % column. View descriptions for other columns, including the SysCall CPU %. When finished viewing online help for columns, click on the question mark one more time. This returns the cursor to normal. 6. Alarms and Symptoms. A symptom is some characteristic of a performance problem. GlancePlus comes with predefined symptoms, or the user can define his own. An alarm is simply a notification that a symptom has been detected. From the main window, select: Adviser -> Symptom History For each defined symptom, a history of that particular symptom is displayed graphically. The duration is dependent on the glance history buffers, which are user-definable. Close the window. Click on the ALARM button in the main window. This displays a history of all the alarms that have occurred since GlancePlus was started. Up to 250 alarms can be displayed. Close the window. 7. Process Details. Close all windows except for the main window. Select: Reports -> Process List This shows the “interesting” processes on the system (interesting in terms of size and/or activity). To customize this listing, select: Configure -> Choose Metrics This will display an astonishing number of metrics, which can be chosen for display in this report. This is also a quick way to get an overview of all of the process-related

H4262S C.00 3-30  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

metrics available in GlancePlus. Note that the familiar ? button is also available from this window. Use the scroll bar to find the metric PROC_NICE_PRI. Select this metric and click on OK. Close this window by clicking on OK. 8. Customizations. Most display windows can be customized to sort on any metric, and to arrange the metrics in any user-defined order. To define the sort fields, select Configure -> Sort Fields The sort order is determined by the order of the columns. Placing a particular metric into column one makes it the first sort field. If multiple entries have the same value within this field, then the second column is used to determine the order between those entries. If further sorting is needed, then the third column is used, and so forth down the line. To sort on Cumulative CPU Percentage, click on the column heading CPU % Cum. The cursor will become a crosshair. Scroll window back to column one, and click on column one. This makes CPU % Cum the first sort field. Arrange the sort order so that CPU % is followed by CPU % Cum. Click Done when finished. This sort order is automatically saved so that the next time processes are viewed, this will remain the sort order. In a similar fashion, the order of the columns can also be arranged. To define the column order, select Configure -> Arrange Columns Select a column to be moved (for example, CPU % Cum). The cursor will become a crosshair. Scroll the window to the location where the column is to be inserted. Click on the column where the column is to be inserted. Arrange the first four columns to be in the following order: Process Name, CPU %, CPU % Cum, Res Mem. Click Done when finished. This display order is automatically saved so that the next time processes are viewed, this will remain the display order. 9. More Customizations. It is possible to modify the definition of interesting processes by selecting: Configure -> Filters An easy way to limit the processes shown is to and all the conditions (the default is to OR the conditions). In the Configure Filters window, select AND logic, then click on OK. A much smaller list of processes should be displayed. Return to the Configure Filters window. Modify the filter definition for CPU % Cum as follows: Change Enable Filter to ON Change Filter Relation to >=

http://education.hp.com

H4262S C.00 3-31  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

Change Filter Value to 3.0 Change Change Change Change

Enable Highlight to ON Highlight Relation to >= Highlight Value to 3.0 Highlight Color to any LOUD color

Reset the logic condition make to OR, then click OK. Verify the filter took effect. 10. Administrative Capabilities. There are two administrative capabilities with GlancePlus. If working as root, processes in the Process List screen can be killed or reniced. In the Process List window, select the proc8 process. To access the Admintools, select: Admin -> Renice Use the slider to set the new nice value for this process to be +19, then click OK. Note the impact on this process. Now, select the proc8 process again. Select: Admin -> Kill Click OK, and note the process is no longer present. 11. Process Details. Detailed metrics can be obtained on a per process basis. To view process details, go to the Process List window and double click on any process. Much of the details in this report will be explained in the Process Management section of the course. The Reports menu provides much valuable information about the process, including the Files Open and the System Calls being generated. After surveying the information available through this window, close and return to the Main window. There are many other features available in GlancePlus. There are close to 1000 metrics available with it. Notice that when you iconify the GlancePlus Main window, all of the other windows are closed and the GlancePlus active icon is displayed. Alarms and histograms are displayed in this active icon. Exploding this icon will again open up all previously open windows. 12. Exit GlancePlus. From the Main window, select: File -> Exit GlancePlus

H4262S C.00 3-32  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 3 GlancePlus

13. Glance, the ASCII version. From a terminal window, which has not been resized, type glance. NOTE:

Never run glance or gpm in the background.

If you are accessing the ASCII version of glance from an X terminal window, make sure you start up an hpterm window to enable full glance softkeys. Do not resize the window as ASCII glance expects a standard terminal size. . You can make the hpterm window longer, but never wider. However, making it longer is frequently of no use. # hpterm & In the new window… # glance Display a list of keyboard functions by typing ?. This brings up a help screen showing all of the command keystrokes that can be used from the ASCII version of GlancePlus. Explore these to familiarize yourself with the interface. 14. Display Main Process Screen. Type g to go to the Main Process Screen. This lists all interesting processes on the system. Retrieve online help related to this window by typing h, which brings up a help menu. Select: Current Screen Metrics Use the cursor keys to select CPU Util NOTE:

This metric has two values. Use the online help to distinguish the difference between the two values. Use the space bar or the “Page Down” key to toggle to the next page of help.

Exit the online help CPU Util description by typing e. Exit the Screen Summary topics by typing e. From the main Help menu, select: Screen Summaries Use the cursor keys to select Global Bars From this help description, explain what R, S, U, N, and A mean in the CPU Util Bar. Exit the online help Global Bar description by typing e. Exit the Screen Summary topics by typing e. Exit the main Help menu by typing e. At any time, you can exit help completely, no matter how deep you are, by pressing the F8 key.

http://education.hp.com

H4262S C.00 3-33  2004 Hewlett-Packard Development Company, L.P.

Module 3 GlancePlus

15. Modify Interesting Process Definition. From the main Process List window, (select g). View the interesting processes. What makes these processes interesting? Type o and select 1 (one) to view the process threshold screen. Cursor down to the Sort Key field, and indicate to sort the processes by CPU usage. Before confirming the other options are correct, note that any CPU usage (greater than zero), or any disk I/Os will cause the process to be considered interesting. Run the KILLIT command to stop all lab loads. 16. Glance Reports. This is the free form part of the lab. Spend the rest of your lab time going through the various Glance screens and GlancePlus windows. Use the table below to produce the different performance reports. Feel free to use this time to ask the instructor "How Do I . . .?" types of questions. Glance COMMAND *a b *c *d e f *g h *i j *l *m *n o p q r *s *t *u *v *w y z ! ?

GlancePlus (gpm) FUNCTION

"REPORT"

All CPUs Performance Stats Back one screen CPU Utilization Stats Disk I/O Stats Exit Forward one screen Global Process Stats Help I/O by Filesystem Change update interval Lan Stats Memory Stats NFS Stats Change Threshold Options Print current screen Quit Redraw screen Single process information OS Table Utilization Disk Queue Length Logical Volume Mgr Stats Swap Stats Renice process Zero all Stats Shell escape Help with options Update screen data

CPU by

Processor

CPU Report Disk Report

Process List I/O by Filesystem Network by LAN Memory Report NFS Report

Process List, double-click process System Table Report Disk Report,double-click disk I/O by Logical Volume Swap Detail Administrative Capabilities

H4262S C.00 3-34  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4  Process Management Objectives Upon completion of this module, you will be able to do the following: •

Describe the components of a process.

•

Describe how a process executes, and identify its process states.

•

Describe the CPU scheduler.

•

Describe a context switch and the circumstances under which context switching occurs.

•

Describe in general, the HP-UX priority queues.

http://education.hp.com

H4262S C.00 4-1  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

4–1. SLIDE: The HP-UX Operating System

The HP-UX Operating System

User Level

(

Gateway

)

Kernel Level System Call Interface File Subsystem

Buffer Cache

Process

Interprocess Communication

Control

Scheduler

Subsystem

Memory Management

Character

Block I/O Subsystem Device Drivers Hardware Control Interface

Kernel Level Hardware Level

Hardware Devices

Student Notes The main purpose of an operating system is to provide an environment where processes can execute. This includes scheduling processes for time on the CPU, managing the memory which is assigned to processes, allowing processes to read data from disk, and many other things. When processes execute within the HP-UX operating system, there are two modes that they can be in: User mode and Kernel (system) mode.

User Mode and Kernel Mode User mode refers to instructions that do not require the assistance of the kernel program in order to execute. These include numeric calculations, string manipulations, looping constructs, and many others. In general, it is good when a process can spend the majority of its time in “user mode”, because it implies the CPU is executing instructions that are related to the process, as opposed to instructions related to the kernel. Kernel mode refers to time spent in the kernel executing instructions on behalf of the process. Processes access the kernel through system calls, often referred to as the System Call Interface. Examples include performing I/O, creating new processes, and expanding data space. H4262S C.00 4-2  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

Kernel mode is also used for “background” activities, performed by the kernel on behalf of processes. Examples include page faulting the program's text or data in from disk, initializing and growing a process's data space, paging a portion of the process to swap space, performing file system reads and writes, and many other things. In general, when a process spends too much time in kernel mode, it is considered bad for performance. This is because too much time (overhead) is being spent to manage the environment in which the process executes, and not enough time on executing the actual process itself (which is user mode).

Performance Tools Most all performance tools that track CPU utilization distinguish between time spent by the CPU in user mode versus time spent in kernel mode. On a good, healthy system with plenty of memory resources, a typical ratio between user mode and kernel mode time is 4:1. This means the process spends 75-80% of its execution in user mode and 20-25% in kernel mode. Another general rule of thumb is, kernel mode CPU time should not exceed 50%. When this happens, it generally means too much time is being spent managing the system (i.e. memory and swap space management, context switching), and not enough is being spent executing process code.

http://education.hp.com

H4262S C.00 4-3  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

4–2. SLIDE: Virtual Address Process Space (PA-RISC)

Process Virtual Address Space (PA-RISC) 32-bit

64-bit

Text

Shared Objects

Data

Text

Shared Objects

Data

(1GB/quadrant)

Shared Objects

(4TB/quadrant)

Shared Objects

Student Notes Each process views itself as starting at address 0 and ending at the maximum address addressable by 32 or 64 bits. This address space is known as the Virtual Address Space for a process. The virtual address space is a logical addressing scheme used internally by the process to reference related instructions and data variables. The physical memory address locations cannot be used, because a program does not know where in physical memory it will be loaded. In fact, a program could be loaded at different memory locations each time it executes.

The Four Quadrants (32-bit) Each process segments its virtual address space into four quadrants, with each quadrant containing 1 GB of address space. The first quadrant is reserved for the program's instructions (also known as text). Though an address range of 1 GB is reserved for text, very rarely does the program need all these addresses. Most of the time, only a fraction (often less than 10%) of this space is needed to address the program's text.

H4262S C.00 4-4  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

The second quadrant holds the programs private data variables. Again, 1 GB of address space is reserved for data variables, and only a fraction of this space is used (in general). Since this quadrant is limited to 1 GB of address space, a maximum global data size of approximately 900 MB is imposed. (In HP-UX, changes were made to allow the global data to use addresses in other quadrants for private data, thereby increasing its maximum size to 3.9 GB.) The third and fourth quadrants are usually used to address shared memory segments, shared text segments, shared memory-mapped files, and other shared structures, such as the System Call Interface.

64-Bit HP-UX 11.00 Update With the introduction of HP-UX 11.00 and its 64-bit operating system, the virtual address space changes dramatically. A 32-bit process running under the 64-bit kernel is given the same space allocations as under a 32-bit kernel. With a 64 bit process, the addressable space increases to 16 Terabytes. This limits each quadrant to 4 TB (for a total of 16 TB of virtual address space), but the capability exists to increase this address space, if necessary, in future releases. Notice also that the locations of the various components of the process have been shifted among the quadrants.

http://education.hp.com

H4262S C.00 4-5  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

4–3. SLIDE: Virtual Address Process Space (IA-64)

Process Virtual Address Space (IA-64) 32-bit

Text

(1GB/octant)

64-bit

Shared Objects

Data

Shared Objects

Shared Objects

Text

Shared Objects

Data

(2EB/octant)

Data Shared Objects Shared Objects (2EB/octant)

Kernel

Kernel

Student Notes There is no 32-bit kernel running on the IA-64 processor. The virtual address space is always 16 EB in size, although it may not all be used or allocated while a particular process is running. The space is divided into eight equal-sized octants – each octant is 2 EB in size. When executing a PA-RISC 32-bit process, the first four octants are set up just like the PARISC, 32-bit virtual address space, using only 1 GB out of each octant to simulate the four original quadrants. The last octant holds the kernel and all of its related structures.

64-Bit Processes With a 64-bit process, the virtual address space changes dramatically. The first two octants become the equivalent to the first PA-RISC quadrant and hold shared objects. The third octant holds the text. The fourth and fifth octants are reserved for any process private data, and the sixth and seventh octants contain more shared objects. Only the last octant is laid out exactly the same for both 32-bit and 64-bit processes.

H4262S C.00 4-6  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

4–4. SLIDE: Physical Process Components

Physical Process Components

Kernel Proc Table Entry

Text

OS Tables

MemMap

Data LibTxt Stack UArea

ShMem

Memory

Student Notes Each process executing in memory contains an entry in the kernel's process table. The entry in the proc table then references the locations of the program's four main components: text, data, stack, and uarea. The text segment contains the program's executable code. The data segment contains the programs' global data structures and variables. The stack area contains the programs' local data structures and variables. The uarea is an extension of the proc table entry. In a multithreaded process, each thread will have its own uarea. Other components that may or may not be associated with a process are shared libraries, shared memory segments, and memory-mapped files. The text and initialized global data segments of the process are taken from the executed program file on disk during process startup. In an attempt to save on startup time, the uninitialized global data segments and the stack area are zero filled, and no pages of a program are loaded at startup. Copying the entire text and data into memory would generate long startup latency. This latency problem is avoided in HP-UX by demand paging the program's text and data as needed.

http://education.hp.com

H4262S C.00 4-7  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

Using this demand paging approach, the program is loaded into memory in smaller pieces (pages), on an as-needed basis. One page on HP-UX 10.X is equal to a 4-K size. On HP-UX 11.00, the page size is variable (meaning the initial program could page in sizes greater than 4 KB).

H4262S C.00 4-8  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

4–5. SLIDE: The Life Cycle of a Process

The Life Cycle of a Process

filesys

filesys

Cache

Process Starts

Main Memory

C P U

Stop Disk

C P U

C P U

End

CPU Queue

Swap

Student Notes The life cycle of a process can be generalized by the above slide. When a process is born (or starts), its text must be paged in from the file system on disk (on demand) in order to be executed. (Remember, the operating system only pages in a text page when it determines that a process needs a particular page in order to execute.) In addition, space must be reserved on the swap partition for the process in the event it may need to page portions of the data area out to swap. Once the swap space is reserved and the process is initialized, the process can begin executing on the CPU. As the process executes, it often performs actions that require it to wait. These actions include reading data from the disk or the network, waiting for a user to enter a response at a terminal window, or waiting on a shared resource (like semaphores). Once the item, which the process is waiting on, becomes available, the process puts itself in the CPU run queue so it can begin executing again. This is the standard cycle that a process goes through: WAIT for a resource, enter the CPU run queue when the resource is available, execute on the CPU. The waiting on a resource is symbolized in the slide as the octagon (or stop sign). The entering of the CPU run queue is symbolized by the triangle, and the execution on the CPU is indicated by the CPU in the rectangle.

http://education.hp.com

H4262S C.00 4-9  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

An advantage of the glance performance tool is that it displays on a per process basis (or system-wide) the various reasons why a process is blocked or waiting on the CPU.

H4262S C.00 4-10  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

4–6. SLIDE: Process States

Process States

SZOMB Exit

Go to user mode

SLEEP (IN MEMORY) SSLEEP Wait on an event

SLEEP (SWAP DEVICE) SSLEEP

STOP

USER MODE SRUN

ZOMBIE

SSTOP

Go to kernel mode Debugger or Job Control Stop

KERNEL MODE SRUN

Context Switch

Wakeup, event completed Wakeup, current completed

RUNNABLE (IN MEMORY) SRUN

fork completes

IDLE SIDL

RUNNABLE (SWAP DEVICE) SRUN

Student Notes The process table entry contains the process state. This state is logically divided into several categories of information to do the following: scheduling, identification, memory management, synchronization, and resource accounting. There are five major process states: SRUN

The process is running or is runnable, in kernel mode or user mode, in memory or on the swap device.

SSLEEP

The process is waiting for an event in memory or on the swap device.

SIDL

The process is being setup via fork.

SZOMB

The process has released all system resources except for the process table entry. This is the final process state.

SSTOP

The process has been stopped by job control or by process tracing and is waiting to continue.

http://education.hp.com

H4262S C.00 4-11  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

Most processes, except the currently executing process, are placed in one of three queues within the process table: a run queue, a sleep queue, or a deactivation queue. Processes that are in a runnable state (ready for CPU) are placed on a run queue, processes that are blocked awaiting an event are located on a sleep queue, and processes that are temporarily out of the scheduling mix are placed on a deactivation queue. Deactivated processes typically only occur during a system memory management crisis. Processes either terminate voluntarily through an exit system call or involuntarily as a result of a signal. In either case, process termination causes a status code to be returned to the parent of the terminating process. This termination status is returned to the parent process using a version of the wait() system call. Within the kernel, a process terminates by calling the exit() routine. The exit(0) routine completes the following tasks: cancels any pending timers, releases virtual memory resources, closes open file descriptors, and handles stopped or traced child processes. Next, the process is taken off the list of active processes and is placed on a list of zombie processes, which is finally changed to being a no process state. The exit() routine continues to record the termination status in the proc structure, bundles up the process's accumulated resource usage for accounting purposes, and notifies the deceased process's parent. If a process in SZOMB state is found, the wait() system call will copy the termination status from the deceased process and then reclaim the associated process structure. The process table entry is taken off the zombie list and returned to the freeproc list. As of HP-UX 10.10, the concept of a thread was introduced into the kernel. Processes became an environment in which one (or more) threads could execute. Each thread was visible and manageable by the kernel separately. When this occurred, processes were in any of the following states: SINUSE

The process structure is being used to define one or more threads.

SIDL

The process is being setup via fork.

SZOMB

The process has released all system resources except for the process table entry. This is the final process state.

Whereas threads now took on the previous states of the process: TSRUN

The thread is running or is runnable, in kernel mode or user mode, in memory or on the swap device.

TSSLEEP

The thread is waiting for an event in memory or on the swap device.

TSIDL

The thread is being setup via fork.

TSZOMB

The thread has released all system resources except for the thread table entry. This is the final thread state.

H4262S C.00 4-12  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

TSSTOP

The thread has been stopped by job control or by process tracing and is waiting to continue.

The generic UNIX tools have no awareness of threads and so they continue to report process states and all other metrics from the viewpoint of the process, Only the HP-specific tools (such as glance, gpm, PerfView/OVPM, and MeasureWare/OVPA) have the ability to look at individual threads and report their metrics separately from the process. Of course, the vast majority of processes are single-threaded. In those cases, there is no practical difference between the reports of the various tools.

http://education.hp.com

H4262S C.00 4-13  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

4–7. SLIDE: CPU Scheduler

CPU Scheduler The CPU scheduler handles: • Context switches

CPU Scheduler

Kernel

• Interrupts CPU

OS Tables

Proc A pri=156

Proc B pri=220

Proc C pri=172

Proc D pri=186

Memory

Student Notes Once the required data is available in memory, the process waits for the CPU scheduler to assign the process CPU time. CPU scheduling forms the basis for the multitasking, multiuser operating system. By switching the CPU between processes that are waiting for other events, such as I/O, the operating system can function more productively. HP-UX uses a round robin scheduling mechanism. The CPU lets each process run for a preset maximum amount of time, called a quantum or time slice (default = 1/10th second), until the process completes, or is preempted to let another process run. Of course, the process can always voluntarily surrender the CPU before its timeslice expires when it realizes that it cannot continue. The CPU saves the status of the first process in a context and switches to the next process. When a process is switched out due to its timeslice expiring, it drops to the bottom of the run queue to wait for its next turn. If it is preempted by a stronger priority process, it is placed back onto the front of the run queue. If it voluntarily gives up the CPU, it goes onto one of the sleep queues, until the resource it’s waiting for becomes available. When that resource does become available, the process moves the end of the run queue.

H4262S C.00 4-14  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

As a multitasking system, HP-UX requires some way of changing from process to process. It does this by interrupting the CPU to shift to the kernel. The clock interrupt handler is the system software that processes clock interrupts. It performs several functions related to CPU usage including gathering system and accounting statistics and signaling a context switch. System performance is affected by how rapidly and efficiently these activities occur.

Terms CPU scheduler

Schedules processes for CPU usage

System clock

Maintains the system timing

Clock Interrupt handler

Executes the clock interrupt code and gathers system accounting statistics

Context switching

Interrupts the currently running process and saves information about the process so that it can begin to run after the interrupt, as if it had never stopped.

http://education.hp.com

H4262S C.00 4-15  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

4–8. SLIDE: Context Switching

Context Switching A context switch occurs when • A timeslice expires (a thread accumulates 10 clock ticks) (Forced) • A preemption occurs (a stronger priority thread is runnable) (Forced) - if the stronger thread is RT, immediate preemption - if the stronger thread is not RT, at next convenient time • A thread becomes non-computable, i.e. - it goes to sleep - it is stopped - it exits (Voluntary)

Student Notes A context switch is the mechanism by which the kernel stops the execution of one process and begins execution of another. A context switch occurs under the circumstances shown on the slide. There are two types of context switches: forced and voluntary. A forced context switch occurs when the process is forced to give up the CPU before it is ready. These include timeslice expiration or a stronger priority process becoming runnable. A voluntary context switch occurs when the process itself gives up the CPU without using its full timeslice. This happens when the process exits, or puts itself to sleep (waiting on a resource), or puts itself into a stopped state (debugging). The glance tool distinguishes between forced and voluntary context switches on a per process basis.

H4262S C.00 4-16  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

4–9. SLIDE: Priority Queues

Priority Queues

-32

-1

0

1

2

...

128 131

127

...

152 155

...

PSWP (128) Real Time Priority Queues (1 priority wide) POSIX Real Time (rtsched)

HP-UX Real Time (rtprio)

Signalable Priorities

172 - 176 - 180 175 179 183

...

252 255

...

PZERO (153)

PUSER (178)

Time-shared Priority Queues (4 priorities wide) System Level Priorities Nonsignalable

User Level Priorities

Signalable Priorities

Student Notes Every process has a priority associated with it at creation time. These priorities determine the order in which processes execute on the CPU. Processes with the weakest priority number always execute before processes with stronger numbers. In UNIX, stronger priorities are represented by smaller numbers and weaker priorities are represented by larger numbers. HP-UX uses adjustable priorities to schedule its time slicing for general timeshare processes generated by all users (priorities 128-255). By that we mean, a process’s priority can be adjusted, up or down, by the kernel, according to how “favored” a process might be. In general, the more a process executes, the less favorable it will be treated by the kernel. However, since HP-UX also supports real-time processing, it must include priority-based scheduling for those processes (priorities 0-127). As of HP-UX 10.X, support is also provided for POSIX real-time processes (priorities -32 through -1). The /usr/include/sys/param.h file contains some extra information on the priorities used in the system. Each processor in an HP system has its own run queue. Each run queue is further broken down into multiple priority queues, to make it easier for that processor to select the most deserving process to run.

http://education.hp.com

H4262S C.00 4-17  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

Real-Time Process Priorities Real-time priority queues are one wide, i.e. each queue represents one priority value. The strongest priority real-time process preempts all others (of weaker priority) and runs until it sleeps or exits or is preempted by a stronger or timesliced by an equal real-time process. Equal priority real-time processes run in a round robin fashion. A process can be made to run with a real-time priority by using the rtprio(1) or rtsched(1) command. The rtsched command can also be used to disable timeslicing for a particular process, by assigning it a different scheduling policy. Because a real-time process will execute at the expense of all time-share processes, make sure that you consider the impact on your users before invoking the command. A CPU-bound, real-time process will halt all other use of the system. A POSIX real-time process (ttisr) runs on HP-UX at priority -32.

Time Share Process Priorities Timeshare priority queues are four-wide, i.e. each priority queue represents four, adjacent priority values. For example, the first timeshare priority queue is used by processes with priorities of 128, 129, 130, and 131. Timeshare processes are grouped into system and user processes. Priorities 128-177 are reserved for runnable system processes and sleeping processes (both system and user), and priorities 178-255 are for runnable user processes. A nice value is assigned to a timeshare process, which will be used in the calculation of a new priority for the process. This value will be used by the kernel to help determine how to “adjust” the priority of the process. Nice values have no effect on real-time processes.

H4262S C.00 4-18  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

4–10. SLIDE: Nice Values

Nice Values

177

nice = 20

ProcB Priority

ProcA nice = 39 255

ProcA Running

ProcA Sleeping

ProcA Running

ProcA Sleeping

ProcA Running

(ProcA nice=20)

ProcB Running

ProcB Sleeping

ProcB Sleeping

ProcB Sleeping

ProcB Running

(ProcB nice=39)

Student Notes Time shared processes are all initially assigned the priority of the parent when they are spawned. The user can make modifications to how much the kernel “favors” a process with the nice value. Timeshare processes lose priority as they execute, and regain priority as they wait their turns. The rate at which a process loses priority is linear, but the rate at which it regains priority is exponential. A process's nice value is used as a factor in calculating how fast a process regains priority. The nice value is the only control a user has to give greater or less favor to a time share process. The default nice value is 20. Therefore, to make a process run at a weaker priority, it should be assigned a higher nice value (maximum value 39). The superuser can assign a lower nice value to a process (minimum value 0), effectively giving it a stronger priority.

http://education.hp.com

H4262S C.00 4-19  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

4–11. SLIDE: Parent-Child Process Relationship

Parent-Child Process Relationship

Kernel OS Tables

ksh

sam

csh

sh

ksh

glance

su

sh

Memory

Student Notes One item to keep in mind related to process management is the relationship between parent and child processes. Every process started from a terminal window on the system has a parent process that spawns it. The parent process does not terminate once a child is spawned. Instead, it goes to sleep waiting for the child to terminate from its execution. If a child process does not exit properly, for example, if it spawns a new process rather than exiting to its parent, then the system could end up with many processes sleeping in memory and using proc table entries unnecessarily. The example in the slide shows a ksh shell that spawns a sam process. Within sam, the system administrator shells out to su to a regular user. Once in the login shell, the user starts glance. From within glance, they shell out, and now decide they'd rather be in a csh shell. This string of events caused eight different processes to be started. If the user decides he wants to return to sam by typing sam, would the previous sam process be reactivated, or would a new sam process be spawned? (Answer: A new sam process is spawned).

H4262S C.00 4-20  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

4–12. SLIDE: glance — Process List

glance – Process List

B3692A GlancePlus B.10.12 14:52:27 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S S N N CPU Util | 22% 29% 51% F Disk Util | 1% 7% 13% Mem Util | 91% 91% 91% S S U U B B Swap Util | 25% 24% 35% U U R R -------------------------------------------------------------------------------PROCESS LIST Users= 11 User CPU Util Cum Disk Thread Process Name PID PPID Pri Name ( 100 max) CPU IO Rate RSS Count -------------------------------------------------------------------------------netscape 16013 12988 154 sohrab 12.9/14.0 64.9 0.0/ 0.6 14.7mb 1 supsched 18 0 100 root 2.9/ 2.1 942.6 0.0/ 0.0 16kb 1 lmx.srv 1219 1121 154 root 1.6/ 0.9 389.4 0.5/ 0.0 2.7mb 1 glance 15726 15396 156 root 0.6/ 0.9 2.0 0.0/ 0.2 4.0mb 1 statdaemon 3 0 128 root 0.6/ 0.7 302.1 0.0/ 0.0 16kb 1 midaemon 1051 1050 50 root 0.4/ 0.4 201.4 0.0/ 0.0 1.3mb 2 ttisr 7 0 -32 root 0.4/ 0.3 121.0 0.0/ 0.0 16kb 1 dtterm 15559 15558 154 roc 0.4/ 0.4 1.6 0.0/ 0.0 6.2mb 1 rep_server 1098 1084 154 root 0.2/ 0.1 23.7 0.0/ 0.0 2.0mb 1 syncer 325 1 154 root 0.2/ 0.0 20.2 0.1/ 0.0 1.0mb 1 xload 13569 13531 154 al 0.2/ 0.0 2.4 0.0/ 0.0 2.6mb 1 Page 1 of 13

Student Notes The next four slides are designed to illustrate how the management of processes can be monitored through glance. Topics just covered (like kernel versus user CPU time, process components, process wait states, nice values, and process priorities) can all be viewed through glance. The first Global Bar graph, which displays on every glance screen, is the CPU Util. This displays how the CPU is being distributed. •

S = System or Kernel Time

•

N = User Time (executing processes who have had their nice value set greater than 20. (21-39)

•

U = User Time (executing processes with a nice value of 20)

•

A = User Time (executing processes who have had their nice value set less than 20 (0 – 19). In other words: Anti-nice.

•

R = Real Time (executing processes with priorities 127 and less)

http://education.hp.com

H4262S C.00 4-21  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

The Process List screen (g key), as shown on the slide, can be used to see process priorities. The order in which the processes are displayed can be configured (o key) to display by CPU usage, memory usage, or disk I/O activity. In HP-UX version 10.X, the thread count column was the blocked on column. The blocked on information can still be obtained by looking at the individual processes’ resource summary screens.

H4262S C.00 4-22  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

4–13. SLIDE: glance — Individual Process

glance – Individual Process

B3692A GlancePlus B.10.12 15:17:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util | 91% 91% 91% S S U U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------Resource Usage for PID: 16013, netscape PPID: 12988 euid: 520 User:sohrab -------------------------------------------------------------------------------CPU Usage (sec) : 3.38 Log Reads : 166 Wait Reason : SLEEP User/Nice/RT CPU: 2.43 Log Writes: 75 Total RSS/VSS : 22.4mb/ 28.3mb System CPU : 0.73 Phy Reads : 4 Traps / Vfaults: 414/ 8 Interrupt CPU : 0.14 Phy Writes: 61 Faults Mem/Disk: 0/ 0 Cont Switch CPU : 0.08 FS Reads : 4 Deactivations : 0 Scheduler : HPUX FS Writes : 29 Forks & Vforks : 0 Priority : 154 VM Reads : 0 Signals Recd : 339 Nice Value : 24 VM Writes : 0 Mesg Sent/Recd : 775/ 1358 Dispatches : 1307 Sys Reads : 0 Other Log Rd/Wt: 3924/ 957 Forced CSwitch : 460 Sys Writes: 32 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 814 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Fri Feb 6 15:14:45 1998 CPU Switches : 0 Bytes Xfer: 410kb

Student Notes From the Process List screen, an individual process can be selected for further analysis (s key). The above slide shows some of the additional details available when analyzing a process further. Items of interest from the Individual Process screen include the process's nice value, the number of Forced versus Voluntary context switches, the current Wait reason, and the Parent PID.

http://education.hp.com

H4262S C.00 4-23  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

4–14. SLIDE: glance — Process Memory Regions

glance – Process Memory Regions

B3692A GlancePlus B.10.12 10:17:41 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S S N N CPU Util | 22% 29% 51% F Disk Util | 1% 7% 13% Mem Util | 91% 91% 91% S S U U B B U U R R Swap Util | 25% 24% 35% -------------------------------------------------------------------------------Memory Regions for PID: 16013, netscape PPID: 14061 euid: 520 User:sohrab Type RefCt RSS VSS Locked File Name -------------------------------------------------------------------------------NULLDR/Shared 64 4kb 4kb 0kb TEXT /Shared 3 4.3mb 9.5mb 0kb /opt/…/netscape-bin DATA /Priv 1 5.8mb 8.6mb 0kb /opt/…/netscape-bin MEMMAP/Priv 1 4kb 20kb 0kb /opt/…/netscape-bin MEMMAP/Priv 1 36kb 36kb 0kb /opt/…/netscape-bin MEMMAP/Priv 1 12kb 12kb 0kb STACK /Priv 1 28kb 28kb 0kb UAREA /Priv 1 16kb 16kb 0kb LIBTXT/Shared 85 56kb 60kb 0kb /usr/lib/dld/sl Text RSS/VSS:4.3mb/9.5mb Shmem RSS/VSS: 0kb/ 0kb

Data RSS/VSS:5.8mb/8.6mb Other RSS/VSS:4.1mb/5.7mb

Stack RSS/VSS: 28kb/ 28kb

Student Notes From the Individual Process screen, the memory regions (i.e. process components) corresponding to that process can be viewed (M key). The above slide shows the memory regions for the currently selected process. Items of interest from the Memory Region screen include the location of the process's Text, Data, Stack, and U-Area, along with its Shared/Private flag, its Resident Set Size and Virtual Set Size, and its reference count. If the process is associated with Memory Map files (MEMMAP), Shared Libraries (LIBTXT), or Shared Memory Segments (SHMEM), these will be displayed. In HP-UX version 11.X, glance no longer displays the addresses of each memory region. However, gpm still does.

H4262S C.00 4-24  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

4–15. SLIDE: glance — Process Wait States

glance – Process Wait States

B3692A GlancePlus B.10.12 10:23:03 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------Wait States for PID: 14205, netscape PPID: 14061 euid: 520 User:sohrab Event % Blocked On % -------------------------------------------------------------------------------IPC : 0.0 Cache : 0.0 CPU Util : 13.7 Job Control: 0.0 CDROM IO : 0.0 Wait Reason: SLEEP Message : 0.0 Disk IO : 0.0 Pipe : 0.0 Graphics : 0.0 RPC : 0.0 Inode : 0.0 Semaphore : 0.0 IO : 0.0 Sleep : 77.2 LAN : 0.0 Socket : 0.0 NFS : 0.0 Stream : 0.0 Priority : 9.1 Terminal : 0.0 System : 0.0 Other : 0.0 Virtual Mem: 0.0

C - cum/interval toggle

% - pct/absolute toggle

Page 1 of 1

Student Notes From the Process List screen, the process wait states can be viewed (W key). The above slide shows the categories of wait states and where/what the selected process has waited on. Items of interest from the Process Wait State screen include the percentage of time the process has spent in each of the possible wait state categories.

http://education.hp.com

H4262S C.00 4-25  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

4–16. LAB: Process Management Directions The following lab is designed to manage a group of processes. This includes observing the parent-child relationship and modifying process nice values (and thus indirectly priorities) with the nice/renice command .

Modifying Process Priorities This portion of the lab uses glance to monitor and modify nice values of competing processes. 1. Change directory to /home/h4262/baseline. # cd /home/h4262/baseline 2. Start seven long processes in the background. # ./long & ./long & ./long & ./long & ./long & ./long & ./long & 3. Start a glance session. Answer the following questions. How much CPU time is each long process receiving? _______sec, _______% How are the processes being context switched (forced or voluntary)? _______________ How many times over the interval is the process being dispatched? ____________ What is the ratio of system CPU time to user CPU time? __________ What are the processes being blocked on? _________________ What are the nice values for the processes? _________ 4. Select one of the processes and favor it by giving it a more favorable nice value. What is the PID of the process being favored? __________ To change the processes nice value, enter: # renice –n -5 Watch that process’s percentage of the CPU over several display intervals with glance or top. What effect did it have on the process? _____________________________ ____________________________________________________________________

H4262S C.00 4-26  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 4 Process Management

5. Select another long process and set the nice value to 30. # renice –n 10 What effect did that have on that process? ___________________________________ ______________________________________________________________________ 6. You can either let the processes finish up on their own as the next module is covered, or you can kill them now with: # kill $(ps –el | grep long | cut –c18-22)

http://education.hp.com

H4262S C.00 4-27  2004 Hewlett-Packard Development Company, L.P.

Module 4 Process Management

H4262S C.00 4-28  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5  CPU Management Objectives Upon completion of this module, you will be able to do the following: •

Describe the components of the processor module.

•

Describe how the TLB and CPU cache are used.

•

List four CPU related metrics.

•

Identify how to monitor CPU activity.

•

Discuss how best to use the performance tools to diagnose CPU problems.

•

Specify appropriate corrections for CPU bottlenecks.

http://education.hp.com

H4262S C.00 5-1  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–1. SLIDE: Processor Module

Processor Module

CPU

TLB

Cache

Coprocessor

System Bus

Student Notes A typical HP processor module consists of a central processing unit (CPU), a cache, a translation lookaside buffer (TLB), and a coprocessor. These components are connected via internal processor busses, with the entire processor module being connected to the system bus. The cache is made up of very high-speed memory chips. Cache can be accessed in one CPU cycle. Its contents are instructions and data that recently have been or are anticipated to be used soon by the CPU. Cache size varies between processors. The size of the cache can have a big effect on system performance. The translation lookaside buffer (TLB) is used to translate virtual addresses into physical addresses. It is a high-speed cache whose entries consist of pairs of recently accessed virtual addresses and their associated physical addresses, along with access rights and an access ID. The TLB is a subset of a system-wide translation table (page directory) that is held in memory. TLB size also affects system performance, and different HP 9000 processors have different TLB sizes.

H4262S C.00 5-2  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

The address translations kept in the TLB enable us to locate the appropriate data and instructions in the memory. The memory is accessed via the physical address. Without the translation in the TLB, we would not be able to find the information in the memory. Note these other points regarding the TLB: •

Each process has a unique virtual address space.

•

Each TLB entry refers to a page of memory, not a single location. In all 64-bit architectures used by HP, pages are fundamentally 4KB in size, but can be any multiple of 4K under various circumstances – to reduce the number of entries needed in the TLB.

http://education.hp.com

H4262S C.00 5-3  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–2. SLIDE: Symmetric Multiprocessing

Symmetric Multiprocessing

CPU

TLB

Cache

CPU

Coprocessor

TLB

Cache

Coprocessor

System Bus

Student Notes Symmetric Multiprocessing (SMP) refers to systems containing two or more processor units. SMP is implemented on all Hewlett-Packard workstations and servers capable of supporting more than one CPU. Each processor on an SMP system has exactly the same characteristics, including the same processing unit, the same CPU cache design, and the same size translation lookaside buffer (TLB).

H4262S C.00 5-4  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

5–3. SLIDE: Cell Module

Cell Module

Processor

Processor Processor

Processor

Memory

Cell Internal Bus I/O Buses

Student Notes A more recent design of HP systems is based on the “cell” architecture. In a cell, there are multiple processors, some memory and some I/O buses. Each cell could act as an independent SMP system, or as part of a collection of cells, forming a larger SMP system. Each processor in the cell has the same access speed (or latency) to the memory within the same cell. However, if one of those processors would have to access a location in the memory of a different cell, the latency would be greater. Each processor within the cell does have its own cache memory and TLB. Each processor has equal access to the I/O buses that are part of the same cell. They may also have access (with somewhat greater delays) to the I/O of other cells in the same system.

http://education.hp.com

H4262S C.00 5-5  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–4. SLIDE: Multi-Cell Processing

Multi-Cell Processing

I/O

I/O

P P P P

P P P P

Memory

Memory

Memory

Memory

I/O

I/O P P P P

P P P P High-Speed Memory interconnect

Student Notes The best example HP currently has of a SMP using cell architecture is the Superdome. Here we find 4 cells, each with four processors, some memory, and some I/O buses. Each cell could be configured (using Node Partitioning or NPars) into a separate and individual system – capable of booting its own operating system. It would be functionally apart from the other cells. The only way that the operating system on that cell could communicate with the software running on any other cell would be through a network interface. On the other hand, multiple cells could be configured to act as a unit. They would pool their resources and boot a single operating system. They would seamlessly act as a SMP system. This architecture gives the customer and the system administrator tremendous flexibility in how to set up their hardware. They could even change it relatively easily from one configuration to another as their needs changed. On a wider range of systems, you may be using Virtual Partitioning (VPars). There are similar to NPars, but are not limited to cell boundaries and are handled entirely by software. A system could use both NPars and VPars at the same time. Using software, processors can be moved from one VPar to another.

H4262S C.00 5-6  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

Finally, on an even wider range of systems, we have the concept of processor sets (psets). Multiple psets could exist within the same partition (either NPar or VPar). Each pset would be set aside for use by a particular application of group of applications. Using software, psets can be created and removed, and processors could be moved from one pset to another.

http://education.hp.com

H4262S C.00 5-7  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–5. SLIDE: CPU Processor

CPU Processor

Shadow Registers

General Registers

Control Registers

Special Function Unit Registers

Space Registers CPU

Process Status Word Coprocessor Registers

TLB

Cache

Instruction Address Queues

Coprocessor

Student Notes The CPU ultimately is responsible for your system speed. The kernel loads the process text for the CPU to execute. The processor module has many Registers, which assist in the execution of instructions. The definition of all these registers is beyond the scope of this course. The primary objective of this module is to focus on CPU clock speed, the size of the CPU cache, and the effects of the TLB related to overall system performance. Each HP 9000 server and workstation has a chip at its heart. The latest version PA-RISC chip is the 64-bit, PA-8xxx. HP has also introduced systems using the 64-bit, IA-64 Itanium chip. A selection of the range of current systems is listed on the following pages. Note the difference not only in clock speeds, but also in cache size. The following tables list the specifics of several HP-UX servers and workstations. It is very difficult to keep a list of this nature up to date in training materials but it has been included merely to demonstrate the wide variety of system characteristics present in the HP computing products family.

H4262S C.00 5-8  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

Business Servers Model

No. of CPUs

Clock Speed

Max. RAM

Cache (KB)

I/O Slots

(GB) rp3410-2

2

800 MHz

6

(PA-8800) rp3440-4

4

1 GHz

24

8 16

(PA-8800)

(2 cells)

rp8420-32

32

(PA-8800)

(4 cells)

rx1600

4 PCI (64-bit)

1 GHz

64

1.5MB(L1)

6 PCI

32MB(L2)

rp7420-16

(PA-8800)

1.5MB(L1) 32MB(L2)

(PA-8800)

Superdome

2-PCI (64-bit)

32MB(L2)

(PA-8800) rp4440-8

1.5MB(L1)

128

1 GHz

64

1.5MB(L1)

15 PCI

32MB(L2) 1 GHz

128

1.5MB(L1)

16 PCI

32MB(L2) 1 GHz

1024

(16 cells)

1.5MB(L1)

192 PCI

32MB(L2)

2

1 GHz

16

1.5MB(L3)

0/1/1 PCI *

2

1.5 GHz

24

6MB(L3)

0/4/0 PCI *

4

1.5 GHz

64

6MB(L3)

0/4/2 PCI *

4

1.5 GHz

96

6MB(L3)

0/6/3 PCI *

8

1.5 GHz

64

6MB(L3)

15 PCI

(Itanium 2) rx2600 (Itanium 2) rx4640 (Itanium 2) rx5670 (Itanium 2) rx7620 (Itanium 2) rx8620

(2 cells) 16

(Itanium 2)

(4 cells)

Superdome

64

(Itanium 2)

(16 cells)

http://education.hp.com

(128-bit) 1.5 GHz

128

6MB(L3)

16+16 PCI (128-bit)

1.5 GHz

512

6MB(L3)

0/128/64 PCI *

H4262S C.00 5-9  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

Workstations Model

No. of CPUs

Clock Speed

Max. RAM

Cache (KB)

I/O Slots

(GB) B2600

1

500 MHz

4

512/1024

2/2/0 PCI *

1

750 MHz

8

768/1536

2/3/1 PCI *

1

875 MHz

8

768/1536

2/3/1 PCI *

2

875 MHz

16

768/1536

0/0/3 PCI *

1

1.4 GHz

8

1536(L3)

5 PCI - 1 AGP

2

1.5 GHz

24

6144(L3)

3 PCI – 1 AGP

(PA-8600) B3700 (PA-8700) C3750 (PA-8700+) J6750 (PA-8700+) zx2000 (Itanium 2) zx6000 (Itanium 2 * 2/3/1 means 2 32-bit PCIs, 3 64-bit PCIs and 1 128-bit PCI. All Itanium 2 processors include 32KB of L1 cache and 256KB of L2 cache. To determine the specifics of your system, refer on-line to http://www.hp.com/go/enterprise, select "Products Index" and scroll down to select your system platform name [i.e. J-Class (HP 9000)]. This will display the "Product Information" screen for the selected hardware.

H4262S C.00 5-10  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

5–6. SLIDE: CPU Cache

CPU Cache

Memory CPU Instruction to Execute xxxx

xxxx TLB

Cache

Coprocessor

Process Text

System Bus

Student Notes The CPU loads instructions from memory and runs multiple instructions per cycle. To minimize the time that the CPU spends waiting for instructions and data, the CPU uses a cache. The cache is a very high-speed memory that can be accessed in one CPU cycle with the contents being a subset of the contents of main memory. As the CPU requires instructions and data, they are loaded into the cache. The size of the cache has a large bearing on how busy the CPU is kept. The larger the cache, the more likely it is that it will contain the instructions and data to be executed. Most current processors support multi-level caches. The Level 1 cache (L1) is the fastest – operating at the same speed as the CPU. It is relatively small. The Level 2 cache (L2) operates at one-half the speed of the CPU. It is somewhat larger. The IA-64 has a Level 3 cache (L3) that is even larger and slower.

http://education.hp.com

H4262S C.00 5-11  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–7. SLIDE: TLB Cache

TLB Cache

Memory Instruction Address Queues

VA | PA TLB

Page VA\PA Directory

CPU

0 . . . 4GB

xxxx Cache

Coprocessor

Instruction to Execute

xxxx Process Text

System Bus

Student Notes All 32-bit programs view their address space as starting at address 0, and ending at address 4 GB. All addresses referenced by the program are referenced relative to this address space. This is referred to as the program's virtual address space. A program's physical address is the address location in physical memory where the program is loaded at execution time. When the CPU executes a program, it is presented with the virtual address containing the instruction to be executed. In order to fetch this instruction from physical memory, the CPU must convert the virtual address (VA) into the corresponding physical address (PA). To do this, the CPU checks the TLB. If the VA->PA is present, it then knows the PA in memory of the instruction. If the VA is not present, it then needs to fetch the information from the PDIR (Page DIRectory) table in memory. This memory fetch of the PDIR table is relatively expensive from a performance standpoint. Once the PA is known, the CPU then checks the Instruction Cache on the CPU for the PA. If the PA is present, it then loads the instruction straight from Instruction Cache. If not present, it then needs to fetch the instruction from memory, which is relatively expensive (performance-wise).

H4262S C.00 5-12  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

The size of the TLB is anywhere from 96 to 160 entries (each entry points to a variable-sized memory page) on a PA-RISC and an IA-64.

http://education.hp.com

H4262S C.00 5-13  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–8. SLIDE: TLB, Cache, and Memory

TLB, Cache, and Memory

TLB

Cache

Memory

Consequence

Hit

Hit

Hit

1 CPU cycle fetch

Hit

Miss

Hit

Data/instruction memory fetch

Miss

X

Hit

PDIR memory fetch

Miss

X

Miss

Page fault

X = Don’t Care

Student Notes The slide shows some of the permutations of hits and misses on memory, cache, and the TLB, as well as the consequences of each. The best situation is when the VA has an entry in the TLB, and the corresponding PA has an entry in the CPU cache. This allows the instruction or data to be present to the CPU in one clock cycle. The next-best scenario, would be to have a hit on the TLB, but a miss on the CPU cache. An example number of clock cycles to fetch a PA from memory to the CPU cache is 50 clock cycles. Another scenario would be to have a miss on the TLB, but a hit on the CPU cache. The miss on the TLB requires the PDIR table in memory to be searched, and an appropriate entry to be loaded into the TLB. This takes a variable number of cycles to perform. On one model the average was 131 clock cycles. Therefore, a miss on the TLB is more expensive than a miss on CPU cache. A miss on both the TLB and the CPU cache would translate into 131 + 50 or 181 clock cycles on average to access the instruction or data that the CPU needs. This could have been accessed in 1 clock cycle had the VA been in TLB, and the PA been in CPU cache.

H4262S C.00 5-14  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

The worst scenario, performance-wise, is not having the instruction or data loaded in memory at all. In this case, a page fault would occur to retrieve the information from disk. Assuming a 1-GHz clock, a 10-ms disk transfer rate, and an idle disk drive, this would correspond to 10,000,000 clock cycles to access the data or instruction.

http://education.hp.com

H4262S C.00 5-15  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–9. SLIDE: HP-UX — Performance Optimized Page Sizes

HP-UX 11.00 — Performance Optimized Page Sizes (POPS)

VA

HP-UX 10.x Fixed Page Size – 4 KB

PA

0 8192 4096 65536 8192 128000 12288 256000 16384 512000

0

Filesystem

8192

VA 0 4096 8192 12288 16384

65536 128000 256000

TLB (on CPU)

512000

File Memory

VA

HP-UX 11.00

PA

0 8192 16384 512000

0

Filesystem

8192

VA 0

Variable Page Size Range: 4 KB – 64 MB

512000

TLB (on CPU)

Memory

File

16384

Student Notes HP-UX 11.00 is the first release of the operating system to have general support for performance optimized page sizes (POPS), also known as variable page sizes. Partial support for variable memory page sizes has existed since HP-UX 10.20. HP-UX 11.00 allows customers to configure executables to use specific performance optimized page sizes, based on the program's text and data sizes. Page sizes can be selected from a range of 4 KB to 4 GB. The use of performance optimized page sizing can significantly increase performance of applications that have very large data or instruction sets. NOTE:

Performance-optimized page sizing works on PA-8000-based and IA-64-based systems.

Fixed Page Sizes (Prior to 11.00) Prior to HP-UX 11.00, all page sizes were fixed at 4 KB. As a program executed, each 4 KB page would be mapped into physical memory, and a TLB entry would be created to map the virtual address corresponding to that page to the physical memory address. Selected models

H4262S C.00 5-16  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

had a few “Block” TLB entries, which could map multiple pages into a single entry, if the pages were contiguous in both virtual and physical address spaces. These entries were reserved for mapping the kernel, the I/O pages, and other segments that were locked into memory. At some point, the TLB would become full, and the virtual-to-physical address mapping would only be stored in the PDIR table in memory, not in the TLB on the CPU. This meant that if a virtual address needed to be translated, there would be a chance that the address would not have an entry in the TLB, and time would have to be spent to look up the address within the PDIR table in memory. This handling of the TLB miss was expensive in terms of performance.

Performance Optimized Page Sizes (11.00 and Beyond) With the release of HP-UX 11.00, support for variable page sizes is available. With POPS, a larger portion of the process's virtual address space can be referenced within a single page or within a few, large pages. Therefore, a larger portion of the process can be referenced with much fewer TLB entries. Below are two tables showing what sizes of pages are available in the PA-RISC and the IA-64 architectures.

PA-RISC

IA-64

4K 16K 64K 256K 1M 4M 16M 64M 256M 1G -

4K 8K 16K 64K 256K 1M 4M 16M 64M 256M 4G

Affecting Page Sizes There are two methods of affecting page size in a process. One is through tunable kernel parameters. vps_pagesize determines what the “default” page size will be with no other information. The size is given in 1K units and the setting is typically 4. vps_ceiling determines how large the kernel can “promote” a page size for a process, if it notices that a process is getting a very large number of TLB misses. The default setting for this is 16 (1K). The second method is done by the system administrator. A command, chatr, can be used to provide the kernel with a hint of what page sizes would work best with this process. Following is an example of this command. chatr –pi 16 –pd 256 /opt/app/bin/app The above command would hint to the kernel that this process would best execute with 16K pages for the instructions (text) and 256K pages for the data. This hint would be stored in the

http://education.hp.com

H4262S C.00 5-17  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

header of the executable file and be visible to the kernel whenever the program was invoked. The kernel would do its best to see that the hint is followed. However, if memory pressure exists, the kernel may not be able to honor the request and may end up “demoting” the size of the page to be able to manage it in memory. There is a third tunable parameter, vps_chatr_ceiling, that determines the maximum value a chatr command can assign to an executable file.

H4262S C.00 5-18  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

5–10. SLIDE: CPU — Metrics to Monitor Systemwide

CPU — Metrics to Monitor Systemwide • User CPU utilization • Nice/Anti-Nice utilization • Real time processes • System CPU utilization • System call rate • Context switch rate • Idle CPU utilization • CPU run queues (load averages)

Student Notes The load on the CPU can be monitored in a number of different ways. There are multiple tools and multiple metrics that monitor CPU performance.

User CPU Utilization This is the percentage of time the CPU spent running in user mode. This corresponds to executing code within user processes, as opposed to code within the kernel. It is better to see user CPU utilization higher than system CPU utilization (preferably two to three times higher).

Nice/Anti-Nice Utilization This is the percentage of time the CPU spent running user processes with nice values of 21-39 (Nice) or 0-19 (Anti-Nice). This is typically included in USER CPU utilization, but some tools, like glance, track this separately to see how much CPU time is being spent on weaker or stronger priority processes.

http://education.hp.com

H4262S C.00 5-19  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

Real Time Processes This is the amount of time spent executing real time processes that are running on the system. Real time processes get the CPU immediately when they are ready to execute, and can have a big impact on the performance of time-shared processes.

System CPU Utilization This is the percentage of time the CPU spent running in system (or kernel) mode. This corresponds to executing code within the kernel. We have to have some kernel time just to do minimum management on the system. However, excessive time spent managing the system is bad for performance. Excessive system CPU utilization is considered to be when system utilization is greater than the user utilization.

System Call Rate The system call rate is the rate at which system calls are being generated by the user processes. Every system call causes a switch to occur between user mode and system (or kernel) mode. A high system call rate typically corresponds to a high system CPU utilization. If the system call rate is high, it is recommended to investigate which system calls are being generated, the frequency of each system call, and the average duration of each system call.

Context Switch Rate This is the number of times the CPU switched processes (on average) per second. This is typically included in system CPU utilization, but some tools, like glance, track this separately.

Idle CPU This is the percentage of time the CPU spent doing nothing (i.e. it did not execute any user or kernel code). It is good to see some, even lots, of idle CPU time. A non-idle CPU means the CPU run queue is never exhausted (or emptied), which means processes are always having to wait before reaching the CPU. The size of the line (CPU run queue) grows, as idle CPU time approaches 0.

CPU Run Queues/Load Average Both these terms reference the same thing. This is the number of processes in the CPU run queue. For best performance, the average load in the CPU run queue should not exceed three.

H4262S C.00 5-20  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

5–11. SLIDE: CPU — Metrics to Monitor per Process

CPU — Metrics to Monitor per Process • Process priority • Process nice value • Amount of CPU user time • Amount of CPU system time

Student Notes Individual processes vary greatly in terms of the load they place on the CPU. Metrics to monitor on an individual process include the following.

Process Priority This is the priority of the process. If the priority is 127 or less, we know it is a real time process. If the priority is 128-177, either it is a system process, or it is a user process that is sleeping. If the priority is 178-255, then we know the process is executing in USER mode.

Process Nice Value This is the nice value associated with the process. This only applies to time-share processes. This value determines how fast the process regains priority while it is waiting for the CPU. Small nice values (0-19) should be given to more important processes allowing them to regain priority quickly. Large nice values (21-39) should be given to less important processes, causing them to regain priority slowly.

http://education.hp.com

H4262S C.00 5-21  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

User CPU Time vs. System CPU Time This is the percentage of time the individual process spent in user mode (i.e. having the CPU execute user code) and system mode (i.e. having CPU execute kernel code). This is helpful in determining where the CPU spends its time when executing the process: user code or kernel code. It is generally desirable to see more time in user code.

H4262S C.00 5-22  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

5–12. SLIDE: Activities that Utilize the CPU

Activities that Utilize the CPU • Process management • File system I/O • Memory management activities • System calls • Applications (for example, CAD-CAM and database processes) • Batch jobs

Student Notes Examples of activities that place a load on the CPU include the following.

System Activities System activities are those activities which execute in kernel mode. Examples of system activities include system processes and user processes executing system calls. •

Process startup

•

Process scheduling

•

File system and raw I/O

•

Memory management

•

Handling of system calls

http://education.hp.com

H4262S C.00 5-23  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

User Activities User activities are those activities that execute in user mode. •

CAD/CAM applications

•

Database processing

•

Client/server applications

•

Compute-bound applications

•

Background jobs (i.e. batch jobs)

H4262S C.00 5-24  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

5–13. SLIDE: glance — CPU Report

glance — CPU Report

B3692A GlancePlus B.10.12 05:00:42 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 25% 20% 47% Disk Util F | 12% 6% 23% Mem Util S S U | 85% 83% 85% U B B Swap Util U | 18% 18% 18% U R R -------------------------------------------------------------------------------CPU REPORT Users= 4 State Current Average High Time Cum Time -------------------------------------------------------------------------------User 18.9 6.0 32.3 0.96 3.61 Nice 0.0 2.4 5.7 0.00 1.47 Negative Nice 0.4 0.8 16.2 0.02 0.51 RealTime 0.4 0.4 0.7 0.02 0.22 System 3.3 7.0 16.2 0.17 4.21 Interrupt 1.8 1.7 2.7 0.09 1.02 ContextSwitch 0.6 0.7 1.4 0.03 0.40 Traps 0.0 0.0 0.0 0.00 0.00 Vfaults 0.0 0.7 3.6 0.00 0.45 Idle 74.6 80.2 91.2 3.79 48.18 Top CPU user: PID Active CPUs: 1

2097, dthelpview

19.5% cpu util Page 1 of 2

Student Notes The glance CPU report (c key) provides details on where the CPU is spending its time from a global perspective. •

User mode: This is time spent by the CPU in user mode for all processes on the system. This includes processes with a nice value of 20 (user), processes with nice values between 21-39 (nice), processes with nice values between 0-19 (negative nice), and realtime priority processes.

•

System mode: This is time spent by the CPU in system mode for all processes on the system. It includes time spent handling general system calls (system), and time spent handling interrupts, context switches, traps, and Vfaults (virtual faults).

•

Load Average: This is the number of jobs in the CPU run queue averaged over three time intervals. It includes the average length of the run queue over the last 1 minute, the last 5 minutes, and the last 15 minutes. The CPU load average data is viewable on page 2 of this glance report. Also on page two are the System Call Rate, the Interrupt Rate, and the Context Switch Rate.

http://education.hp.com

H4262S C.00 5-25  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–14. SLIDE: glance — CPU by Processor

glance — CPU by Processor

B3692A GlancePlus B.10.12 05:13:18 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 25% 20% 47% Disk Util F | 12% 6% 23% Mem Util | 85% 83% 85% S S U U B B Swap Util U | 18% 18% 18% U R R -------------------------------------------------------------------------------CPU BY PROCESSOR Users= 4 CPU State Util LoadAvg(1/5/15 min) CSwitch Last Pid -------------------------------------------------------------------------------0 Enable 25.4 0.6/ 0.4/ 0.3 72187 1061

Page 1 of 2 CPU Util User Nice NNice RealTm Sys Intrpt CSwitch Trap Vfault -------------------------------------------------------------------------------0 25.4 20.7 0.0 0.0 0.0 4.7 0.0 0.0 0.0 0.0

Page 2 of 2

Student Notes The glance CPU-by-processor report (a key) provides details on a per CPU basis. CPU Utilization: This is the CPU utilization for the specific processor. If two or more processors exist on the system, the Global CPU Util bar graph shows an average CPU utilization. That is, a CPU that is 100% utilized and a second CPU that is 0% utilized will display 50% CPU utilization. This report displays utilization on a per processor basis. Load Average: This is the number of processes, on average, in the CPU run queue over the last 1 minute, 5 minutes, and 15 minutes. This report displays CPU run queue information on a per processor basis. Page two of this display shows the Utilization broken down into User mode, Nice, Negative Nice, Realtime, System, Interrupts, Context Switches, Trap and Virtual Faults on a perprocessor basis.

H4262S C.00 5-26  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

5–15. SLIDE: glance — Individual Process

glance — Individual Process

B3692A GlancePlus B.10.12 15:17:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------Resource Usage for PID: 16013, netscape PPID: 12988 euid: 520 User:sohrab -------------------------------------------------------------------------------CPU Usage (sec) : 3.38 Log Reads : 166 Wait Reason : SLEEP User/Nice/RT CPU: 2.43 Log Writes: 75 Total RSS/VSS : 22.4mb/ 28.3mb System CPU : 0.73 Phy Reads : 4 Traps / Vfaults: 414/ 8 Interrupt CPU : 0.14 Phy Writes: 61 Faults Mem/Disk: 0/ 0 Cont Switch CPU : 0.08 FS Reads : 4 Deactivations : 0 Scheduler : HPUX FS Writes : 29 Forks & Vforks : 0 Priority : 154 VM Reads : 0 Signals Recd : 339 Nice Value : 24 VM Writes : 0 Mesg Sent/Recd : 775/ 1358 Dispatches : 1307 Sys Reads : 0 Other Log Rd/Wt: 3924/ 957 Forced CSwitch : 460 Sys Writes: 32 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 814 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Fri Feb 6 15:14:45 1998 CPU Switches : 0 Bytes Xfer: 410kb

Student Notes The glance individual process report (s key followed by the PID) displays CPU usage for an individual process, and the distribution of CPU time when executing the process (user, system, interrupt, context switch). Ideally, a process should spend more time in User/Nice/RT mode than in any of the other three modes. Also displayed on a per-process basis is the Priority and Nice values for the selected process. In addition, the total number of forced context switches (time slice expiration or process preemptions) and voluntary context switches (process putting itself to sleep) are displayed.

http://education.hp.com

H4262S C.00 5-27  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–16. SLIDE: glance — Global System Calls

glance — Global System Calls

B3692A GlancePlus B.10.12 05:17:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 25% 20% 47% Disk Util F | 12% 6% 23% Mem Util S S U | 85% 83% 85% U B B Swap Util U | 18% 18% 18% U R R -------------------------------------------------------------------------------GLOBAL SYSTEM CALLS Users= 4 System Call Name ID Count Rate CPU Time Cum CPU -------------------------------------------------------------------------------syscall-0 0 16 3.1 0.05921 2.19037 fork 2 0 0.0 0.00000 0.01398 read 3 105 20.5 0.00210 0.07625 write 4 47 9.2 0.00208 0.13624 open 5 16 3.1 0.00143 0.03146 close 6 16 3.1 0.00040 0.00848 wait 7 1 0.1 0.00011 0.00031 time 13 46 9.0 0.00023 0.00446 chmod 15 0 0.0 0.00000 0.00009 ioctl 54 503 57.8 0.00900 0.79813 poll 269 277 48.5 0.00983 1.83466 Cumulative Interval:

87 secs

Page 1 of 7

Student Notes The glance global system calls report (Y key) displays all the system calls that have been executed system-wide. When system CPU utilization is high, this report can be used to identify on which system calls the CPU is spending most of its time.

H4262S C.00 5-28  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

5–17. SLIDE: glance — System Calls by Process

glance — System Calls by Process

B3692A GlancePlus B.10.12 05:39:20 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------System Calls for PID: 1822, netscape PPID: 1775 euid: 503 User:roc Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------read 3 477 93.5 0.16884 742 49.1 0.24275 write 4 219 42.9 0.02831 352 23.3 0.06787 open 5 63 12.3 0.01396 99 6.5 0.02491 close 6 9 1.7 0.00046 20 1.3 0.00104 time 13 34 6.6 0.00031 89 5.8 0.00083 brk 17 27 5.2 0.00171 45 2.9 0.00264 lseek 19 69 13.5 0.00150 135 8.9 0.00304 stat 38 4 0.7 0.00131 13 0.8 0.00415 ioctl 54 636 124.7 0.01463 1167 77.2 0.02813 utssys 57 0 0.0 0.00000 3 0.1 0.00013 Cumulative Interval:

15 secs

Page 1 of 3

Student Notes While examining an individual process, the system calls generated by that particular process can be viewed using the L key. When the system time utilization is high for an individual process, this report can be used to view the specific system calls the process is performing, how many times the system calls are being invoked, and (most importantly) how much time is being spent by the CPU to execute the system calls. The read() and write() system calls often take the most time, as they require physical I/O to the disk drives.

http://education.hp.com

H4262S C.00 5-29  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–18. SLIDE: sar Command

sar Command $ sar option Options: -u

CPU Utilization (usr, sys, wio, idle)

-q

Queue lengths/utilization (run, swap)

-M

Above information in per-processor format

-c

System calls

Student Notes The sar command can be used to display global statistics on several important CPU operations. Using the –u option, information can be displayed on the time the system spent in User mode, System mode, Waiting for (disk) I/O, and idle. The Waiting for (disk) I/O is not reported by any other tool. Other tools simply lump it in with idle time. An example of the sar output with the –u option is shown below: # sar -u 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:32:24 08:32:29 08:32:34 08:32:39 08:32:44 Average

10/14/97

%usr 64 61 61 61

%sys 36 39 39 39

%wio 0 0 0 0

%idle 0 0 0 0

61

39

0

0

H4262S C.00 5-30  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

Using the –q command, information can be displayed on the length and utilization of the run queue and the swap queue. We are most interested at this time in the run queue. An example of the sar output with the –q option is shown below: # sar -q 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:33:24 08:33:29 08:33:34 08:33:39 08:33:44 Average

10/14/97

runq-sz %runocc swpq-sz %swpocc 8 100 0 0 8 100 0 0 8 100 0 0 8 100 0 0 8

100

0

0

The –M option is always used in conjunction with –u and/or –q. It causes the metrics to be broken down by processor, so you can see how each processor is being utilized. The –c option shows the total number of system calls being executed per second and singles out four specific system calls for further detail. They are the read(), write(), fork(), and exec() system calls. Also reported on this display is the average number of characters transferred in or out each second. An example of this output follows: # sar -c 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:33:24 scalls/s 08:33:29 332 08:33:34 435 08:33:39 270 08:33:44 524 Average

390

http://education.hp.com

10/14/97

sread/s 3 4 3 20

swrit/s 9 24 14 15

fork/s 0.00 0.00 0.00 0.20

exec/s 0.00 0.00 0.00 0.20

7

15

0.05

0.05

rchar/s 38630 30310 6758 73523 37187

wchar/s 2657 2662 0 0 1331

H4262S C.00 5-31  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–19. SLIDE: timex Command

timex Command $ timex

prime_med

real user sys

25.65 20.71 3.43

Student Notes The timex command can be used to benchmark how long the execution of a particular process takes in seconds. The command measures: •

real time  the amount of elapsed time from when the program started to when the program completed (sometimes referred to as the “wall clock” time).

•

user time  the amount of time spent by the program executing in user mode.

•

sys time  the amount of time spent by the program executing in kernel mode.

The example on the slide shows a total of 25.65 seconds elapsed from when the program prime_med started to when it completed. The execution spent 20.71 seconds executing in user mode and 3.43 seconds executing in kernel mode. The difference between user + system and real time is attributed to time the process spent not running on the CPU. The process may not get CPU time either because it was waiting on some resource (like disk or CPU) or because it was in a sleep state waiting for an event (like a child process waiting to finish executing).

H4262S C.00 5-32  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

5–20. SLIDE: Tuning a CPU-Bound System — Hardware Solutions

Tuning a CPU-Bound System — Hardware Solutions • Upgrade to a faster processor • Upgrade the system with a larger data/instruction cache • Add a processor to a multiprocessor system • Spread applications to multiple systems

Student Notes Practically speaking, the easiest performance gains are usually achieved by adding more and faster hardware. This could be upgrading to a faster processor, upgrading to a processor with more cache, adding another processor, or buying another system and off-loading some applications to the second system. Upgrading to a faster processor may be possible with a simple module swap, but, more than likely, it would involve upgrading your entire system to a newer model. Some systems come with two or three possible processors and yours may not have the fastest available processors. If so, you may be able to upgrade the system’s processors to faster versions without touching the rest of the system. Nowadays, it’s unlikely that you’ll be able to upgrade the cache memory or TLB to larger sizes. Each processor chip seems to come with a predetermined amount of cache and Sized TLB. Only going to a different processor chip (and thus a larger model) will you be able to affect the cache memory and TLB sizes. If your system is not yet at its full complement of processors, it may relieve your workload to add more processors. If you have a cell-based architecture, you may be able to add more processors to each cell, or even add more cells. Some servers come with extra processors

http://education.hp.com

H4262S C.00 5-33  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

installed, but not enabled. These systems have a feature called ICOD (Instant Capacity On Demand). By simply contacting HP, these disabled processors can be enabled, giving you more processing power with a minimum of time. If, at a later date, those processors are no longer needed, they can be disabled in a similar fashion. Finally, if you have a system which is heavily loaded and another system which is lightly loaded, it may be possible to transfer some of the tasks from the busy system to the one which is less busy. The disadvantage of these solutions is that most of them cost money.

H4262S C.00 5-34  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

5–21. SLIDE: Tuning a CPU-Bound System — Software Solutions

Tuning a CPU-Bound System — Software Solutions • Nice less important processes • Anti-nice more important processes • Consider using rtprio or rtsched on most important processes • Run batch jobs during non-peak hours • Consider using PRM/WLM • Consider using the processor affinity call mpctl() • Optimize/recompile application

Student Notes If the easiest performance gains are upgrading the hardware, then the greatest performance gains that are likely to be achieved are improving the software. A system with the fastest and most current hardware can still run slowly if the software is not configured properly. One way to improve the performance of specific processes is to improve the priority of those processes. You can do this by improving the process's nice value or by making the process a real-time process. Or, you can reduce the nice value of other processes. Be careful when promoting a process to real time. If the process is not well-behaved, it can take over your entire system. By well-behaved, we mean that it is not compute bound and it is free of serious bugs. Running batch jobs at non-peak hours has been a standard performance solution for many years on many systems. Other software performance improvements can be realized by using PRM (Process Resource Manager), WLM (Workload Manager), or the mpctl() system call.

http://education.hp.com

H4262S C.00 5-35  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–22. SLIDE: CPU Utilization and MP Systems

CPU Utilization and MP Systems

Processor 1

Processor 2

CPU

CPU

Memory Process

TLB

Cache

Coprocessor

TLB

Cache

Coprocessor

mpctl (proc2)

System Bus

Is each processor pulling its weight? The sar -uqM command string can help you monitor the CPU loading on the individual processors in a MP system.

Student Notes The sar command can be utilized to report CPU utilization for the overall system on a perprocessor basis (when the -u and -M options are specified). In addition the -q option will report average run queue length while occupied, and percent of time occupied. Both of these metrics can assist in the evaluation of CPU loading and should be considered before making processor affinity calls. top can also show you how your CPU resource is being distributed over the system. It automatically breaks down the load and utilization percentages on a per-processor basis when invoked. Remember, when you are running a system that supports Partitions (NPars or VPars), these tools only show you what is happening within a partition, as each partition has booted its own copy of the operating system and is acting as an independent system.

H4262S C.00 5-36  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

5–23. SLIDE: Processor Affinity

Processor Affinity

Processor 1

Processor 2

CPU

CPU

Memory Process

TLB

Cache

Coprocessor

TLB

Cache

Coprocessor

mpctl (proc2)

System Bus

The mpctl() system call assigns the calling process to a specific processor.

Student Notes The mpctl() system call provides a means for determining how many processors are installed in the system (or partition), how many processors are in this pset, and assigning processes or threads to run on specific processors (also known as processor affinity) or within specific psets, and much, much more. Refer to the man page for mpctl() on your system. Much of the functionality of this capability is highly dependent on the underlying hardware. An application that uses this system call should not be expected to be portable across architectures or implementations. Processor sets are supported by the pset() system call. If your version of the operating system supports psets, refer to the man page for pset() for full details.

http://education.hp.com

H4262S C.00 5-37  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5-24. LAB: CPU Utilization, System Calls, and Context Switches Directions General Setup Create a working data file in a separate file system (on a separate disk, if possible). If another disk is available: # # # #

vgdisplay –v | grep Name (Note which disks are already in use by LVM) ioscan –fnC disk (Note any disks not mentioned above, select one) pvcreate -f vgextend vg00

In either case: # # # # # #

lvcreate -n vxfs vg00 lvextend -L 1024 /dev/vg00/vxfs newfs -F vxfs /dev/vg00/rvxfs mkdir /vxfs mount /dev/vg00/vxfs /vxfs prealloc /vxfs/file

The lab programs are under /home/h4262/cpu/lab0 # cd /home/h4262/cpu/lab0 The tests should be run on an otherwise idle system — otherwise results are unpredictable. If the executables are missing, generate them by typing: # make all

CPU Utilization: System Call Overhead Use the dd command to size the read and write operations. Thus their number can be varied to change the number of system calls used to transfer the same amount of information. Then we can see the overhead of the system call interface. The first command loads the entire file into buffer cache. # timex dd if=/stand/vmunix of=/dev/null bs=64k Now we take our measurements. # timex dd if=/stand/vmunix of=/dev/null bs=64k real

__

user __________

system ____________

H4262S C.00 5-38  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

# timex dd if=/stand/vmunix of=/dev/null bs=2k real __ user __________ system ____________ # timex dd if=/stand/vmunix of=/dev/null bs=64 real

__

user __________

system ____________

System Calls and Context Switches This lab shows you the maximum system call and context switch rates that your system can take. Three programs are supplied: • • •

syscall loads the system with system calls of one type filestress (shell script) generates file system-related system calls cs loads the system with context switches

1. What is the system call rate when your system is "idle"? ________________ 2. Run filestress in the background. What is the system call rate now? What system calls are generated by filestress? Take an average with sar over about 40 seconds i.e. # sar –c 10 4 3. Terminate the filestress process by entering the following commands: # kill $(ps -el | grep find | cut -c24-28) # kill $(ps -el | grep find | cut -c18-22) 4. Run the syscall program and again answer question 2. Is the system call rate lower or higher than with filestress? Why? _____________________________________________________________________ Kill the syscall program, before proceeding. # kill $(ps –el | grep syscall | cut –c18-22) 5. Using cs, compare the number of context switches on an idle system and a loaded system. Idle ________

Loaded ______________

6. Kill the cs program, remove the /vxfs/file, and dismount the /vxfs filesystem. # kill $(ps –el | grep cs | cut –c18-22) # rm –f /vxfs/file # umount /vxfs

http://education.hp.com

H4262S C.00 5-39  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

5–25. LAB: Identifying CPU Bottlenecks Directions The following labs are designed to show the symptoms of a CPU bottleneck.

Lab 1 1. Change directory to /home/h4262/cpu/lab1 # cd /home/h4262/cpu/lab1 2. Start the processes running in the background. # ./RUN 3. Start a glance session and answer the following questions. What is the CPU utilization? _______ What are the nice values of the processes receiving the most CPU time? _______ What is the average number of jobs in the CPU run queue? ______ 4. Characterize the 8 lab processes that are running (proc1-8). Which are CPU hogs? Memory hogs? Disk I/O hogs etc. Identify processes that you think are in pairs. ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 5. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______ 6. Compare your results to the baseline established in the lab exercise in module 1, step 5. 7. End the CPU load by executing the KILLIT script. #

./KILLIT

H4262S C.00 5-40  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 5 CPU Management

Lab 2 1. Change directory to /home/h4262/cpu/lab2. # cd /home/h4262/cpu/lab2 2. Start the processes running in the background. # ./RUN 3. In one terminal window, start glance. In a second terminal window run # sar -u 5 200. Answer the following questions: What does glance report for CPU utilization? _______ What does sar report for CPU utilization? ________ What is the priority of the process receiving the most CPU time? _______ How much time is the process spending in the sigpause system call? ______ How is the process being context switched (forced or voluntary)? ______ 4. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______ 5. End the CPU load by executing the KILLIT script. # ./KILLIT

http://education.hp.com

H4262S C.00 5-41  2004 Hewlett-Packard Development Company, L.P.

Module 5 CPU Management

H4262S C.00 5-42  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6  Memory Management Objectives Upon completion of this module, you will be able to do the following: •

Describe how the HP-UX operating system performs memory management.

•

Describe the main performance issues that involve memory management.

•

Describe the UNIX buffer cache.

•

Describe the sync process.

•

Identify the symptoms of a memory bottleneck.

•

Identify global and process memory metrics.

•

Use performance tools to diagnose memory problems.

•

Specify appropriate corrections for memory bottlenecks.

•

Describe the function of the serialize command.

http://education.hp.com

H4262S C.00  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6–1. SLIDE: Memory Management

Memory Management

Swap Space Virtual Memory

Memory

Student Notes Memory management refers to the subsystem within the kernel that is responsible for managing the main memory (also known as RAM) of the computer. When managing main memory, the kernel allocates memory pages (default size is 4 KB) to processes as they need space. When main memory runs low on free space, the kernel will try to free up some pages in memory by copying those pages out to swap space on disk. The swap space can be thought of as an extension of main memory (like an overflow area) that is used when main memory becomes full. Processes paged out to the swap area cannot be referenced again until they are paged back in to main memory. The term virtual memory refers to how much memory the kernel perceives as being available for allocation to processes. When the kernel allocates space to a process, it must track that page for the life of the process. Virtual memory includes main memory and swap space, as pages allocated to processes may be moved to swap space.

Example In the slide, there are three different processes being tracked: a one-page process, a two-page process, and a three-page process. The one-page process started in main memory and was

H4262S C.00 6-2  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

subsequently paged to swap space. The two-page process is entirely resident in main memory. And the three-page process has been partially paged to swap space (two of three pages are on swap). From a virtual memory standpoint, the three processes are taking up six pages of memory: three pages in main memory and three pages on swap. The preceding example is pretty simple. Reality is a little more complex. Processes actually consist of two basic types of pages, text and data. The data pages have “write” capabilities and thus their contents must be preserved when they are moved out of memory (to swap space). The text pages cannot be modified by the executing program. They are initially read in from the file system. If the memory manager should want to release the space that a text page is taking, it does not have to copy it out to swap, or even back to the file system.

http://education.hp.com

H4262S C.00 6-3  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6–2. SLIDE: Memory Management — Paging

Memory Management — Paging

1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1

F 0 0 0 0 0 0 1 1 1 1 0 1 1 0 1

1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 1

1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 1

1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0

F 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0

F 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1

F 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1

Memory

F F 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1

1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1

F 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1

Free Hand

Vhand Process

Reference Hand

1 = Page is being referenced 0 = Page is NOT being referenced F = Freed Memory Page by vhand process

Student Notes The vhand daemon is responsible for keeping a minimum amount of memory free on the system at all times. The vhand daemon does this by monitoring free pages and trying to keep their number above a threshold to ensure sufficient memory for efficient demand paging. The vhand daemon utilizes a "two-handed" clock algorithm as seen on the slide. The first hand (also known as the “reference” hand or “age” hand) clears reference bits on a group of pages in an active part of memory. If the bits are still clear by the time the second hand (also known as the “free” hand or “steal” hand) reaches them, the pages are paged out. The kernel automatically keeps an appropriate distance between the hands, based on the available paging bandwidth, the number of pages that need to be stolen, the number of pages already scheduled to be freed, and the frequency in which vhand runs. In essence, the distance between the hands determines how “aggressive” vhand is behaving. It behaves more aggressively as the memory pressure increases.

H4262S C.00 6-4  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

6–3. SLIDE: Paging and Process Deactivation

Paging and Process Deactivation

Non-Kernel memory

Free Mem Pages

LOTSFREE

Paging begins with possibility of stabilization. Paging continues at maximum rate, with no possibility of stabilization.

DESFREE MINFREE 0 MB Paging Scanning Rate

Process deactivation begins to occur.

Student Notes The system uses a combination of paging and deactivation to manage the amount of free memory. A minimum amount of free memory is needed to allow the demand paging system to work properly. No paging occurs until the free memory falls below a threshold call LOTSFREE. Upon falling below LOTSFREE, paging will occur at a minimum level – becoming more aggressive as the number of free pages decreases. If the demand for memory continues, then paging will continue. However, if the demand for memory subsides, then there is a possibility that the amount of free memory will stabilize below the LOTSFREE threshold. If free memory falls below a second threshold call DESFREE, then there is no possibility of stabilization (until free memory goes back above DESFREE) and the paging rate becomes much more aggressive compared to the initial paging rate. Finally, if free memory falls below MINFREE, then process deactivation begins. A process is chosen by the kernel to be deactivated, and it is placed on the deactivation queue. Because the process is deactivated (therefore its pages are not being referenced) vhand will be able to page all its pages (including the uarea) out to the swap partition. The process will be

http://education.hp.com

H4262S C.00 6-5  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

reactivated automatically once free memory rises above MINFREE. When a process is reactivated, only the uarea is immediately paged in. Other pages are faulted in as needed. Below are the default formulae for LOTSFREE, DESFREE, and MINFREE. (NKM = Non-Kernel Memory) =32 MB, 2 GB

LOTSFREE

1/8 of NKM

1/16 of NKM

64 MB

DESFREE

1/16 of NKM

1/64 of NKM

12 MB

MINFREE

1/2 of DESFREE

1/4 of DESFREE

5 MB

NOTE

The values of LOTSFREE, DESFREE, and MINFREE were made tunable kernel parameters in HP-UX 11.00. Prior to the 11.00 release, these values were fixed and could not be changed. It is recommended by HP, however, that the parameters not be tuned manually.

H4262S C.00 6-6  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

6–4. SLIDE: The Buffer Cache

The Buffer Cache • Pool of memory designed to retain the most commonly accessed files from disk • Used only for file system I/O (not raw I/O) • Size of buffer cache controlled by dbc_min_pct and dbc_max_pct

Buffer Cache

Filesystem

Process

File

Memory

Student Notes The buffer cache exists to speed up file system I/O. The system tries to minimize disk access by going to disk as infrequently as possible, because disk access is often a bottleneck on most systems. Therefore, the most recently- or commonly-accessed files from disk persist in the portion of memory called the buffer cache. It is called dynamic because the size of the buffer cache grows or shrinks dynamically, depending on competing requests for system memory. Its minimum size is governed by the tunable parameter dbc_min_pct, and it cannot grow larger than the size specified in dbc_max_pct. These two parameters are expressed as percentages of total physical memory on the system. Let's say dbc_min_pct is set to 10, while dbc_max_pct is 50. This means that initially 10% of physical memory is allocated to the buffer cache. As the system needs more space to buffer files read in from disk, the buffer cache will allocate more memory, and this will continue until it occupies 50% of memory, its maximum size. Later, when the system requires more memory for another use, say processes, the buffer cache could shrink an appropriate amount, but will never be less than the 10% minimum value. Therefore, a larger buffer cache is able to hold more files and will minimize their access time but will leave less memory available for other uses.

http://education.hp.com

H4262S C.00 6-7  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

NOTE:

The buffer cache is dynamic in nature only when two other tunable parameters, bufpages and nbuf, are both set to their default values of 0.

Another example: if dbc_min_pct and dbc_max_pct are both set to the same value, say 20, the kernel will always use exactly that percentage of physical memory for the buffer cache.

H4262S C.00 6-8  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

6–5. SLIDE: The syncer Daemon

The syncer Daemon • All entries stay in the buffer cache for a minimum of 30 seconds before being flushed. • The syncer daemon runs once every 6 seconds and flushes 20% of the buffer cache to disk.

Buffer Cache

Filesystem flushes

syncer

File

Memory

Student Notes For disk writes, data flows from the buffer cache to disk. How does it get to the buffer cache? The kernel writes data to it. The syncer process takes care of flushing data in the buffer cache to the files on the disk. When a user edits a file, makes changes to that file, and saves the changes, those changes do not go to disk right away. The kernel writes the data to the buffer cache, and some time later (within 60 seconds) the data finally arrives at the disk. This time period is chosen as a balance between ensuring that the file system is fairly up-to-date in case of a crash and efficiently performing disk I/O. There are many applications that do not rely on the operating system's built-in processes to flush data to disk, but instead take over that operation themselves. In other words, they create their own buffers and manage the flushing at appropriate intervals. A common example is a database application that needs to guarantee the completion of a transaction within a specified time interval.

http://education.hp.com

H4262S C.00 6-9  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6–6. SLIDE: IPC Memory Allocation

IPC Memory Allocation # ipcs -mob IPC status from /dev/kmem as of Sat Feb 14 06:53:27 T ID KEY MODE OWNER GROUP Shared Memory: m 5 0x06347849--rw-rw-rwroot root m 7 0x000c0568 --rw------- root root

1998 NATTCH

SEGSZ

0 2

77384 131516

Shared Memory Segment

Shared Memory Segment

Memory Text

Text

Data

Data

Sh. Mem

Sh. Mem

Sh. Lib

Sh. Lib

Student Notes UNIX implements interprocess communications using different mechanisms. Three mechanisms that require additional system memory are semaphores, shared memory, and message queues. •

Semaphores are used to synchronize memory resources between competing processes.

•

Shared memory segments are resources capable of holding (in memory) large amounts of data that can be shared between processes.

•

Message queues hold strings of information (messages) that can be transferred between processes. Two types of processes that utilize message queues are networking and database processes.

Shared memory provides a mechanism to reduce interprocess communication costs significantly. Two processes that are ready to share data, address the same portion of shared memory into their addressable space. Changes made to the shared memory are seen immediately by all processes and do not require kernel services. So from a kernel perspective, other than initially setting up the shared memory, there is very low cost in using shared memory.

H4262S C.00 6-10  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

On the slide, each process has a shared memory segment that references one and the same shared memory area. The more processes that allocate shared memory segments, the higher the memory usage. The shared memory segments in physical memory can be viewed with the ipcs -mob command or a reporting tool like glance. From time-to-time, they might have to be cleaned up or removed manually if an application terminates ungracefully. This is done by the superuser with the ipcrm command. A worthwhile baseline measurement for a system administrator is to run the ipcs -mob command during a quiet period. It is also eye opening to repeat this command when the system is at its busiest.

http://education.hp.com

H4262S C.00 6-11  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6–7. SLIDE: Memory Metrics to Monitor — Systemwide

Memory Metrics to Monitor — Systemwide • Is vhand Active? –

Pages scanned by vhand (SR)

–

Pages freed by vhand (FR)

Pages paged out • Is swapper Active? –

–

Processes deactivated (SO)

–

Amount of free memory relative to - lotsfree - desfree - minfree

• Size of dynamic buffer cache • Size of IPC Shared memory segments

Student Notes The utilization of memory can be monitored in a number of different ways. There are multiple tools and multiple metrics that monitor memory usage. The first metrics you want to look at are those that will tell you whether vhand is active.

Pages Scanned by vhand This is the number of pages the vhand process has scanned (i.e. dereferenced with the reference hand) when looking for pages to free in memory. This tells you that vhand is actively scanning pages in an attempt to free them up. There is some memory pressure.

Pages Freed by vhand This is the number of pages the vhand process has freed (i.e. the reference bit was still dereferenced when the free hand looked at it). The ratio between pages scanned and pages freed indicates how successful the vhand process is when looking for memory pages to free.

H4262S C.00 6-12  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

Amount of Paging This indicates the level of disk activity to the swap partition. If a consistent amount of paging to swap space is occurring, then performance is impacted (most likely significantly). Next, check to see if the swapper is active.

Process Deactivations This indicates that processes are being deactivated, meaning free memory has fallen below the MINFREE threshold. There is severe memory pressure.

Amount of Free Memory This indicates the severity of the free memory situation. If free memory has fallen below LOTSFREE, then we know some paging has taken place. vhand is active. If it is below DESFREE, then the situation is more severe, and much more paging is occurring. vhand is aggressively active. Finally, if free memory is below MINFREE, then a high level of paging and process deactivation is occurring. vhand and swapper are both active. To determine what the values are for lotsfree, desfree, and minfree, use the following commands: # echo “lotsfree/D” | adb –k /stand/vmunix /dev/mem # echo “desfree/D” | adb –k /stand/vmunix /dev/mem # echo “minfree/D” | adb –k /stand/vmunix /dev/mem The settings for these three values in the kernel will then be displayed in 4K pages. You can then compare them to the current size of the free page list. These values will not change, unless you change the size of Non-Kernel Memory. (Remember the formulas shown earlier?)

Size of Dynamic Buffer Cache This is the amount of memory being consumed by the buffer cache. If memory is full and the buffer cache is large, it will most likely cause paging, since the buffer cache typically shrinks slower than the rate at which new memory is needed. Heavy disk I/O demands may prevent the buffer cache from shrinking at all.

Size of IPC Shared Memory Segments This is the amount of memory used for interprocess communications. Of special interest will be the number and sizes of shared memory segments, as these can be quite large, especially if graphical applications or a database management system is being used.

http://education.hp.com

H4262S C.00 6-13  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6–8. SLIDE: Memory Metrics to Monitor — per Process

Memory Metrics to Monitor — per Process • Size of RSS/VSS • Size of text, data, and stack segments • Number of shared memory segments • Amount of time blocked on virtual memory

Student Notes Individual processes vary greatly in terms of the amount of memory they use. Metrics to monitor memory utilization on a per-process basis include the following:

Size of RSS/VSS The Resident Set Size (RSS) for a process is the portion of the process (in KB) that is currently resident in physical memory. Since the entire process does not have to be resident in memory in order to execute, this shows how much of the process is actually resident in memory. The Virtual Set Size (VSS) for a process is the total size of the process (in KB). This indicates that if the entire process were to be loaded, this is how much memory the entire process would consume. Very rarely is the entire process resident in memory. If the entire process were in memory, then the RSS value would be equal to the VSS value.

Size of Text, Data, and Stack Segments These are the RSS and VSS sizes for the three main components of a process. Since every process has a single text, data, and stack segment, these values should be monitored, especially for large processes. The data segment is the most likely to be large.

H4262S C.00 6-14  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

Each of these three segments has a maximum size to which they can grow – limited by tunable kernel parameters. They are maxtsiz, maxdsiz, and maxssiz for a 32-bit process. They are maxtsiz_64bit, maxdsiz_64bit, and maxssiz_64bit for a 64-bit process. If a process tries to grow one of these segments beyond its maximum size, then the process terminates (and in some cases “core dumps”).

Number and Size of Shared Memory Segments These are the shared memory segments to which a process is attached. The maximum size of a shared memory segment is limited by the kernel parameter, shmmax. The number of shared memory segments a process can attach to is limited by the kernel parameter, shmseg.

Amount of Time Spent Blocked on Virtual Memory This is the amount of time the process was prevented from executing because it was waiting (or blocked) on a text or data page to be paged in.

http://education.hp.com

H4262S C.00 6-15  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6–9. SLIDE: Memory Monitoring vmstat Output

Memory Monitoring vmstat Output

#=> vmstat -n 5 VM memory avm free 9140 3824 CPU cpu us sy id r 9 5 86 1 9017 3500 24 17 60 0 10292 2255 67 24 9 5 10227 976 67 33 0 7 10958 400 67 31 3 8 10759 454 62 20 18 6 13448 404 32 15 53 0

re 3

at 4

procs b 100 41 100 65 102 89 103 81 110 33 111 21 118

w 0 49 0 20 0 19 0 12 0 3 0 0 0

pi 0

page po 0

fr 0

de 0

sr 0

in 675

faults sy cs 824 140

11

0

0

0

0

1257

2823

329

41

0

0

0

0

1419

3795

481

85

0

0

0

0

1698

4771

641

91

48

26

0

194

1791

5847

697

98

51

24

0

268

1598

4313

598

65

74

39

0

282

1021

3175

354

Student Notes A useful command to view virtual memory statistics is vmstat. The slide shows vmstat's output being updated every 5 seconds. When viewing vmstat's output, always keep an eye on the po (pages paged out) parameter. Ideally, you want this to be zero, indicating no paging out is occurring. Statistics regarding the vhand algorithm, the fr (pages freed per second) and sr (pages scanned by the clock algorithm, per second) parameters show the actual behavior of vhand.

Output Headings procs

r b w

In run queue Blocked for resources (I/O, paging, and so on) Runnable or short sleeper (less than 20 seconds) but deactivated

H4262S C.00 6-16  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management memory

avm free re at pi po fr defacto sr

Active virtual pages (run during the last 20 seconds) Size of free list page (in 4K pages) Page reclaims per second Address translation faults per second (page faults) Pages paged in per second Pages paged out per second Pages freed per second Anticipated short term memory shortfall Pages scanned by algorithm per second

faults

in sy cs

Non-clock device interrupts per second System calls per second CPU context switches per second

CPU

us sy id

Percentage of time CPU spent in user mode Percentage of time CPU spent in system mode Percentage of time CPU is idle

with -S option

si so

Processes reactivated per second Processes deactivated per second

http://education.hp.com

H4262S C.00 6-17  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6–10. SLIDE: Memory Monitoring glance — Memory Report

Memory Monitoring glance — Memory Report

B3692A GlancePlus B.10.12 17:33:59 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------MEMORY REPORT Users= 19 Event Current Cumulative Current Rate Cum Rate High Rate -------------------------------------------------------------------------------Page Faults 78 287 7.5 24.3 139.3 Paging Requests 3 21 0.2 1.7 12.0 KB Paged In 52kb 336kb 5.0 28.4 189.3 KB Paged Out 0kb 0kb 0.0 0.0 0.0 Reactivations 0 0 0.0 0.0 0.0 Deactivations 0 0 0.0 0.0 0.0 KB Reactivated 0kb 0kb 0.0 0.0 0.0 KB Deactivated 0kb 0kb 0.0 0.0 0.0 VM Reads 3 6 0.2 0.5 2.0 VM Writes 0 0 0.0 0.0 0.0 Total VM : Active VM:

78.9mb 23.4mb

Sys Mem : Buf Cache:

10.6mb 19.1mb

User Mem: Free Mem:

78.0mb 20.3mb

Phys Mem: 128.0mb Page 1 of 1

Student Notes glance has extensive memory monitoring abilities. Like vmstat, it can give paging statistics, in addition to showing if any processes are being deactivated. Remember, this is an indication of severe memory shortage. There is other valuable information on this report, such as the statistics at the bottom showing the current Dynamic Buffer Cache size, the current amount of Free Memory, and the total Physical Memory in the system.

H4262S C.00 6-18  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

6–11. SLIDE: Memory Monitoring glance — Process List

Memory Monitoring glance — Process List

B3692A GlancePlus B.10.12 14:52:27 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------PROCESS LIST Users= 11 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100 max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------netscape 16013 12988 154 sohrab 12.9/14.0 64.9 0.0/ 0.6 14.7mb 1 supsched 18 0 100 root 2.9/ 2.1 942.6 0.0/ 0.0 16kb 1 lmx.srv 1219 1121 154 root 1.6/ 0.9 389.4 0.5/ 0.0 2.7mb 1 glance 15726 15396 156 root 0.6/ 0.9 2.0 0.0/ 0.2 4.0mb 1 statdaemon 3 0 128 root 0.6/ 0.7 302.1 0.0/ 0.0 16kb 1 midaemon 1051 1050 50 root 0.4/ 0.4 201.4 0.0/ 0.0 1.3mb 2 ttisr 7 0 -32 root 0.4/ 0.3 121.0 0.0/ 0.0 16kb 1 dtterm 15559 15558 154 roc 0.4/ 0.4 1.6 0.0/ 0.0 6.2mb 1 rep_server 1098 1084 154 root 0.2/ 0.1 23.7 0.0/ 0.0 2.0mb 1 syncer 325 1 154 root 0.2/ 0.0 20.2 0.1/ 0.0 1.0mb 1 xload 13569 13531 154 al 0.2/ 0.0 2.4 0.0/ 0.0 2.6mb 1 Page 1 of 13

Student Notes The glance Process List report can be used to monitor process statistics, including how much memory processes are currently consuming. The highlighted column, RSS (Resident Set Size), shows memory being used on a per-process basis. Very simply put, this helps to identify the "memory hogs" on the system. For example, the process called netscape has an RSS of 14.7 MB, while statdaemon is minimal. Other large processes include glance, xload, and dtterm. What do all these processes have in common? They are all GUI (graphical user interface) programs running as windows in a graphical window environment. Moral: programs that open their own windows are relatively memory-intensive and should be minimized. Users should be encouraged not to leave several windows open on their screens if they do not have a continuing need for them.

http://education.hp.com

H4262S C.00 6-19  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6–12. SLIDE: Memory Monitoring glance — Individual Process

Memory Monitoring glance — Individual Process

B3692A GlancePlus C.03.70.00 15:52:03 r206c42 9000/800 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 15% 15% 15% Disk Util F | 1% 0% 2% Mem Util S S U | 96% 96% 96% U B B Swap Util U | 15% 15% 15% U R R -------------------------------------------------------------------------------Resources PID: 28030, glance PPID: 27993 euid: 0 User: root -------------------------------------------------------------------------------CPU Usage (util): 0.1 Log Reads : 1 Wait Reason : STRMS User/Nice/RT CPU: 0.1 Log Writes: 0 Total RSS/VSS : 3.6mb/ 5.6mb System CPU : 0.0 Phy Reads : 0 Traps / Vfaults: 1/ 10 Interrupt CPU : 0.0 Phy Writes: 0 Faults Mem/Disk: 6/ 0 Cont Switch CPU : 0.0 FS Reads : 0 Deactivations : 0 Scheduler : HPUX FS Writes : 0 Forks & Vforks : 0 Priority : 154 VM Reads : 0 Signals Recd : 0 Nice Value : 10 VM Writes : 0 Mesg Sent/Recd : 0/ 0 Dispatches : 6 Sys Reads : 0 Other Log Rd/Wt: 38/ 172 Forced CSwitch : 0 Sys Writes: 0 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 4 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Tue Mar 16 15:49:14 2004 CPU Switches : 0 Bytes Xfer: 0kb : C - cum/interval toggle

% - pct/absolute toggle

Page 1 of 1

Student Notes The glance Individual Process report displays memory usage for an individual process, and the RSS and VSS sizes for the process. Also displayed on a per-process basis, is the VM reads and VM writes being performed by the process. This indicates how much paging from/to the swap device the individual process is performing. If performance is poor for an individual process, this is a good field to check.

H4262S C.00 6-20  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

6–13. SLIDE: Memory Monitoring glance — System Tables

Memory Monitoring glance — System Tables

B3692A GlancePlus C.03.70.00 15:58:40 r206c42 9000/800 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 15% 15% 15% Disk Util F | 0% 0% 4% Mem Util | 96% 96% 96% S S U U B B Swap Util U | 15% 21% 45% U R R -------------------------------------------------------------------------------SYSTEM TABLES REPORT Users= 1 System Table Available Requested Used High -------------------------------------------------------------------------------Inode Cache (ninode) 2884 na 645 645 Shared Memory 12.5gb 11.1mb Message Buffers 800kb na 0kb 0kb Buffer Cache 314.4mb na 314.4mb na Buffer Cache Min 32.0mb Buffer Cache Max 320.0mb DNLC Cache 8004 Model : 9000/800/A400-6X OS Name : HP-UX OS Release: B.11.11 OS Kernel Type: 64 bits

Phys Memory :640.0mb Network Interfaces : Number CPUs : 1 Number Swap Areas : Number Disks: 2 Avail Volume Groups: Mem Region Max Page Size: 1024mb Page 2 of

2 2 1 2

Student Notes The glance System Table report displays the size of kernel tables in memory, and the current utilization of theses tables. It is important not to set the size of these tables too large, as the tables are memory resident (and the bigger the table, the more memory it consumes). Yet, it is even more important that enough resources be allocated so that the kernel does not have to wait for a resource to become free (or even error out) when a particular resource is requested. The Available column displays the total size of the particular table, and the Used column shows how many entries within the table are currently being used. In general, the Used value should not be close to the Available value. If it is, then the kernel is close to running out of that particular resource. The High % column shows the high water mark for the resource since glance has been running. Also of interest in this report are the buffer cache statistics, especially the Buffer Cache that shows the current size of the buffer cache.

http://education.hp.com

H4262S C.00 6-21  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

NOTE:

There are two pages to this report. Shown here is the second page of this report. More system tables are shown on the first page.

H4262S C.00 6-22  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

6–14. SLIDE: Tuning a Memory-Bound System — Hardware Solutions

Tuning a Memory-Bound System — Hardware Solutions • Add more physical memory • Reduce usage of X-terminals

Student Notes An obvious hardware solution to a memory bottleneck is to add more physical memory. While this solution requires an outlay of money, it may pay for itself quickly by saving the system administrator hours of time looking for ways to reduce memory consumption. If adding more memory is not an option, then a second hardware suggestion is to look at the use of X terminals on the system. An X terminal typically consumes a large portion of memory. X terminals will take up 34 MB of memory for light application usage, and as much as 1020+ MB for heavy application usage. These figures do not take into account any additional RAM that the system will use for window managers or any other X-related overhead.

http://education.hp.com

H4262S C.00 6-23  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6–15. SLIDE: Tuning a Memory-Bound System — Software Solutions

Tuning a Memory-Bound System — Software Solutions • Look for unnecessary processes –

Extra windows

–

Screen Savers Long strings of child processes

–

• Reduce dbc_max_pct (max size of dynamic buffer cache). • Identify programs with memory leaks. • Check for unreferenced shared memory segments. • Use serialize command to reduce process thrashing. • Use PRM to prioritize memory allocation.

Student Notes Quite often, users will run X-windows type programs to enhance the look of their desktop. Examples include an X-eyes program, a bouncing ball program, or fancy screen savers. All of these graphical programs consume system resources, including memory. The biggest consumer of memory will most likely be the buffer cache. We saw earlier that if the buffer cache is dynamic, it will grow to its maximum size, as long as memory is available. The problem with this is when a process needs additional memory, and the free memory is below LOTSFREE, then the buffer cache is slow to shrink (if at all!), causing paging to occur among the processes. To prevent this situation, the tunable parameter dbc_max_pct should be tuned to limit the maximum size in which the buffer cache can grow. A recommendation for dbc_max_pct is 25 or less. Programs with memory leaks will allocate memory and then stop using – without returning it to the system for use elsewhere. These programs may require you to shut them down periodically, to release the memory. They may even require you to reboot the system occasionally to reclaim the memory. There are a number of third party tools that will help you locate memory leaks in applications – such as Purify.

H4262S C.00 6-24  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

Unreferenced shared memory segments can also be a problem. An application sets one up and then forgets to deallocate it when the application exits. Here is a possible procedure for locating abandoned shared memory segments: First, look for any shared memory segments that have no processes attached to them. # ipcs –ma Note which shared memory segments have a “0” in the NATTCH column. If they are owned by “root”, let them stay. Otherwise, write down their ID numbers and their CPID numbers. Second, one at a time, find out whether the creating process still exists. # ps –el | grep

If it does, it’s probably just a quiescent segment, But if not, the segment is probably abandoned. Finally, remove the segment. # ipcrm –m

The serialize command will be discussed later in this chapter. You may wish to use PRM to control your memory resource and its allocation.

http://education.hp.com

H4262S C.00 6-25  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6-16: SLIDE: PA-RISC Access Control

PA-RISC Access Control

Control Register resident Access ID keys Memory resource

Access ID keys stored in the kernel tables

Student Notes Since we are discussing system memory and performance there is one other topic that we should think about, hardware based memory page access control. The processor architecture has several features related to assuring that a process thread can not access areas of physical memory that are not part of its process space. An in depth discussion of page access control is presented in the HP-UX training course; "Inside HP-UX”, course number H5081S and we won't attempt to recreate it here. There is one particular aspect of this hardware feature that we will spend some time with in discussion though, and that is "Protection ID's". Every discrete region of virtual memory assigned to a process (text space, private data space, shared memory space, shared library data space, etc) is assigned a unique ID "key", called an Access Key. Any process attempting to access that memory space must have a copy of a matching ID "key", called a Protection Key. To speed things up, the most frequently or likely used Protection Keys are kept in processor registers. (These registers are part of a process thread’s "context" and are preserved across switches and interrupts.) The hardware performs the Protection check as part of the actual memory access instruction.

H4262S C.00 6-26  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

Now here is the catch, there is only room in the control registers for a limited number of frequently used Protection Keys. The rest are stored in kernel space in memory management tables, which are accessed when a protection ID fault occurs. The fault handler will search for and find these other "keys" when they are needed but at the cost of CPU cycles! To better understand the dynamics of this process consider the following analogy:

The Key Ring I have many keys to many locks around my home and office. It is not practical to carry all of my keys around with me all the time due to their bulk and weight. To solve this problem I have two key rings. One is small and has only those keys that I need on a daily basis, my car key, house key, desk key, and garage key. The other key ring is large and bulky with dozens of other miscellaneous keys; my workshop, tool boxes, garden shed, lawnmower (wish I could loose that one!), boat ignition, etc… This method is a blessing and a curse. When I need to start the car or unlock the front door the key I need is readily available in my pocket and I can quickly gain access. When I actually have time to go fishing, it is always a hassle to go find my utility key ring and remember to take the boat key with me. (Once I actually hauled the boat all the way to the lake, several miles away from my home, only to realize that I had not remembered the boat key!). To somewhat address this problem I now move the boat key to my everyday key ring during the summer months (replacing the snow-blower key) and reverse the procedure in the fall. The HP-UX kernel performs a similar process, every time a protection ID fault occurs. The fault handler moves the key it had to search for to the register context of the thread (replacing the least recently used key). PA-RISC 1.x has room for 4 keys in the register context while PA-RISC 2.X has room for 8 keys. IA-64 has room for at least 16 keys. Depending on how frequently a process moves from one memory region to another the number of protection ID faults will vary. With the larger number of Protection Registers in the later processors, Protection Register thrashing has become much less a problem than it has been in the past. It should also be noted that shared library regions on 11.x were modified to use a type of "skeleton" key, i.e., a key that always matches so that attempted access to them will never result in a protection ID fault.

http://education.hp.com

H4262S C.00 6-27  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6–17. SLIDE: The serialize Command

The serialize Command

Kernel OS Tables

500MB of Available Memory

Swap Space Proc I

Proc J

Proc K

Proc L

Each process: CPU bound Large (400 MB) Timeshare priority

Memory

Student Notes The serialize (1) command can help if a system has a number of large processes and is experiencing memory pressures. The serialize command will allow these big processes to run one after another, instead of running all at the same time. By running the processes sequentially, rather than in parallel, the CPU can spend more time executing the process code (i.e. user mode) and less time managing the competing processes (i.e. kernel mode).

Thrashing On systems with very demanding memory needs (for example, systems that run many large processes), the paging daemons can become so busy moving pages in and out that the system spends too much time paging and not enough time running processes. When this happens, system performance degrades rapidly, sometimes to such a degree that nothing seems to be happening. At this point, the system is said to be thrashing, meaning it is doing more overhead work than productive work.

H4262S C.00 6-28  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

How serialize Helps Reduce Thrashing All processes marked via the serialize command will run serially with other processes marked the same way. The serialize command addresses the problem caused when a group of large processes all try to make forward progress at once, which results in degrading throughput. In such a case, each process constantly faults in its working set, only to have the pages stolen when another process starts running. By using the serialize command to run large processes one at a time, the system can make more efficient use of the CPU, as well as system memory. Let’s look at the example on the slide. We have a system with 500MB of available memory. We are trying to execute four processes. Each process is CPU bound, has large memory requirements (400MB), and has a timeshare priority level. The first process (I) executes. As it executes, its pages are faulted into memory. At the end of its timeslice (typically 100ms), it is switched out and process J is started. As it executes, it pages in a large number of pages – forcing the pages belonging to process I to be paged out. 100ms later process J is switched out and process K starts up, pulling its pages into memory and pushing the other process’s pages out. The system spends so much time pulling pages in and pushing pages out, that it literally has no time left to perform any useful work. The culprit here is the timeslice. OK, we could simply disable timeslicing altogether via the tunable parameter (timeslice). But that may be overkill – more than we want to do. After all, it’s just these four processes that are causing the thrashing. A better solution would be to “serialize” these processes. when you do that, each process executes until it either voluntarily gives up the CPU or it is preempted by a stronger priority process – which will happen much less frequently than the timeslice! Thus more real work will get done and much less paging will be needed. In 10.20, the kernel was given the authority to serialize processes automatically, if it detects that memory thrashing is taking place and it can identify which processes are responsible for the thrashing.

http://education.hp.com

H4262S C.00 6-29  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

6–18. LAB: Memory Leaks There are several performance issues related to memory management, memory leaks, and swapping/paging, protection ID thrashing…. Let's investigate a few of them. 1. Change directories to /home/h4262/memory/leak: # cd /home/h4262/memory/leak Memory leaks occur when a process requests memory (typically through the malloc()or shmget() calls) but doesn't free the memory once it finishes using it. The five processes in this directory all have memory leaks to different degrees.

2. Before starting the background processes, look up the current value for maxdsiz using the kmtune command on 11i v1 and the kctune command on 11i v2. On the rp2430: # kmtune –lq maxdsiz On the rx2600: # kctune –avq maxdsiz The default maxdsiz on 11i v2 is 1 GB. This will make proc1 very slow in reaching its limits. You can change maxdsiz to a more reasonable number for this lab exercise by: # kctune maxdsiz=0x10000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 1073741824 Default Immed (now) 0x10000000 0x10000000

Also take some vmstat reading to satisfy yourself that the system is not under memory pressure. How much free memory do you have? # vmstat 2 2

H4262S C.00 6-30  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 6 Memory Management

3. Use the RUN script to start the background processes: # ./RUN

4. Open another window. Start glance. Sort the processes by CPU utilization (should be the default), and answer the following questions — fairly quickly, before the memory leaks get too large. • • • • • •

What is the current amount of free memory? What is the size of the buffer cache? Is there any paging to the swap space? How much swap space is currently reserved? Which process has the largest Resident Set Size (RSS)? What is the data segment size of the process with the largest RSS?

5. After a several minutes, the proc1 process should reach its maximum data size. If your maxdsiz is set to 1 GB, this could take a while. Please be patient. Observe the behavior of the system when this occurs. • •

What happens when the process reaches its maximum data size? Why does disk utilization become so high at this point?

6. As the other processes grow towards their maximum data segment size, continue to monitor the following: • • • • •

Free memory Swap space reserved The size of the processes' data segments The RSS of the processes The number of page-outs/page-ins to the swap space

http://education.hp.com

H4262S C.00 6-31  2004 Hewlett-Packard Development Company, L.P.

Module 6 Memory Management

7. Run the two baseline programs, short and diskread. # timex /home/h4262/baseline/short # timex /home/h4262/baseline/diskread How does the performance of these programs compare to their earlier runs?

8. When finished monitoring the behavior of processes with memory leaks, clean up the processes. • •

Exit glance. Execute the KILLIT script: # ./KILLIT

•

If you changed maxdsiz, change it back:

# kctune maxdsiz=0x40000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 0x10000000 0x10000000 Immed (now) 0x40000000 0x40000000

H4262S C.00 6-32  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7  Swap Space Performance Objectives Upon completion of this module, you will be able to do the following: •

Describe the difference between swap usage and swap reservation.

•

Interpret the output of the swapinfo command.

•

Define and configure pseudo swap.

•

Define and configure swap space priorities.

•

Define and configure swchunk and maxswapchunks.

http://education.hp.com

H4262S C.00 7-1  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

7–1. SLIDE: Swap Space Management — Simple View

Swap Space Management — Simple View

Kernel and OS Tables

Swap (55 MB)

CPU

Swap

Processes Memory

Program

Reserved: 20 MB Used : 0 MB

Usr

Disk

New program wants to execute; not enough space for program to fit into memory.

Student Notes The purpose of “swap space” is to relieve the pressure on memory when memory becomes too full. When “free memory” falls below a certain threshold, processes (or parts of processes) will be written out to the swap partition on disk in order to free up space in memory for other processes. For simplicity, the above slide assumes each process is 1 MB in size, and the amount of available memory for process execution is 20 MB. The slide also assumes (for simplicity) that each process “reserves” 1-MB on the swap partition each time it executes. Therefore, since 20 processes are currently present in memory (as shown on the slide), 20 MB of swap space has been reserved—1 MB for each process. The HP-UX operating system “reserves” swap space for each process that executes on the system. The reservation of swap space is done so that the operating system knows how much swap space “potentially” may be needed for all the processes currently running on the system. For example, if all the processes in memory were to be swapped out, the operating system would know it had enough swap space to perform that function.

H4262S C.00 7-2  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

Analogy A good analogy for swap space reservation, is a hotel that takes room reservations. When a hotel takes a reservation, it subtracts one from the count of available rooms. If a hotel had 55 rooms, and it took 20 reservations, then it would only have 35 rooms still available, even though none of the 55 rooms were currently occupied. The same holds true for swap space. In the above example, a total of 55 MB of swap space exists, 20 MB of the space is “reserved” by processes currently running in memory, even though none of the processes are currently using the swap space they have reserved. To take the analogy even further, the hotel does not earmark a particular room to satisfy a reservation. Room assignments are done when the occupant shows up at the front desk. Likewise, a swap reservation is not associated with a particular block out on the swap device. Only when the kernel actually wants to move a page in memory out to the swap device does it select a block. It knows it has the swap space available. It just doesn’t know where it is until it needs to use it.

Current Situation In the above slide, all the memory is in use by the 20 processes. Now assume a new program from disk wants to execute. What happens? How does it fit in memory if all the memory is in use?

http://education.hp.com

H4262S C.00 7-3  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

7–2. SLIDE: Swap Space — After a New Process Executes

Swap Space — After a New Process Executes

Kernel and OS Tables

Reserved: 20 MB Used : 1 MB

1

2

CPU

Processes Memory

4 Program

Swap

3

Usr

Disk

Student Notes Below is the basic sequence of steps that occurs when a new process wants to execute and there is not enough memory available: 1. The operating system selects a process (or portion of a process) to be written out to the swap partition on disk. The process selected is one that is not expected to execute in the near future. 2. Once the process is written to the swap partition, the amount of swap space used is incremented accordingly and the amount of swap space reserved is decremented by the same amount. 3. The new program which wants to execute reserves swap space for itself. The amount of swap space reserved is incremented accordingly. 4. The new program is copied into memory and the operating system initializes the process. The new process uses the physical memory that was just freed.

H4262S C.00 7-4  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

7–3. SLIDE: The swapinfo Command

The swapinfo Command

# swapinfo -mt Mb TYPE AVAIL dev 32 localfs 23 reserve total 55

Mb USED 1 0 20 21

Mb FREE 31 23 -20 34

PCT USED 3% -

START/ LIMIT 0 none

Mb RESERVE 0

PRI 1 1

38%

-

0

-

NAME /dev/vg00/lvol2 /home/paging

Student Notes The swapinfo command displays important swap-related information, including how much swap space is used, and how much swap space is reserved. With today’s systems, we recommend that you always use the –m option to display all spaces in MB rather than the default KB. The swapinfo –mt command shows information related to device (raw) swap partitions and file system swap space and their totals, including: Mb AVAIL

The total amount of swap space available. For file system swap, this value may vary, as more swap space is needed.

Mb USED

The current amount of swap space being used.

Mb FREE

The current amount of swap space free. The Mb FREE plus Mb USED is equal to Mb AVAIL.

PCT USED

The percentage of swap space in use on that device.

http://education.hp.com

H4262S C.00 7-5  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

START/LIMIT

Applies only to file system swap. Specifies the starting block within the file system of the paging file. The LIMIT specifies the maximum size to which the paging file can grow.

Mb RESERVED Applies only to file system swap, and is only applicable when no limit is given to the maximum size of the paging file. In these situations, this value specifies how much file system space to reserve for user files on the file system. PRI

The priority of the swap area. The highest priority swap areas are used first. The swap priorities range from 0-10. (Note: stronger priority swap areas have smaller priority numbers.)

The swapinfo command also shows how much swap space all the processes on the system are reserving currently. This is indicated by the reserve entry. The columns described above for device and file system swap do not apply to the reserve entry in the output of the swapinfo command. In the example, there are 32 MB of device swap on a raw disk, and 23 MB of swap in the /home file system, making a total of 55 MB. 1 MB is in use on the device swap and 20 MB are reserved, leaving 34 MB available.

H4262S C.00 7-6  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

7–4. SLIDE: Swap Space Management — Realistic View

Swap Space Management — Realistic View

Initial Allocation Reserved

: 0 MB

Used

: 0 MB

Swap Avail : 55 MB Kernel and OS Tables

Swap (55 MB)

Current Allocation

CPU

Processes Memory

Program

Reserved

: 20 MB

Used

: 0 MB

Swap Avail : 35 MB

Disk

New program wants to execute; not enough memory for program to fit.

Student Notes An earlier slide implied that specific space was allocated on a swap device for each process running in memory. The analogy was of a hotel subtracting one from the count of available rooms when a customer phoned in for a reservation. As mentioned earlier, specific space is not allocated on a swap device for a reservation. Instead, a variable is maintained call SWAP_AVAIL. The SWAP_AVAIL variable is initialized when the system boots to equal the total amount of swap space available. As each new process begins executing, this variable is decremented according to the amount of swap space the process would need if its entire contents were to be swapped out. When a process terminates, it returns the amount of swap space it reserved back to the SWAP_AVAIL variable. The slide above shows what the SWAP_AVAIL variable would contain when 20 MB worth of processes is executing on the system. Each process has caused the SWAP_AVAIL variable to be decremented, but no specific space has been allocated on the swap partition. No specific swap space is allocated until processes need to be paged out, as shown on the next slide.

http://education.hp.com

H4262S C.00 7-7  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

7–5. SLIDE: Swap Space — After a New Process Executes

Swap Space — After a New Process Executes

Current Allocation Kernel and OS Tables

Swap (55 MB) 1

CPU Processes Memory

3

Reserved

: 20 MB

Used

: 1 MB

Swap Avail : 34 MB 2 Program

Disk

Student Notes This is an updated description of the sequence of events that occurs when a program is being executed and not enough memory is available: •

The operating system selects a process (or portion of a process) to be written out to the swap partition on disk. Since no specific swap space has been reserved, swap space is allocated from the strongest priority swap device, first available block.

•

Once the process is written to the swap partition, the amount of swap space used is incremented accordingly, and the old program “unreserves” its swap space by incrementing the SWAP_AVAIL variable.

•

Then the new program decrements SWAP_AVAIL to reserve its swap space. In effect, the amount of swap space reserved is decremented by the amount of space being moved out to swap space and then incremented by the new reservation amount. In the slide, the process being swapped out causes the USED swap to become 1 MB, causing the SWAP_AVAIL to become 34 MB. Then the old process releases its 1 MB reservation, causing the SWAP_AVAIL to increase back to 35 MB. Finally, the new process starts up and causes the SWAP_AVAIL to decrease from 35 to 34 MB.

H4262S C.00 7-8  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

•

The new program is copied into memory, and the operating system initializes the process after it has confirmed that it can successfully reserve the needed swap for the new process (SWAP_AVAIL does not go negative when the swap reservation is made).

http://education.hp.com

H4262S C.00 7-9  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

7–6. SLIDE: Swap Space — When Memory Equals Data Swapped

Swap Space — When Memory Equals Data Swapped

Current Allocation Kernel and OS Tables

Swap (55 MB) 1

Reserved

: 20 MB

Used

: 20 MB

Swap Avail : 15 MB

CPU

2 3 Processes Available Memory (20 MB)

Program

Disk

Student Notes The above slide shows the state of the system and the current swap space allocations when 20 MB (or all of available memory) has been paged out to the swap partition. The swap partition contains 20 MB worth of processes, which is the size of available memory. The initial 20 MB of processes is shaded in gray, to distinguish them from the second 20 MB of processes, which are filled with black. With this color code, we can see only 4 MB of the original processes are still loaded in memory, everything else (including 4 MB of the 21st to 40th processes) has been paged to the swap partition. The swap space allocation reflects 20 MB worth of processes that have reserved swap space, and 20 MB that is currently in use. This would be analogous to stating that a hotel received 40 room reservations, and 20 of those reservations are currently being used. The SWAP_AVAIL variable is down to 15 MB, because the total amount of swap space is 55 MB and 40 MB of that space is reserved or in use.

H4262S C.00 7-10  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

7–7. SLIDE: Swap Space — When Swap Space Fills Up

Swap Space — When Swap Space Fills Up

Current Allocation

Kernel and OS Tables

Swap (55 MB) 1

CPU

Reserved

: 20 MB

Used

: 35 MB

Swap Avail : 0 MB Processes Available Memory (20 MB)

Program

2 ERROR: no more swap space

Disk

Q: Could this error have been prevented? A: YES!! Use pseudo swap.

Student Notes The above slide shows the situation when SWAP_AVAIL equals 0 MB. In this situation, the error message, ERROR: no swap space available is displayed, even though there is swap space to page an existing process to the swap partition and thus free up memory for a new program to load. The reason the system reports no swap space is available is because 35 MB of memory have been paged out, and the remaining 20 MB of swap space are reserved by the existing processes currently executing in memory.

Could this error have been prevented? From a resource perspective, the new program should be able to execute, because memory is available for the new process. A tunable OS parameter, referred to as pseudo swap, would have allowed the program to execute under these conditions.

http://education.hp.com

H4262S C.00 7-11  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

7–8. SLIDE: Pseudo Swap

Pseudo Swap Definition:

Pseudo swap is fictitious, make-believe, swap space. It does NOT exist physically, but logically the operating system recognizes it.

Purpose:

Pseudo swap allows more swap space to be made available than physically exists.

Benefit:

Pseudo swap adds “75% of physical memory” to the amount of swap space that the operating system thinks is available. This lessens swap space requirements (especially helpful on large memory systems.)

**NOTE:

Pseudo swap is NOT allocated in memory!

Student Notes Pseudo swap is HP's solution for large memory customers who do not wish to purchase a large amount of disks to use for swap space. The justification for purchasing large memory systems is to prevent paging and swapping, therefore, the argument becomes, “Why purchase a lot of device swap space if the system is not expected to page or swap?” Pseudo swap is swap space that the operating system recognizes, but in reality it does not exist. Pseudo swap is make-believe swap space. It does not exist in memory; it does not exist on disk; it does not exist anywhere. However, the operating system does recognize it, which means more swap space can be reserved than physically exists. The purpose of pseudo swap is to allow more processes to run in memory than could be supported by the swap device(s). It allows the operating system (specifically the SWAP_AVAIL variable) to recognize more swap space, thereby allowing additional processes to start when all physical swap has been reserved. By having the operating system recognize more swap space than physically exists, large memory customers can now operate without having to purchase large amounts of swap space, which they will most likely never use. The size of pseudo swap is dependent on the amount of memory in the system. Specifically, the size is (approximately) 75% of physical memory. This means the SWAP_AVAIL variable

H4262S C.00 7-12  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

will have an additional amount (75% of physical memory) added to its content. This additional amount allows more processes to start when the physical swap has been completely reserved. NOTE:

Pseudo swap is enabled through a tunable OS parameter call swapmem_on. If the value for swapmem_on is 1, then pseudo swap will be enabled (turned on). If the value for swapmem_on is 0, then pseudo swap will be disabled (turned off).

Analogy A good analogy for pseudo swap is an airline overbooking a flight. Airlines know that customers sometimes don’t show up for their flight. If they reserved only enough seats for the plane, they would likely depart with a plane that wasn’t full – lost revenue. So they reserve more seats than actually exist on the plane, betting that a certain percentage of customers won’t show. That way they can fly a plane that is much closer to full and get more revenue. Of course, they are occasionally wrong.

http://education.hp.com

H4262S C.00 7-13  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

7–9. SLIDE: Total Swap Space Calculation — with Pseudo Swap

Total Swap Space Calculation — with Pseudo Swap

Memory Size x Pseudo Swap + Physical Swap Total Swap

= 32 MB 0.75 = 24 MB = 55 MB = 79 MB

Student Notes The above slide shows how “Total Available Swap Space” (also known as SWAP_AVAIL) is calculated with pseudo swap turned on. The SWAP_AVAIL variable is calculated as all of the configured physical swap space (device and file system swap) PLUS 75% of physical memory (pseudo swap). (The calculation of the size of pseudo swap is actually more complex than given here. The resultant value of pseudo swap can vary anywhere from 67% to 88% of physical memory. But we’ll use 75% as a pretty typical figure.) In our example, the total amount of physical swap was 55 MB, and the amount of physical memory was 32 MB. Since the size of pseudo swap is estimated at 75% of physical memory, the pseudo swap size in our example is 24 MB.

H4262S C.00 7-14  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

This means the Total Available Swap Space (SWAP_AVAIL) is: 55 MB (Physical Swap) + 24 MB (Pseudo Swap) --------79 MB (Total Avail Swap)

http://education.hp.com

H4262S C.00 7-15  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

7–10. SLIDE: Example Situation Using Pseudo Swap

Example Situation Using Pseudo Swap Allocation without Pseudo Swap

Kernel and OS Tables

Reserved

: 20 MB

Used

: 35 MB

Swap Avail : 0 MB Swap (55 MB) Allocation with Pseudo Swap

CPU

Processes Available Memory (20 MB)

Program

Reserved

: 20 MB

Used

: 35 MB

Swap Avail : 24 MB

Disk

New program wants to execute; not enough memory for program to fit. With pseudo swap turned ON, program can now execute!

Student Notes The above slide revisits our previous situation with pseudo swap turned ON. In our previous situation, we had swap space of 55 MB, of which 35 MB was in use and the remaining 20-MB was reserved. With pseudo swap turned OFF, we saw that no new processes could start because no physical swap space was available for reservation purposes. With pseudo swap turned ON, the total available swap space is 79 MB (not 55 MB). Therefore, when the system runs out of physical swap, it still has 24 MB (due to pseudo swap), which it thinks it can allocate and therefore can reserve. Consequently, the operating system is able to support more processes without having to allocate more physical swap space. This is important for large memory customers who do not want to purchase a lot of swap space on disk in order to support the large memory.

H4262S C.00 7-16  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

7–11. SLIDE: Swap Priorities

Swap Priorities Equal Priorities

Unequal Priorities

1st chunk of swap - disk 1, chunk 1 2nd “ “ disk 2, chunk 1 3rd “ “ disk 1, chunk 2 4th “ “ disk 2, chunk 2 5th chunk will be allocated here

1st chunk of swap 2nd “ 3rd “ 4th “

disk 1, chunk 1 “ disk 1, chunk 2 “ disk 1, chunk 3 “ disk 1, chunk 4

5th chunk will be allocated here

1 3

Swap - Priority 1

2 4

1 4 2 3

Swap - Priority 1

Swap - Priority 1

Swap - Priority 2

Student Notes When the HP-UX operating system needs to page something from memory to a swap device, it selects the smallest-numbered, strongest-priority swap device. A system administrator can define a priority number for each swap device on the system. The priority numbers range from 0 to 10, with 0 being the strongest priority, and 10 being the weakest priority. If multiple swap devices are available when the system needs to page out to swap, the strongest priority swap device is used. The slide shows two examples. The first example illustrates how the system behaves when two equal priority swap devices are available. In this situation, the system alternates between the two swap devices, with the first chunk of swap being allocated on swap device #1, and the second chunk of swap being allocated on swap device #2. The second example illustrates how the system behaves when two unequal priority swap devices are available. In this situation, the system will continue to allocate chunks of swap from the lowest-numbered (strongest priority) swap device. Only when that device is 100% full will the system begin allocating chunks from the second swap device.

http://education.hp.com

H4262S C.00 7-17  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

7–12. SLIDE: Swap Chunks

Swap Chunks

1

2

3

4

Swap - Priority 1

Swap - Priority 1

Space on the swap device is allocated to the kernel in increments called swapchunks. The default swapchunk size is 2 MB.

Student Notes A swap chunk is the amount of space that the operating system allocates swap devices. The default swap chunk size is 2 MB. In the above example, two equal priority swap devices are available to the system. The system will allocate the first swap chunk to be on swap device #1, and this size will be 2 MB by default. Once this swap chunk has been filled by 512 pages (page size = 4 KB), then the system will allocate a second swap chunk to be on swap device #2. The system continues alternating swap space between the two systems in swap chunk increments. Swap chunks are also the unit in which swap space is allocated on file system swap devices. With file system swap devices, the operating system will only allocate swap space on the file system if the space is needed; if it does not need the swap space, then it does not allocate space. When it does need swap space, it allocates the file system swap space in swap chunk sizes. Files are created – each of a size equal to a swap chunk – and named hostname.N, where N is a number from 0 on up.

H4262S C.00 7-18  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

7–13. SLIDE: Swap Space Parameters

Swap Space Parameters

DEV_BSIZE

Device block size. This is the size (in bytes) of a block on the disk. The default size is 1024 bytes.

swchunk

This is the number of blocks to allocate to the kernel when it need swap space. The default is to allocate swap space to the kernel in 2-MB increments. The default value is 2048. The maximum value is 65,536.

maxswapchunks This is the maximum number of swchunks which can be allocated to the kernel. The default value is 256. The maximum value is 16,384. Total swap space recognized by the kernel = maxswapchunks x swchunk x DEV_BSIZE 256

Defaults:

x

2048

x

1024

=

512 MB

Student Notes There are two configurable parameters and one fixed, non-configurable parameter that affect swap space configurations and allocations. DEV_BSIZE

The size in bytes of a block of disk space. The default size is 1 KB. It is not configurable.

swchunk

The number of blocks (of size DEV_BSIZE) to associate with a “chunk” of swap space, referred to as a swap chunk. The default value is 2048 blocks or 2 MB. The maximum value is 65,536 or 64 MB.

maxswapchunks

This is the maximum number of swap chunks that will be recognized systemwide. The default value is 256. The maximum value is 16,384.

Using these defaults, the maximum amount of swap space that the operating system recognizes is 512 MB. This means if a system is configured physically for 1 GB of swap space, only 512 MB of the 1 GB will be used by the system. In order for the system to use the other 512 MB, the tunable OS parameter maxswapchunks needs to be increased to 512.

http://education.hp.com

H4262S C.00 7-19  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

If you were to install HP-UX on a system that had 2 GB of physical memory, the installation process would automatically increase maxswapchunks to accommodate the larger memory. In this example, it would set maxswapchunks to 1024. However, if you were to add more memory at a later date (without reinstalling the kernel), you would have to manually tune maxswapchunks to be able to allocate enough swap space and use all of your available memory. Or, use pseudo swap. In 11.23 (11i v2), maxswapchunks has been eliminated and no longer becomes an issue.

H4262S C.00 7-20  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

7–14. SLIDE: Summary

Summary • Swap space reservation • Pseudo swap • Swap priorities • Swap chunks • Swap space parameters

Student Notes To summarize this module, all processes must reserve swap space by decrementing a variable called SWAP_AVAIL when they initialize. If this variable cannot be decremented, the process will not be able to start. To allow this variable to recognize more swap space than physically exists, setting a tunable parameter, swapmem_on, to 1 will turn on pseudo swap. This allows more processes to execute than the amount of swap space can support. This is not considered a problem on large memory systems, because these machines are not expected to swap. If a system does need to swap, it will swap to the lowest-numbered (strongest) priority swap device first. The priority of a swap device is specified when the device is activated. If two swap devices have the same priority, the system will alternate between the two devices. Swap chunks are the unit of disk space by which swap space is allocated. By default, the size of a swap chunk is 2 MB. By default, the system recognizes a maximum of 512 MB of swap space. If more swap space exists, the tunable parameter, maxswapchunks, must be increased, in order for the additional swap space to be recognized. If maxswapchunks is already set to the maximum value, then increase the value of swchunk.

http://education.hp.com

H4262S C.00 7-21  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

7–15. LAB: Monitoring Swap Space Preliminary Steps A portion of this lab requires you to interact with the ISL and boot menus, which can only be accomplished via a console login. If you are using remote lab equipment, access your system’s console interface via the GSP/MP. You may get some “file system full” messages while you are shutting down the system. You can ignore these messages.

Directions The following lab illustrates swap reservation, configures and de-configures pseudo swap, and adds additional swap partitions with different swap priorities. 1. Use the swapinfo command to display the current swap space statistics on the system. List the MB Avail and MB Used for the following three items:

MB Available

MB Used

dev reserve memory

2. To see total swap space available and total swap space reserved, enter: # swapinfo -mt What is the total swap space available (including pseudo swap)? What is the total space reserved?

H4262S C.00 7-22  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

3. Start a new shell process by typing sh. Re-execute the swapinfo command and verify whether any additional swap space was reserved when the new shell process started. In this case, the difference is going to be pretty small, so let’s not use the –m option. Upon verification, exit the shell. Is the swap space returned upon exiting the shell process?

4. Start glance and observe the Global bars at the top of the display for the duration of this step. Start a large, memory process and note how much the Current Swap Utilization percentage increases in glance. Type: # /home/h4262/memory/paging/mem256 & Use the process that most closely matches your physical memory size. This should reserve a large amount of swap space. Start as many mem256 processes as possible. For best results, wait until each swap reservation is complete, by observing the incremental increases in Current Swap Utilization in glance. The system will get slower and slower as you start more mem256 processes. What was the maximum number of mem256 processes that can be started? What prevented an additional mem256 process from being started? Kill all mem256 processes to restore performance.

5. Recompile the kernel, disabling pseudo swap. Use the following procedure: 11i v1 or earlier: # # # # # #

cd /stand/build /usr/lbin/sysadm/system_prep -s system echo "swapmem_on 0" >> system mk_kernel -s ./system cd / shutdown -ry 0

http://education.hp.com

H4262S C.00 7-23  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

11i v2 and later: # cd / # kctune swapmem_on=0 NOTE: The configuration being loaded contains the following change(s) that cannot be applied immediately and which will be held for the next boot: -- The tunable swapmem_on cannot be changed in a dynamic fashion. WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? no NOTE: The backup will not be updated. * The requested changes have been saved, and will take effect at next boot. Tunable Value Expression swapmem_on (now) 1 Default (next boot) 0 0 # shutdown –ry 0

6. Reboot from the new kernel. Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test 7. Once the system reboots, login and execute swapinfo. Is there a memory entry? Why or why not? Will the same number of mem256 processes be able to execute as earlier? How many mem256 processes can be started now? Kill all mem256 processes to restore performance.

8. If you have a two disk system. If you have a two disk system, add the second disk to vg00 (if this was not already done in a previous exercise) and build a second swap logical volume on it. This lvol should be the same size as the primary swap volume. If you do not have a second disk continue this lab at question 13. If you did not add the second disk earlier:

H4262S C.00 7-24  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

# # # #

vgdisplay –v | grep Name (Note the physical disks used by vg00) ioscan –fnC disk (Note which disks are unused) pvcreate –f vgextend /dev/vg00

To create the new swap device on the second disk: # lvcreate –n swap1 /dev/vg00 # lvextend –L 512 /dev/vg00/swap1 Note in our case the primary swap is 512MB. See swapinfo on your system and match the size of the new swap device to the primary swap. 9. Now add the new logical volume to swap space. Ensure that the priority is the same as the primary swap: Check your work. # swapon –p 1 /dev/vg00/swap1 swapon: Device /dev/vg00/swap1 contains a file system. Use -e to page after the end of the file system, or -f to overwrite the file system with paging. Oops! Problem 1, swapon is being overly cautious. If you get this message, the memory manager has detected what appears to be a file system already on the device. (Probably, left over from some previous use.) You need to override. # swapon -p 1 -f /dev/vg00/swap1 swapon: The kernel tunable parameter "maxswapchunks" needs to be increased to add paging on device /dev/vg00/swap1. Oops! Problem 2, the kernel cannot deal with this amount of swap. If you get this message, the tunable parameter, maxswapchunks, is set too small to accommodate all of the new swap space. We need to modify “maxswapchunks” and reboot. If you have this problem, use sam to double maxswapchunks. In 11i v2, maxswapchunks has been obsoleted and will not have to be modified. Recompile the kernel, increasing maxswapchunks. Use the following procedure: # # # # #

cd /stand/build echo "maxswapchunks 512" >> system mk_kernel -s ./system cd / shutdown -ry 0

10. If you had to rebuild the kernel to increase maxswapchunks, reboot the system. Otherwise, skip to step 11. Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test

http://education.hp.com

H4262S C.00 7-25  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

And now add the new swap device: # swapon -p 1 -f /dev/vg00/swap1 Verify that the new swap space has be recognized by the kernel: # swapinfo -mt Done! 11. Start enough mem256processes to make the system start paging.

12. Measure the disk I/O to see what is happening with swapspace. Go to question 15 when you have finished.

13. If you have a single disk system. Create three additional swap devices with sizes of 20 MB. # lvcreate -L 20 -n swap1 vg00 # lvcreate -L 20 -n swap2 vg00 # lvcreate -L 20 -n swap3 vg00 List the current amount of swap space in use. If 10 MB is currently in use on a single swap device, and we activate an equal priority swap device, what is the distribution if an additional 10 MB is paged out? A) The distribution would be 10 MB and 10 MB. B) The distribution would be 15 MB and 5 MB. Prior to activating these swap devices, make note of the amount of swap space currently in use. When the new swap devices are activated with equal priority, all new paging activity will be spread evenly over these swap devices.

H4262S C.00 7-26  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 7 Swap Space Performance

14. Activate the newly created swap devices. Activate two with a priority of 1, and the third with a priority of 2. # swapon -p 1 /dev/vg00/swap1 # swapon -p 2 /dev/vg00/swap2 # swapon -p 1 /dev/vg00/swap3 Start enough mem256 processes to make the system start paging. Is the new paging activity being distributed evenly across the paging devices?

15. When finished with the lab, reboot the system as normal (do not boot vmunix_test) to re-enable pseudo swap and remove the additional swap devices. For 11i v1 and earlier, follow this procedure: # cd / # shutdown –ry 0 For 11i v2 and later, follow this procedure: # cd / # kctune swapmem_on=1 # shutdown –ry 0

http://education.hp.com

H4262S C.00 7-27  2004 Hewlett-Packard Development Company, L.P.

Module 7 Swap Space Performance

H4262S C.00 7-28  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8  Disk Performance Issues Objectives Upon completion of this module, you will be able to do the following: •

List three ways disk space can be used.

•

List disk device files.

•

Identify disk bottlenecks.

•

Identify kernel system parameters.

http://education.hp.com

H4262S C.00 8-1  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–1. SLIDE: Disk Overview

Disk Overview

Tracks

Cylinder 0 Cylinder 1 Cylinder 2 . . .

Data Blocks

Cylinder N-1

Physical View

Logical View

Internal Cylinder View

Student Notes Disks are used to store data for the operating system and the applications. A disk can be used several different ways, but they boil down to just two – file system and raw. If a disk holds a file system, there are several structures which are built on the disk (using the data blocks of the disk) to help support the software in the kernel, which needs to access and manage the file system files and their contents. If the disks are to be used raw (such as a device swap space or an application database), no kernel structures are built out on the disk. The related code simply reads, manages and organizes the data blocks as it sees fit. There are several types of file systems available with the HP-UX 10.x and 11.x releases. The two primary types of local file systems are HFS (High performance File System), which was the original file system for HP-UX and has continually been enhanced since, and JFS (Journaled File System), which was introduced with the HP-UX 10.01 release and continues to grow in popularity and functionality. In the near future, you should see another type of file system become available for HP-UX – the Advanced File System (AdvFS) ported over from Tru64 UNIX. In later modules, we will

H4262S C.00 8-2 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

discuss the performance issues that pertain to each of the available file systems. In this module, we’ll address the issues pertaining to all disks.

Physical View From a physical disk perspective, the disk drives upon which a file system is placed contains sectors, tracks, platters, and read/write heads. A key behavior of most all disk drives is that the read/write heads move in parallel across the platters in such a way that each read/write head is over the same track within each platter at the same time. To maximize the I/O throughput of the disk, it is desirable to minimize the amount of head movement. To help achieve this goal, all the sectors in a cylinder are addressed in sequential order. Cylinder Analogy

Consider a health spa or gym with three floors. Each floor contains a jogging track, and the three jogging tracks are located directly above or beneath one another from floor to floor. From this point of view, a cylinder would be all the same lanes from each floor's jogging track. In other words, all lane 1 tracks would make up cylinder 1; all lane 2 tracks would make up cylinder 2, etc. By organizing space on disks in cylinders, the software can logically distribute its sectors across all platters of the disk evenly and uniformly. For example, in the slide above, the first 6 sectors would be allocated as follows: block block block block block block

#1: #2: #3: #4: #5: #6:

Platter Platter Platter Platter Platter Platter

#1, #1, #1, #1, #1, #1,

Track Track Track Track Track Track

#1, #1, #1, #1, #1, #1,

Sector Sector Sector Sector Sector Sector

#1 #2 #3 #4 #5 #6

By allocating disk space in this manner, a multiple block read (say 6 blocks) could be read in one operation.

Logical View From a logical view, each cylinder is simply a repository for a certain amount of data, which can be read or written without having to move the heads. This data area is further broken down into blocks. The block is the most fundamental unit of data that can be read from or written to the disk. We mentioned in an earlier chapter a value in the kernel, called DEV_BSIZE. It is equal to 1024 bytes. This is the block size from the kernel’s perspective. The disk can be viewed as simply a series of blocks running from block 0 to block N-1, where N is the total number of blocks on the disk. The closer two blocks are to each other, the more likely they will be in the same cylinder. If they are in the same cylinder, a minimum amount of time is needed to read or write both blocks.

http://education.hp.com

H4262S C.00 8-3  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–2. SLIDE: Disk I/O — Read Data Flow

Disk I/O — Read Data Flow 1. Process issues read system call (logical I/O generated). 2. Block to be read is not in buffer cache; physical I/O is issued. 3. Block on disk is accessed through seek, latency, and transfer. 4. Data is read into buffer cache, completing physical I/O request. 5. Data is returned to process, completing the logical I/O and system call.

Disk I/O Queue

2

Buffer Cache 4

Filesystem 3

1 5

Process

File

Seek, Latency, Transfer

Memory

Student Notes Up to this point, we have looked at I/O from the standpoint of the disk. The following slide illustrates disk I/O activities from the standpoint of memory and the process initiating the I/O. The assumption here is that we are dealing with a disk that has a file system on it, so the buffer cache becomes a factor in the operation. If this were a raw disk, the buffer cache would be bypassed by all I/O operations.

Asynchronous vs. Synchronous Reads There are two possible approaches to doing reads – synchronous and asynchronous. By, default any read will be synchronous, i.e., the process will wait (and sleep, if necessary) until the data can be transferred to the data area of the process. If the read is asynchronous, the process informs another driver (an asynchronous I/O driver) that it will need certain data in the future. The driver fetches the data from the disk and places it in the buffer cache, while the process continues with other operations. When the data is in the cache, the driver signals the process and the read is now executed. The data is guaranteed to be in the buffer and the process never has to sleep. Asynchronous reads are significantly more difficult to program, so they are used only in the more sophisticated applications.

H4262S C.00 8-4 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

Buffered Read Data Flow The flow diagram on the slide highlights the main actions from the time a process issues a read() system call, to when the data is returned to the process. 1. A process issues the read() system call. This is viewed by the kernel as a logical I/O, meaning the kernel will satisfy the request any way it can, either through the buffer cache or by performing a physical I/O. 2. The buffer cache is searched, looking for the data blocks being requested. If the data block is found in the buffer cache, the read() system call is returned with the corresponding data. If the data block is not found, The requesting process goes to sleep and a physical I/O request is generated to read the data block into the buffer cache. We will assume the data block was not found. NOTE:

Logical I/Os may or may not generate corresponding physical I/Os. The goal of the buffer cache is to handle as many logical I/Os with as few physical I/Os as possible.

3. The physical read is performed because the data was not in the buffer cache. Because physical I/O involves movement of the disk head (seek time), waiting for the data on the platter to rotate under the disk head (latency time), and moving the data from the platter into memory (transfer time), the cost of a physical I/O is high from a performance standpoint. Physical I/Os are the most time-consuming operations that the kernel performs. If the disk I/O queue is long (3 or more requests), the time spent waiting to be serviced can be longer than the time to actually service the I/O request. 4. Once the physical I/O request returns, the data is stored in the buffer cache so that future I/O requests for the same file system block can be satisfied without having to perform another physical I/O. This step completes the physical I/O initiated by the kernel. 5. The final step is to return the data to the original calling process that issued the read(). The sleeping process is awakened and transfers the desired data from the buffer (in buffer cache) to the data area of the process. Then the process returns from the read() system call. This step completes the logical I/O initiated by the process.

Raw Read Data Flow If the read operation is raw, the buffer cache is bypassed. Data is transferred directly from the disk to the data area of the calling process. All raw reads are synchronous and therefore result in the process sleeping until the data has been read in.

http://education.hp.com

H4262S C.00 8-5  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–3. SLIDE: Disk I/O — Write Data Flow (Synchronous)

Disk I/O — Write Data Flow (Synchronous) 1. 2. 3. 4. 5. 6.

Process issues write system call. Block is assigned on disk, and image for block is allocated in buffer cache. Once data is written to buffer cache, a physical I/O to disk is generated. Data is written to disk controller cache. Data is then transferred from the disk controller to the corresponding platter. Upon completion of I/O, the disk controller sends an acknowledgment to the kernel.

7.

Write system call returns to process.

6

Disk I/O Queue

3

Buffer Cache

Disk Controller Cache

4 2

7

1

5

Process Memory

Student Notes As with reads, there are two methods for performing write() system calls: asynchronous and synchronous. Although the default write operation is asynchronous (the writing process does not sleep – waiting for the write to complete), it is quite simple for a program to choose synchronous writes. It can be done by simply setting a flag on the open file before issuing the write. This can be done when the file is opened or at some later time.

Synchronous Writes The slide shows the data flow of a synchronous write, from the time the write()system call is issued, to when the write call returns to the process. 1. The process issues a synchronous write() system call. 2. Assuming the process is writing to a new file data block, a new file system block is allocated on disk and an image of that block is allocated in the buffer cache. 3. Once the data is copied from the data area of the process to the buffer cache, an I/O request is placed in the disk I/O queue for that particular disk. The calling process goes to sleep until the write is reported to be complete.

H4262S C.00 8-6 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

4. When the physical write is performed, the data is first copied from the buffer cache to the firmware cache on the disk drive controller. NOTE

Most SCSI disk drive controllers can be configured to return an I/O complete acknowledgment at this point, rather than waiting for the data to be transferred to the physical platters. This condition is called “immediate reporting”.

5. The data is transferred from the disk controller cache to the platter. This operation is often the most time consuming part of the write, as it involves seek, latency, and data transfer operations. 6. Once the data has been successfully transferred to the platters, the disk drive controller returns an I/O complete acknowledgment to the kernel (assuming this was not done in step 4 with immediate reporting). 7. The kernel, upon receiving the I/O complete acknowledgment, Wakes the sleeping process, which then returns from the write call.

Asynchronous Writes An asynchronous write does not wait for the data to get to the disk. An asynchronous write system call returns immediately upon the data being written to the buffer cache. In the diagram on the slide, the write call would return following step 2. The advantage of asynchronous writes is performance — the process does not have to wait for the physical I/O. The disadvantage is lack of data integrity. Because the process continues executing before the data is written to disk, it can perform additional actions that are dependent upon the data being written successfully. If for some reason the data does not get written (a disk goes offline or a disk head crashes), the additional actions can leave the system in an inconsistent state. For example, assume a database record is written asynchronously. Because it is written asynchronously, the database process continues its execution. A subsequent action is to update a corresponding entry in another table of the database located on another disk. Assume the first asynchronous write is posted to a busy disk with a long queue, and the subsequent write is posted to a disk with an empty queue. The second write finishes before the first write begins! If the system were to crash after the second write, but before the first write, the database would be out-of-sync and corrupted, because the second write assumed that the first write succeeded. There is no “signaling” to the writing process to let it know that a write has completed. For that, the process would have to do synchronous writes.

http://education.hp.com

H4262S C.00 8-7  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–4. SLIDE: Disk Metrics to Monitor — Systemwide

Disk Metrics to Monitor — Systemwide • Utilization of disk drives • Disk I/O queue length • Amount of physical I/O to – Device (i.e., Disk) –

Logical volume

–

File system

• Buffer cache hit ratio

Student Notes When monitoring disk I/O activity, the main metrics to monitor are: •

Percent utilization of the disk drives: As utilization of the disk drives increases, so does the amount of time it takes to perform an I/O. According to the performance queuing theory, it takes twice as long to perform an I/O when the disk is 50% busy, than it does when the disk is idle. Therefore, we consider that a disk may be experiencing a bottleneck if the disk is 50% busy or more.

•

Requests in the disk I/O queue: The number of requests in the disk I/O queue is one of the best indicators of a disk performance problem. If the average number of requests is above two, then requests are forced to wait in the queue longer than the amount of time needed to service their own requests. If the average number of requests is three or greater, you should also see that the average wait time for a request is greater than the average service time.

•

Amount of physical I/O: If the amount of disk activity is high, it is important to investigate on which disk, which logical volume, and which file system the activity is occurring on.

H4262S C.00 8-8 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

•

Buffer cache hit ratio: One reason disk activity could be high is that read or write requests are not finding corresponding disk blocks in the buffer cache. As a result, physical I/O requests are being generated to the disk. The read cache hit ratio on the buffer cache indicates how frequently read data is found in the buffer cache. The minimum read hit ratio should be 90% or higher for optimal performance. Less than 90% indicates the buffer cache may be too small, causing (potential) excess disk activity. It may also indicate that the application is not using buffer cache in an efficient manner, e.g. doing a lot of random I/O or very large I/O. The write cache hit ratio on the buffer cache indicates how frequently a write to a buffer does not trigger a physical read or write to the disk. (if only a portion of a block is being written, and the image of that block is not already in a buffer, it may be necessary to read the original contents of the block into buffer cache before modifying it with the new write data.) The minimum write cache hit ratio should be 70% or higher for optimal performance. Less than 70% indicates the buffer cache may be too small, causing (potential) excess disk activity. Again, the fault may lie with the application’s use of the buffer cache.

http://education.hp.com

H4262S C.00 8-9  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–5. SLIDE: Disk Metrics to Monitor — Per Process

Disk Metrics to Monitor — Per Process • Amount of physical and logical I/O being performed on a per process basis • Type and amount of system calls (I/O-related) being generated by processes performing large amounts of I/O • Paging to swap device (VM read/writes) on a per process basis • Files opened by processes performing large amounts of I/O

Student Notes On a per process basis, it is important to identify which processes are generating large amounts of disk I/O. Metrics that help to identify I/O activity on a per process basis are: •

Amount of physical and logical I/O: This indicates “how much” I/O the process is performing. For processes performing large amounts of I/O, the additional three metrics shown below should be investigated.

•

Type and amount of I/O related system calls being generated: For each process performing high I/O, the number of read(), write(), and other I/O related calls should be inspected.

•

Amount of VM reads and VM writes: If the I/O activity being generated is due to paging (VM read and VM writes), then the problem is probably not a disk I/O problem, but more like a memory problem.

•

Files opened with heavy access: For each process performing large amounts of file system I/O, the names of the files to which they are reading or writing should be inspected. For files receiving high I/O activity, consider relocating these files to other disks that are less busy. To determine how “random” the I/O requests are, hit

H4262S C.00 8-10 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

frequently while looking at the list of open files for that process (in glance), then inspect how quickly the offset to each file changes and whether it is monotonically increasing or varies up and down.

http://education.hp.com

H4262S C.00 8-11  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–6. SLIDE: Activities that Create a Large Amount of Disk I/O

Activities that Create a Large Amount of Disk I/O • Buffer cache misses • Synchronous I/O • Accessing sequentially with a small block size • Accessing many files on a single disk • Accessing many disk drives from a single disk controller card

Student Notes Common causes of disk-related performance problems are shown on the slide. •

Buffer cache misses cause physical I/O to occur. When the appropriate buffer is not found in the buffer cache, a physical I/O is triggered. By the way, a buffer cache can be too large as well. A very large buffer cache takes more time to search to see if the appropriate buffer exists! More on how to properly size a buffer cache will be given later in this module.

•

Synchronous I/O forces the write system calls to wait until the I/O physically completes. Very good for data integrity, very poor for performance.

•

Sequential access, with a small block size, causes excessive amounts of physical I/O.

•

Accessing lots of files on one disk, versus many disks, creates an imbalance of disk drive utilization. This leads to performance problems with the busy disks and under utilization with the less busy disks.

•

Accessing lots of disks on the same disk controller creates contention problems on the SCSI bus. You can determine this by noticing that multiple disks on the same controller

H4262S C.00 8-12 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

have request queues that are consistently three or greater in length and the average time a request waits to be serviced is greater than the average time it takes to actually service the request. The individual disks may not show a disk utilization 50% or greater! If this situation occurs, it would be best to spilt up the busiest disks onto separate controllers.

http://education.hp.com

H4262S C.00 8-13  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–7. SLIDE: Disk I/O Monitoring sar –d Output

Disk I/O Monitoring sar -d Output # sar -d 5 6 05:23:50 device 05:23:55 c1t5d0 c0t4d0 c0t5d0 c0t6d0 05:24:00 c1t5d0 c0t4d0 c0t5d0 c0t6d0 05:24:05 c1t5d0 c0t4d0 c0t5d0 c0t6d0 05:24:10 c1t5d0 c0t4d0 c0t5d0 c0t6d0 05:24:15 c0t4d0 c0t5d0 c0t6d0 05:24:20 c1t6d0 c1t5d0

%busy 0.60 62.40 33.20 54.80 1.20 63.80 39.20 61.80 2.20 56.40 35.60 62.80 0.20 68.60 33.80 60.00 24.40 23.00 50.60 0.60 1.40

avque 0.50 10.51 2.76 8.10 0.50 10.84 2.94 19.60 0.50 18.40 2.69 18.41 0.50 13.00 3.25 5.72 4.25 3.46 18.77 0.50 1.17

r+w/s 2 46 16 31 3 48 19 36 3 39 17 36 2 51 16 33 15 14 28 0 2

blks/s 35 2783 1226 2166 39 2943 1427 2371 45 2392 1258 2643 35 3118 1226 2301 823 851 1846 2 23

avwait 1.55 127.97 42.89 242.52 1.97 129.23 38.85 331.15 3.85 234.33 39.96 192.28 1.01 154.68 47.82 238.43 60.83 43.33 306.13 4.63 9.85

avserv 5.07 152.92 143.96 193.15 6.72 159.47 154.55 208.49 13.04 163.10 138.81 178.66 4.86 159.02 147.32 203.88 180.68 118.87 233.36 11.53 21.50

Student Notes The sar -d report shows disk activity on a per disk drive (spindle) basis. The key fields within this report are: % busy

Indicates the average percent utilization of the disk over the interval (5 seconds in the slide).

avque

Indicates the average number of requests in the disk I/O queue.

avwait

Indicates the average amount of time a requests spends waiting in the disk I/O queue.

avserv

Indicates the average amount of time to service a disk I/O request.

The sar -d report on the slide shows that when the disk had the most requests in the queue (19.60 and 18.77), the average wait time was at its highest. The slide also shows that there are five disk drives spread across two disk controllers. One disk controller (c0) appears to have two busy drives (t4 and t6), and a relatively low usage drive (t5). Disk controller (c1) has two disks that are mainly idle. One performance solution

H4262S C.00 8-14 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

here would be to balance the disk activity across the two controllers by moving one disk (say c0t4) over to the less busy disk controller (c1).

http://education.hp.com

H4262S C.00 8-15  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–8. SLIDE: Disk I/O Monitoring sar –b Output

Disk I/O Monitoring sar -b Output

#=> sar -b 10 20 HP-UX e2403roc B.10.20 U 9000/856

02/09/98

05:51:04 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 05:51:14 0 0 0 1 1 25 0 0 05:52:04 0 0 0 0 1 85 0 0 05:52:14 0 0 0 1 8 87 0 0 05:52:24 0 0 0 0 4 100 0 0 05:52:34 0 0 0 0 1 100 0 0 05:52:54 1 68 99 0 0 33 0 0 05:53:04 7 11936 100 1 2 13 0 0 05:53:14 6 19506 100 1 1 0 0 0 05:53:24 28 24147 100 1 2 65 0 0 05:53:34 64 16659 100 0 14 99 0 0 05:53:44 118 118 0 2 3 46 0 0 05:53:54 0 0 0 3 3 0 0 0 05:54:04 0 0 0 18 19 4 0 0 05:54:14 179 179 0 18 18 3 0 0 05:54:24 179 179 0 13 14 4 0 0 Average

29

3639

99

3

5

39

0

0

Student Notes The sar -b report shows disk activity related to the buffer cache. The key fields within this report are: bread/s

Indicates the average number of physical I/O reads per second over the interval. The term bread refers to block reads.

lread/s

Indicates the average number of logical I/O reads per second over the interval.

%rcache

Indicates the average percent read cache hit rate. This shows what percentage of read requests were satisfied through the buffer cache. Ideally, this value should be consistently 90% or greater.

bwrit/s

Indicates the average number of physical I/O writes per second over the interval. The term bwrit refers to block writes.

lwrit/s

Indicates the average number of logical I/O writes per second over the interval.

H4262S C.00 8-16 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

%wcache

Indicates the average percent write cache hit rate. This shows what percentage of write requests were satisfied through the buffer cache. Ideally, this value should be consistently 70% or greater.

The sar -b report on the slide shows the two extreme situations. The first extreme is a 100% cache hit rate, which occurs when there are lots of logical I/O requests and all requests are satisfied through the buffer cache, rather than having to go to disk. This is a very desirable condition. The other extreme is a 0% cache hit ratio. This occurs when every logical I/O request required a physical I/O from disk. In this case, the number of physical reads or writes is equal to the number of logical reads or writes. This is most undesirable.

http://education.hp.com

H4262S C.00 8-17  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–9. SLIDE: Disk I/O Monitoring glance — Disk Report

Disk I/O Monitoring glance — Disk Report

B3692A GlancePlus B.10.12 06:16:25 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------DISK REPORT Users= 4 Req Type Requests % Rate Bytes Cum Req % Cum Rate Cum Byte -------------------------------------------------------------------------------Local Logl Rds 68 2.7 13.6 5kb 1260 7.8 9.6 3.2mb Logl Wts 2455 97.3 491.0 19.2mb 14798 92.2 112.9 114.8mb Phys Rds 10 1.7 2.0 80kb 189 5.1 1.4 1.8mb Phys Wts 565 98.3 113.0 18.9mb 3520 94.9 26.8 112.4mb User 571 99.3 114.2 18.9mb 3448 93.0 26.3 112.2mb Virt Mem 0 0.0 0.0 0kb 66 1.8 0.5 968kb System 4 0.7 0.8 32kb 195 5.3 1.4 1.2mb Raw 0 0.0 0.0 0kb 0 0.0 0.0 0kb Remote Logl Rds 0 0.0 0.0 0kb 0 0.0 0.0 0kb Logl Wts 0 0.0 0.0 0kb 0 0.0 0.0 0kb Phys Rds 0 0.0 0.0 0kb 1 100.0 0.0 0kb Phys Wts 0 0.0 0.0 0kb 0 0.0 0.0 0kb

Student Notes The glance disk report (d key) shows local and remote I/O activity. The I/O distribution can be viewed from the following: •

Logical Perspective (logical reads and logical writes)

•

Physical Perspective (physical reads and physical writes)

•

I/O Type Perspective (User, Virtual Mem, System, Raw)

Items of interest in this report include the number of logical I/O requests (read and writes), the number of physical I/O requests (reads and writes), and the ratio between the two. In the slide, disk utilization is 94% (very high), with the majority of the I/Os being writes (92%) as opposed to reads. It is also interesting to note the logical to physical write ratio is 14,798 / 3,520 or approximately 4:1, which is an acceptable write performance ratio.

H4262S C.00 8-18 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

8–10. SLIDE: Disk I/O Monitoring glance — Disk Device I/O

Disk I/O Monitoring glance — Disk Device I/O

B3692A GlancePlus B.10.12 06:31:12 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S S R U U |100% 100% 100% Cpu Util F Disk Util F | 83% 22% 84% Mem Util | 94% 95% 96% S S U U B B Swap Util | 21% 21% 22% U U R R -------------------------------------------------------------------------------IO BY DISK Users= 4 Idx Device Util Qlen KB/Sec Logl IO Phys IO -------------------------------------------------------------------------------1 56/52.6.0 0/ 0 0.0 0.0/ 1.8 na/ na 0.0/ 0.2 2 56/52.5.0 1/ 1 0.0 16.0/ 5.1 na/ na 2.0/ 0.7 3 56/36.4.0 78/ 9 18.2 1584.8/ 178.4 na/ na 48.0/ 5.6 4 56/36.5.0 52/ 6 3.8 932.8/ 120.5 na/ na 24.0/ 3.0 5 56/36.6.0 68/ 9 10.6 1172.8/ 154.9 na/ na 35.8/ 4.6 6 56/52.2.0 0/ 0 0.0 0.0/ 0.0 0.0/ 0.0 0.0/ 0.0

Top disk user: PID

3280, disc

106.4 IOs/sec

S - Select a Disk

Student Notes The glance disk device report (u key) shows current and average utilization of each disk drive on the system. The report also shows the current I/O queue length for each disk. This display shows basically the same information as sar –d. In the slide, three disks show utilization greater than 50% and queue lengths greater than 3. This is normally a valid reason for further investigation. The 10.6 and 18.2 queue lengths are high, but, because the average utilization of both the drives is 9%, this may just be a spike in disk activity. In this case, monitor the situation further to see if the high queue lengths persist or if they were just spikes in disk usage.

http://education.hp.com

H4262S C.00 8-19  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–11. SLIDE: Disk I/O Monitoring glance — Logical Volume I/O

Disk I/O Monitoring glance — Logical Volume I/O

B3692A GlancePlus B.10.12 06:34:41 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------IO BY LOGICAL VOLUME Users= 4 Idx Vol Group/Log Volume Open LVs LV Reads LV Writes -------------------------------------------------------------------------------1 /dev/vg00 10 0.0/ 0.0 0.0/ 0.0 2 /dev/vg00/group 0.0/ 0.0 0.0/ 0.0 3 /dev/vg00/lvol3 0.0/ 0.0 0.2/ 0.0 4 /dev/vg00/lvol2 0.0/ 0.0 0.0/ 0.0 5 /dev/vg00/lvol1 0.0/ 0.0 0.0/ 0.0 9 /dev/vg00/lvol7 0.0/ 0.0 0.0/ 0.0 10 /dev/vg00/lvol4 0.0/ 0.0 0.0/ 0.0 12 /dev/vg01 2 0.0/ 0.0 0.0/ 0.0 13 /dev/vg01/lvol1 0.0/ 0.0 105.6/ 19.2 Open Volume Groups:

2

S - Select a Volume

Student Notes The glance logical v volume report (v key) shows disk activity on a per logical volume basis. Only physical I/O activity (not logical I/O activity) is shown with this report. In the previous slide, we saw high activity across three disk drives (drives 4, 5, and 6). The logical volume report on the slide shows all this activity is being performed against one logical volume (/dev/vg01/lvol1), which implies that the logical volume is being spread across three disks (a good idea since the I/O to the logical volume is so high).

H4262S C.00 8-20 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

8–12. SLIDE: Disk I/O Monitoring glance — System Calls per Process

Disk I/O Monitoring glance — System Calls per Process

B3692A GlancePlus B.10.12 06:48:15 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------System Calls for PID: 4055, disc PPID: 2410 euid: 0 User:root Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------write 4 377 754.0 0.10650 12851 477.7 4.10153 open 5 3 6.0 0.05910 100 3.7 0.61923 close 6 3 6.0 0.00006 100 3.7 0.00225 lseek 19 0 0.0 0.00000 75 2.7 0.00204 ioctl 54 3 6.0 0.00007 100 3.7 0.00259 vfork 66 0 0.0 0.00000 25 0.9 0.34908 sigprocmask 185 0 0.0 0.00000 50 1.8 0.00088 sigaction 188 0 0.0 0.00000 150 5.5 0.01340 waitpid 200 0 0.0 0.00000 25 0.9 1.47745

Cumulative Interval:

27 secs

Student Notes The glance system calls report (L key), available only from the select process report (s key), shows the names of the system calls being generated by the selected process. The system calls report can be viewed for individual processes (as shown on the slide), or globally for all processes on the system (Y key). Significant system calls, which typically consume a lot of time, are the file I/O related calls, such as read(), write(), open(), and close(). In the slide, the write() system call is being invoked heavily by the selected process (754 times/second) and has accounted for 4.1 seconds of the CPU's time over a 27-second period (approximately 15%).

http://education.hp.com

H4262S C.00 8-21  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–13. SLIDE: Tuning a Disk I/O-Bound System — Hardware Solutions

Tuning a Disk I/O-Bound System — Hardware Solutions • Add additional disk drives (and off load busy drives). • Add additional controller cards (and balance disk drive load across controllers). • Add faster disk drives. • Implement disk striping. • Implement disk mirroring.

Student Notes The hardware solutions on the above slide will help to lessen the performance impact of high disk I/O on a system. •

Add more disk drives and load balance across disks. This spreads the amount of I/O over more drives, decreasing the average number of I/O requests for each disk. Many smaller disks are better than a few large disks.

•

Add more disk controllers and balance load across disk controllers. This spreads the amount of I/O over more controllers, decreasing the likelihood that any one disk controller will become overloaded with I/O requests.

•

Add faster disk drives. This decreases the amount of time it takes to service an I/O request, which decreases the amount of time requests spend waiting in the disk I/O queue.

•

Implement disk striping. This increases the number of disk heads having access to the striped data (the more disks striped across, the more heads accessing the data,

H4262S C.00 8-22 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

simultaneously). It also allows for “overlapping seeks,” meaning that one disk head can be seeking the next block, while a second disk head is reading the current data block. •

Implement disk mirroring. This can increase read performance, as either the primary or mirrored copy of the data can be read. In fact, the data will be read from whichever disk has the fewest I/Os pending against it. However, it will negatively impact write performance. In order to maintain the integrity of the mirrors, duplicate writes must be done to each copy of the mirrored volume/disk. Mirroring is primary a data integrity feature, but under the right circumstances (read-intensive data) it can improve performance, as well.

http://education.hp.com

H4262S C.00 8-23  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–14. SLIDE: Tuning a Disk I/O-Bound System — Perform Asynchronous Meta-data I/O

Tuning a Disk I/O-Bound System — Perform Asynchronous I/O • Configure individual disk drives to behave “somewhat” asynchronously with immediate reporting feature of SCSI disk controllers. • Configure immediate reporting with the scsictl command.

Disk I/O Queue Buffer Cache

Disk Controller Cache

Process

Memory

Student Notes Asynchronous I/O significantly improves write performance over synchronous I/O because the write requests (and thus the requesting processes) do not have to wait for the data to be written to the disk platters.

Immediate Reporting for Selected Disks Immediate reporting can be turned on at boot time by setting the tunable parameter default_disk_ir to “ON”. An alternative to turning on default_disk_ir is to enable certain disk controllers selectively to report immediately to the kernel when the data reaches the disk controller cache. For normal writes, the disk waits until data is transferred from the controller cache to the disk platters, before returning to the kernel. By setting immediate reporting to ON for individual disk controllers, processes do not have to wait for the seek or latency times when writing to those disks. The scsictl command can be used to turn immediate reporting ON (1) for a particular SCSI disk. The default for immediate reporting is OFF (0).

H4262S C.00 8-24 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues Examples

To view the device settings for the controller at SCSI adapter address "0" and SCSI target address 6: # /usr/sbin/scsictl -m ir /dev/rdsk/c0t6d0 immediate_report = 0 To change the value of immediate reporting to ON: # /usr/sbin/scsictl -m ir=1 /dev/rdsk/c0t6d0 To view the changes in the device settings: # /usr/sbin/scsictl -a /dev/rdsk/c0t6d0 immediate_report = 1; queue_depth = 8

http://education.hp.com

H4262S C.00 8-25  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–15. SLIDE: Tuning a Disk I/O-Bound System — Load Balance across Disk Controllers

Tuning a Disk I/O-Bound System — Load Balance across Disk Controllers

PVG1

C0

System

C1

PVG2

Volume Group vg01

Student Notes Another potential solution to a disk I/O performance problem is to spread the write requests across the disk controllers as evenly as possible. This helps ensure no one controller becomes overloaded with I/O requests.

Mirroring Logical Volumes A popular feature of LVM is the ability to mirror logical volumes to separate disk drives. This involves writing one copy of the data to the primary disk and one copy to the mirrored disk. When the primary disk and mirror disk are on the same disk controller, a performance bottleneck often results because the disk controller has to service the writes for both the primary and mirrored data. Physical Volume Groups

Physical volume groups (PVGs) allow disk drives to be grouped, based on the disk controller to which they're attached. Used in conjunction with LVM mirroring, it ensures the mirrored data not only goes to a different disk, but also goes to a different PVG group (that is, a different disk controller).

H4262S C.00 8-26 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues How to Set Up PVGs

The PVG groups are defined in the /etc/lvmpvg file. This file can be manually edited or updated with the -g option to the vgcreate and vgextend commands. A sample /etc/lvmpvg file, based on the four disks on the slide are: VG /dev/vg01 PVG PV_group0 /dev/dsk/c0t6d0 /dev/dsk/c0t5d0 PVG PV_group1 /dev/dsk/c2t5d0 /dev/dsk/c2t4d0 Configuring LVM to Mirror to Different PVGs

The command to configure LVM mirroring for different PVGs is lvchange. The strict option to this command, -s, contains the following three arguments: y

This indicates all mirrored copies must reside on different disks.

n

This indicates mirrored copies can reside on the same disk as the primary copy.

g

This indicates all mirrored copies must reside with different PVGs.

For example, to configure /dev/vg01/lvol1 to mirror to different PVG: lvchange -s g /dev/vg01/lvol1

http://education.hp.com

H4262S C.00 8-27  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–16. SLIDE: Tuning a Disk I/O-Bound System — Load Balance across Disk Drives

Tuning a Disk I/O-Bound System — Load Balance across Disk Drives

1 2 3 4 5 6

System

100 % Util

1 3 5 7 9 11

90% Util System

90% Util

52% Util 2 4 6 8 10 12

5% Util

20% Util

Volume Group vg01

Without Striping

52% Util

20% Util

Volume Group vg01

With Striping

Student Notes Balancing the disk activity so that the utilization across drives is approximately the same helps to ensure that no one disk becomes overloaded with I/O requests (that is, 50% or greater utilization, with three or more requests in the disk queue). The slide illustrates a situation in which one disk is heavily utilized (100%) while another disk is only 5% utilized. One potential solution is to stripe the heavily utilized logical volume on the first disk to both disks.

LVM Striping The ability to stripe a logical volume across multiple disks (at a file system block level) was introduced into LVM at the HP-UX 10.01 release. A logical volume must be configured for striping at the time of creation. Once a logical volume is created, it cannot be striped without recreating the logical volume.

H4262S C.00 8-28 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

The command to create a striped logical volume is lvcreate. The syntax, related to striping, for this command is: lvcreate -i [number of disks] -I [stripe size] -L [size in MB] vg_name

Example: lvcreate -i 2 –I 8 /dev/vg01 lvextend -L 50 /dev/vg01/lvol2 /dev/dsk/c0t5d0 /dev/dsk/c0t4d0

http://education.hp.com

H4262S C.00 8-29  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

8–17. SLIDE: Tuning a Disk I/O-Bound System — Tune Buffer Cache

Tuning a Disk I/O-Bound System — Tune Buffer Cache

Kernel and OS Tables

Defaults

Fixed Buffer Cache 5%

dbc_min_pct=5%

Additional Buffer Cache 0 - 45%

User Process and Shared Memory Area

dbc_max_pct=50%

Memory

Student Notes With the introduction of HP-UX 10.0, the buffer cache becomes dynamic, growing and shrinking between a minimum size and a maximum size. NOTE:

Space for the buffer cache is allocated in two different areas of memory: the minimum size is created in the O/S area of memory, and anything above the minimum size is allocated from the User Process area.

How the Buffer Cache Grows As the kernel reads in files from the file system, it will try to store the data in the buffer cache. If memory is available and the buffer cache has not reached its maximum size, the kernel will grow the buffer cache to make room for the new data. As long as there is memory available, the kernel will keep growing the buffer cache until it reaches its maximum size (50% of memory, by default). If memory is not available, or the buffer cache is at its maximum size when new data is read, the kernel will select buffer cache entries that are least likely to be needed in the future, and reallocate those entries to store the new data.

H4262S C.00 8-30 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

The main point is that if there is available memory, the buffer cache will grow into this memory until there is no memory left (or until the buffer cache reaches its maximum size).

How the Buffer Cache Shrinks As memory falls below LOTSFREE, the vhand-paging daemon wakes up and begins paging out 4-KB pages of memory. The eligible pages include process segments (text, data, and stack), shared memory segments, and the buffer cache. In other words, the buffer cache is shrunk by having vhand page out its pages. The buffer cache is treated by vhand as just another structure in memory with pages that it can dereference and free. Like process text pages, buffer cache pages are not written out to the swap space. But, since their contents may have been modified, they could be flushed out to the file system being placed back on the free page list. NOTE:

It should be noted that the kernel global value, dbc_steal_factor, determines how aggressive the vhand daemon is at stealing buffer cache pages in comparison to process pages. A value of 16 says to treat buffer cache pages no differently than process pages; the default value of 48 says to steal buffer cache pages three times as aggressively! However, if the buffer cache is referencing those pages, vhand will find few buffers to free up.

Buffer Cache Performance Implications Because the buffer cache grows quickly into free memory and shrinks slowly by necessitating vhand to page it out, one consideration is to limit the maximum size to which the buffer cache can grow. The default maximum size is 50% of total memory. This probably was a fairly reasonable number when the parameter was introduced, but with the very large memory systems existing nowadays, it’s probably much too high. By setting the dbc_max_pct tunable kernel parameter to a smaller number (say 20 or 25), the buffer cache can still grow to a significant size, but will not be so large that it takes a long time to shrink when more processes become ready to execute. Prior to HP-UX11i, there was a definite performance penalty for having a buffer cache that was too large. It took a long time to search the cache to determine if the needed buffer was already there. Improvements in the search algorithm in 11i have reduced that penalty significantly.

Fixed vs. Dynamic Buffer Cache Should you use a fixed-size buffer cache or a dynamic buffer cache? If your buffer cache requirements are constant over time, of course you should use a fixed-size buffer cache. Simply set the dbc_min_pct and dbc_max_pct parameters to the same value. If your buffer cache requirements change over time, do they change rapidly or slowly. There is some overhead associated with growing and shrinking buffer cache. Plus, shrinking buffer cache is not a very fast operation. If your buffer cache changes slowly over time, it would be best to use a dynamic buffer cache. The overhead of growing and shrinking would be spread out and become relatively insignificant.

http://education.hp.com

H4262S C.00 8-31  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

If, however, your buffer cache requirements change rapidly over time, you probably would be better served with a fixed-size buffer cache, properly sized to give you adequate buffers most of the time. Only on relatively rare occasions, would buffer cache be a bottleneck and only for short periods. In the long run, your performance would be better than trying to deal with the rapidly changing needs using a dynamic buffer cache.

Sizing Buffer Cache Here is a set of recommendations for properly sizing your buffer cache. 1. Are you getting at least a 90% read cache hit rate and a 70% write cache hit rate? If so, your buffer cache may already be larger than necessary. If you are experiencing no memory pressure, and no apparent disk bottlenecks, leave the buffer cache as it is. 2. If you are experiencing memory pressure or apparent disk bottlenecks, try shrinking the size of your buffer cache. Adjust dbc_max_pct down, in increments, no more than 10% at a time, until your performance figures fall to 90%/70%. 3. If you are not getting 90%/70% performance from your buffer cache, it may be too small or your application may be using it in an inefficient manner. Try increasing its size. If the figures improve, keep increasing the size until either you reach 90%/70% or your performance ceases to improve. Leave the size there. 4. If increasing the size of the buffer cache does not produce an immediate improvement in performance, your application may need to be tuned to use the buffer cache more efficiently. However, your buffer cache may still be larger than it needs to be. After you have tuned your application, recheck your buffer cache performance, as above.

H4262S C.00 8-32 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

8–18. LAB: Disk Performance Issues Directions The following lab illustrates a number of performance issues related to disks. 1. A file system is required for this lab. One was created in an earlier exercise. Mount it now. # mount /dev/vg00/vxfs /vxfs We also need to assure that the controller does not have " SCSI immediate reporting" enabled. Enter the following command and check your current state: (fill in the device file name as appropriate) # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) If the current immediate_report = 1 then enter the following: # scsictl -m ir=0 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) 2. Copy the lab files to the file system. # cp /home/h4262/disk/lab1/disk_long # cp /home/h4262/disk/lab1/make_files

/vxfs /vxfs

Next, execute the make_files program to create five 4-MB ASCII files. # cd /vxfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /vxfs # mount /dev/vg00/vxfs /vxfs # cd /vxfs

http://education.hp.com

H4262S C.00 8-33  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

4. Open a second terminal window and start glance. While in glance, display the Disk Report (d key). Zero out the data with the z key. From the first window, time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null

glance Disk Report

real: user: sys:

Logl Rds: Phys Rds:

5. At this point, all 20 MB of data is resident in the buffer cache. Re-execute the same command and record the results below: # timex cat file* > /dev/null real: user: sys: NOTE:

glance Disk Report Logl Rds: Phys Rds:

The conclusion is that I/O is much faster coming from the buffer cache, than having to go to disk to get the data.

6. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long • • • •

How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?

7. The glance I/O by Disk report Exit from the sar -d report, and start glance again. While in glance, display the I/O by Disk report (u key). From the first window, re-execute disk_long, timing the execution. Record results below: glance I/O by Disk Report

# ./disk_long Util:

Qlen:

H4262S C.00 8-34 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 8 Disk Performance Issues

8. The glance I/O by File System report Reset the data with the z key, and display the I/O by File System report (i key). From the first window, re-execute disk_long, timing the execution. Record results below: # ./disk_long

glance I/O by Disk Report Logl I/O:

Phys I/O:

9. Performance tuning — immediate reporting. Ensure the immediate reporting options are set for the disk that the file system is located on. If immediate reporting is not set, set it. # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) # scsictl -m ir=1 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) Purge the contents of buffer cache. # # # #

cd / umount /vxfs mount /dev/vg00/vxfs /vxfs cd /vxfs

10. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program (which writes 400 MB to the file system and then removes the files). # timex ./disk_long • • • •

How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?

How do the results of step 11 compare to the results in step 6? ________________________________________________________________

http://education.hp.com

H4262S C.00 8-35  2004 Hewlett-Packard Development Company, L.P.

Module 8 Disk Performance Issues

H4262S C.00 8-36 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9  HFS File System Performance Objectives Upon completion of this module, you will be able to do the following: •

List three ways HFS file systems are used.

•

List basic HFS file system data structures.

•

Identify HFS file system bottlenecks.

•

Identify HFS kernel system parameters.

http://education.hp.com

H4262S C.00 9-1  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

9–1. SLIDE: HFS File System Overview

HFS File System Overview

Tracks

Cylinder Group

Primary Superblock

Cylinder Group 1 Cylinder Group 2 Cylinder Group 3 . . .

Data Blocks Red. Cylinder Inode SprBk Grp Hdr Table

Data Blocks Cylinder Group N

Physical View

Logical View

Internal Cylinder Group View

Student Notes The HFS model is a foundation for all other file system variants. We will begin our discussion of File System performance using the HFS file system model.

The HP-UX File System The HFS file system strategically lays out its data structures on disk to most efficiently utilize the geometry of the disk. The design of the HFS file system can best be explained by looking at the file system from three perspectives.

Physical View From a physical disk perspective, the disk drive upon which a file system is placed contains sectors, tracks, platters, and disk heads. A key behavior of most all disk drives is that the disk heads move in parallel across the platters in such a way that each disk head is over the same track within each platter at the same time. To maximize the file system I/O throughput of the disk, it is desirable to have as many file blocks close to each other as possible, to minimize the time it takes to read or write the various blocks of a file. To help achieve this goal, the blocks on the disk are allocated to

H4262S C.00 9-2 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

the HFS file system in units call cylinder groups. A cylinder group is all the tracks, from every platter, grouped together, of several adjacent cylinders. Cylinder Group Analogy

Consider a health spa or gym with three floors. Each floor contains a jogging track, and the three jogging tracks are located directly above or beneath one another from floor to floor. From this point of view, a cylinder group would be the same group of lanes from each floor's jogging track. In other words, all lane 1, 2, and 3 tracks would make up cylinder group 1; all lane 4, 5, and 6 tracks would make up cylinder group 2, etc. By organizing space on disks in cylinder group units, the HFS file system can logically keep all the blocks of a given file close to each other. For example, in the slide above, the first 6 blocks of a file might be allocated as follows: File File File File File File

block block block block block block

#1: #2: #3: #4: #5: #6:

Platter Platter Platter Platter Platter Platter

#1, #1, #1, #1, #2, #3,

Track Track Track Track Track Track

#1, #1, #3, #3, #7, #9,

Sector Sector Sector Sector Sector Sector

#1 #2 #5 #6 #10 #7

By allocating file system space in this manner, a multiple block read (say 6 blocks) could be read with less than six separate reads. In the example above, file blocks 1 and 2 could be read with one read operation, followed by a head switch (no carriage movement) to track 3, another read for file blocks 3 and 4, a short seek to the next cylinder and a head switch to read file block 5, and repeat for file block 6. Four reads could then read the six blocks. The more contiguous the blocks that make up the file, the more efficient the reads and writes can be.

Logical View From a logical perspective, an HFS file system contains a series of cylinder groups. Even though the physical cylinder groups are laid out from top to bottom, transcending all the platters, logically, we view the cylinder groups as horizontal units going from left to right. The HFS file system is made up of multiple cylinder groups, where the number of cylinder groups is dependent on the size of the file system. In the slide, we assume the HFS file system takes the whole disk, therefore, there are N cylinder groups in the sample file system. Typically, they are numbered from 0 to N-1. A critical data structure contained with every HFS file system is the primary superblock. The primary superblock is located at the start of every HFS file system at the start of the first cylinder group, and contains the critical header information for the HFS file system. Data structures contained within the superblock include the free block list, the mount flag, the starting address of each cylinder group, and much more.

http://education.hp.com

H4262S C.00 9-3  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

Internal Cylinder Group View Within each cylinder group, the following data structures exist: Data blocks

The data blocks are where files are stored within the cylinder group. The data blocks are distributed in such a way that a portion of the data blocks come before the cylinder group header structures and the rest come after the cylinder group header structures. This ensures that the cylinder group header structures are randomly placed throughout the cylinder groups.

Redundant Superblock

A redundant copy of the primary superblock is contained within each cylinder group. These redundant copies are kept to protect against the loss of the primary superblock. The locations of the redundant superblocks can be viewed by displaying the contents of the /etc/sbtab file. Should the primary superblock become lost or corrupted, the file system could still be recovered by executing the fsck command and specifying the location of one of the alternate superblocks.

Cylinder Group Header

The cylinder group header contains the header information for the cylinder group. This information includes the free blocks within the cylinder group, the starting addresses of the inode tables for that group, and a list of free inodes for the local inode table.

Inode Table

The inode table contains all the inodes (file header structures) for files located within the cylinder group. Every file within a file system is managed by an inode, which describes the attributes and location of the file. The inode table is divided into equal-sized sections and a section is stored in each cylinder group. Inodes within a cylinder group point to files usually contained within the same cylinder group.

H4262S C.00 9-4 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

9–2. SLIDE: Inode Structure

Inode Structure

Inode for File Data Blocks Type Red. Cylinder SprBk Grp Hdr

Inode Table

Owner Atime

Permissions

Links

Group

Size

Mtime

CTime

File

Data Block Pointers

Student Notes An inode contains all the header information for a particular file. Every file has a corresponding inode, usually located within the same cylinder group as the file. Fields contained within the inode include: • • • • • • •

File type File access permissions Number of hard links to the file Owner and group of the file Size of the file in bytes Time stamps (file access, file modification, inode changes) Data block pointers (direct and indirect)

NOTE:

Although the size of the inode differs from one type of file system to another, the basic types of data contained is virtually the same, the main differences are in the data pointer structures.

http://education.hp.com

H4262S C.00 9-5  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

9–3. SLIDE: Inode Data Block Pointers

Inode Data Block Pointers

Direct Access Inode

Single Indirection Inode

Data Block Data Block Data Block . . . . . . Data Block

2 Logical I/Os needed to access each 8 KB of data

Double Indirection Inode

Inode Extension

Data Blocks 3 Logical I/Os needed to access each 8 KB of data

Data Blocks 4 Logical I/Os needed to access each 8 KB of data

Student Notes One of the structures within each HFS inode is the array of data block pointers that reference the data blocks within the file. The size of the data block pointer array is 15 entries, meaning there are a maximum of 15 file system block addresses within the array. The first 12 addresses within the data block pointer array are “direct access” addresses. The thirteenth entry is a “single indirection” block address, the fourteenth is a “double indirection” block address, and the fifteenth (and last) entry is a “triple indirection” block address.

Direct Access A direct access address points directly to a file's data block. When accessing a file using a “direct access” address, a minimum of two logical I/Os are needed: one I/O to access the file's inode (containing the direct access address), and one I/O to access the file's corresponding data block.

H4262S C.00 9-6 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

Single Indirection Single indirection implies the address within the inode references a block on disk that acts as an inode extension block. The inode extension block, in turn, contains addresses that point to the file's corresponding data blocks. It should be noted that three logical I/Os are needed to access a file's data blocks using single indirection: one I/O for the file's inode, one I/O for the inode extension block, and one I/O for the data block itself.

Double Indirection Double indirection means access to a file's data blocks require going through two inode extension blocks. The first inode extension block references the address of a second inode extension block, which contains addresses referencing the file's datablocks. Double indirection is needed only for files above 16 MB (with a default block size of 8KB). When accessing files requiring double indirection, a total of four logical I/Os are required: an I/O for the file's inode, an I/O for each of the two inode extension blocks, and an I/O for the file's data block.

Triple Indirection Triple indirection (not shown on the slide) adds one more level of indirection when accessing a file's data blocks. Triple indirection is only needed to access files larger than 32 GB (with a default block size of 8KB). NOTE:

Every level of indirection adds an additional logical I/O when accessing the file's data. In the case of triple indirection, five logical I/Os are needed compared to two I/Os for direct access data blocks.

As you can see, the performance of an HFS file system tends to favor small files (12 blocks or less), and tends to penalize large files that have to use single, double, or even triple indirection. You can delay this performance degradation somewhat, by building the file systems with larger block sizes. (More on that later in the module.)

http://education.hp.com

H4262S C.00 9-7  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

9–4. SLIDE: How Many Logical I/Os Does It Take to Access /etc/passwd?

How Many Logical I/Os Does It Take to Access /etc/passwd?

Type

Permissions

Owner Blk. 74

Inodes (0 - 249) 2 / 504 /etc 2

Group

ATime

MTime

1123 host 1824 passwd

Permissions

Owner Inodes (500 - 749) 504

root::0:3;. Inodes (1750 - 1999) 1824 sys::3:3:..

Group

ATime

CTime

Links Size

MTime

CTime

Inode 504 /etc

717

Type Blk. 2240

Inode 2 / (root)

74

Type Blk. 717

Links Size

Permissions

Links

Owner

Group

Size

Atime

MTime

CTime

Inode 1824 /etc/passwd

2240

Student Notes The above slide illustrates how a file within the HFS file system is accessed. It may surprise some people when they find out how many logical I/Os are needed to access the /etc/passwd file.

Starting from the Top When the full pathname of a file is specified for access (as in /etc/passwd), the kernel starts with the only inode it knows: the inode of the root directory of the root file system. Inode number 2 is always the inode of the root directory of any file system. “/” symbolizes (in the kernel) the root directory of the root file system. Using the slide as an example, after reading inode 2 of the root file system (first logical I/O), the kernel discovers that the contents of the root directory (the listing of the files contained in that directory) are located at file system block 74. Upon reading block 74 (second logical I/O), the names of the files in the root directory and their corresponding inode numbers are known. Directories are primarily listings of file names and the numbers of the inodes that manage them.

H4262S C.00 9-8 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

From this information, the kernel discovers the inode for the etc directory (in “/”) is 504. Inode 504 is then read (third logical I/O) and from that the kernel learns the etc directory is located at file system block 717. Block 717 is read (4th logical I/O) and the file names and inodes contained within that directory are now known. One of the entries within block 717 is the passwd file and its corresponding inode number 1824. The inode 1824 is read (5th I/O), and from this the kernel finally learns that block 2240 is the one that contains the contents of the /etc/passwd file. Block 2240 is read (6th I/O) and the kernel finally has the data it set out to access. So, the answer to the question at the top of the slide, “How many logical I/Os does it take to access /etc/passwd?” is . . . 6.

http://education.hp.com

H4262S C.00 9-9  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

9–5. SLIDE: File System Blocks and Fragments

File System Blocks and Fragments

FileA FileB FileB FileC FileC

FileD FileD FileD FileD

FileE FileE FileE FileE FileE

Fragment

FileF FileF FileF FileF FileF FileF

End of Disk

File System Block

FileA, FileC, and FileD grow by 1 fragment FileB FileB FileA FileA

FileD FileD FileD FileD FileD

FileE FileE FileE FileE FileE FileC FileC FileC

FileF FileF FileF FileF FileF FileF

End of Disk

Student Notes The concept of blocks and fragments was introduced when the HFS file system was designed. There is always a tradeoff when managing a resource based on a fixed allocation unit size (the file system "block" in this case). If the block size is large we can manage them with fewer pointers (system overhead) but if it is too large there is an opportunity for inefficient utilization of the resource (very small files still require a block). In the case of the HFS file system this concern was addressed by making the block capable of uniform subdivision. The fragment was created for this purpose.

Definitions Sector

A sector is the smallest unit of space addressable on the physical disk. The sector size is used when the disk is formatted to appropriately place timing markers on the platter. The default sector size for HP-UX and most UNIX systems is 512 bytes.

Fragment

A fragment is the increment in which space is allocated to files within the HFS file system. The default fragment size is 1 KB. This can be tuned when the HFS file system is initially created. Allowable sizes are 1K, 2K, 4K, and 8K.

H4262S C.00 9-10 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

File System Block

A file system block is the minimum amount of data transferred to/from the disk when performing a disk I/O on an HFS file system. The default file system block size is 8 KB. This can be tuned when the HFS file system is initially created. Allowable sizes are 4K, 8K, 16K, 32K, and 64K.

Example — Top Half The top half of the slide shows the allocation of disk space when the following six files are created (assuming only 5 file system blocks are free within the HFS file system). File A (size 1 KB):

The kernel searches for the first free fragment. On the slide, the first fragment in the first file system block is allocated.

File B (size 2 KB)

The kernel searches for the first 2-KB continuous fragment that is available. This is in the same file system block in which FileA was allocated. The fact that FileA has already been allocated in this file-system block does not matter. Multiple files can be allocated within the same file system block. The first basic rule is: Best fit on “close”.

File C (size 2 KB):

The kernel searches for the first 2-KB continuous fragment available. This is in the same file system block as FileA and FileB. Hence, FileC is allocated 2 KB from this same file-system block. If any of these three files are accessed, then all three files are read into the file system buffer cache as a single unit.

File D (size 4 KB): The kernel searches for the first four contiguous 1KB fragments available (within the same file system block). This is in the second file system block. The kernel does not allocate 3 fragments from the first file system block and 1 fragment from the second file system block, because that would require two logical I/Os to read in the entire 4 KB. This is inefficient, as only one I/O is required if the file is contained within the same file system block. The second basic rule is: If the size of a file is 8 KB or less, the kernel will fit the entire file within a single file-system block. File E (size 5 KB) and File F (size 6 KB): The kernel searches for the first available file-system block that can hold the entire file. On the slide, FileE is allocated in file system block 3, and FileF is allocated in file system block 4.

http://education.hp.com

H4262S C.00 9-11  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

Example — Bottom Half The bottom half of the slide illustrates how the growth of three files affects allocation within the HFS file system. FileA (1KB -> 2KB): When FileA grows, it cannot grow into the next fragment because FileB is occupying this spot. Therefore, the kernel relocates FileA to the first free 2 KB that is within the same file system block. (Why transfer another block into memory, at this point?) FileC (2KB -> 3KB): When FileC grows, it cannot grow into the next fragment, because FileA is now in that spot. Therefore, the kernel relocates FileC to the first free 3 KB that is in a different file system block. (It can no longer fit into the first block). It selects block three because that block has a space exactly suited for the three-block FileC. The third basic rule is: if a file owns multiple fragments within the same block, they must be contiguous. FileD (4KB -> 5KB): When FileD grows, it simply grows into the next fragment because it is still free.

H4262S C.00 9-12 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

9–6. SLIDE: Creating a New File on a Full File System

Creating a New File on a Full File System

New FileG (4 KB) is created FileB FileB FileA FileA

FileD FileD FileD FileD FileD FileC FileC FileC

FileE FileE FileE FileE FileE

FileF FileF FileF FileF FileF FileF

FileG FileG FileG FileG

End of Disk

What happens when new FileH (1 KB) is created?

Student Notes As an HFS file system becomes full, the performance impact of creating a new file becomes significant. This is due to the behavior of the kernel when creating a new file: When a new file is created on HFS file systems, the kernel tries allocates a block-sized buffer in buffer cache for the file to grow into. Upon the file being closed, the kernel allocates the file's fragments to an already allocated file system block, if possible.

FileG Is Created In the example on the slide, FileG is opened/created as a new file. Not knowing the size to which FileG will grow, the kernel allocates a block-sized buffer in buffer cache for FileG to grow into. When FileG is closed, the kernel searches for a set of four contiguous 1KB fragments in a block. Since there are no shared blocks that have four contiguous fragments, the file is written to a new, empty block.

What Happens When FileH Is Created? The impact of creating new files on a full file system can be seen when FileH is created. When FileH is opened for creation, the kernel allocates a block-sized buffer in buffer cache.

http://education.hp.com

H4262S C.00 9-13  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

As it turns out, FileH is closed after writing only 1 KB worth of data. Upon closure, FileH is moved to file system block 1, first fragment. NOTE:

Performance on HFS file systems typically degrades when free space falls below 10%, due to the length of time it takes to find free file system blocks for new files. For this reason, it is recommended that MINFREE always be 10% or greater, even for large file systems (greater than 4 GB).

The fourth basic rule is: No fragment belonging to another file will be moved to make room for this file.

H4262S C.00 9-14 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

9–7. SLIDE: HFS Metrics to Monitor — Systemwide

HFS Metrics to Monitor — Systemwide • Utilization of the file systems • File system I/O queue lengths • Amount of physical I/O to the file systems • File system free space • Open files for each process

Student Notes When monitoring disk I/O activity, the main metrics to monitor are: •

Percent utilization of the file systems: As utilization of the file system increases, so does the amount of time it takes to perform an I/O. According to the performance queuing theory, it takes twice as long to perform an I/O when the file system is 50% busy, than it does when the file system is idle.

•

Requests in the file system I/O queue: The number of requests in the file system I/O queue is one of the best indicators of a file system performance problem. If the average number of requests is three or greater, then requests are having to wait in the queue longer than the amount of time needed to service those requests.

•

Amount of physical I/O: If the amount of file system activity is high, it is important to investigate on which file system the activity is occurring.

•

File system free space: As an HFS file system becomes full (greater than 90%), it takes longer and longer to find an available free fragment for a new file or to grow an existing file. This creates additional disk activity, leading to slow file system performance.

http://education.hp.com

H4262S C.00 9-15  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

•

Files opened with heavy access: For each process performing large amounts of file system I/O, the names of the files to which they are reading or writing should be inspected. For files receiving high I/O activity, (hit frequently, then inspect how quickly the offset to each file changes) consider relocating these files to other disks that are less busy.

H4262S C.00 9-16 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

9–8. SLIDE: Activities that Create a Large Amount of File System I/O

Activities that Create a Large Amount of File System I/O • File writes on an almost full file system • Long, inefficient PATH variables • Deep subdirectory structures • Accessing large files sequentially with a small READ block size • Accessing many files on a single disk

Student Notes Common causes of disk-related performance problems are shown on the slide. •

Full file system cause excessive I/O due to locating free fragments.

•

Long, inefficient PATH variables cause excessive directory I/O (especially when the command is found in the last directory within the PATH variable).

•

Deep subdirectories cause lots of logical I/Os (two logical I/Os for each subdirectory in the full path name).

•

Sequential file access, with a small file system block size, causes excessive amounts of physical I/O.

•

Accessing lots of files on one file system, versus many, creates an imbalance of utilization. This leads to performance problems with the busy file systems and under utilization with the others.

http://education.hp.com

H4262S C.00 9-17  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

9–9. SLIDE: HFS I/O Monitoring bdf Output

HFS I/O Monitoring bdf Output

# bdf Filesystem /dev/root /dev/vg00/lvol1 /dev/vg00/lvol6 /dev/vg00/lvol4 /dev/dsk/c0t4d0 /dev/vg00/lvol7 /dev/vg00/lvol5

kbytes used 81920 38018 47829 22403 286720 257116 360448 346127 1177626 1113204 122880 102098 53248 22589

avail %used Mounted on 40901 48% / 20643 52% /stand 28003 90% /usr 13444 96% /opt 0 100% /disk 19257 84% /var 28549 44% /tmp

Student Notes The bdf report shows how much file system space is being used (and how much is free) for all file systems currently mounted on the system. The key fields are: avail

Indicates the amount of disk space available on the file system (in KB).

%used

Indicates the percentage of disk space used.

The slide shows there are three file systems with 90% usage or more, and one of the file systems is at 100% utilization. Recall that when an HFS file system becomes full, performance on that file system suffers due to fragments being moved. The good news is that the amount of free space which is being held back by the file system parameter, MINFREE, is already subtracted from the values. In fact, if you compare the kbytes, used, and avail columns, you’ll see that something is missing. used + avail do not add up to be kbytes. The difference is MINFREE. For example, look at /stand. Clearly, 22403 + 20643 does not equal 47829. In fact, 22403+20643 divided by 47829 equals 90%, indicating that MINFREE must be set to 10% for this file system.

H4262S C.00 9-18 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

9–10. SLIDE: HFS I/O Monitoring glance — File System I/O

HFS I/O Monitoring glance — File System I/O

B3692A GlancePlus B.10.12 06:39:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------IO BY FILE SYSTEM Users= 4 Idx File System Device Type Logl IO Phys IO -------------------------------------------------------------------------------1 / /dev/root vxfs 0.3/ 0.6 0.0/ 0.0 2 /stand /dev/vg00/lvol1 hfs 0.0/ 0.0 0.0/ 0.0 3 /var /dev/vg00/lvol9 vxfs 1.0/ 1.8 0.1/ 0.3 4 /usr /dev/vg00/lvol8 vxfs 9.2/ 2.8 1.5/ 0.6 5 /tmp /dev/vg00/lvol7 vxfs 0.0/ 0.0 0.1/ 0.0 6 /opt /dev/vg00/lvol6 vxfs 0.0/ 0.0 0.0/ 0.0 7 /home.lvol5 /dev/vg00/lvol5 vxfs 0.0/ 0.0 0.0/ 0.0 8 /export /dev/vg00/lvol4 vxfs 0.0/ 0.0 0.0/ 0.0 9 /disk /dev/vg01/lvol1 vxfs 463.8/ 86.4 105.8/ 20.1 10 /cdrom /dev/dsk/c1t2d0 cdfs 0.0/ 0.0 0.0/ 0.0 11 /net e2403roc:(pid604) nfs 0.0/ 0.0 0.0/ 0.0 Top disk user: PID

3603, disc

104.0 IOs/sec

S - Select a Disk

Student Notes The glance file system I/O report (i key) shows activity on a per file system basis. Only total I/O activity (not reads versus writes) is shown with this report. This report is similar to the logical volume report (discussed in the previous module) except this report shows logical I/O compared to physical I/O, and does not distinguish between read and write activities. The logical volume report shows reads compared against writes, but does not distinguish between logical and physical activities. From the report on the slide, we note that all the file system activity is being performed against one file system. Note:

The file system I/O report shows I/O activity for all types of mounted file systems, including CDFS file systems and NFS-mounted file systems.

http://education.hp.com

H4262S C.00 9-19  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

9–11. SLIDE: HFS I/O Monitoring glance — File Opens per Process

HFS I/O Monitoring glance — File Opens per Process

B3692A GlancePlus B.10.12 06:44:39 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------Open Files for PID: 3911, disc PPID: 2410 euid: 0 User:root Open Open FD File Name Type Mode Count Offset -------------------------------------------------------------------------------0 /dev/pts/1 chr rd/wr 6 13582826 1 /dev/pts/1 chr rd/wr 6 13582826 2 /dev/pts/1 chr rd/wr 6 13582826 3 reg read 1 85 4 /stand/file5 reg write 1 32768 10 /dev/null chr read 2 0

Student Notes The glance open files report (F key), available only from the select process report (s key), shows the names of files opened for the currently selected process. Sometimes, the full path name of the file is shown. Otherwise, the inode number, and device name are shown and you would have to translate that information into the filename. NOTE:

To determine the full pathname of a file, given its inode number and logical volume name, use the ncheck command: ncheck -F vxfs -i [inode #] [device name] Another way to determine the full pathname of a file, given its inode number and logical volume name, is to use the find command: find [mountpoint of device] –inum [inode #] -xdev

H4262S C.00 9-20 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

To determine whether I/O activity is occurring against a file, enter the open file report for a particular process, and press multiple times in succession. Watch the offset field for each file. If the offset field is constantly changing, it indicates the file is currently being accessed.

Performance Scenario A system is experiencing slow performance due to high file system utilization. Upon further investigation, not all file systems are heavily utilized. In fact, some show no activity at all. By sorting the processes within glance by disk I/O activity, then selecting those processes to obtain further details, you can determine which files are getting the majority of the activity. To take advantage of the underutilized file system, move the heavily accessed files to this file system and create a symbolic link to the file from its original location, thereby removing a heavily accessed file from a busy file system and putting it on an underutilized file system.

http://education.hp.com

H4262S C.00 9-21  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

9–12. SLIDE: Tuning a HFS I/O-Bound System — Tune Configuration for Workload

Tuning an HFS I/O-Bound System — Tune Configuration for Workload Tune the following parameters, based on workload: • File System Block and Fragment Sizes • Blocks per cylinder group (maxbpg) • File system mount options • The mkfs options when creating the file system • The tunefs options can modify parameters on existing file systems Tune other configurations, based on workload: • Optimize $PATH variables • Use flat directory structures when possible • Ensure sufficient freespace exists on file systems

Student Notes Every workload and every application is different. Each has different resource requirements and each places different demands on the system. There is no one configuration that is optimal for all applications. For example, CAD-CAM application stress memory (and graphics); accounting applications that do forecasting stress CPU; NFS-based applications stress the disks (and the network); and RDBMS applications stress all resources.

File System Blocks and Fragments Tips and notes for choosing the sizes of file system block and fragments follow: Fragments

• • • •

Fragment sizes can be 1, 2, 4, or 8 KB in size. Fragments can be 1/8, ¼, ½, or equal to the file system block size. For large files which are opened and closed a lot during their growth, large fragments are recommended. For file systems with lots of small files, small fragments are recommended.

H4262S C.00 9-22 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance File System Blocks

• • • •

File system blocks sizes can be 4, 8, 16, 32, or 64 KB. For file systems with large files, large file system blocks are recommended. For file systems with large files, increase maxbpg (maximum blocks per group). For applications which perform a lot of sequential I/O (with read-aheads and write-behinds), large file system blocks are recommended.

HFS Mount Options The mount options affect performance by specifying when files on the file system are updated. These options can be specified in the options column of the /etc/fstab file. The HFS-specific mount options include: behind

Enable, when possible, asynchronous writes to disk. This is the default for workstations. It does not use the sync daemon.

delayed

Enable delayed or buffered writes to disk. This is the default for servers. It does use the sync daemon.

fs-async

Enable relaxed (asynchronous) posting of file system metadata (changes to the superblocks, inodes, etc.). This option may improve file system performance, but increases exposure to file system corruption in the event of power failure.

no_fs_async Force rigorous (synchronous) posting of file system metadata to disk. This is the default.

mkfs Options mkfs is usually not executed directly, but is called by newfs -F hfs instead. File system tuning is best accomplished when the file system is created. The workload for a file system should be well understood and dedicated before serious attempts are made to tune one. Many options are also dependent on the type of physical device on which a file system is being created. The HFS specific options include: size

The size of the file system in DEV_BSIZE blocks (the default is the entire device).

largefiles

The maximum size of a file can be up to 128 GB.

nolargefiles The maximum size of a file will be limited to 2 GB. ncpg

The number of cylinders per cylinder group (range 1-32, the default is 16).

minfree

the minimum percentage free disk space reserved for non-root processes (default is 10%). Beginning with HP-UX 10.20 the bdf command does not conceal this free space and as a result will report free disk space accurately. This means that a file system cannot show 111% utilization anymore.

http://education.hp.com

H4262S C.00 9-23  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

The number of bytes per inode. This value determines how many inodes are allocated given a file system of a certain size. (The default is 6144.)

nbpi

tunefs Options Some parameters can be changed once the file system has been created, with tunefs(1m). There are minfree and maxbpg. minfree is explained above. maxbpg

The maximum number of data blocks that a large file can use out of a cylinder group, before it is forced to continue to grow in a different cylinder group. This value does not apply to any file which size is 12 blocks or less.

tunefs can also be used to display the contents of an HFS file system: # tunefs –v /dev/…/…

Other Configurations Optimize $PATH

The PATH variable in a user's environment specifies a list of directories to search when a command is entered. Having an excessive number of directories or duplicate directories to search can increase disk access, particularly when the user makes a mistake typing a command. This problem can be greatly exacerbated if the user's PATH variable contains directories that are mounted automatically with the NFS automount utility, causing the network mount of a file system because of a typographical error. Use Flat Directory Structures

Long directory path names create more work for the system because each directory file and its associated inode entry require a disk I/O in order to bring them into memory. Recall that six logical I/Os were required to read the /etc/passwd file. Conversely, you don’t want thousands of files in the same directory, as it would take many I/O operations to read and search the directory. Ensure Sufficient Freespace

As the file system becomes full (greater than 90%), the kernel begins to take longer and longer to find available free fragments. The algorithm gets very lengthy when the file system free space falls below 10%. Of course, if you do not have any files that grow and you are not adding any new files, this would waste 10% of your file system free space for no reason.

H4262S C.00 9-24 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

9–13. SLIDE: Tuning a HFS I/O-Bound System — Use Fast Links

Tuning a HFS I/O-Bound System — Use Fast Links

Type Permissions Links Owner Group

/data

ATime MTime CTime

12

Size

74

/usr/data -> /data

Standard Symbolic Links

Type Permissions Links Owner

Blk. 74

12

Group

Size

ATime MTime CTime /

d

a

t

a

/usr/data -> /data

HP Fast Links

Student Notes There are two ways symbolic links can be stored on HFS file systems.

Standard Symbolic Links Standard symbolic links are implemented in the same way as they are on other UNIX systems. The inode for the symbolic link points to a data block on disk, and the contents of the data block contains the name of the file being referenced by the symbolic link. In the example on the slide, /usr/data is the symbolic link with an inode number of 12. The contents of inode 12 contain an address pointer to data block 74, and the contents of data block 74 contain the name of the file being referenced (in the example, /data). Two logical I/Os are required to resolve the symbolic link, one I/O to retrieve the inode and one I/O to retrieve the data block containing the referenced name.

HP Fast Links HP fast links allow symbolic links to be resolved with one logical I/O instead of two. HP fast links store the name of the referenced file in the inode of the symbolic link itself, rather than in a data block that the inode references. In the example, when the inode (12) of the symbolic link is retrieved, the contents of the inode contain the name of the referenced file.

http://education.hp.com

H4262S C.00 9-25  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

HP fast links can be configured by setting the tunable OS parameter create_fastlinks to 1, and recompiling the kernel. Upon booting from the new kernel, all future symbolic links created will use HP fast links. No existing standard symbolic links will be automatically converted to fast symbolic links. The standard symbolic links would have to be removed and then recreated to convert them. Fast symbolic links will only work for link destinations that can be expressed in 59 characters or less as this is the limit of the space within the inode where the fast link information is stored. If a symbolic link contains more than 59 characters, it will be stored as a standard symbolic link, regardless of the value of create_fastlinks.

Transition Links Saving one logical I/O when accessing a symbolic link may not seem significant, until considering that HP-UX makes heavy use of transition links (which are an implementation of symbolic links). Transition links allow an HP-UX file system to contain older 9.x directory paths. The 9.x directory names are symbolic links that point to the correct, current location (for example, /bin > /usr/bin). Many HP-UX installations have applications (including HP-UX applications), which rely on and make heavy use of transition links. A quick performance gain for all HP-UX systems is to convert these transition links from standard symbolic links to HP fast links. The procedure for making this conversion is: 1. Recompile the kernel to use HP fast links (i.e. set the create_fastlinks to 1). 2. Shut down and reboot the system. 3. Execute tlremove to remove all the transition links from the system. Over 500 links will be removed. 4. Execute tlinstall to reinstall (that is, recreate) the transition links. When the links are reinstalled, they will be created with HP fast links.

H4262S C.00 9-26 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

9–14. LAB: HFS Performance Issues Directions The following lab illustrates a number of performance issues related to HFS file systems. 1. A 512 MB HFS file system is required for this lab. Use the mount and bdf commands to determine if such a file system is available. # mount –v # bdf If there is no such HFS file system available, create one using the commands below: # lvcreate -n hfs vg00 # lvextend –L 512 /dev/vg00/hfs /dev/dsk/cXtYdZ (second disk) # newfs -F hfs /dev/vg00/rhfs # mkdir /hfs # mount /dev/vg00/hfs /hfs 2. Copy the lab files to the newly created HFS file system. # cp /home/h4262/disk/lab1/disk_long # cp /home/h4262/disk/lab1/make_files

/hfs /hfs

Next, execute the make_files program to create five 4-MB ASCII files. # cd /hfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /hfs # mount /dev/vg00/hfs /hfs # cd /hfs

http://education.hp.com

H4262S C.00 9-27  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

4. Time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: 5. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long • • • •

How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?

6. Performance tuning — recreate the file system with larger fragment and file system block sizes. Tuning the size of the fragments and file system blocks can improve performance for sequentially accessed files. The procedure for creating a new file system with customized fragments of 8 KB and file system blocks of 64 KB is shown below: # lvcreate -n custom-lv vg00 # lvextend –L 512 /dev/vg00/custom-lv /dev/dsk/cXtYdZ # newfs -F hfs -f 8192 -b 65536 /dev/vg00/rcustom-lv # mkdir /cust-hfs # mount /dev/vg00/custom_lv /cust-hfs 7. Copy the lab files to the customized HFS file system, execute the make_files program, and purge the buffer cache. # cp /hfs/disk_long

/cust-hfs

# cp /hfs/make_files /cust-hfs # cd /cust-hfs # ./make_files # cd / # umount /cust-hfs

H4262S C.00 9-28 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 9 HFS File System Performance

# mount /dev/vg00/custom-lv /cust-hfs # cd /cust-hfs 8. Time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null

real: user: sys: How do the results of step 8 compare to the default HFS block and fragment results from step 4? _______________________________________________________________________ 9. Performance tuning — change file system mount options. The manner in which the file system is mounted can impact performance. The fsasync mount option can improve performance, but data (metadata) integrity is not as reliable in the event of a crash, and fsck could run into difficulties. # cd / # umount /hfs # mount -o fsasync /dev/vg00/hfs /hfs # cd /hfs 10. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long • • • •

How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?

How do the results of step 10 compare to the default mount options in step 5? _____________________________________________________________________

http://education.hp.com

H4262S C.00 9-29  2004 Hewlett-Packard Development Company, L.P.

Module 9 HFS File System Performance

H4262S C.00 9-30 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10  VxFS Performance Issues •

Understand JFS structure and version differences

•

Explain how to enhance JFS performance

•

Set block sizes to improve performance

•

Set Intent-Log size and rules to improve performance

•

Understand and manipulate synchronous and asynchronous IO

•

Identify JFS tuning parameters

•

Understand and control fragmentation issues

•

Evaluate the overhead of online backup snapshots

http://education.hp.com

H4262S C.00 10-1  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

10–1. SLIDE: Objectives

Objectives Upon completion of this lesson, you will be able to: • Understand JFS structure and version differences • Explain how to enhance JFS performance • Set block sizes to improve performance • Set Intent-Log size and rules to improve performance • Understand and manipulate synchronous and asynchronous I/O • Identify JFS tuning parameters • Understand and control fragmentation issues • Evaluate the overhead of online backup snapshots

Student Notes Upon completion of this module, you will be able to do the following: •

Understand JFS Structure and version differences These course notes are based on the JFS Version 3.5 file system, built on Version 4 disk layout. The next few slides will describe the basic differences between versions and relate them to HP-UX releases. HP JFS 3.5 and HP OnlineJFS 3.5 are available for HP-UX 11i and later systems. The standard (base) version of HP JFS has been bundled with HP-UX since release 10.01. The “advanced” HP OnlineJFS is a purchasable product with additional administrative features for higher availability and tunable performance. These notes will make clear which features belong to the base product and which belong to the OnlineJFS version. The Operating Environment delivery model of HP-UX 11i includes JFS as follows: HP-UX 11i OE HP-UX 11i Enterprise OE HP-UX 11i Mission Critical OE

BaseJFS 3.3 OnlineJFS 3.3 OnlineJFS 3.3

H4262S C.00 10-2 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

You can download JFS 3.5 for HP-UX 11i for free from the HP Software Depot (http://www.software.hp.com), or you can request a free JFS 3.5 CD from the Software Depot. You can purchase HP OnlineJFS 3.3 (product number B3929CA for servers and product number B5118CA for workstations) for HP-UX 11.0 or HP-UX 11i from your HP sales representative. JFS 3.5 is included with HP-UX 11i systems. •

Explain how to enhance JFS performance The HFS file system uses block based allocation schemes, which provide adequate random access and latency for small files but limit throughput for larger files. As a result, the HFS file system is less than optimal for commercial environments. VxFS addresses this file system performance issue through an alternative allocation scheme and increased user control over allocation, I/O, and caching policies.

•

Set Block Sizes to improve performance It is often advantageous to match the block size of a file system to the I/O size of the application. We will show you how!

•

Set Intent Log size to improve performance The JFS intent log provides for rapid fsck recovery after a system crash. In general the intent log is not protecting your data; the focus is on structural integrity and not data integrity! Fast fsck comes at a price and that price is performance. Setting the correct intent log size is important as it cannot be changed once a file system is created.

•

Understand and manipulate synchronous and asynchronous I/O Programmers and data base providers do different types of I/Os to obtain the best possible balance between data integrity and performance. We will investigate all the “gray” areas and tune the JFS file system to meet our administrative and performance goals which might be quite different to those of the programmer!

•

Identify JFS tuning parameters The JFS is tunable through mount options, the command line, configuration files and kernel parameters. We will learn where and how to tune.

•

Understand and control fragmentation issues The extent based file allocation design of JFS is ideal for performance of large files. One weakness of this approach is the potential fragmentation of files and free space over the life of the file system. In general this will only occur in dynamic work file orientated JFS file systems (e.g. a mail server) and is unlikely in fixed “large file” file systems where major I/O rates occur to static files (e.g. a data base). We will investigate ways of measuring and fixing fragmentation.

http://education.hp.com

H4262S C.00 10-3  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

•

Evaluate the overhead of online backup snapshots OnlineJFS supports online backups via snapshot mounts. We will discuss the performance issues involved when working with snapshots.

H4262S C.00 10-4 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10–2. SLIDE: JFS History and Version Review

JFS History and Version Review • JFS introduced in 1995 with HP-UX 10.01 • Version 2 structure at introduction • Version 3 structure at 10.20 allows 1TB files • Version 4 structure at 11.00 allows more tunable controls and supports ACLs • Do not use V4 structure on 11.00 for /, /usr, /opt, /var • vxupgrade(1M) tool can migrate up through versions (not down!) • 11i delivers JFS 3.5 software on V4 structure • Differences between Base JFS 3.5 and OnlineJFS 3.5

Student Notes The HP-UX Journaled File System (JFS) was introduced by HP in August, 1995, on the HP-UX 10.01 release. The journaled file system attempts to improve on the high-performance file system (HFS) by offering the following enhancements: •

Extent-based allocation of disk space

•

Fast file system recovery through an Intent Log

•

Greater control and flexibility of file system behavior through new mount options and tunable options.

http://education.hp.com

H4262S C.00 10-5  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

Disk Layout Versions Version 1

The Version 1 disk layout was never used in HP-UX.

Version 2

The Version 2 disk layout has the following changes and features: • Many internal JFS structures are dynamic files themselves. • Internal “filesets” separate data files (User Fileset) from structural files (Structural Fileset). • Allocation units now contain data and data map structures only, inode tables are elsewhere. • inode allocation is dynamic and cannot run out. • Optional support for quotas.

Version 3

The Version 3 disk layout offers additional support for: • Files up to one terabyte • File systems up to one terabyte • Indirect inode extent maps can now address variant length file extents. V2 restricts all indirect extents to the size of the first indirect extent. Hence large files and sparse files possible with less overhead.

Version 4

Version 4 is the latest disk layout: • The Version 4 disk layout supports Access Control Lists. • The Version 4 disk layout does not include significant physical changes from the Version 3 disk layout. Instead, the policies implemented for Version 4 are different, allowing for performance improvements, file system shrinking, and other enhancements. • HP-UX 11i with Version 4 layout now supports both files and file systems up to 2TB in size.

Table: Matching HP-UX version to JFS version HP-UX Release

VxFS Version

Supported Disk Layouts

Default Disk Layout

10.10 10.20 11.00 with JFS 3.1 11.00 with JFS 3.3 11i v1 11i v2

2.3 3.0 3.1 3.3 3.3 3.5

2 2,3 2,3 2,3,4 2,3,4 2,3,4

2 3 3 3 4 4

vxupgrade(1M) The vxupgrade command can upgrade an existing Version 3 VxFS file system to the Version 4 layout while the file system remains online. vxupgrade can also upgrade a Version 2 file system to the Version 3 layout. See vxupgrade(1M) for details on upgrading VxFS file systems. You cannot downgrade a file system that has been upgraded.

H4262S C.00 10-6 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

NOTE:

You cannot upgrade the root (/) or /usr file systems to Version 4 on an 11.00 system running JFS 3.3. Additionally, we do not advise upgrading the /var or /opt file systems to Version 4 on an 11.00 system. These core file systems are crucial for system recovery. The HP-UX 11.00 kernel and emergency recovery media were built with an older version of JFS that does not recognize the Version 4 disk layout. If these file systems were upgraded to Version 4, your system might have errors booting with the 11.00 kernel as delivered, or booting with the emergency recovery media.

Comparing Base and Advanced JFS Table: Comparing Base and Online JFS Feature extent-based allocation extent attributes fast file system recovery access control list (ACL) support enhanced application interface enhanced mount options improved synchronous write performance support for large files (up to two terabytes) support for large file systems (up to two terabytes) enhanced I/O performance support for BSD-style quotas unlimited number of inodes file system tuning [vxtunefs(1M)] online administration ability to reserve space for a file and set fixed extent sizes and allocation flags online snapshot file system for backup direct I/O, supporting improved database performance data synchronous I/O DMAPI (Data Management API)

JFS 3.5 * * * * * * * * * * * * *

OnlineJFS 3.5 * * * * * * * * * * * * * * * * * * *

How to tell if JFS 3.5 is installed To determine if a vmunix file has JFS3.5 compiled into it, you can run: what /stand/vmunix | grep libvxfs.a or nm /stand/vmunix | grep vx_work If you get output from either of these two commands, then the vmunix file has JFS 3.5 compiled into it, e.g.:

http://education.hp.com

H4262S C.00 10-7  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues # what /stand/vmunix | grep libvxfs.a $Revision: libvxfs.a: 22 PST 2000 $ # nm /stand/vmunix | grep vx_work [13585] | 9746968| st_gettag [13587] | 9746976| st_enqueue [13589] | 9746984| st_thread [13591] | 9746992| st_process [13593] | 9747000| read_set [34118] | 991200| ueue [27664] | 1940288| tag [23820] | 13229528| h [22805] | 13182888| [36804] | 13762256| [39078] cess [33997] ead [23090] ead_sv [36954] eup [31238] reate [13579] et

CUPI80_BL2000_1108_2 Wed Nov

8 10:59:

8|OBJT |LOCAL|0| .rodata|S$704$vx_workli 8|OBJT |LOCAL|0| .rodata|S$705$vx_workli 8|OBJT |LOCAL|0| .rodata|S$706$vx_workli 8|OBJT |LOCAL|0| .rodata|S$707$vx_workli 8|OBJT |LOCAL|0| .rodata|S$708$vx_workth 232|FUNC |GLOB |0|

.text|vx_worklist_enq

96|FUNC |GLOB |0|

.text|vx_worklist_get

40|OBJT |GLOB |0|

.bss|vx_worklist_hig

16|OBJT |GLOB |0| 40|OBJT |GLOB |0|

.bss|vx_worklist_lk .bss|vx_worklist_low

|

1744792|

436|FUNC |GLOB |0|

.text|vx_worklist_pro

|

1745344|

196|FUNC |GLOB |0|

.text|vx_worklist_thr

|

12350056|

8|OBJT |GLOB |0|

.sbss|vx_worklist_thr

|

1745232|

84|FUNC |GLOB |0|

.text|vx_worklist_wak

|

2034928|

48|FUNC |GLOB |0|

.text|vx_workthread_c

|

7215680|

232|FUNC |LOCAL|0|

.text|vx_workthread_s

H4262S C.00 10-8 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10–3. SLIDE: JFS Extents

JFS Extents

Start Length

40 128 200 64 8 5 ...

Extent 1 Extent 2 Extent 3 Different Files

JFS Inode (data pointers) Disk

Student Notes JFS allocates space to files in the form of extents - adjacent blocks of disk space treated as a unit. Extents can vary in size from a single block (minimum 1 KB in size) to many megabytes. Organizing file storage in this manner allows JFS to better support large I/O requests, with more efficient reading and writing to continuous disk space areas. JFS extents are represented by a starting block number and a block count. In the example on the slide, the first extent starts at block 40 and contains a length of 128 blocks (or 128 KB, assuming blocks are 1KB in size). When the file grew past the 128 KB size, JFS tried to increase the size of the last extent. Since another file was already occupying this location, a new extent was allocated, starting at block 200. This extent grew to a size of 64 KB, before encountering another file. At this point, a third extent was allocated at block 8. Initially, 8 KB were allocated to the third extent, but upon closing the file, any space not used by the last extent is returned to the operating system. Since only 5 KB were used, the extra 3 KB were returned.

http://education.hp.com

H4262S C.00 10-9  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

Direct and Indirect Extents in Version 2 Disk Layout Unlike the HFS inode, the vxfs inode is 184 (rather than 128) bytes long and contains direct and indirect pointers. In the HFS inode, the pointers address data blocks (8K by default) with 12 direct pointers and 3 additional indirect pointers for single, double and triple indirection. In reality triple indirection is rarely needed. Mapping large files in HFS is complex due to the levels of indirection needed to address many 8K blocks. The JFS (vxfs) inode has 10 direct pointers and three additional pointers for single, double, and triple indirect addressing. The pointers no longer address single blocks of data but rather large extents of data. It is unlikely that any indirect pointers will be needed at all as the 10 direct pointers can define large spaces due to the variant length of the extents themselves.

Version 3 and Version 4 Extent Mapping.. “Typed Extents”. The above discussion is true only for Version 2 disk layout. In addition to the above, in V3/V4 we also have “Typed Extents” which basically allow any level of indirection allowing very large files to be created from many small extents if required (this is not desirable however!). Version 2 also imposes the limit that all indirect extents be the same size (direct extents can be variable in length). In V3/V4 we can have indirect extents of any size mix. V3/V4 will always attempt to use the simplest approach. We use the 10 direct pointers when we can! Inodes are converted to Typed Indirect when the file exceeds the capability of 10 direct extents.

H4262S C.00 10-10 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10–4. SLIDE: Extent Allocation Policies

Extent Allocation Policies • Disk space allocation: the Block Size can be 1K, 2K, 4K, 8K. • Extents are predefined in free space - “Power of 2 Rule” • Preferred allocation rules • Largest single extent is 16MB (with 8K block size). • Full use of Single Indirection in default HFS would also be 16MB • VxFS supports large files without indirection.

Student Notes Disk Space Allocation: The Block Size Disk space is allocated by the system in 1024 byte device blocks (DEV_BSIZE). An integral number of device blocks are grouped together to form a file system block. VxFS supports file system block sizes of 1024, 2048, 4096, and 8192 bytes. The default block size is: • • • •

1024 bytes for file systems less than 8 gigabytes; 2048 bytes for file systems less than 16 gigabytes; 4096 bytes for file systems less than 32 gigabytes; 8192 bytes for file systems 32 gigabytes or larger.

The block size may be specified as an argument to the mkfs or newfs utility and may vary between VxFS file systems mounted on the same system. VxFS allocates disk space to files in extents. An extent is a set of contiguous blocks (up to 2048 blocks in size).

http://education.hp.com

H4262S C.00 10-11  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

Extents in Free Space - “Power of 2 Rule” Free space is described by bitmaps in each allocation unit. The allocation units are split into 16 “sections”. Each section has a series of bitmaps that represent all the possible extents with sizes from 1 block to 2048 blocks by powers of 2. The first bitmap represents all the blocks in the section as one block extents, the second as two block extents, the third as four block extents, etc. The first bitmap, of 2048 bits, represents the section as 2048 one-block extents. The second bitmap, of 1024 bits, represents the section as 1024 two-block extents. This continues for all powers of 2 up to the single bit that represents one 2048 block extent. The file system uses this bitmapping scheme to find an available extent closest in size to the space required. This keeps files as contiguous as possible for faster performance. The largest possible extent on a file in a VxFS file system (with the largest block size of 8 KB) is 2048 * 8 KB = 16 MB.

Preferred Allocation The following rules are satisfied wherever possible starting with the preferred rules at the top and working down to less preferred rules. • • • •

Allocate files using contiguous extent of blocks Attempt to allocate each file in one extent of blocks If not possible, attempt to allocate all extents for a file close to each other If possible, attempt to allocate all extents for a file in the same allocation unit

An allocation unit is an amount of contiguous (and therefore close together) file system space equal to 32 MB in size. It is roughly analogous to the HFS cylinder group, but is not dependent on the geometry of the disk drive in any way.

H4262S C.00 10-12 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10–5. SLIDE: JFS Intent Log

JFS Intent Log

= meta data update in memory (i.e. superblock or inode table update)

= JFS Intent Log Write

= Sync

= System Crash

Student Notes A key advantage of JFS is that all file system transactions are written to an Intent Log. The logging of file system transactions helps to ensure the integrity of the file system, and allows the file system to be recovered quickly in the event of a system crash.

How the Intent Log Works When a change is made to a file within the file system, such as a new file being created, a file being deleted, or a file being updated, a number of updates must be made to the superblock, inode table, bit maps, and other structures for that file system. These changes are called metadata updates. Typically, there are multiple metadata updates, which take place every time a change is made to a file. With JFS, after every successful file change (also called a transaction), all the metadata updates related to that transaction get written out to a JFS Intent Log. The purpose of the Intent Log is to hold all completed transactions that have not yet been flushed out to disk. If the system were to crash, the file system could quickly be recovered by checking the file system and applying all transactions in the intent log. Since only completed transactions are logged, there is no risk of a file change being only partially updated (i.e. only some metadata

http://education.hp.com

H4262S C.00 10-13  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

updates related to the transactions being logged, and other metadata updates related to the same transaction not being logged). The logging of only COMPLETED transactions prevents the file system from being out-of-sync due a crash occurring in the middle of a transaction. Either the entire transaction is logged or none of the transaction is logged. This allows the JFS intent log to be used in a recovery situation as opposed to a standard fsck. The JFS recovery is done in seconds, as opposed to a standard fsck that (on a big file system) could take minutes, or even hours. Example

Using the example on the slide, assume that each file transaction requires from one to four metadata updates. After each successful file transaction, all the related metadata updates are written to the JFS intent log. After 30 seconds, all the metadata updates are written out to disk by the sync daemon, and a corresponding DONE record is written to the JFS intent log for each JFS transaction that was flushed during the sync. The system can now reuse that space in the JFS intent log for new JFS transactions. When a crash occurs (in our example, in the middle of a file transaction), the uncompleted transaction never has any metadata written to the JFS intent log; therefore only one transaction is in the JFS intent log since the last sync. Only this transaction needs to be redone and then the file system is recovered and in a stable state. Compare this with having to do a standard fsck.

Performance Impacts The intent log size is chosen when a file system is created and cannot be subsequently changed. The mkfs utility uses a default intent log size of 1024 blocks. The default size is sufficient for most workloads. If the system is used as an NFS server, for intensive synchronous write workloads, or for dynamic “work file loads” with many metadata changes, performance may be improved using a larger log size. File data is not normally written to the intent log. However, if the application has designated to do synchronous writes and the writes are 32 KB or smaller, the file data will be written to the intent log, along with the meta-data. This behavior can be modified by mount options (discussed later in this module). With larger intent log sizes, recovery time is proportionately longer and the file system may consume more system resources (such as memory) during normal operation. There are several system performance benchmark suites for which VxFS performs better with larger log sizes. As with block sizes, the best way to pick the log size is to try representative system loads against various sizes and pick the fastest. The performance degradation occurs when the entire JFS intent log becomes filled with pending JFS transactions. In these situations, all new JFS transactions must wait for DONE records to arrive for the existing JFS transactions. Once the DONE records arrive, the space used by the corresponding transactions can be freed and reused for new transactions. Having to wait for DONE records to arrive can significantly decrease performance with JFS. In these cases, it is suggested the JFS file system be reinitialized with a larger JFS intent log.

H4262S C.00 10-14 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

CAUTION:

Network file systems (NFS) can generate a large number of metadata updates if accessed currently by multiple systems. For JFS file systems being exported for network access via NFS, it is strongly recommended these file systems have an intent log size of 16 MB (maximum size for intent log).

http://education.hp.com

H4262S C.00 10-15  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

10–6. SLIDE: Intent Log Data Flow

Intent Log Data Flow

1 Superblock

3 Inodes

Bitmaps

SB

Intent Log Allocation Unit

Inode

2

Buffer Cache

JFS Transaction

5

Allocation Unit Inode

Process

4

Allocation Unit Inode

Memory

Allocation Unit Inode

Disk

Student Notes The following slide shows a graphical representation of how JFS transactions are processed. System call is issued (for example, write call). 1. All in-memory data structures related to the transaction are updated. These in-memory structures would include the superblock, the inode table, and the bitmaps. 2. Once the in-memory structures are updated, a JFS transaction is packaged containing the modifications to the in-memory structures. This packaged transaction contains all the data needed to reproduce the transaction (should that be necessary). 3. Once the JFS transaction is created, it is written to the intent log. (When it is written depends on mount options.) At this point, control is returned to the system call. 4. Since the transaction is now stored on disk (in the intent log), there is no hurry to flush the in-memory data structures to their corresponding disk-based data structures.

H4262S C.00 10-16 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

Therefore, the in-memory structures are transferred to the buffer cache, and the sync daemon flushes out these transactions within the next 30 seconds. 5. After the metadata structures are flushed out, a DONE record is written to the intent log indicating the transaction has been updated to disk, and the corresponding transaction no longer needs to be kept in the intent log.

http://education.hp.com

H4262S C.00 10-17  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

10–7. SLIDE: Understand Your I/O Workload

Understand your I/O Workload • Is it data-intensive? –

few files, large chunks being shuffled around

• Is it attribute-intensive? – many files, small chunks being shuffled • Is the access pattern random or sequential I/O? – Check for read(), write(), and lseek() system calls • What is the bandwidth and size of the I/Os? – Are these consistent? –

Spindles Win Prizes! LVM or VxVM Stripes

–

Use XP Disk Arrays

Student Notes Understand your I/O Workload Tuning the file system’s parameters to optimize performance can only be done effectively when you know what type of I/Os the application is doing. It would be wrong to tune for large block size and maximum contiguous space allocation if the application does many small random I/Os to many small files.

Data Intensive? Commercial data base applications generally deal with very large files in the table space and large I/Os to those files. Any high degree of small random I/O should be taken care of by the data base’s own buffers (System Global Area) and the HP-UX buffer cache (if it is being used). We may choose to increase the block size in this situation and tune for maximum read ahead/write behind. The following slides will cover this type of tuning.

H4262S C.00 10-18 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

Attribute Intensive? Some applications generate many small I/Os to many small files. In this situation a large block size and maximum read ahead/write behind would be inappropriate, generating more I/O than is necessary. A Mail Server or Web Server could be regarded as such an application.

Sequential or Random IO? We need to characterize the I/O from an application into sequential or random types. Again, in general sequential I/O will benefit from larger block size and continuous files and random I/O will require smaller block size to increase the number of blocks that can be maintained in the buffer cache. With sequential I/O we are more interested in maximizing the MB/sec throughput of the disk (as seen with sar or glance etc). With random I/O we will be looking to the “I/Os Per Second” metrics associated with the disk (r+w/s in sar). Remember that the fastest random I/Os we do are the ones that never go to the disk (!) because they are in the buffer cache (we hope). The Direct I/O feature of OnlineJFS 3.5 is an attempt to recognize when I/Os to a file are very large and sequential. Direct I/O will then attempt to bypass the buffer cache to benefit the generator of the large I/Os in question. Most applications are not designed to handle their own buffering and will lose a great deal of performance if they attempt to use Direct I/O.

Disk Bandwidth In the end we can only get so much performance out of a single spindle. Modern fast disks (10,000+ RPM, 5ms access time) can only provide an absolute maximum of approx 10 MB/s for very sequential I/O and around 150 I/Os Per Second. Once your file system is extracting these sorts of numbers (or even 50% of them!) you can consider that the hardware has become the limiting factor. Stop tuning and buy more disks! Remember that “spindles win prizes”. LVM or VxVM striping will help in this situation as the single spindle performance can be aggregated by the number of spindles. Using expensive RAID technology like the HP XP256, XP512, or XP1024 Disk Arrays will also improve apparent “spindle” performance. The author has seen a single XP512 logical device provide a sustained 60MB/s read performance for sequential I/O and over 1500 I/Os per second for a single threaded random application test to a single logical device.

http://education.hp.com

H4262S C.00 10-19  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

10–8. SLIDE: Performance Parameters

Performance Parameters Things that an administrator can change to optimize JFS: • Choosing a Block Size • Choosing an Intent Log Size • Choosing Mount Options • Kernel Tunables –

Internal Inode Table Size

• Monitoring Free Space and Fragmentation • Changing extent attributes on individual files • I/O Tuning –

Tunable VxFS I/O Parameters

–

Command Line Configuration file (/etc/vx/tunefstab)

–

Student Notes We will discuss the following choices over the next slides. Note that the some parameters can only be set when the file system is created. •

At file system creation time (only):

− − •

Choosing a Block Size Choosing an Intent Log Size

After file system creation:

− − − − − − −

Choosing Mount Options Kernel Tunables Kernel Inode Table Size Monitoring Free Space and Fragmentation Changing extent attributes on individual files I/O Tuning Tunable VxFS I/O Parameters

H4262S C.00 10-20 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10–9. SLIDE: Choosing a Block Size

Choosing a Block Size • Choose the right block size for the application. • Consider maximum block size (8K) for large file data base –

Small files will waste space System overhead will be less

–

Files approaching 1GB are “large”

–

• Consider minimum block size (1K) for small file mail server or web server –

More system overhead if files are large

• Use large block size for sequential I/O application • Use small block size for random I/O application

Student Notes You specify the block size when a file system is created; it cannot be changed later. The standard HFS file system defaults to a block size of 8K with a 1K fragment size. This means that space is allocated to small files (up to 12 blocks) in 1K increments. Allocations for larger files are done in 8K increments. Because many files are small, the fragment facility saves a large amount of space compared to allocating space 8K at a time. The unit of allocation in VxFS is a block. There are no fragments because storage is allocated in extents that consist of one or more blocks. The smallest block size available is 1K, which is also the default block size for VxFS file systems created on devices of less than 8 gigabytes. Choose a block size based on the type of application being run. For example, if there are many small files, a 1K block size may save space. For large file systems, with relatively few files, a larger block size is more appropriate. The trade-offs of specifying larger block sizes are: 1) a decrease in the amount of space used to hold the free extent bitmaps for each allocation unit, 2) an increase in the maximum extent size, and 3) a decrease in the number of extents used per file versus an increase in the amount of space wasted at the end of files that are not a multiple of the block size.

http://education.hp.com

H4262S C.00 10-21  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

Larger block sizes use less disk space in file system overhead, but consume more space for files that are not a multiple of the block size. The easiest way to judge which block sizes provide the greatest system efficiency is to try representative system loads against various sizes and pick the fastest.

Specifying the Block Size The following newfs command creates a VxFS file system with the maximum block size and support for large files. # newfs –F vxfs –b 8192 –o largefiles /dev/vgjfs/rlvol1 The block size for files on the file system represents the smallest amount of disk space that can be allocated to a file. It must be a power of 2 selected from the range 1024 to 8192. The default is 1024 for file systems less than 8 gigabytes, 2048 for file systems less than 16 gigabytes, 4096 for file systems less than 32 gigabytes, and 8192 for larger file systems.

H4262S C.00 10-22 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10–10. SLIDE: Choosing an Intent Log Size

Choosing an Intent Log Size • Intent log size cannot be changed after file system creation • mkfs applies a default log size of 1024 blocks • Performance may improve as using a larger log size – NFS server will benefit from a 16MB (largest) log size –

Synchronous write intensive applications

Student Notes The intent log size is chosen when a file system is created and cannot be changed afterwards. The default intent log size chosen by mkfs is 1024 blocks and is suitable in most situations. For some types of applications (NFS server or intensive synchronous write loads), performance may be improved by increasing the size of the intent log. Note that recovery time will also be proportionally longer as the log size increases. Memory requirements for the log maintenance will also increase as the log size increases. Ensure that the log size is not more than 50% of the physical memory size of the system or fsck will not be able to fix it after a system crash. Ideal log size for NFS is 2048 with a file system block size of 8192.

http://education.hp.com

H4262S C.00 10-23  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

Specifying the Intent Log Size To create a VxFS file system with a default block size and a 16MB intent log: # newfs –F vxfs –o logsize=16384 /dev/vgjfs/rlvol1 “-o logsize=” specifies the number of file system blocks to allocate for the transactionlogging area. It must be in the range of 32 to 16384 blocks. The minimum number for Version 2 disk layouts is 32 blocks. The minimum number for Version 3 and Version 4 disk layouts is the number of blocks that make the log no less than 256K. If the file system is: • • •

greater than or equal to 8MB, default is 1024 blocks greater than or equal to 2MB, and less than 8 MB, default is 128 blocks less than 2MB, default is 32 blocks

While logsize is specified in blocks, the maximum size of the intent log is 16384 KB. This means the maximum values for logsize are: • • • •

16384 for a block size of 1024 bytes 8192 for a block size of 2048 bytes 4096 for a block size of 4096 bytes 2048 for a block size of 8192 bytes

H4262S C.00 10-24 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10–11. SLIDE: Intent Log Mount Options

Intent Log Mount Options • Full logging*

log

• Delayed logging

delaylog

• Temporary logging • No logging

tmplog nolog

• Disallow small sync I/Os in log

nodatainlog (50% perf cost!)

• Force clear new file blocks

blkclear

(10% perf cost!)

*Note only the first option is default for mount

Student Notes JFS offers mount options to delay or disable transaction logging to the intent log. This allows the system administrator to make trade-offs between file system integrity and performance. Following are the logging options: Mount Option

Description

Full logging (log)

File system structural changes are logged to disk before the system call returns to the application (synchronously). If the system crashes, fsck(1M) will complete logged operations that have not completed.

Delayed logging (delaylog)

Some system calls return before the intent log is written. This improves the performance of the system, but some changes are not guaranteed until a short time later when the intent log is written. This mode approximates traditional UNIX system guarantees for correctness in case of system failure.

Temporary logging

The intent log is almost always delayed. This improves

http://education.hp.com

H4262S C.00 10-25  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

(tmplog)

performance, but recent changes may disappear if the system crashes. This mode is only recommended for temporary file systems.

No logging (nolog)

The intent log is disabled. The other three logging modes provide for fast file system recovery; nolog does not provide fast file system recovery. With nolog mode, a full structural check must be performed after a crash. This may result in loss of substantial portions of the file system, depending upon activity at the time of the crash. Usually, a nolog file system should be rebuilt with mkfs(1M) after a crash. The nolog mode should only be used for memory resident or very temporary file systems.

nodatainlog

The nodatainlog mode should be used on systems with disks that do revectoring. Normally, a VxFS file system uses the intent log for synchronous writes. The inode update and the data are both logged in the transaction, so a synchronous write only requires one disk write instead of two. When the synchronous write returns to the application, the file system has told the application that the data is already written. If a disk error causes the data update to fail, then the file must be marked bad and the entire file is lost. If a disk supports bad block revectoring, then a failure on the data update is unlikely, so logging synchronous writes should be allowed. If the disk does not support bad block revectoring, then a failure is more likely, so the nodatainlog mode should be used. A nodatainlog mode file system should be approximately 50 percent slower than a standard mode VxFS file system for synchronous writes. Other operations are not affected.

blkclear

The blkclear mode is used in increased data security environments. The blkclear mode guarantees that uninitialized storage never appears in files. The increased integrity is provided by clearing extents on disk when they are allocated to a file. Extending writes are not affected by this mode. A blkclear mode file system should be approximately 10 percent slower than a standard mode VxFS file system, depending on the workload.

H4262S C.00 10-26 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10–12. SLIDE: Other JFS Mount Options

Other JFS Mount Options mincache options (buffer cache) • closesync*

convosync options (synchronous I/O) • closesync

• direct

• direct

• dsync

• dsync

• unbuffered

• unbuffered

• tmpcache

• delay

* NOTE: This is the only additional option available with BaseJFS, all other options require OnlineJFS.

Student Notes Understanding asynchronous, data synchronous (O_DSYNC) and fully synchronous (O_SYNC) application I/O. When an application program opens a file with the open() system call, the programmer makes a decision on how the I/Os will occur between the application memory and the file system. The following three options are available, in order, ranging from highest performance (lowest integrity) to lowest performance (best integrity). In this discussion “integrity” refers to the potential damage to file system structures and customer data during a system crash. 1. Asynchronous I/O

Standard Mode

High performance / Low integrity

In asynchronous mode, all application I/Os are done to buffer cache including data and inode modifications. The write() system call will return quickly to the application which can continue “in faith” that the data will make it to the disk. Data integrity will be fully compromised by a system crash and new “just created” files may even disappear.

http://education.hp.com

H4262S C.00 10-27  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

2. Data Synchronous I/O O_DSYNC

Low performance / Good integrity

If the file is opened with the O_DSYNC flag, the file is in “Data Synchronous” mode. In this situation, write() system calls that modify data do not return until the disk has acknowledged the receipt of the data. However, some inode changes (time stamps, etc.) are still performed asynchronously and may not have arrived at the disk in the case of a system crash.

3. Synchronous I/O

O_SYNC

Lowest performance / Best integrity

Fully synchronous behavior is obtained by opening the file with O_SYNC. All operations are now synchronous and write() system calls block for both data and inode modifications. Minimal damage will now occur in the event of a system crash.

mincache vs. convosync mincache manipulates the behavior of the buffer cache. All of the mincache options except mincache=closesync require the OnlineJFS product (see slide). convosync (“convert osync”) changes the behavior of data synchronous (O_DSYNC) and synchronous (O_SYNC) writes. All convosync options require OnlineJFS. The mincache and convosync options generally control the integrity of the user data, where the log options (log, delaylog, tmplog, nolog) control the integrity of the metadata only.

mincache mincache=closesync

Flush data to disk synchronously when file is closed.

The mincache=closesync mode is useful in desktop environments where users are likely to shut off the power on the machine without halting it first. In this mode, any changes to the file are flushed to disk synchronously when the file is closed. To improve performance, most file systems do not synchronously update data and inode changes to disk. If the system crashes, files that have been updated within the past minute are in danger of losing data. With the mincache=closesync mode, if the system crashes or is switched off, only files that are currently open can lose data. A mincache=closesync mode file system should be approximately 15 percent slower than a standard mode VxFS file system, depending on the workload. mincache=direct

Bypass the buffer cache for all data and inode changes, forces fully synchronous behavior and totally skips buffer cache.

mincache=unbuffered

Bypass the buffer cache for data only. Inode changes are cached. Forces data synchronous-like behavior with no data in cache.

H4262S C.00 10-28 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

Equivalent to normal data synchronous behavior. Write does not return until data is on disk but data does go through buffer cache.

mincache=dsync

The mincache=direct, mincache=unbuffered, and mincache=dsync modes are used in environments where applications are experiencing reliability problems caused by the kernel buffering of I/O and delayed flushing of non-synchronous I/O. The mincache=direct and mincache=unbuffered modes guarantee that all nonsynchronous I/O requests to files will be handled as if the VX_DIRECT or VX_UNBUFFERED caching advisories had been specified. The mincache=dsync mode guarantees that all nonsynchronous I/O requests to files will be handled as if the VX_DSYNC caching advisory had been specified. Refer to vxfsio(7) for explanations of VX_DIRECT, VX_UNBUFFERED, and VX_DSYNC. The mincache=direct, mincache=unbuffered, and mincache=dsync modes also flush file data on close as mincache=closesync does. mincache=tmpcache

Speeds up file growth by breaking data initialization rules.

The -o mincache=tmpcache option only affects write extending calls and is not available to files performing synchronous I/O. write extending calls refer to write calls that cause new file system blocks to be assigned to the file, extending the size of the file in blocks. The normal behavior for write extending calls is to write the new user data first, and insist on metadata to be written only after the user data. Write extending calls are expensive from a performance standpoint, because the write call has to wait for the user data and the metadata to be written. A non-extending write call only requires the call to wait for the metadata. With the -o mincache=tmpcache option, write extending calls do not have to wait for the user data to be written. This option allows the metadata to be written before user data (and the write call to return before the user data is written), significantly improving performance. CAUTION:

The -o mincache=tmpcache option significantly increases the likelihood of non-initialized file system blocks (i.e. junk) appearing in files during a system crash. This is due to the file pointing to data blocks before the data is actually there. If the system crashes between the file's inode being updated (done first) and the user data being written (done second), then uninitialized data will appear in the file. The tmpcache option should only be used for memory resident or very temporary file systems.

http://education.hp.com

H4262S C.00 10-29  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

convosync NOTE:

Use of the convosync=dsync option violates POSIX guarantees for synchronous I/O.

The “convert osync” (convosync) mode has five values: convosync=closesync, convosync=direct, convosync=dsync, convosync=unbuffered, and convosync=delay. The convosync=closesync mode converts synchronous and data synchronous writes to non-synchronous writes and flushes the changes in the file to disk when the file is closed. The convosync=delay mode causes synchronous and data synchronous writes to be delayed rather than to take effect immediately. No special action is performed when closing a file. This option effectively cancels any data integrity guarantees normally provided by opening a file with O_SYNC. See open(2), fcntl(2), and vxfsio(7) for more information on O_SYNC. Caution!

Extreme care should be taken when using the convosync=closesync or convosync=delay mode because they actually change synchronous I/O into non-synchronous I/O. This may cause applications that use synchronous I/O for data reliability to fail, if the system crashes and synchronously written data is lost.

The convosync=direct and convosync=unbuffered mode convert synchronous and data synchronous reads and writes to direct reads and writes, bypassing the buffer cache. The convosync=dsync mode converts synchronous writes to data synchronous writes. As with closesync, the direct, unbuffered, and dsync modes flush changes in the file to disk when it is closed. These modes can be used to speed up applications that use synchronous I/O. Many applications that are concerned with data integrity specify O_SYNC in order to write the file data synchronously. However, this has the undesirable side effect of updating inode times and therefore slowing down performance. The convosync=dsync, convosync=unbuffered, and convosync=direct modes alleviate this problem by allowing applications to take advantage of synchronous writes without modifying inode times as well. NOTE:

Before using convosync=dsync, convosync=unbuffered, or convosync=direct, make sure that all applications that use the file system do not require synchronous inode time updates for O_SYNC writes.

H4262S C.00 10-30 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10–13. SLIDE: JFS Mount Option: mincache=direct

JFS Mount Option: mincache=direct

Buffer Cache

SGA Database Cache

Buffer Cache

ORACLE Database

Oracle Process

Data Flow with default mount options

SGA Database Cache

ORACLE Database

Oracle Process

Data Flow with mount option mincache=direct

Student Notes The above slide illustrates the impact of setting the -o mincache=direct option. By default, all JFS file system I/O goes through the system's buffer cache. When an application does its own caching (e.g. an Oracle database application), there are two levels of caching. One cache is managed by the application; the other cache is managed by the kernel. Using two caches is inefficient from both a performance and a memory usage standpoint (data exists in both caches). When the file system is mounted with the -o mincache=direct option, it causes bypassing of the system's buffer cache and the data is written directly to disk. This improves performance and keeps the buffer cache available for other file systems that do not go through an application cache.

http://education.hp.com

H4262S C.00 10-31  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

CAUTION:

Use of the -o mincache=direct option can lead to a significant decrease in performance if used in the wrong situation. This option should only be used if: 1. An application creates and maintains its own data cache, and 2. All the files on the file system are cached in the application's data cache. If there are some files being accessed on the mounted file system and these files are not being cached by the application, this option should not be used.

NOTE:

This option is only available with the OnlineJFS product.

H4262S C.00 10-32 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10-14. SLIDE: JFS Mount Option: mincache=tmpcache

JFS Mount Option: mincache=tmpcache

2 SB

Inode

AU

1 SB

Intent Log

SB

Inode

AU

SB

Allocation Unit Buffer Cache

Intent Log

Allocation Unit Buffer Cache

JFS Transaction Process

Memory

JFS Transaction

Allocation Unit

1

Allocation Unit

Process

Inode

Memory File

Disk

Inode

2

File

Disk

default

mincache=tmpcache

Student Notes By default, when a process performs a write extending call, the new data is written to disk before the file's inode is updated. In the slide above, the left side shows the default behavior: 1. Write data to newly allocated file system block. 2. Write JFS transaction meta-data out to the disk. The system call returns. The advantage of this behavior is that uninitialized data will not be found within the file should a system crash occur. This is important from a data integrity standpoint. The disadvantage of this behavior is slow performance, because the JFS transaction must wait for the user data I/O to complete before it can be written to the intent log.

Behavior with -o mincache=tmpcache Option Performance can be improved (at the expense of data integrity) by mounting file systems with the -o mincache=tmpcache option. This option allows the JFS transactions to be written to the intent log before the user data is written to the file. In the slide, the right side shows the tmpcache behavior:

http://education.hp.com

H4262S C.00 10-33  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

1. Write JFS transaction out to disk. (The system call returns). 2. Write data to newly allocated file system block. The advantage of this behavior is performance of write extending calls is fast. The system does not wait for the user data to be written to disk. The disadvantage of this behavior is data integrity of the file is jeopardized, especially if the file is being updated at the time of a system crash. By updating the file's inode first, the file points to uninitialized data blocks which contains unknown data. The uninitialized file system blocks are expected to be initialized soon after the inode is updated; however, there still exists a small window of time when the file's inode references unknown data. If the system crashes during this small window, then the file will still be referencing the uninitialized data after the crash. CAUTION:

The -o mincache=tmpcache option should only be used for memory resident or very temporary file systems.

H4262S C.00 10-34 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10–15. SLIDE: Kernel Tunables

Kernel Tunables • VxFS inodes are cached in memory, separate from HFS. • Kernel parameter ninode has no effect on VxFS. • When vx_ninode is zero (default), inode cache is set in proportion to system memory (see table). • vx_ncsize sets directory name lookup cache (1KB)

Student Notes Internal Inode Table Size VxFS caches inodes in an inode table (see Table below, “Inode Table Size”). There is a tunable in VxFS called vx_ninode that determines the number of entries in the inode table. A VxFS file system obtains the value of vx_ninode from the system configuration file used for making the kernel (/stand/system for example). This value is used to determine the number of entries in the VxFS inode table. By default, vx_ninode is set to zero. The kernel then computes a value based on the system memory size.

http://education.hp.com

H4262S C.00 10-35  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues Total Memory in Mbytes 8 16 32 64 128 256 512 1024 2048 8192 32768 131072

MaximumNumber of Inodes 400 1000 2500 6000 8000 16000 32000 64000 128000 256000 512000 1024000

If the available memory is a value between two entries, the value of vx_ninode is interpolated.

Other VxFS Kernel Parameters vx_ncsize

Controls the size of the DNLC (directory name lookup cache) in the kernel. Recent directory path names are stored in memory to improve performance. This parameter is set in DNLC entries. The size of the DNLC is set to the sum of ninode and vx_ncsize.

H4262S C.00 10-36 © 2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 10 VxFS Performance Issues

10–16. SLIDE: Fragmentation

Fragmentation • Keep file system free space over 10% • Maintain free space distribution goals – Monitor with df(1M) or fsadm(1M) • Repack files and free space with fsadm –e –

Reduces the number of extents in large files

–

Makes small files contiguous (one extent)

Moves small recently used file closer to inode structures – Optimizes free space into larger extents • Repack directories with fsadm –d –

–

Remove empty entries from directories

–

Place recently used files at beginning of directory lists

–

Pack small directories directly in inode if possible

Student Notes •

Keep file system free space over 10% In general, VxFS works best if the percentage of free space in the file system does not get below 10 percent. This is because file systems with 10 percent or more free space have less fragmentation and better extent allocation. Regular use of the df(1M) command to monitor free space is desirable. Full file systems should therefore have some files removed, or should be expanded (see fsadm(1M) for a description of online file system expansion).

•

Maintain free space distribution goals 3 factors which can be used to determine the degree of fragmentation: • • •

percentage of free space in extents < than 8 blocks in length percentage of free space in extents < than 64 blocks in length percentage of free space in extents of 64 blocks or greater

An unfragmented file system will have the following characteristics:

http://education.hp.com

H4262S C.00 10-37  2004 Hewlett-Packard Development Company, L.P.

Module 10 VxFS Performance Issues

• • •

less than 1% of free space in extents < 8 blocks in length less than 5% of free space in extents < 64 blocks in length more than 5% of total file system size available as free extents in length of 64 or more blocks A fragmented file system will have the following characteristics: • greater than 5% of free space in extents < 8 blocks in length • more than 50% of free space in extents < 64 blocks in length • less than 5% of total file system size available as free extents in lengths of 64 or more blocks in size Using df(1M) The following example shows how to use df to map free space: # df -F vxfs -o s /usr /usr (/dev/vg00/lvol7 ) : Free Extents by Size 1: 823 2: 206 4: 16: 158 32: 61 64: 256: 23 512: 14 1024: 4096: 1 8192: 1 16384:

•

55 48 3 0

8: 128: 2048: 32768:

206 43 3 0

Repack files and freespace fsadm –e has the following goals for files and free data space • • • • •

•

Make “small” files (default: /dev/null Record results:

Real: _____________ User: ____________ Sys: ____________ 5. Now that the data is in the client's buffer cache, time how long it takes to read the exact same files again. Record the results: # timex cat /vxfs/file* > /dev/null Record results:

Real: _____________ User: ____________ Sys: ____________ Moral: Try to have a big enough buffer on the client system for a lot of data to be cached. Also, biod daemons will help prefetching data.

http://education.hp.com

H4262S C.00 11-41  2004 Hewlett-Packard Develoment Company, L.P.

Module 11 Network Performance

6. Test to see if fewer biod daemons will change the initial performance. # # # # # #

cd / umount /vxfs kill $(ps -e |grep biod|cut -c1-7) /usr/sbin/biod 4 mount server_hostname:/vxfs /vxfs timex cat /vxfs/file* > /dev/null

Record results:

Real: _____________ User: ____________ Sys: ____________ 7. Once finished, remove the files and umount the file system. # rm /vxfs/file* # umount /vxfs

H4262S C.00 11-42 © 2004 Hewlett-Packard Develoment Company, L.P.

http://education.hp.com

Module 11 Network Performance

Lab 2  Network Write Performance The following lab has the client perform many writes to an NFS file system. The following parameters will be investigated: •

Number of biod daemons

•

NFS version 2 versus NFS version 3

•

TCP versus UDP

During this lab, the monitoring tools shown below should be used on the client and server CLIENT

SERVER

# nfsstat -c # nfsstat -s # glance NFS report (n key) # glance NFS report (n key) # glance Global Process (g key) # glance Global Process (g key) - monitor biod daemons -monitor nfsd daemons # glance Disk report (d key) - monitor Remote Rds/Wrts 1. From the NFS client, mount the NFS file system as a version 2 file system. # mount -o vers=2

server_hostname:/vxfs

/vxfs

2. Terminate all the biod daemons on the client. # kill $(ps -e |grep biod|cut -c1-7) 3. Time how long it takes to copy the vmunix file to the mounted NFS file system. Record the results: The first command buffers the file. # cat /stand/vmunix >/dev/null # timex cp /stand/vmunix /jfs Record results:

Real: _____________ User: ____________ Sys: ____________ 4. Now, start up the biod daemons, and retry timing the copy. Record the results: # /usr/sbin/biod 4 # timex cp /stand/vmunix

/jfs

Record results:

Real: _____________ User: ____________ Sys: ____________

http://education.hp.com

H4262S C.00 11-43  2004 Hewlett-Packard Develoment Company, L.P.

Module 11 Network Performance

5. Change the mount options to version 3 and retime the transfer: # # # # #

cd / umount /vxfs mount –o vers=3 server_hostname:/vxfs /vxfs cd / timex cp /stand/vmunix /vxfs

Record results:

Real: _____________ User: ____________ Sys: ____________ 6. Compare the speed of FTP to NFS. Transfer the file to the server using the ftp utility. # ftp server_hostname # put /stand/vmunix /vxfs/vmunix.ftp How long did the FTP transfer take? _________ Explain the difference in performance. 7. Test the potential performance benefit of turning off the new TCP feature of HPUX 11i. First, mount the file system with UDP protocol rather than the default TCP. # umount /vxfs # mount -o vers=3 –o proto=udp server_hostname:/vxfs /vxfs Perform the copy test again and compare the results with the TCP version 3 mount data in part 3. Is UDP quicker than TCP? # timex cp /stand/vmunix /vxfs

H4262S C.00 11-44 © 2004 Hewlett-Packard Develoment Company, L.P.

http://education.hp.com

Module 12  Tunable Kernel Parameters Objectives Upon completion of this module, you will be able to do the following: •

Identify which tunable parameters belong to which category

•

Identify tunable kernel parameters that could impact performance

•

Tune both static and dynamic tunable parameters

http://education.hp.com

H4262S C.00 12-1  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

12–1. SLIDE: Kernel Parameter Classes

Kernel Parameters Classes • Static –

requires a kernel rebuild and a reboot

• Dynamic – –

changes take place immediately changes survive a reboot

• Automatic –

constantly being tuned by the kernel

–

can be set manually to a fixed value

Student Notes There are a number of tunable parameters within the kernel that can have a big impact on performance. When making changes to these parameters, it may require that a new kernel be compiled. As of 11i v1, about 12 parameters were converted to dynamically tunable parameters. That is, their values could be changed without rebuilding the kernel and without rebooting the system. As of 11i v2, there are now around 36 dynamically tunable parameters, plus a few traditional parameters that are now tuned by the kernel, so no manual tuning of them need be done at all. Static kernel parameters have been around since UNIX was first designed. In order to change one of these parameters, it was necessary to alter the contents of a system configuration file, system, rebuild the kernel using this altered configuration file, move the new kernel into place, and reboot the system to activate the new kernel. This tended to be time consuming and forced the system to become unavailable for a time. Recently, with HP-UX 11i v1, a few kernel parameters were converted to dynamic tuning. These parameters could be altered, using SAM or kmtune, and the changes would become effective immediately. There was no longer a need to rebuild the kernel or reboot the system. However, this only applied to those few kernel parameters. The vast majority of kernel parameters were still static. The dozen parameters that were made dynamically tunable,

H4262S C.00 12-2  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 12 Tunable Kernel Parameters

were ones that tended to be tuned by system administrators more frequently, but were relatively easy to convert to dynamic. More recently, with HP-UX 11i v2, several more parameters were converted to dynamic tuning. These parameters were also tuned fairly frequently by system administrators, but were more difficult to convert to dynamic. At the same time, a new class of parameters was introduced – automatic. These parameters were tuned by the kernel – constantly – in response to changing conditions in the system. However, the system administrator could override the automatic handing by the kernel and force the parameter to some fixed value, if needed. At HP-UX 11i v1, the following kernel parameters became dynamic: core_addshmem_read core_addshmem_write maxfiles_lim maxtsiz maxtsiz_64bit maxuprc msgmax msgmnb scsi_max_qdepth semmsl shmmax shmseg At HP-UX 11i v2, the following additional kernel parameters became dynamic: aio_listio_max aio_max_ops aio_monitor_run_sec aio_prio_delta_max aio_proc_thread_pct aio_proc_threads aio_req_per_thread alloc_fs_swapmap alwaysdump dbc_max_pct dbc_min_pct dontdump fs_symlinks ksi_alloc_max max_acct_file_size max_thread_proc maxdsiz maxdsiz_64bit maxssiz maxssiz_64bit nfile nflocks nkthread

http://education.hp.com

H4262S C.00 12-3  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

nproc nsysmap nsysmap64 physical_io_buffers shmmni vxfs_ifree_timelag Also at HP-UX 11i v2, the following kernel parameters are obsolete or automatic: bootspinlocks clicreservedmem maxswapchunks maxusers mesg ncallout netisr_priority nni ndilbuffers sema semmap shmem spread_UP_drivers

H4262S C.00 12-4  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 12 Tunable Kernel Parameters

12–2. SLIDE: Tuning the Kernel

Tuning the Kernel • Use system_prep, kmtune, or kctune to view current values of tunable kernel parameters. • Use SAM (or new km/kc commands) to tune kernel parameters. • Tune only one parameter at a time. • Do not make parameters unnecessarily large. • Use glance to monitor system table sizes (ensure highest value is not equal to total table size). • Some kernel parameters are dynamic (no reboot) see kmtune and kctune.

Student Notes Some general rules and notes regarding tuning and recompiling the kernel: •

View the existing, tunable parameters with the kctune command (HP-UX 11i v2), the kmtune command (HP-UX 11.00 and 11i v1) or the sysdef or system_prep commands (HP-UX 10.x). You can also use SAM with any version of HP-UX to view the current values. Examples of outputs are shown below.

•

Use the System Administration Manager (SAM) to tune the kernel parameters and rebuild the systems. SAM has the advantage of displaying all available, tunable parameters, their current values, and a range of acceptable values. SAM also knows which parameters can be tuned dynamically and will make changes to them immediately. As of HP-UX 11i v2, SAM calls a separate utility to do the actual tuning.

•

When tuning performance by modifying kernel parameters, modify only one value with each kernel rebuild. By changing several parameters at once, you may cloud the picture and make it much more difficult to determine what helped and what hurt the system’s performance.

http://education.hp.com

H4262S C.00 12-5  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

•

Avoid setting the tunable parameters too large. Many of the parameters create in-core memory data structures whose size is dependent upon the value of the tunable parameter (for example, nprocs to the size of the process table). Generally, it is a good rule of thumb to increase or decrease a parameter by no more than 20%, while trying to find the best setting for it. Of course, if you are changing a parameter’s value to accommodate some new application you are installing, always follow the manufacturer’s suggested changes.

•

Use glance to monitor system table sizes. Ensure the system tables are not running out of entries. In general, there should be around 20% of unused entries in any table. This will ensure that you have enough entries to handle any high demand periods.

The step-by-step procedure for tuning and recompiling the kernel manually on HP-UX 11.X is shown below: 1. Log in as superuser. 2. Change directory. cd /stand/build 3. Create a system file from your current kernel. /usr/lbin/sysadm/system_prep -v -s system 4. Modify the /stand/build/system file as desired. 5. Build the kernel: /usr/sbin/mk_kernel -s system. 6. Save your old system and kernel files, just in case you want to go back. cp /stand/system /stand/system.prev cp /stand/vmunix /stand/vmunix.prev cp /stand/dlkm /stand/dlkm_vmunix.prev 7. Schedule the kernel update on the next reboot. kmupdate 8. Shut down and reboot from your new kernel. /sbin/shutdown -ry 0

Understanding Dynamic Kernel Variables. kctune(1M), kmtune(1M) or sam can be used “on the fly” to modify some kernel variables. Any changes take place immediately without the need to reboot. In HP-UX 11i v2, kmtune still exists, but simply calls kctune.

H4262S C.00 12-6  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 12 Tunable Kernel Parameters

Example using kmtune to set and then activate a new value for a dynamic kernel variable. # kmtune -q shmseg Parameter Current Dyn Planned Module Version ===================================================== shmseg 120 Y 120 # kmtune -s shmseg=155 # kmtune -l -q shmseg Parameter: shmseg Current: 120 Planned: 155 Default: 120 Minimum: Module: Version: Dynamic: Yes # kmtune -u shmseg shmseg has been set to 155 (0x9b).

http://education.hp.com

H4262S C.00 12-7  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

12–3. SLIDE: Kernel Parameter Categories

Kernel Parameter Categories • File system • Message queues • Semaphores • Shared memory • Process • Swap • LVM • Networking • Miscellaneous

Student Notes The next few slides will present the tunable kernel parameters in these categories.

H4262S C.00 12-8  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 12 Tunable Kernel Parameters

12–4. SLIDE: File System Kernel Parameters

File System Kernel Parameters Kernel Parameter

Default

Description

dbc_min_pct

5

Minimum size of dynamic buffer cache (dbc)

dbc_max_pct

50

Maximum size of dynamic buffer cache (dbc)

nbuf

0

Number of buffer headers (in 10.x and above, use DBC)

bufpages

0

Number of 4-KB buffer pages (in 10.x and above, use DBC)

fs_async

0

If on (1), forces all meta-data writes to disk to be asynchronous

maxfiles

60

maxfiles_lim

1024

Soft limit to the number of files a process can have open Hard limit to the number of files a process can have open

nfile

formula

Size of file table in memory

ninode

formula

Size of inode table in memory

nflocks

200

vx_ncsize

1024

Size of file-lock table in memory Size of vxfs directory name lookup cache (DNLC)

Student Notes dbc_min_pct

dbc_min_pct specifies the minimum size that the system's buffer cache may shrink to as a percentage of physical memory. It is now dynamic in 11i v2.

dbc_max_pct

dbc_max_pct specifies the maximum size that the system's buffer cache may grow to as a percentage of physical memory. It is now dynamic in 11i v2.

nbuf

nbuf is used to specify the number of file system buffer cache headers. Set nbuf to zero if you want to use the system's ability to grow and shrink this important table dynamically, based on demand. It is not yet obsolete, but expect it to be so in a future release.

bufpages

bufpages specifies the number of 4-KB pages in memory that will be allocated for the file system buffer cache. Like nbuf, this parameter should be set to zero if you want to use the dynamic form of buffer cache allocation. If this value is non-zero, enough nbufs (one for every two bufpages) will be created as well, unless otherwise specified. It is not yet obsolete, but expect it to be so in a future release.

http://education.hp.com

H4262S C.00 12-9  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

fs_async

fs_async specifies that file system data structures may be posted to disk asynchronously. While this can speed file system performance for some applications, it increases the risk that a file system will be corrupted in the event of system power loss.

maxfiles

maxfiles specifies the soft limit to the number of files that a single program may have open at one time. A program may exceed this soft limit up to the value of maxfiles_lim. In 11i v2, maxfiles is computed at boot and is set to 512, if memory is less than 1 GB. Otherwise it’s set to 2048.

maxfiles_lim

maxfiles_lim is the hard limit to the number of files that a single program can open up at one time. This parameter was made dynamic in 11i v1 and the default value was set to 4096.

nfile

nfile is the size of the file table in memory, and therefore defines the maximum number of files that may be open at any one time on the system. Every process uses at least three file descriptors. Be generous with this number, as the required memory is minimal. nfile depends on the parameters nproc, maxusers, and npty. This parameter was made dynamic in 11i v2 and was no longer dependent on maxusers. Its value is computed at boot time and is set to 16384 if memory is less than 1 GB; otherwise it’s set to 65536.

ninode

ninode is the size of the HFS in-core inode table. By caching inodes in memory the amount of physical I/O is decreased when accessing files. Each unique HFS file open on the system has a unique inode. This table is hashed for performance. At boot time in 11i v2, it’s set to 4880, if memory is less than 1GB; otherwise it’s set to 8196.

nflocks

nflocks is the number of file locks available on the system. File locks are a kernel service to enable applications to safely share files. Databases or other applications that make use of the lockf() system call can be large consumers of file locks. Note that one file may have several locks associated with it. This parameter was made dynamic in 11i v2 - at boot time, if memory is less than 1 GB, it’s set to 1200; otherwise it’s set to 4096.

vx_ncsize

Along with ninode, this parameter controls the size of the DNLC (directory name lookup cache). Recent directory path names are stored in memory to improve performance. This parameter is set in bytes. This parameter has been obsoleted in 11i v2. VxFS 3.5 now uses its own internal DNLC.

H4262S C.00 12-10  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 12 Tunable Kernel Parameters

12–5. SLIDE: Message Queue Kernel Parameters

Message Queue Kernel Parameters Kernel Parameter mesg

Default 1

Description Enable or disable IPC messaging (700 only)

msgmap

formula

msgmax

8192

Maximum size in bytes of an individual message

msgmnb

16384

Maximum size in bytes of message queue space

msgmni

50

Maximum number of messages queue identifiers

msgseg

2048

Number of segments in the the system message buffer

msgssz

8

Size in bytes of segments to be allocated for messages

msgtql

40

Size of message header space (1 header per message)

Size of message-free-space map

Student Notes Message queues are used by applications to transfer a small to medium amount of information from exactly one process to another process. This information could be in the form of a structure, a string, a numerical value, or any combination thereof. SVIPC message queues have been around for a long time. They are controlled by a number of tunable kernel parameters. mesg

mesg when set (mesg = 1) enables the message queue services in the kernel. This parameter is obsolete as of 11i v2.

msgmap

msgmap specifies the size of the free-space map used in allocating message buffer segments for messages.

msgmax

msgmax specifies the maximum size in bytes of an individual message. This parameter is dynamic at HP-UX 11i v1.

msgmnb

msgmnb specifies the maximum total space consumed by all messages in a queue. This parameter is dynamic at HP-UX 11i v1.

msgmni

msgmni specifies the maximum number of message queue identifiers allowed on the system at one time. Each message queue has an associated message

http://education.hp.com

H4262S C.00 12-11  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

queue identifier stored in non-swappable kernel memory. In 11i v2, the default was raised to 512. msgseg

msgseg is the number of segments in the system-wide message buffer. In 11i v2, the default was raised to 8192.

msgssz

msgssz is the size in bytes of each message buffer segment. In 11i v2, the default was raised to 96.

msgtql

msgtql is the total number of messages that can reside on the system at any on time. In 11i v2, the default was raised to 1024.

Any of these parameters could affect the performance of an application, simply by virtue of not having enough of the message queue resources available when needed. However, the msgssz and the msgseg parameters also control the size in an in-memory message buffer that is shared by all SVIPC message queues. It needs to be large enough to handle all the messages that may be pending at any one time, but by the same token, should not be much larger than that. It could be taking up far more memory than is necessary. It is not dynamic; it is fixed in size. There also exist in HP-UX 11.x POSIX message queues. There are no tunable parameters for them. POSIX message queues have been shown to consistently out-perform SVIPC message queues.

H4262S C.00 12-12  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 12 Tunable Kernel Parameters

12–6. SLIDE: Semaphore Kernel Parameters

Semaphore Kernel Parameters Kernel Parameter sema

Default 1

Description Enable or disable Semaphore code (700 only)

semaem

16384

Maximum amount a semaphore can be changed by undo

semmap

formula

semmni

64

semmns

128

semmnu

30

Maximum number of processes that can have undo operations pending on a given semaphore

semume

10

Maximum number of semaphores that a given process can have undo operations pending on

semvmx

32767

semmsl

2048

Size of free-space map used for allocating new semaphores Maximum number of sets of semaphores Maximum number of semaphores, system-wide

Maximum value a semaphore is allowed to reach Maximum number of semaphores in a given set

Student Notes Semaphores are another form of interprocess communication. Semaphores are used mainly to keep processes properly synchronized to prevent collisions when accessing shared data structures. Semaphores are typically incremented or decremented by a process to block other processes while it is performing a critical operation or using a shared resource. When finished, it decrements or increments the value, allowing blocked processes to then access the resource. Semaphores can be configured as binary semaphores with only two values: 0 and 1, or they can serve as general semaphores (or counters), where one process increments/decrements the semaphore and one or more cooperating processes decrement/increment it. SVIPC semaphores have been around for a long time. They are controlled by several tunable parameters. sema

sema (Series 700 only) enables or disables IPC semaphores at system boot time. This parameter is obsolete as of 11i v2.

semaem

semaem is the maximum value by which a semaphore can be changed in a semaphore undo operation.

http://education.hp.com

H4262S C.00 12-13  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

semmap

semmap is the size of the free-semaphores resource map for allocating requested sets of semaphores. This semaphore is obsolete as of 11i v2.

semmni

semmni is the maximum number of sets of IPC semaphores allowed on the system at any given time. In 11i v2, the default was raised to 2048.

semmns

semmns is the total system-wide number of individual IPC semaphores available to system users. In 11i v2, the default was raised to 4096.

semmnu

semmnu is the maximum number of processes that can have undo operations pending on any given IPC semaphore on the system. In 11i v2, the default was raised to 256.

semume

semume is the maximum number of IPC semaphores on which a given process can have undo operations pending. In 11i v2, the default was raised to 100.

semvmx

semvmx, the maximum value any given IPC semaphore is allowed to reach, prevents undetected overflow conditions).

semmsl

Until 11i v2, semmsl was an untunable value in the kernel. It specified the maximum number of semaphores that could be allocated to a specific semaphore set. In 10.X it was set to 500. In 11.00, it was set to 2048. Now it is a dynamic tunable.

Any of these parameters could affect the performance of an application, simply by virtue of not having enough of semaphore resources available when needed. There also exist in HP-UX 11.x POSIX semaphores. There are no tunable parameters for them. POSIX semaphores have been shown to consistently out-perform SVIPC semaphores.

H4262S C.00 12-14  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 12 Tunable Kernel Parameters

12–7. SLIDE: Shared Memory Kernel Parameters

Shared Memory Kernel Parameters Kernel Parameter

Default

Description

shmem

1

Enable or disable Shared Memory (700 only)

shmmax

64 MB

shmmni

200

Maximum number of total shared memory segments

shmseg

120

Maximum number of shared memory segments that a single process may attach

Maximum shared memory segment size

Student Notes Shared memory is reserved memory space for storing data shared between or among cooperating processes. Sharing a common memory space eliminates the need for copying or moving data to a separate location before it can be used by other processes, reducing processor time and overhead, as well as memory consumption. Shared memory is allocated in swappable, shared memory space. Data structures for managing shared memory are located in the kernel. Shared memory segments are much preferred by memory intensive applications, such as Data Bases, since they can be very large and can be accessed without using system calls. SVIPC shared memory use the following tunable parameters. shmem

shmem ,when set to true, enables the shared memory subsystem at boot time. This parameter is obsolete in 11i v2.

shmmax

shmmax specifies the maximum shared memory segment size. Dynamic in 11i v1. Also in 11i v2, the default was raised to 1GB.

shmmni

shmmni specifies the maximum number of shared memory segments allowed on the system at any one time. Dynamic in 11i v2. Also in 11i v2, the default was raised to 400.

http://education.hp.com

H4262S C.00 12-15  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

shmseg

shmseg specifies the maximum number of shared memory segments that can be simultaneously attached (shmat()) to a single process. Dynamic in 11i v1. Also in 11i v2, the default was raised to 300.

Any of these parameters could affect the performance of an application, simply by virtue of not having enough shared memory resources available when needed. There also exist in HP-UX 11.x POSIX shared memory. There are no tunable parameters for them. POSIX shared memory segments are implemented through the memory-mapped file architecture, so it could be affected by some of the file system tunable parameters described earlier.

H4262S C.00 12-16  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 12 Tunable Kernel Parameters

12–8. SLIDE: Process-Related Kernel Parameters

Process-Related Kernel Parameters Kernel Parameter

Default

Description

maxdsiz maxdsiz_64bit

256 MB

maxssiz maxssiz_64bit

8 MB

maxtsiz maxtsiz_64bit

64 MB

maxressiz maxressiz_64bit

8 MB

Maximum 32 and 64 bit process RSE stack size (IA-64 only)

50

Maximum number of concurrent processes per user ID

maxuprc nproc timeslice

formula 8

Maximum 32 and 64 bit process data segment size

Maximum 32 and 64 bit process stack size

Maximum 32 and 64 bit process text segment size

Maximum number of processes system wide Maximum time a process can have the CPU before yielding to next highest priority. Set in “ticks” (10ms).

Student Notes Manage the number of processes on the system and processes per user to keep system resources effectively distributed among users for optimal overall system operation. Manage allocation of CPU time to competing processes at equal and different priority levels. Allocate virtual memory among processes, protecting the system and competing users against unreasonable demands of abusive or run-away processes. maxdsiz

maxdsiz defines the maximum size of the static data storage segment of an executing 32-bit process. In 11i v2, this default has been raised to 1 GB.

maxdsiz_64bit

maxdsiz_64bit defines the maximum size of the static data storage segment of an executing 64-bit process. In 11i v2, this default has been raised to 4 GB.

maxssiz

maxssiz defines the maximum size of the dynamic storage segment (DSS), also called the stack segment, of an executing 32bit process.

http://education.hp.com

H4262S C.00 12-17  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

maxssiz_64bit

maxssiz_64bit defines the maximum size of the dynamic storage segment (DSS), also called the stack segment, of an executing 64-bit process. In 11i v2, this default has been raised to 256 MB.

maxtsiz

maxtsiz defines the maximum size of the shared text segment (program storage space) of an executing process. Note maxtsiz_64bit for 64 bit HP-UX 11.

maxressiz

maxressiz defines the maximum size of the register stack engine (RSE), also called the RSE stack segment, of an executing 32-bit process. This parameter is only found on an IA-64 kernel.

maxressiz_64bit

maxressiz_64bit defines the maximum size of the register stack engine (RSE), also called the RSE stack segment, of an executing 64-bit process. This parameter is only found on an IA-64 kernel.

maxuprc

maxuprc establishes the maximum number of simultaneous processes available to each user on the system. The user ID number identifies a user. The superuser is immune to this limit. In 11i v2, this default is now set to 256.

nproc

nproc specifies the maximum total number of processes that can exist simultaneously in the system. This parameter has been made dynamic in 11i v2, and the new default setting is 4200.

timeslice

The timeslice interval is the amount of time one thread is allowed to accumulate before the CPU is given to the next thread at the same priority. The value of timeslice is specified in units of (10 millisecond) clock ticks.

H4262S C.00 12-18  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 12 Tunable Kernel Parameters

12–9. SLIDE: Memory-Related Kernel Parameters

Memory-Related Kernel Parameters Kernel Parameter vps_ceiling vps_chatr_ceiling

Default 16 1048576

Description Maximum automatic page size (kbytes) the kernel selects Maximum page size (kbytes) useable with chatr

vps_pagesize

4

Default page size used without chatr specification

swapmem_on

1

Enable or disable pseudo swap

nswapdev

10

Maximum number of device swap areas

nswapfs

10

Maximum number of file system swap areas

swchunk

2048

maxswapchunks

256

page_text_to_local

0

Size in DEV_BSIZE (1-KB) units of swap space units Maximum number of swchunk units Enable or disable process text to be swapped locally

Student Notes Configurable kernel parameters for memory paging enforce operating rules and limits related to virtual memory (swap space). vps_ceiling

This parameter is provided as a means to minimize lost cycle time caused by TLB (translation look-aside buffer) misses on systems using newer PA-RISC devices such as the PA-8000 and the Itanium family that have smaller TLBs and may not have a hardware TLB walker. If a user application does not use the chatr command to specify a page size for program text and data segments, the kernel selects a page size that, based on system configuration and object size, appears to be suitable. This is called transparent selection.

vps_chatr_ceiling

http://education.hp.com

User applications can use the chatr command to specify a page size for program text and data segments, providing some flexibility for improving overall performance, depending on system configuration and object size. The specified size is then

H4262S C.00 12-19  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

compared to the page-size value limit defined by vps_chatr_ceiling that is defined in the kernel at systemboot time. If the value specified is larger than vps_chatr_ceiling, vps_chatr_ceiling is used. vps_page_size

Specifies the default user-page size (in Kbytes) that is used by the kernel if the user application does not use the chatr command to specify a page size. swapmem_on swapmem_on enables or disables the creation of pseudo-swap, which is swap space designed to increase the apparent total swap space, so that real swap can be used completely, or large memory systems don’t need corresponding swap space.

nswapdev

nswapdev specifies an integer value equal to the number of physical disk devices that can be configured for device swap up to the maximum limit of 25.

nwapfs

nswapfs specifies an integer value equal to the number of file systems that can be made available for file-system swap, up to the maximum limit of 25.

swchunk

swchunk defines the chunk size for swap. This value must be an integer power of two. When the system needs swap space, one swap chunk is obtained from a device or file system. When that chunk has been used and another is needed, a new chunk is obtained. If the swap space is full or if there is another swap space at the same priority, the new chunk is taken from a different device or file system, thus distributing swap use over several devices.

maxswapchunks

maxswapchunks specifies the maximum amount of configurable swap space on the system. In 11i v2 this parameter is obsolete.

page_text_to_local

page_text_to_local allows NFS clients to write the text segment to local swap and retrieve it later. This eliminates two separate text-segment data transfers to and from the NFS server, thus improving NFS client program performance. This parameter does not seem to be defined in 11i v2, even though it has not been identified as an obsolete parameter.

H4262S C.00 12-20  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 12 Tunable Kernel Parameters

12–10. SLIDE: LVM-Related Kernel Parameters

LVM-Related Kernel Parameters Kernel Parameter

Default

maxvgs

10

no_lvm_disks

0

Description Maximum number of volume group on the system Enable or disable system to use LVM (0 = false, LVM exists 1 = true, no LVM disks exist)

Student Notes Two configurable kernel parameters are provided that relate to kernel interaction with the logical volume manager. maxvgs

maxvgs defines the maximum number of volume groups configured by the logical volume manager on the system.

no_lvm_disks

no_lvm_disks flag notifies the kernel when no logical volumes exist on the system, i.e. LVM is disabled. This parameter does not seem to be defined in 11i v2, although it is not identified as an obsolete parameter.

http://education.hp.com

H4262S C.00 12-21  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

12–11. SLIDE: Networking-Related Kernel Parameters

Networking-Related Kernel Parameters Kernel Parameter

Default

netisr_priority

netmemmax

1

10% of mem

Description Define priority to assign to the network packet processing daemon (-1 means handle on an interrupt basis – best packet processing performance) Amount of memory, in bytes, to be allocated for IP packet fragmentation reassembly queue

Student Notes Two configurable kernel parameter are related to the kernel's interaction with the networking subsystems: netisr_priority

netisr_priority sets the real-time interrupt priority for the networking interrupt service routine daemon. By default, it is set to 1 on Uniprocessor systems and 100 on Multiprocessor systems. This parameter is obsolete in 11i v2.

netmemmax

netmemmax specifies how much memory is reserved for use by networking for holding partial Internet protocol (IP) messages which are typically held in memory for up to 30 seconds. When messages are transmitted using Internet protocol, they are sometimes broken into multiple, "partial" messages (fragments). netmemmax simply establishes a maximum amount of memory that can be used for storing network-message fragments until they are reassembled. This parameter does not seem to be defined in 11i v2, although it is not identified as an obsolete parameter.

H4262S C.00 12-22  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 12 Tunable Kernel Parameters

12–12. SLIDE: Miscellaneous Kernel Parameters

Miscellaneous Kernel Parameters Kernel Parameter

Default

Description

create_fastlinks

0

Enable or disable creation of fast symbolic links

default_disk_ir

1

Enable or disable immediate reporting on all disks

maxusers

32

Maximum number of simultaneous users expected

ncallout

formula

Maximum number of timeouts (for example, alarms) pending

npty

60

Maximum number of concurrent pseudo tty connections

rtsched_numpri

32

Number of distinct POSIX real-time priorities

unlockable_mem

0

Minimum amount of memory to reserved for use by the paging system

Student Notes The following parameters are more or less unrelated. create_fastlinks

When create_fastlinks is non-zero, it causes the system to create HFS symbolic links in a manner that reduces the number of disk-block accesses by one for each symbolic link in a pathname lookup.

default_disk_ir

default_disk_ir enables or disables immediate reporting. With Immediate Reporting ON, disk drives that have data caches return from a write() system call when the data is cached, rather than returning after the data is written on the media. This sometimes enhances write performance, especially for sequential transfers. In 11i v2, this parameter is set to 0, by default.

maxusers

maxusers does not itself determine the size of any structures in the system; instead, the default value of other global system parameters depends on the value of maxusers. When other configurable parameter values are defined in terms of maxusers, the kernel is made smaller and more efficient by minimizing wasted space due to improperly balanced resource allocations. In

http://education.hp.com

H4262S C.00 12-23  2004 Hewlett-Packard Development Company, L.P.

Module 12 Tunable Kernel Parameters

11i v2, the use of maxusers has been eliminated from the formula of every parameter that was dependent on it. Changing its value has no effect on 11i v2. ncallout

ncallout specifies the maximum number of timeouts that can be scheduled by the kernel at any given time. A general rule is that one callout per process should be allowed unless you have processes that use multiple callouts. In 11i v2 this parameter is obsolete.

npty

npty specifies the maximum number of pseudo-tty data structures available on the system.

rtsched_numpri

rtsched_numpri specifies the number of distinct priorities that can be set for POSIX real-time processes running under the realtime scheduler.

unlockable_mem

unlockable_mem defines the minimum amount of memory that always remains available for virtual memory management and system overhead.

H4262S C.00 12-24  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 13  Putting It All Together Objectives Upon completion of this module, you will be able to do the following: •

Identify and characterize some network performance problems.

•

List some useful tools for measuring network performance problems and state how they might be applied.

•

Identify bottlenecks on other common system devices not associated directly with the CPU, disk, or memory.

http://education.hp.com

H4262S C.00 13-1  2004 Hewlett-Packard Development Company, L.P.

Module 13 Putting It All Together

13–1. SLIDE: Review of Bottleneck Characteristics

Review of Bottleneck Characteristics CPU

Disk

Memory

• High CPU utilization

• High CPU utilization • High disk utilization

• High CPU utilization • High disk utilization • High memory utilization (with swapping)

Student Notes The above slide recaps the characteristics related to the three main performance bottlenecks.

CPU Bottlenecks CPU bottlenecks often exhibit the following characteristics: •

High CPU usage due to lots of processes competing for the CPU Large number of processes in the CPU run queue

•

No disk bottleneck problems; disk utilization is low, few to no I/O requests in the disk queues

•

No memory bottleneck problems; vhand not needing much, no paging to swap devices

Disk Bottlenecks Disk bottlenecks often exhibit the following characteristics: •

High CPU usage due to the disk device drivers constantly executing to perform the I/O and user/system processes continually running to submit the I/O requests

H4262S C.00 13-2  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 13 Putting It All Together

•

High disk utilization due to lots of I/O requests being continually submitted.

•

No memory bottleneck problems; vhand not needing much, no paging to swap devices

Memory Bottlenecks Memory bottlenecks often exhibit the following characteristics: •

High CPU usage (system) due to vhand constantly running to free memory pages, the kernel spending lots of time in the memory management subsystem, and the device drivers for the disk writing memory pages to and from swap

•

High disk utilization due to memory pages being constantly written to and from the swap devices

•

High memory utilization (with swapping) due to free memory falling below LOTSFREE, DESFREE, and MINFREE

Given the above recap, in what order should the three main bottlenecks be checked? When arriving on the scene of an unknown system, where do you start? It would be wise to look for the bottleneck with the most specific symptoms, first. Since the memory bottleneck is the only one to show signs of memory pressure, look for it first. Once you have eliminated that, look for disk bottlenecks. Finally, look for CPU bottlenecks.

http://education.hp.com

H4262S C.00 13-3  2004 Hewlett-Packard Development Company, L.P.

Module 13 Putting It All Together

13–2. SLIDE: Performance Monitoring Flowchart

Performance Monitoring Flowchart

Start glance. Look at the memory utilization bar graph.

Is memory utilization > 95?

No

Is disk utilization > 50?

Yes Is there activity on the swap device? Yes Potential Memory Bottleneck

Look for other kinds of bottlnecks, e.g. network

Look at the disk utilization bar graph.

No

No

Is CPU utilization > 90?

Yes Are there disk I/O requests in the queue? Yes Potential Disk Bottleneck

Look at the CPU utilization bar graph.

No

No

Yes Are there requests in the CPU run queue?

No

Yes Potential CPU Bottleneck

Student Notes The above performance monitoring flow chart assumes glance is being used as the performance-monitoring tool. If glance is not available, the same information can be obtained from a variety of other tools, such as sar and vmstat. The flow chart starts by first looking for symptoms of a memory bottleneck. • •

Is memory utilization high? Is there activity to the swap device?

Memory bottlenecks are checked for first, since memory bottlenecks often exhibit symptoms of high disk and CPU utilization, which could initially be mistaken for disk or CPU bottlenecks. If the system is not bottlenecked on memory, the second bottleneck checked for through the flow chart is a disk bottleneck. • •

Is disk utilization high? Are there disk I/O requests in the disk queue?

H4262S C.00 13-4  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 13 Putting It All Together

Disk bottlenecks are checked for second, as disk bottlenecks often exhibit symptoms of high CPU utilization, but not high memory utilization. If the system is not bottlenecked on disk, the final bottleneck to check for is a CPU bottleneck. • •

Is CPU utilization high? Are there processes in the CPU run queue?

CPU bottlenecks are checked for after memory and disk bottlenecks, as CPU bottlenecks do not exhibit high memory or CPU utilization. If none of these situations appear to exist, then it is time to check the less common bottlenecks. Networks would be a good possibility, but don’t neglect other hardware or even software resources, such as file locks and semaphores.

http://education.hp.com

H4262S C.00 13-5  2004 Hewlett-Packard Development Company, L.P.

Module 13 Putting It All Together

13–3. SLIDE: Review — Memory Bottlenecks

Review — Memory Bottlenecks Look at the memory utilization bar graph.

Is memory utilization > 95?

No

Yes

Is there activity on the swap device? (m) Mem Report – Look at VM writes. (d) Disk Report – Look at Virt Memory (v) I/O by LV – Look at swap devices (w) Swap Space – Look at Used (ignore pseudo).

No

Yes Potential Memory Bottleneck

Student Notes The primary symptoms of a memory bottleneck include high memory utilization and activity to the swap device. The glance reports that show activity on the swap device include: (m) (d) (v) (w)

Memory Report Disk Report I/O by log. volume Swap Space Report

-

shows currently number of VM reads/writes shows VirtMem I/O shows I/O to the swap logical volumes show currently used swap space

Also look at vhand and swapper as processes. Are they accumulating any CPU time? Look at the output of vmstat –S. Are pages being paged out? Are processes being swapped out?

H4262S C.00 13-6  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 13 Putting It All Together

13–4. SLIDE: Correcting Memory Bottlenecks

Correcting Memory Bottlenecks • Reduce maximum size of dynamic buffer cache. • Identify programs with large resident set size (RSS). • Use the serialize command to reduce thrashing. • Use PRM or WLM to prioritize memory allocations. • Add more physical memory.

Student Notes The above slide reviews some of the ways to correct a memory bottleneck: •

Limit the maximum size of the dynamic buffer cache. This can help to prevent unnecessary paging during periods when the dynamic buffer cache needs to shrink.

•

Identify programs (and users) taking up large amounts of memory, and investigate whether the memory usage is warranted or whether the process has memory leaks.

•

Consider using the serialize command to keep several memory intensive programs from competing with each other.

•

Consider using the Process Resource Manager (PRM) or Work Load Manager (WLM) to favor memory allocation to important processes.

•

Adding more physical memory will always help a memory-constrained system.

http://education.hp.com

H4262S C.00 13-7  2004 Hewlett-Packard Development Company, L.P.

Module 13 Putting It All Together

13–5. SLIDE: Review — Disk Bottlenecks

Review — Disk Bottlenecks

Look at the disk utilization bar graph.

Is disk utilization > 50?

No

Yes

Are there Disk I/O requests in the queue ? (u) I/O by Disk – Look at File System activity. (B) Global Waits – Look at Blocked on Disk I/O. (d) Disk Report – Look at Logical I/O to Physical I/O ratio.

No

Yes Potential Disk Bottleneck

Student Notes The primary symptoms of a disk bottleneck include high disk utilization and multiple I/O requests in the disk queue. The glance reports that show disk I/O related activity include: (u) I/O by Phys. Disk (B) Global Waits (d) Disk Report

- shows currently number of reads/writes - shows percentage of processes blocked on Disk I/O - shows Logical I/O and Physical I/O activity

Also check the output of sar –u (%wio), sar –d, and sar –b (for read cache hit rate and write cache hit rate).

H4262S C.00 13-8  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 13 Putting It All Together

13–6. SLIDE: Correcting Disk Bottlenecks

Correcting Disk Bottlenecks • Load balance across disk drives and disk controllers. • Consider asynchronous instead of synchronous I/O. • Tune file system block and fragment/extent size. • Tune file system (vxfs and hfs) mount options. • Tune vxfs file systems with vxtunefs. • Tune buffer cache for better hit ratios. • Add additional and faster disk drives and controllers.

Student Notes The above slide reviews some of the ways to correct a disk bottleneck: •

Spread the I/O activity, as evenly as possible, over the disk drives and disk controllers.

•

Consider using asynchronous I/O so applications do not have to wait for a physical I/O to complete. The trade-off here is a greater exposure to data loss in the event of a system failure.

•

For HFS file systems, increase the fragment and file system block size if large files are being accessed in a sequential manner. For VxFS file systems, increase the block size to improve read-ahead and write-behind. Consider using a fixed extent size.

•

Look at customizing file system mount options (especially for VxFS file systems). Recall that, by default, VxFS is mounted to favor integrity, and HFS is mounted to favor performance.

•

Consider using vxtunefs to tune the performance of VxFS. Match preferred IO size and read ahead to physical stripe depth.

http://education.hp.com

H4262S C.00 13-9  2004 Hewlett-Packard Development Company, L.P.

Module 13 Putting It All Together

•

Verify (and tune) the hit ratio on the file system buffer cache. The ratio of logical reads to physical reads should be a minimum of 10 to 1. The ratio of logical writes to physical writes should be a minimum of 3 to 1.

•

Add bigger, better, faster disks and disk controllers.

H4262S C.00 13-10  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 13 Putting It All Together

13–7. SLIDE: Review — CPU Bottlenecks

Review — CPU Bottlenecks

Look at the CPU utilization bar graph.

Is CPU utilization > 90?

No

Yes Are there processes in the CPU run queue? (a) CPU by Proc – Look at Load Average.

No

(g) Global Report – Look at Processes Blocked on priority.

Yes Potential CPU Bottleneck

Student Notes The primary symptoms of a CPU bottleneck include high CPU utilization and multiple processes in the CPU run queue. The glance reports that show CPU activity include: (a) CPU by Processor (c) CPU Report (g) Process Report

Note

- shows CPU load average over last 1, 5, 15 minutes - shows CPU activities - shows CPU hogs in order (see note)

Make sure you are looking at processes in CPU order. Use the Thresholds Page (o) of glance and set “CPU” as the sort criteria.

Also check sar –u and sar –q. Use the –M option, if you have a multiprocessor.

http://education.hp.com

H4262S C.00 13-11  2004 Hewlett-Packard Development Company, L.P.

Module 13 Putting It All Together

13–8. SLIDE: Correcting CPU Bottlenecks

Correcting CPU Bottlenecks • Use nice to reduce priority of less important processes. • Use nice to improve priority of more important processes. • Use rtprio or rtsched on most important processes. • Run batch jobs during non-peak hours. • Add another (or faster) processor.

Student Notes The above slide reviews some of the ways to correct a CPU bottleneck: •

Use the nice or renice commands on lower priority processes (set nice value to 2139). As a rule of thumb, favor I/O bound programs over CPU-bound programs. I/O-bound programs will block frequently, allowing the CPU-bound programs to run.

•

Use the nice or renice command on higher priority processes (set nice value to 0-19).

•

Use the rtprio or rtsched commands on highest priority processes. BE CAREFUL! A poorly written process could take over your system and render it useless.

•

Schedule large batch jobs, long compiles, and other CPU intensive activity for non-peak hours.

•

Add an additional CPU or a faster CPU to the system.

H4262S C.00 13-12  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Module 13 Putting It All Together

13–9. SLIDE: Final Review — Major Symptoms

Final Review – Major Symptoms Memory Bottleneck:

Both vhand and swapper active

Disk Bottleneck:

Disk utilization > 50% Request queues > 3

CPU Bottleneck:

CPU utilization > 90% Run queues > 3 per processor

Network Bottleneck:

Collisions/out-bound packets > 5%

All conditions sustained over time!

Student Notes Let’s summarize the major bottlenecks and their symptoms: Memory Bottleneck: You know that you have a memory bottleneck if both vhand and swapper are active. This indicates severe memory pressure! Disk Bottleneck: A disk bottleneck will be characterized by disk utilization of at least 50% and at least 3 requests waiting in the request queue. If a controller is the bottleneck, you will see multiple disks with lengthy queues on that controller. Their utilization may not be 50%! The queues are more important than the utilization. CPU Bottleneck: If all of your CPUs are at least 90% busy and they each have run queues that have 3 or more processes in them, you have a CPU bottleneck. If one or more of the processors has empty (or mostly empty) queues, either you are at the limit of your CPU resource, or something is unbalancing the loads on your processors. Network Bottleneck: If your ratio between your collisions/sec and your packets-out/sec is greater than 5%, you have a network bottleneck.

http://education.hp.com

H4262S C.00 13-13  2004 Hewlett-Packard Development Company, L.P.

Module 13 Putting It All Together

As with any bottleneck symptom, it must be a constant condition – sustained over time to be considered a true bottleneck. Otherwise, it’s a momentary spike which we will keep an eye on, but otherwise ignore.

H4262S C.00 13-14  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Appendix A — Applying GlancePlus Data This module is an optional self-study for students.

Objectives Upon completion of this module, you will be able to do the following: •

Use case studies to demonstrate how GlancePlus screens can be used to analyze system performance.

•

Observe how a performance specialist approaches a tuning task.

http://education.hp.com

H4262S C.00 A-1  2004 Hewlett-Packard Development Company, L.P.

Appendix A Applying GlancePlus Data

A–1. TEXT PAGE: Case Studies—Using GlancePlus The case studies stylized in this module come from the logbooks of HP-UX Performance Specialists and are presented for your consideration. The goal is to help you prepare for your own tasks and adventures. The examples show you possibilities and are not intended to be exact recommendations or solutions to situations that you may encounter. These examples may cause you to think up new questions, in addition to answering some of the classic tuning scenarios. As in most endeavors, there is often much to be gained from reviewing someone else's actions and trying to reverse-engineer their solutions.

An Approach to Monitoring System Behavior The best approach to monitoring your system's performance is to become familiar with how your system usually behaves. This helps you recognize whether a sudden shift in activity is normal or a sign of a potential problem. The first screen that appears when you start GlancePlus in character mode summarizes system-wide activity and lists all processes that exceed the usage thresholds set for the system. The information on this screen tells you if a resource is being used excessively or a process is monopolizing available resources. The Global screen is the usual starting point for any review of system activity and performance. You can use the statistics on the Global screen to monitor system-wide activity, or you may need to refer to the detailed data screens to focus on specific areas of system usage. The examples in this chapter highlight the use of all GlancePlus screens. GlancePlus provides you with valuable information, but optimal use of this information depends on how well you understand your system's operation and what is the normal or usual behavior for that system. As you use GlancePlus to review your system's performance, you will learn to recognize patterns that differ from this norm—patterns that may indicate a problem.

Bottlenecks A bottleneck is the most common type of problem on any system. It occurs whenever a hardware or software resource cannot meet the demands placed on it, and processes must wait until the resource becomes available. This results in blocks and long queues. Your system handles processes much like a freeway system handles traffic. During normal hours, the freeway adequately carries the traffic load, and cars can travel at optimum speed. But, during rush hour, when too many cars try to access the freeway, the lanes become clogged and traffic can slow to a halt. The freeway becomes bottlenecked. Similarly, a bottleneck can occur on your system if the processes you are running need more CPU time than is available or more memory than is configured for the system. A bottleneck also can occur if there isn't enough disk 1/0 bandwidth to move data, or if swap space isn't configured optimally. A bottleneck can be a temporary problem that is easily fixed. The solution may be to rearrange workloads, such as rescheduling batch programs to run late at night. Solving a disk bottleneck may require only spreading disk loads among all the available disks.

H4262S C.00 A-2  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Appendix A Applying GlancePlus Data

A recurring bottleneck, however, can indicate a long-term situation that is worsening. Perhaps the system was configured to serve fewer users than are now using it, or workloads have gradually increased beyond the system's capacity. The only solution may be a hardware upgrade, but how do you know? If you can identify a bottleneck correctly, you can avoid randomly tuning the system (which can worsen the problem), and you can avoid adding extra hardware that doesn't help performance. You can also avoid expending resources solving a corollary bottleneck—one that is caused by the primary bottleneck.

Characteristics of Bottlenecks Common system bottlenecks have several general characteristics or symptoms. By comparing these symptoms with the statistics on your GlancePlus screens, you can analyze the performance of your system and detect potential or existing bottlenecks. Although a single symptom may not indicate a problem, a combination of symptoms generally reflects a bottleneck situation.

Symptoms of a CPU Bottleneck • • • •

Long run queue without available idle time High activity in user mode Reasonable activity in system mode (high activity may indicate other bottlenecks as well) Many processes frequently blocked on priority

Symptoms of a Memory Bottleneck • • • • •

High swapping activity High paging activity Very little free memory available High disk activity on swap devices High CPU usage in system mode

Symptoms of a Disk Bottleneck • • • •

High disk activity. CPU is idle, waiting for I/O requests to complete High rate of physical reads/writes Long disk queues

Symptoms of Other 1/0 Bottlenecks • •

High LAN activity Low 1/0 throughputs

You may discover that solving one bottleneck uncovers or creates another. It is possible to have more than one bottleneck on a system. In fact, changing workloads are constantly reflected in changing system performance. The goal is not to seek a final solution, but to seek optimal performance at any given time.

http://education.hp.com

H4262S C.00 A-3  2004 Hewlett-Packard Development Company, L.P.

Appendix A Applying GlancePlus Data

Evaluating System Activity One afternoon Doug noticed that system response had slowed. He ran GlancePlus and looked at the Global screen to view system-wide activity. He saw that the CPU usage was near 100%. Although this is not necessarily a problem, he decided to check it out. Doug then looked at the process summary section of the Global screen, which lists all processes that exceed the usage thresholds set for the system. He noted that a single process accounted for a majority of the near 100% CPU usage. Wanting more information on that particular process, he checked it on the Individual Process screen, which provides detailed information about a specific process. Reviewing that screen, Doug noticed the process was doing no I/0 and was spending all its time in user code. This suggested that the process might be trapped in a CPU loop. After identifying the user's name he telephoned the user to find out if the process could be killed. In this situation, the CPU use for the system did not drop after the user terminated the looping process, because other processes took up the slack. However, response time improved because other processes did not need to wait as often to be given their share of CPU time.

Evaluating CPU Usage Dean was checking the system one afternoon when he noticed a sudden slowdown in system response time. He ran GlancePlus and looked at the Global screen to view system-wide activity. He saw that the CPU usage was near I00% . The other system resources, such as Memory, Disk, and Swap, showed much less use. Further checking revealed that several processes were blocked due to another process using the CPU (PRI), which meant they were waiting for higher priority processes to finish executing. Dean accessed the CPU Detail screen to see how CPU time was allocated to different states (activities). He discovered that real-time activities were using a much larger percentage of the CPU than other activities. Dean returned to the Global screen to check priorities. One user was running with a priority of 127-- an RTPRIO real-time processing priority. Dean knew that this particular program is CPU-intensive and running it at such a high priority would keep other processes from executing. Already it was causing system performance to degrade. He reset the priority for that process to a lower timeshare priority by using the GlancePlus renice command. This allowed other processes more consistent access to the CPU, and system response time improved.

Evaluating Wait States Jose's system was running fine until he installed a new application. Now, every time the application runs, response time degrades. Since the application is the only change to the system, Jose starts by checking how it is using the system. Looking at the Glance Individual Process screen, he sees that CPU utilization is about 7 percent, so that isn't the problem. Next, he checks overall CPU utilization on the system; it's averaging about 48 percent, which means there is sufficient CPU resource to accommodate the new application. Jose checks disk I/Os and notices the application is processing about 5 I/Os per second, most

H4262S C.00 A-4  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Appendix A Applying GlancePlus Data

of which is virtual memory 1/0. That looks slow to Jose, so he looks at the Wait States screen to find out what the process is waiting on. Jose learns that the process is spending about 7 percent of its time utilizing the CPU (executing), 27 percent of its time waiting for terminal input, and 66 percent of its time waiting on virtual memory. That's a significant amount of time. Jose checks other processes on the system and discovers that they are experiencing similar waits for virtual memory. He realizes that the new application overloads the system's memory. He makes copies of the relevant screens so he can explain the situation to his manager.

Evaluating Disk Usage Vivian's company often runs processes that tax available memory. She keeps track of the situation by checking the Disk Detail screen, which displays both logical and physical 1/0 requests for all mounted disk devices. It also categorizes the physical requests as User, Virtual Memory, System, and Raw requests. This screen shows her when large numbers of physical read and write requests are occurring, a situation that results from excessive page faults by processes. Vivian also checks the virtual memory request rate, since that also will be high when system demand is taxing its physical memory capacity. By paying attention to which processes are active when the virtual memory activity is high, Vivian can make intelligent decisions about redistributing activities to balance the system load. This helps increase overall throughput for the system.

Evaluating Memory Usage Terri's system was experiencing a slowdown in system response time. She checked the Global screen to get an overall picture. All four system resources (CPU, Disk, Memory, and Swap) were near 100%. A large portion of the disk bar activity showed virtual memory activity. In addition, the swapper system daemon appeared to be running continuously. Terri realized that this indicated a possible memory bottleneck. She checked the Memory Detail screen, which provides statistics on memory management events such as page faults, number of pages paged-in and paged-out, and the number of accesses to virtual memory. The screen indicated that Free Memory was 0.0 MB, indicating a lack of usable memory, and the Swap In/Outs showed a rate above 1 per second. Concluding that the problem was a memory bottleneck, Terri returned to the Global screen to study the active processes. She knew that a memory bottleneck can be relieved by adding more memory or by reducing the memory demands of active processes. In this case, she suspected that high swap rates were caused by the large Resident Set Sizes (RSS) for the most active processes. One test program showed a large RSS that appeared to grow at a constant rate. Examining this situation more closely, Terri discovered the program had a "memory leak." It allocated memory using malloc() but did not free up memory using free(). The process's memory allocation increased steadily, causing memory pressure on the system. She talked with the developer, who studied the program code and found the memory leak. The test program was changed and recompiled to use far less memory, thus alleviating the memory bottleneck and improving system response time.

http://education.hp.com

H4262S C.00 A-5  2004 Hewlett-Packard Development Company, L.P.

Appendix A Applying GlancePlus Data

Evaluating I/O by File System Ingrid noticed that system performance degraded drastically when the system was doing swapping. Looking at the Global screen, she observed that the swapper process was running and that virtual memory use counted for a high percentage of the disk utilization. She checked the Disk I/0 by File System screen to verify which disk was busiest. The Disk I/0 by File System screen provides details of the I/0 rates for each file system or mounted-disk partition. This information is useful for balancing disk loads. When she looked at the Disk I/O by File System screen, Ingrid saw that one disk was being utilized more than all other disks on the system. The disk most utilized was a swap disk. Ingrid decided to add additional swap disk areas to the system to alleviate the load on that one disk. She also might have considered allocating dynamic swap areas on existing underutilized file systems

Evaluating Disk Queue Lengths Ray had already determined that his system had a disk 1/0 bottleneck. By reading the Global screen, he noticed that the disk utilization was almost always at 100%. He had checked the Disk 1/0 by File System screen, which showed that several disks were being heavily utilized. What Ray wanted to find was a way to ease the situation. He studied the Disk Queue Lengths screen, which shows how well disks are able to process 1/0 requests. He wanted to determine which of the busy disks had the longest delays for service. He knew that "busy" disks did not necessarily have a long queue length. High disk utilization is not a problem unless processes must wait to use the disk. For example, using a high percentage of the lines on a telephone system is not a problem unless calls cannot get through. Ray also knew that long queue lengths meant several disk requests must wait while that drive is servicing other requests. For example, if all phone lines are busy, incoming calls must wait to connect. Once he had a clear picture of the situation, Ray reduced the large queue lengths by moving several files to different file systems to distribute the workload more evenly.

Evaluating NFS Activity Paul works on a system that is used as a network file system server. One local disk is NFSmounted from several different nodes on the LAN. One afternoon, Paul noticed poor response time on the system. The file system mounted by remote systems was very active. Paul reviewed the NFS Detail screen, which provides current statistics on in-bound and out-bound network file system (NFS) activity for a local system. He wanted to determine which remote system was using the disk the most. He observed a large In-bound Reads rate from one system. This led him to examine that remote system to find out why it was overutilizing the NFS-mounted disk. His examination pinpointed the situation to a single user on the remote system. The user was making repeated, unnecessary greps to files on the NFS-mounted disk. Paul explained the problem to the user and worked with her to lessen the heavy disk use. This reduced the load

H4262S C.00 A-6  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Appendix A Applying GlancePlus Data

on the NFS server and improved overall response time.

Evaluating LAN Activity Lee noticed a slow response time for applications using datacom services to access data across the local area network. He checked the LAN Detail screen to see what was causing the problem. The LAN Detail screen describes four functions for each local area network card configured on the system. On networked systems, this information can show potential bottlenecks caused by heavy LAN activity. Lee noticed that the Collision and Errors rates were higher than usual. This information led him to investigate whether processes were competing for LAN resources or overloading the LAN software or hardware. In this case, an application that was improperly written using netipc() was causing a bottleneck. Once this program was stopped, other programs using the LAN were able to improve their response time.

Evaluating System Table Utilization When Debbie was running a program on the system, the program failed, giving this error message: sh: fork failed - too many processes To decide whether or not to reconfigure the value of nproc in her kernel, Debbie needed to find out how much room she had in the Process Table. She referred to the System Table Utilization Detail screen, which provides information on the use and configured sizes of several important internal system tables. The information on this screen provides feedback on how well the kernel configuration is tuned to the average workload on the system. Debbie confirmed that she had indeed run out of room in the Proc Table. She knew that usually the system buffer cache is fully utilized. Other tables can be proactively monitored in order to reconfigure the appropriate kernel variable before she reached a limit.

Evaluating an Individual Process Cliff noticed that one process seemed to be running quite slowly. He ran GlancePlus and looked at the detail information for that process. He then examined the statistics on the Single Process Detail screen, which provides detailed information about a specific process. Cliff knew that if the process was running slowly because of a memory shortage, he would see an increase in context switches and fault counts. He noticed that the I/0 read and write counts were large and that the process was doing a lot of I/O. He checked what the process had been blocking on and noticed a high percentage for Disk 1/0 blocks. He suspected that the process was slowed because of competition for disk throughput capacity.

http://education.hp.com

H4262S C.00 A-7  2004 Hewlett-Packard Development Company, L.P.

Appendix A Applying GlancePlus Data

Had the process shown a high percentage of being blocked on priority, it would have meant the process was ready to run but was unable to do so because the CPU was being used by processes with higher dispatching priorities.

Evaluating Open Files Kathryn is developing an application for communicating with remote systems. When a request is received, the application opens a socket and sends the specified data. However, when Kathryn tests the application, no data is received by the remote system. To find out what happened she checks the Glance Open Files screen. When she looks for the opened socket, she discovers that it never opened. She returns to her application to look for the coding error.

Evaluating Memory Regions One day while reviewing Glance's Global Summary screen, Nancy notices that several processes have very large resident set sizes. Could this mean a potential problem with the applications' memory usage? She wonders if she should begin planning to increase physical memory size to accommodate additional users in the future. She knows current system performance is fine and memory size seems adequate, but she wants to prevent any future degradation in performance. Before making any decisions she reviews Glance's Memory Regions screen to analyze the situation more closely. She discovers that all of the affected processes have a shared memory region of about 200 KB. When added to the private DATA and TEXT regions, this accounts for the large resident set sizes. By checking the virtual address of the shmem region, she determines that the same shared memory region is being used by the processes. Because it is a shared region, it is physically in memory only once, but Glance displays it for each process attached to the shared memory region. Nancy smiles when she sees this, because it means that no problem exists. By using the shared memory region the processes are using far less memory than it appears.

Evaluating All CPUs Statistics Rosalie works on a multiprocessor system. While checking the All CPUs screen one day she noticed that one CPU seemed to be consistently busier than the others. Realizing that overall system throughput would be improved if the load were balanced among the processors, she decided to investigate the situation. As she studied the All CPUs screen she noticed that one PID always seemed to be the last PID executing on CPU 1, her busiest CPU. When Rosalie checked mpctl(), she saw that the process had been assigned to CPU 1. Using mpctl -f, she reassigned the process to be a floater, so that the system could determine which processor should run the process. Rosalie then checked for other processes that had been assigned to CPU I and reassigned them as floating processes. After doing so, she rechecked the All CPUs screen and observed that the load appeared more even among all the processors, thus alleviating a potential bottleneck on any single CPU.

Evaluating Activity on Logical Volumes Lately when Yuki uses GlancePlus to check his system, he notices that the Global Disk Utilization bar — displayed in the top portion of every Glance screen — is often close to 100%. Yuki's system has multiple disk drives, but he knows that the global disk utilization figure indicates the activity on the busiest disk.

H4262S C.00 A-8  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Appendix A Applying GlancePlus Data

Yuki would like to spread the disk I/0 more evenly among the drives to avoid potential I/O bottlenecks. With that goal in mind, he first checks the Disk Detail screen. It shows that logical disk activity is high. For details, he goes to the Logical Volumes screen, where he notices a high write activity on logical volume /dev/vgOO/lvol12. Getting out of Glance and into the UNIX shell, he types vgdisplay -v /dev/vg00 to ascertain the physical disk names associated with the volume. Back in Glance, Yuki views the Disk Queue Lengths screen to determine the busiest disks in the volume. Then he checks the Disk Detail screen to find out whether disk activity was caused by system or user activity. Yuki notices that the Virtual Memory physical accesses are low, indicating application rather than system activity. He checks the Open Files screen to find which application was creating so many writes to the disk. Voila! Fred is running his baseball pool again! Yuki pays a visit to Fred. After discussing Fred's I/0 needs, Yuki returns to his console to balance the I/0 load, using LVM commands to rearrange the logical volumes. Now it’s time to grab your toolbox, pop the hood, and take a look.

Good Luck!

http://education.hp.com

H4262S C.00 A-9  2004 Hewlett-Packard Development Company, L.P.

Appendix A Applying GlancePlus Data

H4262S C.00 A-10  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

http://education.hp.com

H4262S C.00 Solutions-1  2004 Hewlett-Packard Development Company, L.P.

Solutions

1–11. LAB: Establishing a Baseline Directions The following lab exercise establishes baselines for three CPU-bound applications and one disk-bound application. The objective is to time how long these applications take when there is no activity on the system. These same applications will be executed later on in the course when other bottleneck activity is present. The impact of these bottlenecks on user response time will be measured through these applications. 1. Change directory to /home/h4262/baseline. # cd /home/h4262/baseline 2. Compile three C programs long, med and short by running the BUILD script # ./BUILD 3. Time the execution of the long program. Make sure there is no activity on the system. # timex ./long Record Execution Time

real: user: sys:

Answer: Varies with system configuration – in the order of 10’s of seconds to minutes. Example output follows from an rp2430 server: # timex ./long The last prime number is : real user sys

49999

3:37.89 3:35.68 0.12

Example output follows from an rx2600 server: # timex ./long The last prime number is : real user sys

99991

2:53.24 2:51.74 0.06

H4262S C.00 Solutions-2  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

4. Time the execution of the med program. Make sure there is no activity on the system. # timex ./med Record Execution Time

real: user: sys:

Answer: Varies with system configuration – should be about one half of long. Example output follows from an rp2430 server: # timex ./med The last prime number is : real user sys

49999

1:52.68 1:51.55 0.08

Example output follows from an rx2600 server: # timex ./med The last prime number is : real user sys

99991

1:33.71 1:33.02 0.04

5. Time the execution of the short program. Make sure there is no activity on the system. # timex ./short Record Execution Time

real: user: sys:

Answer: Varies with system configuration – should be about one eigth to one tenth of med. Example output follows from an rp2430 server: # timex ./short The last prime number is : real user sys

49999

10.88 10.70 0.05

Example output follows from an rx2600 server:

http://education.hp.com

H4262S C.00 Solutions-3  2004 Hewlett-Packard Development Company, L.P.

Solutions

# timex ./short The last prime number is : real user sys

99991

8.56 8.49 0.03

6. Time the execution of the diskread program. # timex ./diskread Record Execution Time

real: user: sys:

Answer: Varies with system configuration – in the order of tens of seconds. Example output follows from an rp2430 server: # timex ./diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c1t15d0] DiskRead: Start reading : 1024MB 1024+0 records in 1024+0 records out real user sys

28.01 0.02 0.53

Example output follows from an rx2600 server: # timex ./diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c2t1d0s2] DiskRead: Start reading : 2048MB 2048+0 records in 2048+0 records out real user sys

28.69 0.01 0.13

7. In the case of the long, med, and short programs the real time is the sum of the usr and sys time (approximately). This is not the case with diskread. Explain why. Answer: We first assume that there is no other load on the system. In the case of a classic number crunching CPU hog (long, med, and short are all these) there will be no system calls (except for the final terminal output) and the program only needs CPU time in usr mode.

H4262S C.00 Solutions-4  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

As there is only one process, there is no waiting. This is shown by the “real” time being very close to the sum of the sys and usr times for the process. long, med, and short only do calculations and make no call on kernel resources during their execution, so the usr time is very high compared to the sys time. This is not the case for diskread. The program makes very little demand on the CPU, shown by the sum of usr and sys being quite small compared to the real or “wall clock” time. The huge difference between real time and usr+sys time proves that the program is waiting on disk I/O most of the time. Also note that sys is much higher than usr meaning that the program is bound on system calls (disk I/O) rather than computation when it does execute.

http://education.hp.com

H4262S C.00 Solutions-5  2004 Hewlett-Packard Development Company, L.P.

Solutions

1–12. LAB: Verifying the queuing Theory Directions The performance queuing theory states that as the number of jobs in a queue increases, so will the response time of the jobs waiting to use that resource. (This lab uses the short program compiled from /home/h4262/baseline/prime_short.c). Example figures below are from a C200 workstation 1. In terminal window 1, monitor the CPU queue with the sar command. # sar -q 5 200 2. In a second terminal window, time how long it takes for the short program to execute. # timex ./short & Answer: rp2430: # timex ./short & [1] 10050 # The last prime number is : real user sys

49999

10.85 10.70 0.05

# timex ./short & [1] 6486 rx2600: root@r265c145:/home/h4262/baseline # The last prime number is : 99991 real user sys

8.59 8.50 0.03

How long did the program take to execute? 8 to 11 secs. How does this compare to the baseline measurement from earlier? A little longer due to the overhead of sar.

H4262S C.00 Solutions-6  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

3. Time how long it takes for three short programs to execute? # timex ./short &

timex ./short &

timex ./short &

How long did the slowest program take to execute? _____________________ How did the CPU queue size change from first window? __________________ Answer: rp2430: # timex [1] [2] [3]

./short & timex ./short & timex ./short & 10203 10205 10206

# The last prime number is : real user sys

29.86 10.68 0.01

The last prime number is : real user sys

49999

32.07 10.67 0.01

The last prime number is : real user sys

49999

49999

32.35 10.67 0.01 rx2600:

# timex [1] [2] [3]

./short & timex ./short & timex ./short & 6690 6692 6694

# The last prime number is : real user sys

25.08 8.48 0.00

The last prime number is : real user sys

99991

99991

25.56 8.48 0.00

http://education.hp.com

H4262S C.00 Solutions-7  2004 Hewlett-Packard Development Company, L.P.

Solutions

The last prime number is : real user sys

99991

25.60 8.48 0.00

How long did the slowest program take to execute? 25 to 34 secs, around three times longer than one occurrence of the program. If you have a multiprocessor, the time will be distributed over the number of processors – with the lower limit being the time a single process would take. For example, if your system had two processors, the slowest process would complete in one-half the time it would take on a single-processor system. Since we’re only running three processes here (not including sar), three processors or more than three processors would show the same results. How did the CPU queue size change from first window? The sar –q shows that the average cpu queue length (first field) increases by three times when three programs are run concurrently. 4. Time how long it takes for five short programs to execute? # timex ./short & timex ./short & timex ./short & timex ./short & timex ./short &

\

How long did the slowest program take to execute? _________ How did the CPU queue size change from first window?________ Answer: rp2430: # timex timex [1] [2] [3] [4] [5]

./short & timex ./short & timex ./short & \ ./short & timex ./short & 10212 10214 10216 10218 10220

# The last prime number is : real user sys

53.98 10.68 0.01

The last prime number is : real user sys

49999

49999

54.08 10.68 0.01

H4262S C.00 Solutions-8  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

The last prime number is : real user sys

49999

54.08 10.68 0.01

The last prime number is : real user sys

49999

54.08 10.67 0.01

The last prime number is : real user sys

49999

54.15 10.68 0.01 rx2600:

# timex ./short & timex ./short & timex ./short & \ ./short & timex ./short & timex ./short & [1] [2] [3] [4] [5]

6737 6739 6741 6743 6745

# The last prime number is : real user sys

42.52 8.49 0.00

The last prime number is : real user sys

99991

42.59 8.48 0.00

The last prime number is : real user

99991

42.56 8.48 0.00

The last prime number is : real user sys

99991

99991

42.67 8.48

http://education.hp.com

H4262S C.00 Solutions-9  2004 Hewlett-Packard Development Company, L.P.

Solutions

sys

0.00

The last prime number is : real user sys

99991

42.75 8.48 0.00

How long did the slowest program take to execute? 43 to 54 secs If you have a multiprocessor, the time will be distributed over the number of processors – with the lower limit being the time a single process would take. For example, if your system had two processors, the slowest process would complete in one-half the time it would take on a single-processor system. Since we’re only running five processes here (not including sar), five processors or more than five processors would show the same results. How did the CPU queue size change from first window? It increased by 5 while the test is being run. 5. Is the relationship between elapsed execution (real) time and the number of running programs linear? Answer: Yes very much so. The fastest program in the last case (where 5 programs are running) takes five times longer than with one program. You can draw a graph and go to 10 programs if you are unsure! Typing the command with more than 10 occurrences gets a little tedious! You will find a linear relationship in any case. 6. Comment about the overhead of switching from one process to another. Answer: The overhead of task switching is very low. If it were not, the relationship in the above tests would not be linear. If there is an overhead, it looks like we will not see it unless there are hundreds of processes being switched.

H4262S C.00 Solutions-10  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

2–68. LAB: Performance Tools Lab The goal of this lab is to gain familiarity with performance tools. A secondary goal is to get familiar with the metrics reported by the tools, although they will be explored in depth during the next days.

Directions Set up: Change directories to: Execute the setup script:

# cd /home/h4262/tools # ./RUN

Use glance (or gpm if you have a bit-mapped display), sar, top, vmstat, and any other available tools to answer the following questions. List as many as possible, and include the appropriate OPTION or SCREEN, which will give the requested information. Specific numbers are not the important goal of this lab. The goal is to gain familiarity with a variety of performance tools. Always investigate what the basic UNIX tools can tell you before running glance or gpm. You may want to run through this lab with the solution from the back of this book for more guidance and discussion. These results were obtained on a C200 workstation running 11i. Remember the absolute numbers are not important here but you should be drawing similar conclusions. 1. How many processes are running on the system? Which tools can you use to determine this? Answer: top ps glance sar gpm

Gives the number of running processes in the summary portion of the screen: 119 processes: 96 sleeping, 17 running, 6 zombies ps -e | wc -l and subtract 1 for the headers and 1 each for ps and wc. Look at the table screen (t page) and see the current size of the proc table. sar –v 2 10 Look at the proc-sz field. Gives the count at the top of the Process List report.

2. Are there any real-time priority processes running? If so, list the name and priority. What tools can you use to determine this? Answer: syncer, midaemon, lab_proc2, sometimes swapper. ttisr and prm3d will also be seen on 11/11i systems running at pri –32. This is the posix real time range which is even higher than the normal UNIX real time priorities. glance top

Global, PRI column (Turn off all filters) PRI column

http://education.hp.com

H4262S C.00 Solutions-11  2004 Hewlett-Packard Development Company, L.P.

Solutions

gpm ps –el

Use the filters to filter priorities 0 Automatically lists the processes by CPU utilization Cpu hogs often have large C counts ( Edit Adviser Syntax This will display the definitions of the current symptoms being monitored by GlancePlus. Close the Edit Adviser Syntax window.

H4262S C.00 Solutions-24  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

View CPU details: Click the CPU button. To view a detailed report regarding the CPU, select: Reports -> CPU Report Select: Reports -> CPU by Processor This is a useful report, even on a single processor system. 5. On Line Help. One method for accessing online help within GlancePlus is to click on the question mark (?) button. The cursor changes to a ? . Click on the column heading, NNice CPU %. This opens a new window describing the NNice CPU % column. View descriptions for other columns, including the SysCall CPU %. When finished viewing online help for columns, click on the question mark one more time. This returns the cursor to normal. 6. Alarms and Symptoms. A symptom is some characteristic of a performance problem. GlancePlus comes with predefined symptoms, or the user can define his own. An alarm is simply a notification that a symptom has been detected. From the main window, select: Adviser -> Symptom History For each defined symptom, a history of that particular symptom is displayed graphically. The duration is dependent on the glance history buffers, which are user-definable. Close the window. Click on the ALARM button in the main window. This displays a history of all the alarms that have occurred since GlancePlus was started. Up to 250 alarms can be displayed. Close the window. 7. Process Details. Close all windows except for the main window. Select: Reports -> Process List This shows the “interesting” processes on the system (interesting in terms of size and/or activity). To customize this listing, select: Configure -> Choose Metrics

http://education.hp.com

H4262S C.00 Solutions-25  2004 Hewlett-Packard Development Company, L.P.

Solutions

This will display an astonishing number of metrics, which can be chosen for display in this report. This is also a quick way to get an overview of all of the process-related metrics available in GlancePlus. Note that the familiar ? button is also available from this window. Use the scroll bar to find the metric PROC_NICE_PRI. Select this metric and click on OK. Close this window by clicking on OK. 8. Customizations. Most display windows can be customized to sort on any metric, and to arrange the metrics in any user-defined order. To define the sort fields, select Configure -> Sort Fields The sort order is determined by the order of the columns. Placing a particular metric into column one makes it the first sort field. If multiple entries have the same value within this field, then the second column is used to determine the order between those entries. If further sorting is needed, then the third column is used, and so forth down the line. To sort on Cumulative CPU Percentage, click on the column heading CPU % Cum. The cursor will become a crosshair. Scroll window back to column one, and click on column one. This makes CPU % Cum the first sort field. Arrange the sort order so that CPU % is followed by CPU % Cum. Click Done when finished. This sort order is automatically saved so that the next time processes are viewed, this will remain the sort order. In a similar fashion, the order of the columns can also be arranged. To define the column order, select Configure -> Arrange Columns Select a column to be moved (for example, CPU % Cum). The cursor will become a crosshair. Scroll the window to the location where the column is to be inserted. Click on the column where the column is to be inserted. Arrange the first four columns to be in the following order: Process Name, CPU %, CPU % Cum, Res Mem. Click Done when finished. This display order is automatically saved so that the next time processes are viewed, this will remain the display order. 9. More Customizations. It is possible to modify the definition of interesting processes by selecting: Configure -> Filters An easy way to limit the processes shown is to and all the conditions (the default is to OR the conditions). In the Configure Filters window, select AND logic, then click on OK. A much smaller list of processes should be displayed. Return to the Configure Filters window. Modify the filter definition for CPU % Cum as follows: Change Enable Filter to ON

H4262S C.00 Solutions-26  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

Change Filter Relation to >= Change Filter Value to 3.0 Change Change Change Change

Enable Highlight to ON Highlight Relation to >= Highlight Value to 3.0 Highlight Color to any LOUD color

Reset the logic condition make to OR, then click OK. Verify the filter took effect. 10. Administrative Capabilities. There are two administrative capabilities with GlancePlus. If working as root, processes in the Process List screen can be killed or reniced. In the Process List window, select the proc8 process. To access the Admintools, select: Admin -> Renice Use the slider to set the new nice value for this process to be +19, then click OK. Note the impact on this process. Now, select the proc8 process again. Select: Admin -> Kill Click OK, and note the process is no longer present. 11. Process Details. Detailed metrics can be obtained on a per process basis. To view process details, go to the Process List window and double click on any process. Much of the details in this report will be explained in the Process Management section of the course. The Reports menu provides much valuable information about the process, including the Files Open and the System Calls being generated. After surveying the information available through this window, close and return to the Main window. There are many other features available in GlancePlus. There are close to 1000 metrics available with it. Notice that when you iconify the GlancePlus Main window, all of the other windows are closed and the GlancePlus active icon is displayed. Alarms and histograms are displayed in this active icon. Exploding this icon will again open up all previously open windows. 12. Exit GlancePlus. From the Main window, select: File -> Exit GlancePlus

http://education.hp.com

H4262S C.00 Solutions-27  2004 Hewlett-Packard Development Company, L.P.

Solutions

13. Glance, the ASCII version. From a terminal window, which has not been resized, type glance. NOTE:

Never run glance or gpm in the background.

If you are accessing the ASCII version of glance from an X terminal window, make sure you start up an hpterm window to enable full glance softkeys. Do not resize the window as ASCII glance expects a standard terminal size. . You can make the hpterm window longer, but never wider. However, making it longer is frequently of no use. # hpterm & In the new window… # glance Display a list of keyboard functions by typing ?. This brings up a help screen showing all of the command keystrokes that can be used from the ASCII version of GlancePlus. Explore these to familiarize yourself with the interface. 14. Display Main Process Screen. Type g to go to the Main Process Screen. This lists all interesting processes on the system. Retrieve online help related to this window by typing h, which brings up a help menu. Select: Current Screen Metrics Use the cursor keys to select CPU Util NOTE:

This metric has two values. Use the online help to distinguish the difference between the two values. Use the space bar or the “Page Down” key to toggle to the next page of help.

Exit the online help CPU Util description by typing e. Exit the Screen Summary topics by typing e. From the main Help menu, select: Screen Summaries Use the cursor keys to select Global Bars From this help description, explain what R, S, U, N, and A mean in the CPU Util Bar. Exit the online help Global Bar description by typing e. Exit the Screen Summary topics by typing e. Exit the main Help menu by typing e. At any time, you can exit help completely, no matter how deep you are, by pressing the F8 key.

H4262S C.00 Solutions-28  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

15. Modify Interesting Process Definition. From the main Process List window, (select g). View the interesting processes. What makes these processes interesting? Type o and select 1 (one) to view the process threshold screen. Cursor down to the Sort Key field, and indicate to sort the processes by CPU usage. Before confirming the other options are correct, note that any CPU usage (greater than zero), or any disk I/Os will cause the process to be considered interesting. Run the KILLIT command to stop all lab loads. 16. Glance Reports. This is the free form part of the lab. Spend the rest of your lab time going through the various Glance screens and GlancePlus windows. Use the table below to produce the different performance reports. Feel free to use this time to ask the instructor "How Do I . . .?" types of questions. Glance COMMAND *a b *c *d e f *g h *i j *l *m *n o p q r *s *t *u *v *w y z ! ?

GlancePlus (gpm) FUNCTION

"REPORT"

All CPUs Performance Stats Back one screen CPU Utilization Stats Disk I/O Stats Exit Forward one screen Global Process Stats Help I/O by Filesystem Change update interval Lan Stats Memory Stats NFS Stats Change Threshold Options Print current screen Quit Redraw screen Single process information OS Table Utilization Disk Queue Length Logical Volume Mgr Stats Swap Stats Renice process Zero all Stats Shell escape Help with options Update screen data

CPU by

http://education.hp.com

Processor

CPU Report Disk Report

Process List I/O by Filesystem Network by LAN Memory Report NFS Report

Process List, double-click process System Table Report Disk Report,double-click disk I/O by Logical Volume Swap Detail Administrative Capabilities

H4262S C.00 Solutions-29  2004 Hewlett-Packard Development Company, L.P.

Solutions

4–16. LAB: Process Management Directions The following lab is designed to manage a group of processes. This includes observing the parent-child relationship and modifying process nice values (and thus indirectly priorities) with the nice/renice command .

Modifying Process Priorities This portion of the lab uses glance to monitor and modify nice values of competing processes. 1. Change directory to /home/h4262/baseline. # cd /home/h4262/baseline 2. Start seven long processes in the background. # ./long & ./long & ./long & ./long & ./long & ./long & ./long & [1] 15722 [2] 15723 [3] 15724 [4] 15725 [5] 15726 [6] 15727 [7] 15728 3. Start a glance session. Answer the following questions. How much CPU time is each long process receiving? _________sec ________% Answer: Hint: Change the sample period to 10 secs (hit the j key). This will give you more time to think and makes “per second” calculations easier!

The CPU should be balanced between the seven processes with each getting around 14% of the CPU (i.e. 5/7 seconds each for a 5 second interval and 10/7 seconds each for a 10 second interval). This is seen in the CPU Util field of the main glance window. Notice that the programs all have similar priority around 248-249 which is towards the bottom of the pile. If you have a multiprocessor, the processes will quickly distribute themselves among all available processors. However, the overall metrics should stay the same – with the exception of the overall length of time that the processes take.

H4262S C.00 Solutions-30  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

How are the processes being context switched (forced or voluntary)? ______________ Answer:

Select one of the “long” processes using the glance s key. Make sure the PID being suggested is the right one or enter the correct PID. In the first column of info you will find the “Forced CSwitch” and “Voluntary CSwitch” metrics. You will notice that (almost!) all context switched are forced when you compare the two figures. This is normal for a CPU hog process. It never leaves the CPU on his own accord and is always told to leave by the scheduler. We saw 7.7-9.6 context switches per second for the period for each of the processes on an rp2430. All of the context switches were forced. On a multiprocessor, there would be the same number of context switches taking place, however fewer processes would be sharing the same processor. How many times over the interval is the process being dispatched? ___________ Answer:

Again, we can look to the first column of the selected process’ resource summary page. Find the “Dispatches” metric. This is a measure of how often the process is getting onto the CPU with the summation of “Forced CSwitch + Voluntary CSwitch” measuring how often the process gets switched out. On a multiprocessor, each processor would have fewer processes wanting its resource, so, each process would be selected more often. What is the ratio of system CPU time to user CPU time? ____________ Answer:

Look to the first column of the selected process info again and you will find the “System CPU” metric. This will be zero or close to zero on any system. By using the C (upper case) key we can switch between metrics for the last interval (10 seconds if you are following the solutions) or the total over the period of tracking. It makes no difference how you look at it, these processes do not process system calls. They are typical CPU hogs that crunch numbers and do nothing else. All the CPU is User/Nice/RT. What are the processes being blocked on? __________________ Answer

PRIority

The most frequent event that is blocking the process is shown by the “Wait Reason” metric at the bottom of the first column of Process Resource info (the same page we have been looking at all along). In this case it is PRI, short for Priority.

http://education.hp.com

H4262S C.00 Solutions-31  2004 Hewlett-Packard Development Company, L.P.

Solutions

The process has been blocked because it is timeslicing with all the other processes. Each time it is switched out, it is placed at the end of the queue in true round-robin fashion. Thus, it is no longer the most eligible process to run and the scheduler has chosen another. For more stats go to the Wait States page for this process (softkey F2 or hit W) notice that the process is blocked on Priority for 80-90% (6/7) of the time and the rest of the time it is on the CPU. There are no other active wait states. The seven long processes are in a circular fight to get to the top of the pile(s). What are the nice values for the processes? _______ Answer

24

A Bourne-based shell (Bourne, Korn, Posix, bash) always places background processes at a nice level 4 higher than the calling shell. The standard nice value of our shell is 20 so the child background jobs inherit 24 as the nice value. One exception is the C shell which runs background processes at the same nice value as the shell. 4. Select one of the processes and favor it by giving it a more favorable nice value. What is the PID of the process being favored? ____________ Answer:

To change the processes nice value, enter: # renice –n -5 Be careful! This forces a negative offset of 5 from 20 (the standard nice value) and not the current nice value (24). The nice value in this case will end up at 15, which is more favorable than the others, still at 24. Watch that process’s percentage of the CPU over several display intervals with glance or top. What effect did it have on the process? _____________________________ _______________________________________________________________________ Answer:

The effect on the process is that it will race away from the others, consuming approx 5060% of the CPU! This might take a little time to settle down at 50-60%. Give it several intervals to complete its adjustment. 5. Select another long process and set the nice value to 30. # renice –n 10 What effect did that have on that process? ___________________________________ ______________________________________________________________________

H4262S C.00 Solutions-32  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions Answer:

This really turns the process into a loser! The priority of the process drops to 251-252, preventing the process from getting much action. If you select the process and look in the first column of the Process Resource page you will see that it is being dispatched but not very often. You will see the process getting less than 2% of CPU but not much more. Each of the other processes will take up the excess, with the majority of the excess going to the process with the nice value of 15. 6. You can either let the processes finish up on their own as the next module is covered, or you can kill them now with: # kill $(ps –el | grep long | cut –c18-22)

http://education.hp.com

H4262S C.00 Solutions-33  2004 Hewlett-Packard Development Company, L.P.

Solutions

5-24. LAB: CPU Utilization, System Calls, and Context Switches Directions General Setup Create a working data file in a separate file system (on a separate disk, if possible). If another disk is available: # # # #

vgdisplay –v | grep Name (Note which disks are already in use by LVM) ioscan –fnC disk (Note any disks not mentioned above, select one) pvcreate -f vgextend vg00

In either case: # # # # # #

lvcreate -n vxfs vg00 lvextend -L 1024 /dev/vg00/vxfs newfs -F vxfs /dev/vg00/rvxfs mkdir /vxfs mount /dev/vg00/vxfs /vxfs prealloc /vxfs/file

The lab programs are under /home/h4262/cpu/lab0 # cd /home/h4262/cpu/lab0 The tests should be run on an otherwise idle system — otherwise results are unpredictable. If the executables are missing, generate them by typing: # make all

CPU Utilization: System Call Overhead Use the dd command to size the read and write operations. Thus their number can be varied to change the number of system calls used to transfer the same amount of information. Then we can see the overhead of the system call interface. The first command loads the entire file into buffer cache. # timex dd if=/stand/vmunix of=/dev/null bs=64k Now we take our measurements. # timex dd if=/stand/vmunix of=/dev/null bs=64k real

__

user __________

system ____________

H4262S C.00 Solutions-34  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

# timex dd if=/stand/vmunix of=/dev/null bs=2k real __ user __________ system ____________ # timex dd if=/stand/vmunix of=/dev/null bs=64 real

__

user __________

system ____________

Answer:

Results for an rp2430: # timex dd if=/stand/vmunix of=/dev/null bs=64k 282+1 records in 282+1 records out real user sys

0.04 0.00 0.03

# timex dd if=/stand/vmunix of=/dev/null bs=2k 9055+1 records in 9055+1 records out real user sys

0.15 0.02 0.12

# timex dd if=/stand/vmunix of=/dev/null bs=64 289765+1 records in 289765+1 records out real user sys

3.82 0.56 2.95

Results for an rx2600: # timex dd if=/stand/vmunix of=/dev/null bs=64k 728+1 records in 728+1 records out real user sys

0.03 0.00 0.03

# timex dd if=/stand/vmunix of=/dev/null bs=2k 23299+1 records in 23299+1 records out real user

0.18 0.02

http://education.hp.com

H4262S C.00 Solutions-35  2004 Hewlett-Packard Development Company, L.P.

Solutions

sys

0.13

# timex dd if=/stand/vmunix of=/dev/null bs=64 745575+1 records in 745575+1 records out real user sys

4.57 0.54 3.39

Notice that the last case is much slower due to the number of system calls being made. The block size is a factor of 1000 times less than in the first case causing 1000 time more calls to the read() and write() system calls. Try a sar –c 2 10 in another window while the test is being run to see this effect. None of these effects are anything to do with physical disk I/O as the whole vmunix file is coming from buffer cache. Prove this to yourself with a sar –b 2 10 while the test is being run. Notice the 100% read cache hit rate.

System Calls and Context Switches This lab shows you the maximum system call and context switch rates that your system can take. Three programs are supplied: • • •

syscall loads the system with system calls of one type filestress (shell script) generates file system-related system calls cs loads the system with context switches

1. What is the system call rate when your system is "idle"? ________________ Answer

Around 400-500 on our test systems (rp2430)

# sar -c 2 2

HP-UX r206c42 B.11.11 U 9000/800 11:18:56 scall/s 11:18:58 602 11:19:00 264

03/16/04

sread/s 3 4

swrit/s 1 1

fork/s 0.00 0.00

exec/s 0.00 0.00

rchar/s 203272 4096

wchar/s 8151 512

3

1

0.00

0.00

103741

4341

HP-UX r265c145 B.11.23 U ia64 04/06/04 # 10:57:02 scall/s sread/s swrit/s fork/s 10:57:04 719 3 1 0.00 10:57:06 434 3 1 0.00

exec/s 0.00 0.00

rchar/s 260840 4096

wchar/s 0 4096

Average #

434

sar -c 2 2

(rx2600)

H4262S C.00 Solutions-36  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions Average

577

3

1

0.00

0.00

132668

2043

2. Run filestress in the background. What is the system call rate now? What system calls are generated by filestress? Take an average with sar over about 40 seconds i.e. # sar –c 10 4 Answer

Around 20000-30000 on our test systems

# sar -c 10 4

(rp2430)

HP-UX r206c42 B.11.11 U 9000/800 11:19:43 scall/s 11:19:53 17423 11:20:03 12420 11:20:13 23240 11:20:23 26279 Average

19840

# sar -c 10 4

sread/s 3112 3577 4227 3884

swrit/s 1158 2627 1337 700

fork/s 130.07 63.40 192.60 212.10

exec/s 130.07 63.40 192.60 212.00

rchar/s wchar/s 29710218 147104 32159540 8192 39581900 17818 40309248 134963

3700

1456

149.54

149.51

35438766

11:02:40 scall/s 11:02:50 39624 11:03:00 28069 11:03:10 27178 11:03:20 31592

•

77037

(rx2600)

HP-UX r265c145 B.11.23 U ia64

Average

03/16/04

31618

04/06/04

sread/s 4530 5618 5214 5057

swrit/s 1619 3883 3320 2814

fork/s 290.51 171.70 189.40 222.70

exec/s 290.51 171.60 189.40 222.60

rchar/s wchar/s 92426384 77746 69435392 80282 67771592 62259 72799840 91750

5105

2909

218.60

218.55

75612445

78009

What system calls are generated by filestress?

Answer

read() and write().

3. Terminate the filestress process by entering the following commands: # kill $(ps -el | grep find | cut -c24-28) # kill $(ps -el | grep find | cut -c18-22) 4. Run the syscall program and again answer question 2. Is the system call rate lower or higher than with filestress? Why? Answer

Syscall rate is higher than with filestress. Non-blocking system calls Produce rates up to 138,000 per second on an rp2430 and up to 290,000 on an rx2600.

# sar -c 10 4

(rp2430)

http://education.hp.com

H4262S C.00 Solutions-37  2004 Hewlett-Packard Development Company, L.P.

Solutions HP-UX r206c42 B.11.11 U 9000/800 11:36:11 scall/s 11:36:21 137619 11:36:31 136788 11:36:41 137887 11:36:51 138224 Average

sread/s 2 2 2 2

swrit/s 0 0 0 0

fork/s 0.00 0.00 0.00 0.00

exec/s 0.00 0.00 0.00 0.00

rchar/s 42863 4506 5734 3686

wchar/s 3376 1946 3277 1229

2

0

0.00

0.00

14171

2457

137629

# sar -c 10 4

(rx2600)

HP-UX r265c145 B.11.23 U ia64 11:15:51 scall/s 11:16:01 287322 11:16:11 288439 11:16:21 289239 11:16:31 290331 Average

03/16/04

04/06/04

sread/s 27 7 9 4

swrit/s 1 1 1 0

fork/s 0.50 0.00 0.00 0.00

exec/s 0.40 0.00 0.00 0.00

rchar/s 60560 233472 27853 14746

wchar/s 4092 20480 4096 3277

12

1

0.12

0.10

84104

7985

288832

The syscall program uses the open() and close() system calls and does no I/O as such. These system calls do not block the process which turns into a CPU hog, only blocking on Priority in the glance Wait States page. Kill the syscall program, before proceeding. # kill $(ps –el | grep syscall | cut –c18-22) 5. Using cs, compare the number of context switches on an idle system and a loaded system. Idle ________

Loaded ______________

Answer # sar -w 2 2

(rp2430)

HP-UX r206c42 B.11.11 U 9000/800

03/16/04

11:39:27 swpin/s bswin/s swpot/s bswot/s pswch/s 11:39:29 0.00 0.0 0.00 0.0 86 11:39:31 0.00 0.0 0.00 0.0 83 Average

0.00

0.0

0.00

0.0

85

# ./cs & # sar -w 2 2 HP-UX r206c42 B.11.11 U 9000/800

03/16/04

11:41:43 swpin/s bswin/s swpot/s bswot/s pswch/s 11:41:45 0.00 0.0 0.00 0.0 47733

H4262S C.00 Solutions-38  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions 11:41:47

0.00

0.0

0.00

0.0

47471

Average 0.00 0.0 # sar -w 2 2 (rx2600)

0.00

0.0

47602

HP-UX r265c145 B.11.23 U ia64

04/06/04

11:22:07 swpin/s bswin/s swpot/s bswot/s pswch/s 11:22:09 0.00 0.0 0.00 0.0 150 11:22:11 0.00 0.0 0.00 0.0 177 Average

0.00

0.0

0.00

0.0

164

# ./cs& # sar -w 2 2 HP-UX r265c145 B.11.23 U ia64

04/06/04

11:22:57 swpin/s bswin/s swpot/s bswot/s pswch/s 11:22:59 0.00 0.0 0.00 0.0 81912 11:23:01 0.00 0.0 0.00 0.0 82728 Average

0.00

0.0

0.00

0.0

82319

Notice that we go from an idle context switch rate (pswch/s) of approx 100 processes per second up to 47000 or 82000! Additionally, you can look at the glance CPU Report (c). Note how much of the CPU time is spent doing context switching. (About 15%) 6. Kill the cs program, remove the /vxfs/file, and dismount the /vxfs filesystem. # kill $(ps –el | grep cs | cut –c18-22) # rm –f /vxfs/file # umount /vxfs

http://education.hp.com

H4262S C.00 Solutions-39  2004 Hewlett-Packard Development Company, L.P.

Solutions

5–25. LAB: Identifying CPU Bottlenecks Directions The following labs are designed to show symptoms of a CPU bottleneck.

Lab 1 1. Change directory to /home/h4262/cpu/lab1 # cd /home/h4262/cpu/lab1

2. Start the processes running in the background. # ./RUN

3. Start a glance session and answer the following questions. What is the CPU utilization? _______ Answer

At or near 100%

What are the nice values of the processes receiving the most CPU time? _______ Answer

10

What is the average number of jobs in the CPU run queue? ______ Answer

Varies with configuration – should be approx 3-5

# uptime 12:05pm #

up 4 days, 19:38,

7 users,

load average: 4.73, 3.31, 2.26

4. Characterize the 8 lab processes that are running (proc1-8). Which are CPU hogs? Memory hogs? Disk I/O hogs etc. Identify processes that you think are in pairs. Glance global (g) page output (rp2430): PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------proc8 27425 1 215 root 50.1/49.4 138.6 0.0/ 0.0 168kb 1 proc3 27420 1 221 root 48.4/49.2 138.0 0.0/ 0.0 168kb 1 prm3d 1462 1 168 root 0.0/ 0.2 1125.1 0.0/ 0.0 26.6mb 19 proc5 27422 1 168 root 0.0/ 0.2 0.5 4.0/ 4.0 168kb 1

H4262S C.00 Solutions-40  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions proc2

27419

1 168 root

0.0/ 0.2

0.5

3.8/ 3.9

168kb

1

Glance global (g) page output (rx2600): PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------proc3 26194 1 219 root 50.8/49.3 81.5 0.0/ 0.0 268kb 1 proc8 26199 1 216 root 48.5/49.4 81.6 0.0/ 0.0 268kb 1 scopeux 2105 1 127 root 0.0/ 0.0 13.3 0.0/ 0.0 20.7mb 1 prm3d 2139 1 168 root 0.0/ 0.1 77.5 0.0/ 0.0 49.5mb 19 ia64_corehw 2989 1 154 root 0.0/ 0.1 65.9 1.1/ 0.0 1.8mb 1 proc2 26193 1 168 root 0.0/ 0.1 0.2 7.6/ 7.7 256kb 1 proc5 26196 1 168 root 0.0/ 0.1 0.2 7.6/ 5.8 256kb 1

proc3 and proc8 are the main CPU hogs. They have been run with nice values of 10! The process pair are accounting for almost 100% of the CPU between them. With the same CPU rates and RSS (Resident Set Size), it is likely that these are identical programs. Selecting one of these processes in glance reveals no disc I/O and a context switch profile which is always forced. proc5 and proc2 also manage to execute with 0.2% CPU utilization each. Again these look like a pair. If you select one of these programs and look at the Process Resource page you can see a small amount of write disk I/O, most of which is logical. The main Wait Reason for this process is SLEEP. It would appear that these processes do a small amount of disk I/O and then call sleep() and pause for some time intentionally. proc1 and proc7 are a pair. On selecting one of these we see a nice value of 39! These processes find it nearly impossible to get CPU with the real time pair of proc3 and proc8 taking all the CPU resource. If you watch the Dispatches metric on the Process Resource page they can be seen to get one or two slices of CPU very infrequently. You should also see that for every Dispatch (these are rare), these is always an accompanying Forced Cswitch. You can conclude that these processes would be CPU hogs if they were not so crippled by their own high nice values and the aggression of proc3 and proc8. proc4 and proc6 are the last pair. They have standard nice values of 20 and seem to do nothing but call the sleep() system call. They are being dispatched slightly more frequently than proc1 and proc7 and they are always subject to Voluntary CSwitch. These processes are not CPU hogs. They also do no disk I/O of any kind. None of the above processes had any significant memory size. 5. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______ Answer: # timex /home/h4262/baseline/short & The last prime number is : 49999

http://education.hp.com

(rp2430)

H4262S C.00 Solutions-41  2004 Hewlett-Packard Development Company, L.P.

Solutions

real user sys

56.44 10.66 0.01

# timex /home/h4262/baseline/short & # The last prime number is : real user sys

(rx2600)

99991

1:02.38 8.48 0.00

6. Compare your results to the baseline established in the lab exercise in module 1, step 7. Answer:

Total execution time is over 5 times slower!

7. End the CPU load by executing the KILLIT script. # ./KILLIT

H4262S C.00 Solutions-42  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

Lab 2 1. Change directory to /home/h4262/cpu/lab2. # cd /home/h4262/cpu/lab2 2. Start the processes running in the background. # ./RUN 3. In one terminal window, start glance. In a second terminal window run # sar -u 5 200. Answer the following questions: What does glance report for CPU utilization? _______ Answer: Should be greater than 50%. (the more, the merrier!) Output of rp2430 glance (g) page below PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------proc2 27761 1 1 root 92.0/92.3 723.2 0.0/ 0.0 168kb 1 prm3d 1462 1 168 root 0.0/ 0.2 1137.2 0.0/ 0.0 26.6mb 19

Output of rp2430 glance (a) page below CPU BY PROCESSOR Users= 1 CPU State Util LoadAvg(1/5/15 min) CSwitch Last Pid -------------------------------------------------------------------------------0 Enable 93.2 0.5/ 0.6/ 1.7 1724 27761

Output of rx2600 glance (g) page below PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------proc2 26469 1 1 root 71.5/71.6 47.1 0.0/ 0.0 288kb 1 prm3d 1462 1 168 root 0.0/ 0.2 1137.2 0.0/ 0.0 26.6mb 19

Output of rx2600 glance (a) page below CPU BY PROCESSOR Users= 1 CPU State Util LoadAvg(1/5/15 min) CSwitch Last Pid -------------------------------------------------------------------------------0 Enable 73.1 0.0/ 0.2/ 0.8 1432 26469

What does sar report for CPU utilization? ________

http://education.hp.com

H4262S C.00 Solutions-43  2004 Hewlett-Packard Development Company, L.P.

Solutions

Answer: sar reports the CPU is mostly idle. Util is less than 10%. # sar -u 5 200 HP-UX r206c42 B.11.11 U 9000/800 13:45:58 13:46:03 13:46:08 13:46:13 13:46:18 13:46:23

%usr 4 0 1 0 1

%sys 2 1 1 0 1

%wio 0 0 0 0 0

03/16/04 %idle 94 99 98 100 98

This is very strange; the tools totally disagree with each other. sar is reporting over 90% idle with glance reporting over 80% busy! They cannot both be right. Which one do you trust? The output of top is also confused. It sees the busy process but still reports 90% idle! Load averages: 112 processes: Cpu states: LOAD USER 0.50 0.6%

0.50, 0.56, 1.41 (rp2430) 99 sleeping, 13 running NICE 0.0%

SYS 2.2%

IDLE 97.2%

BLOCK 0.0%

SWAIT 0.0%

INTR 0.0%

SSYS 0.0%

Memory: 91236K (64076K) real, 365020K (299140K) virtual, 30120K free TTY PID USERNAME PRI NI pts/tb 27761 root 1 20 Load averages: 128 processes: Cpu states: LOAD USER 0.03 0.2%

SIZE 1664K

RES STATE 148K sleep

Page# 1/8

TIME %WCPU %CPU COMMAND 14:49 92.56 92.40 proc2

0.03, 0.12, 0.68 (rx2600) 107 sleeping, 20 running, 1 zombie NICE 0.0%

SYS 0.0%

IDLE 99.8%

BLOCK 0.0%

SWAIT 0.0%

INTR 0.0%

SSYS 0.0%

Memory: 197664K (154768K) real, 608492K (523032K) virtual, 23516K free 10 TTY PID USERNAME PRI NI tty1p0 26469 root 1 20

SIZE 3304K

RES STATE 252K sleep

Page# 1/

TIME %WCPU %CPU COMMAND 4:08 71.77 71.64 proc2

What is the priority of the process receiving the most CPU time? _______ Answer

The proc2 process is the culprit and is running with the high UNIX real time priority of 1. How much time is the process spending in the sigpause system call? ______ Answer

Now this is where the clues start!

H4262S C.00 Solutions-44  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

The Wait States for proc2 show that it is blocked on SLEEP when it is not running. This wait state is the result of the process putting itself to sleep. To see the system calls that the process is calling hit the F6 softkey or L key once you have selected the process. glance will collect the data and present it after about 10-20 seconds. rp2430: System Calls PID: 27761, proc2

PPID:

1 euid: 0 User: root Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------sigpause 111 449 99.7 0.35218 1497 74.1 1.17095 sigcleanup 139 450 100.0 0.00166 1500 74.2 0.00553

rx2600: System Calls PID: 26469, proc2

PPID:

1 euid: 0 User: root Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------sigpause 111 525 100.9 1.49255 1500 74.2 4.26847 sigcleanup 139 525 100.9 0.00143 1500 74.2 0.00408

The sigpause() call is causing the sleep blocks that we see in the Wait States page. The interesting thing is that the rate at which the program calls sigpause() is always 100 times per second. That is 10 ms (milli-seconds) between calls. How can a program be so coordinated with the wall clock and what is it using to achieve this synchronization? Can you tell what it is yet? How is the process being context switched (forced or voluntary)? ______ Answer

Review the Resource Summary page again for proc2 and you will see that all the context switches are Voluntary. This is not the expected case for a CPU hog. How is it that a process can use so much CPU and never be seen by the scheduler and thrown off the CPU? The Bottom Line If you examine the code of the lab you will see that the process arms a trap waiting for the system hardware clock (the tick) to pop. When this occurs the program wakes up and wastes CPU for an amount of time that your instructor has tuned to be just under 10ms (see waste.c). The program then arms the trap again and voluntarily goes to sleep waiting for the next hardware tick. Remember the UNIX scheduler analyzes system activity on the hardware tick intervals and our program has done a good job at never being around at these times! It’s a free lunch.

http://education.hp.com

H4262S C.00 Solutions-45  2004 Hewlett-Packard Development Company, L.P.

Solutions

The standard UNIX tools (sar and top for example) feed on the scheduler’s internal statistics for measurement data and so they get the wrong story. glance however uses the midaemon, which recalculates performance stats every time a process returns from a system call. And you cannot play this game without system calls. 4. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______ Answer:

(rp2430)

# timex /home/h4262/baseline/short & The last prime number is : 49999 real user sys

2:32.86 10.88 0.07

(rx2600) # timex /home/h4262/baseline/short & # The last prime number is : real user sys

99991

30.86 8.51 0.01

Our old benchmark figure was around 10 seconds (real) so this is significantly slower. This program is running in the gaps that the proc2 process is leaving. You could further modify waste.c to use more of the tick period. 5. End the CPU load by executing the KILLIT script. # ./KILLIT

H4262S C.00 Solutions-46  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

6–18. LAB: Memory Leaks There are several performance issues related to memory management, memory leaks, and swapping/paging, protection ID thrashing…. Let's investigate a few of them. 1. Change directories to /home/h4262/memory/leak: # cd /home/h4262/memory/leak Memory leaks occur when a process requests memory (typically through the malloc()or shmget() calls) but doesn't free the memory once it finishes using it. The five processes in this directory all have memory leaks to different degrees. The following solution data came from an rp2430 server with 640MB of physical memory and 2GB of device swap, and an rx2600 server with 1012MB of physical memory and 2GB of device swap. The rp2430 was running HPUX 11i v1 and the rx2600 was running 11i v2. 2. Before starting the background processes, look up the current value for maxdsiz using the kmtune command on 11i v1 and the kctune command on 11i v2. On the rp2430: # kmtune –lq maxdsiz Answer: Varies with configuration – probably 64MB if you are pre 11i and 256MB for 11i v1. # kmtune-lq maxdsiz Parameter: maxdsiz Current: 0x10000000 Pending: 0x10000000 Default: 0x10000000 Minimum: Module: Version Dynamic: No # The number is in hex… Converting this to decimal = 268435456 = 256MB On the rx2600: # kctune –avq maxdsiz Answer: Varies with configuration – probably 1GB for 11i v2. # kctune -avq maxdsiz Tunable maxdsiz

http://education.hp.com

H4262S C.00 Solutions-47  2004 Hewlett-Packard Development Company, L.P.

Solutions

Description process (bytes)

Maximum size of the data segment of a 32-bit

Module Current Value Value at Next Boot Value at Last Boot Default Value Constraints

vm 1073741824 [Default] 1073741824 [Default] 1073741824 1073741824 maxdsiz >= 262144 maxdsiz Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 1073741824 Default Immed (now) 0x10000000 0x10000000

Also take some vmstat reading to satisfy yourself that the system is not under memory pressure. How much free memory do you have? rp2430: # vmstat 2 2 procs memory faults cpu r b w avm free sy cs us sy id 3 0 0 75182 92519 408 138 1 0 99 3 0 0 75182 92465 214 75 0 0 100

page re

at

pi

po

fr

de

sr

in

3

0

0

0

0

0

0

104

3

0

1

0

0

0

0

106

We have around 92000 free pages which equates to 368MB. rx2600: # vmstat 2 2 procs memory faults cpu r b w avm free sy cs us sy id 2 0 0 124095 97927

page re

at

pi

po

fr

de

sr

in

466

165

298

0

0

0

2

1134

H4262S C.00 Solutions-48  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions 47856 2 21509

476 0 470

14 19 67 0 124095 3 3 94

96427

137

26

69

0

0

0

26

536

We have around 97000 free pages which equates to 388MB. 3. Use the RUN script to start the background processes: # ./RUN 4. Open another window. Start glance. Sort the processes by CPU utilization (should be the default), and answer the following questions — fairly quickly, before the memory leaks get too large. Go for the m page of glance for the best info. You have to be quick off the mark after starting the leak programs! MEMORY REPORT Users= 1 Event Current Cumulative Current Rate Cum Rate High Rate ------------------------------------------------------------------------------Page Faults 588 1301 113.0 116.1 137.1 Page In 1 33 0.1 2.9 6.1 Page Out 0 0 0.0 0.0 0.0 KB Paged In 0kb 36kb 0.0 3.2 6.9 KB Paged Out 0kb 0kb 0.0 0.0 0.0 Reactivations 0 0 0.0 0.0 0.0 Deactivations 0 0 0.0 0.0 0.0 KB Deactivated 0kb 0kb 0.0 0.0 0.0 VM Reads 0 3 0.0 0.2 0.5 VM Writes 0 0 0.0 0.0 0.0 Total VM : 384.9mb Active VM: 342.1mb

Sys Mem : 182.3mb Buf Cache: 32.4mb

User Mem: 96.9mb Free Mem: 328.4mb

Phys Mem: 640.0mb

• What is the current amount of free memory? Answer:Varies with configuration Already this has dropped to 328.4MB • What is the size of the buffer cache? Answer:Varies with configuration In our case this is 32.4MB • Is there any paging to the swap space? Answer:Varies with configuration No not in the last sample, see KB paged Out above • How much swap space is currently reserved? Answer:Varies with configuration Get this from swapinfo. Again you need to do this just after the programs start: In our case around 249MB. # swapinfo -tm

http://education.hp.com

H4262S C.00 Solutions-49  2004 Hewlett-Packard Development Company, L.P.

Solutions Mb TYPE AVAIL dev 2048 /dev/vg00/lvol2 reserve memory 1013 total 3061

Mb USED 0

Mb FREE 2048

PCT USED 0%

379 330 709

-379 683 2352

33% 23%

START/ Mb LIMIT RESERVE 0 -

-

PRI 1

0

NAME

-

The total swapspace “used” (used = really used + reserved) is the figure in bold. More detail on swap management is in Module 7. For now take the bottom line figure in bold above. • Which process has the largest Resident Set Size (RSS)? Answer proc1. You can see that from the global process list in glance (the g key). As you watch it, it will grow until vhand kicks in and limits its RSS. However, the VSS will continue to grow. Select that process (with s) and observe to RSS/VSS figure. PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt ------------------------------------------------------------------------------proc1 3267 1 168 root 0.0/ 0.2 1.0 0.0/ 0.0 275.8mb 1 proc2 3268 1 168 root 0.0/ 0.1 0.4 0.0/ 0.0 114.6mb 1 proc3 3269 1 168 root 0.0/ 0.0 0.2 0.0/ 0.0 56.7mb 1 proc4 3270 1 168 root 0.0/ 0.0 0.1 0.0/ 0.0 27.7mb 1 alarmgen 3277 3276 168 root 0.0/ 0.0 0.1 1.3/ 0.1 1.6mb 6 vhand 2 0 128 root 0.4/ 0.2 2.0 81.7/44.2 64kb 1

Resources PID: 3267, proc1 PPID: 1 euid: 0 User: root ------------------------------------------------------------------------------CPU Usage (util): 0.0 Log Reads : 0 Wait Reason : SLEEP User/Nice/RT CPU: 0.0 Log Writes: 0 Total RSS/VSS :275.7mb/479.1mb System CPU : 0.0 Phy Reads : 0 Traps / Vfaults: 0/ 542 Interrupt CPU : 0.0 Phy Writes: 0 Faults Mem/Disk: 0/ 0 Cont Switch CPU : 0.0 FS Reads : 0 Deactivations : 0 Scheduler : HPUX FS Writes : 0 Forks & Vforks : 0 Priority : 168 VM Reads : 0 Signals Recd : 0 Nice Value : 20 VM Writes : 0 Mesg Sent/Recd : 0/ 0 Dispatches : 5 Sys Reads : 0 Other Log Rd/Wt: 0/ 0 Forced CSwitch : 0 Sys Writes: 0 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 5 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Tue Apr 6 14:29:16 2004 CPU Switches : 0 Bytes Xfer: 0kb :

• What is the data segment size of the process with the largest RSS? Answer:select the memory regions page for proc1 with the M key. Memory Regions PID:

3267, proc1

PPID:

1 euid:

0 User: root

Type RefCt RSS VSS Locked File Name ------------------------------------------------------------------------------NULLDR/Shared 87 4kb 4kb 0kb TEXT /Shared 2 4kb 4kb 0kb /home/.../leak/proc1 DATA /Priv 1 301.0mb 716.2mb 0kb /home/.../leak/proc1 MEMMAP/Priv 1 0kb 16kb 0kb /usr/lib/tztab

H4262S C.00 Solutions-50  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions MEMMAP/Priv MEMMAP/Priv MEMMAP/Priv MEMMAP/Priv MEMMAP/Priv Text RSS/VSS: Shmem RSS/VSS:

1 1 1 1 1

4kb 4kb 0kb 24kb 40kb

4kb/ 0kb/

4kb 0kb

4kb 8kb 8kb 28kb 40kb

0kb 0kb 0kb 0kb 0kb

/usr/lib/hpux32/libc.so.

Data RSS/VSS:301mb/716mb Other RSS/VSS:1.6mb/3.2mb

Stack RSS/VSS:

4kb/

8kb

The data segment size in this example is 301/716 MB and growing! 5. After a several minutes, the proc1 process should reach its maximum data size. If your maxdsiz is set to 1 GB, this could take a while. Please be patient. Observe the behavior of the system when this occurs. •

What happens when the process reaches its maximum data size? Answer This is going to take several minutes. The maxdsiz limit is probably either 256MB or 1GB on the test system. Be careful! maxdsiz is a limit on the VSS (Virtual Set Size) and not the RSS (Resident Set Size). System starts doing a LOT of disk I/O. Look for the large “F” bar in the Disc Util global meter.

• Why does disk utilization become so high at this point? Answer Kernel is dumping the core file of the user process in our case. You will probably run out of disc space in the /home file system. You may want to remove the /home/h4262/memory/leak/core file! Remember it is not the process that is doing the disk I/O, it is the kernel that is doing it to produce the core file. 6. As the other processes grow towards their maximum data segment size, continue to monitor the following: •

Free memory

# vmstat 2 2 procs memory faults cpu r b w avm free sy cs us sy id 2 0 0 321403 91118 4962 326 2 3 95 2 0 0 321403 90413 552 191 0 0 100

page re

at

pi

po

fr

de

sr

in

54

19

79

285

16

0

359

548

1

0

115

12

0

0

0

397

Not a lot of free memory now. The system is under memory pressure and is paging out to stabilize the memory system •

Swap space reserved

# swapinfo -tm Mb TYPE AVAIL dev 2048

http://education.hp.com

Mb USED 715

Mb FREE 1333

PCT USED 35%

START/ Mb LIMIT RESERVE 0 -

PRI 1

NAME /dev/vg00/lvol2

H4262S C.00 Solutions-51  2004 Hewlett-Packard Development Company, L.P.

Solutions reserve memory total

1013 3061

341 340 1396

-341 673 1665

34% 46%

-

0

-

Swapspace is up to 46% utilization! •

The size of the processes' data segments

All the proc(n) processes continue to grow (see VSS) just like proc1 did and they are aborted in the same way when they cross the line (maxdsiz). •

The RSS of the processes

The running memory hog processes compete for the limited real memory resource. We didn’t have a lot free at the start of the test and the lab procs all want to grow to the maxdsiz limit. They cannot all fit together so they fight. This is a classic memory thrash situation. •

The number of page-outs/page-ins to the swap space

This depends on when you look! These figures were taken while proc2 was still on the move and free memory was approaching its minimum. # vmstat 2 10 procs memory faults cpu r b w avm free sy cs us sy id 2 0 0 166464 2692 173 82 0 0 100 2 1 0 170444 1649 209 92 0 0 100 2 1 0 170444 1028 189 88 0 6 94 2 1 0 170444 1146 176 129 0 5 95 2 1 0 170444 1392 175 112 0 0 100 2 1 0 170444 1366 190 156 0 0 100 1 0 0 169455 1090 209 201 0 0 100 1 0 0 169455 1112 193 163 0 1 99 1 0 0 169455 1048 180 133 5 0 95 1 0 0 169455 1600 240 119 0 4 96

page re

at

pi

po

fr

de

sr

in

0

0

0

0

0

0

0

103

0

0

13

0

0

0

0

123

0

0

8

5

4

0

1256

122

8

0

6

101

109

0

9869

225

12

0

5

263

69

0

9659

316

12

0

5

312

44

0

8186

331

9

0

5

304

28

0

6410

316

6

0

3

351

31

0

5334

359

3

0

2

332

19

0

3902

339

5

0

0

396

12

0

2576

370

• 7. Run the two baseline programs, short and diskread. # timex /home/h4262/baseline/short # timex /home/h4262/baseline/diskread

H4262S C.00 Solutions-52  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

rp2430: # timex /home/h4262/baseline/short The last prime number is : 49999 real user sys

12.00 10.86 0.02

# timex /home/h4262/baseline/diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c1t15d0] DiskRead: Start reading : 1024MB 1024+0 records in 1024+0 records out real user sys

31.79 0.02 0.53

rx2600: # timex /home/h4262/baseline/short & # The last prime number is : real user sys

99991

8.54 8.48 0.00

# timex /home/h4262/baseline/diskread & [1] 3841 root@r265c145:/home/h4262/memory/leak # DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c2t1d0s2] DiskRead: Start reading : 2048MB 2048+0 records in 2048+0 records out real user sys

29.60 0.01 0.16

How does the performance of these programs compare to their earlier runs? Answer: short takes a little longer. The CPU is not under much pressure at this time so compute bound processes will not be affected (unless they need memory!). It is a different story for diskread, in the first test case, it took noticeably longer due to the disk load already in progress for the paging activity. It is not good to have swap space on your application disks! 8. When finished monitoring the behavior of processes with memory leaks, clean up the processes.

http://education.hp.com

H4262S C.00 Solutions-53  2004 Hewlett-Packard Development Company, L.P.

Solutions

• •

Exit glance. Execute the KILLIT script: # ./KILLIT

•

If you changed maxdsiz, change it back:

# kctune maxdsiz=0x40000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 0x10000000 0x10000000 Immed (now) 0x40000000 0x40000000

H4262S C.00 Solutions-54  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

7–15. LAB: Monitoring Swap Space Preliminary Steps A portion of this lab requires you to interact with the ISL and boot menus, which can only be accomplished via a console login. If you are using remote lab equipment, access your system’s console interface via the GSP/MP. You may get some “file system full” messages while you are shutting down the system. You can ignore these messages.

Directions The following lab illustrates swap reservation, configures and de-configures pseudo swap, and adds additional swap partitions with different swap priorities. 1. Use the swapinfo -m command to display the current swap space statistics on the system. List the MB Avail and MB Used for the following three items: MB Available dev

Answer

512

reserve

-

memory

451

Mb AVAIL 512 451

139 27

(rp2430) Mb USED 0 139 27

Mb AVAIL 2048 1013

Mb FREE 512 -139 424

PCT USED 0%

START/ Mb LIMIT RESERVE 0 -

PRI 1

NAME /dev/vg00/lvol2

START/ Mb LIMIT RESERVE 0 -

PRI 1

NAME /dev/vg00/lvol2

6%

(rx2600)

# swapinfo -m TYPE dev reserve memory

0

Varies with configuration, examples below.

# swapinfo –m TYPE dev reserve memory

MB Used

Mb USED 75 189 339

Mb FREE 1973 -189 674

PCT USED 4% 33%

2. To see total swap space “available” and total swap space “reserved”, enter: # swapinfo -mt What is the total swap space “available” (including pseudo swap)? Answer

Varies with configuration, in our case it is 964 Mb or 3 Gb (as seen in the bolded figures below.)

http://education.hp.com

H4262S C.00 Solutions-55  2004 Hewlett-Packard Development Company, L.P.

Solutions

(rp2430)

# swapinfo -tm Mb TYPE AVAIL dev 512 reserve memory 451 total 963

Mb USED 0 139 27 166

# swapinfo -mt Mb TYPE AVAIL dev 2048 reserve memory 1013 total 3061

Mb USED 74 190 339 603

Mb FREE 512 -139 424 797

PCT USED 0% 6% 17%

START/ Mb LIMIT RESERVE 0 -

-

PRI 1

0

-

START/ Mb LIMIT RESERVE 0 -

PRI 1

NAME /dev/vg00/lvol2

(rx2600) Mb FREE 1974 -190 674 2458

PCT USED 4% 33% 20%

-

0

NAME /dev/vg00/lvol2

-

What is the total space "reserved"? Answer Varies with configuration. Swap space is first reserved and then it may (or may not) be used by the process that reserved it. The bottom line is that reserved swap space is no more available than used swap space so the only figure that really matters here are the totals underlined (166 Mb and 603 Mb). This figure is unavailable to any other process. 3. Start a new shell process by typing sh. Re-execute the swapinfo command and verify whether any additional swap space was reserved when the new shell process started. In this case, the difference is going to be pretty small, so let’s not use the –m option. Upon verification, exit the shell. Is the swap space returned upon exiting the shell process? Answer It should and it does. But you have to be careful when you look. It is easy for some other activity on the system to “spoil” the results You may want to try it 2 or 3 times to see if your results change. What SHOULD happen is that the “reserve-USED” entries should increase and then decrease by exactly the same amount. rp2430: # swapinfo TYPE dev reserve memory

Kb AVAIL 524288 462248

Kb Kb USED FREE 0 524288 144444 -144444 28384 433864

PCT USED 0%

Kb Kb USED FREE 0 524288 144768 -144768

PCT USED 0%

START/ Kb LIMIT RESERVE 0 -

PRI 1

NAME /dev/vg00/lvol2

START/ Kb LIMIT RESERVE 0 -

PRI 1

NAME /dev/vg00/lvol2

6%

# sh # swapinfo TYPE dev reserve

Kb AVAIL 524288 -

H4262S C.00 Solutions-56  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions memory

462248

28384

433864

6%

Kb Kb USED FREE 0 524288 144444 -144444 28388 433860

PCT USED 0%

Kb Kb USED FREE 75652 2021500 194900 -194900 346740 690324

PCT USED 4%

Kb Kb USED FREE 75652 2021500 195540 -195540 346740 690324

PCT USED 4%

Kb Kb USED FREE 75652 2021500 194900 -194900 346740 690324

PCT USED 4%

# exit # swapinfo TYPE dev reserve memory

Kb AVAIL 524288 462248

START/ Kb LIMIT RESERVE 0 -

PRI 1

NAME /dev/vg00/lvol2

START/ Kb LIMIT RESERVE 0 -

PRI 1

NAME /dev/vg00/lvol2

START/ Kb LIMIT RESERVE 0 -

PRI 1

NAME /dev/vg00/lvol2

START/ Kb LIMIT RESERVE 0 -

PRI 1

NAME /dev/vg00/lvol2

6%

rx2600: # swapinfo Kb TYPE AVAIL dev 2097152 reserve memory 1037064

33%

# sh # swapinfo Kb TYPE AVAIL dev 2097152 reserve memory 1037064

33%

# exit # swapinfo Kb TYPE AVAIL dev 2097152 reserve memory 1037064

33%

If you see that some swap was reserved and not released, then there is something else going on in the background that is skewing the figures. 4. Start glance and observe the Global bars at the top of the display for the duration of this step. Start a large, memory process and note how much the Current Swap Util. percentage increases in glance. Type: # /home/h4262/memory/paging/mem256 & This should reserve a large amount of swap space. Start as many mem256 processes as possible. For best results, wait until each swap reservation is complete, by observing the incremental increases in Current Swap Util. in glance. The system will get slower and slower as you start more mem256 processes. What was the maximum number of mem256 processes that can be started? Answer

Varies with configuration, depends on your swap space.

http://education.hp.com

H4262S C.00 Solutions-57  2004 Hewlett-Packard Development Company, L.P.

Solutions

On the rp2430, after 12 copies of mem256 the test system swap space was almost gone. Below is what happened when the 13th process was introduced. # swapinfo -tm Mb TYPE AVAIL dev 512 reserve memory 451 total 963

Mb USED 461 51 399 911

Mb FREE 51 -51 52 52

PCT USED 90% 88% 95%

START/ Mb LIMIT RESERVE 0 -

-

0

PRI 1

NAME /dev/vg00/lvol2

-

# /home/h4262/memory/paging/mem256& [13] 2864 # exec(2): insufficient swap or memory available. [13] +

Done(9)

/home/h4262/memory/paging/mem256&

On the rx2600, after 37 copies of mem256 the test system swap space was almost gone. Below is what happened when the 38th process was introduced. # swapinfo -tm Mb TYPE AVAIL dev 2048 reserve memory 1013 total 3061

Mb USED 1978 70 991 3039

Mb FREE 70 -70 22 22

PCT USED 97% 98% 99%

START/ Mb LIMIT RESERVE 0 -

-

0

PRI 1

NAME /dev/vg00/lvol2

-

# ./mem256& [38] 4159 exec(2): insufficient swap or memory available.

What prevented an additional mem256 process from being started? Answer

“Insufficient swap or memory available”

Kill all mem256 processes to restore performance. 5. Recompile the kernel, disabling pseudo-swap. Use the following procedure: 11i v1 and earlier: # # # # # #

cd /stand/build /usr/lbin/sysadm/system_prep -s system echo "swapmem_on 0" >> system mk_kernel -s system cd / shutdown -ry 0

11i v2 and later: # cd / # kctune swapmem_on=0 NOTE: The configuration being loaded contains the following change(s) that cannot be applied immediately and which will be held for the next boot:

H4262S C.00 Solutions-58  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions -- The tunable swapmem_on cannot be changed in a dynamic fashion. WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? no NOTE: The backup will not be updated. * The requested changes have been saved, and will take effect at next boot. Tunable Value Expression swapmem_on (now) 1 Default (next boot) 0 0 # shutdown –ry 0

6. Reboot from the new kernel. rp2430: Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test rx2600: (Nothing special needs to be done) 7. Once the system reboots, login and execute swapinfo. Is there a memory entry? Why or why not? Answer

No. Pseudo-swap has been disabled.

Will the same number of mem256 processes be able to execute as earlier? Answer No. How many mem256 processes can be started now? Answer

Varies with configuration

On the rp2430, only 6 processes could be started successfully. On the rx2600, only 27 processes could be started successfully. Kill all mem256 processes to restore performance. 8. If you have a two disk system. If you have a two disk system, add the second disk to vg00 (if this was not already done in a previous exercise) and build a second swap logical volume on it. This lvol should be the same size as the primary swap volume. If you do not have a second disk, continue this lab at question 13.

http://education.hp.com

H4262S C.00 Solutions-59  2004 Hewlett-Packard Development Company, L.P.

Solutions

If you did not add the second disk earlier, # # # #

vgdisplay –v | grep Name (Note the physical disks used by vg00) ioscan –fnC disk (Note which disk is unused by LVM) pvcreate –f vgextend /dev/vg00

To create the new swap device on the second disk, # lvcreate –n swap1 /dev/vg00 # lvextend –L 512 /dev/vg00/swap1 Note: In our case the primary swap was 512MB. See swapinfo on your system and match the size of the new swap device to the primary swap. 9. Now add the new logical volume to swap space. Ensure that the priority is the same as the primary swap: Check your work. # swapon –p 1 /dev/vg00/swap1 Answer: # swapinfo -tm Mb TYPE AVAIL dev 512 reserve total 512

Mb USED 0 130 130

Mb FREE 512 -130 382

PCT USED 0% 25%

START/ Mb LIMIT RESERVE 0 -

0

PRI 1

NAME /dev/vg00/lvol2

-

# swapon -p 1 /dev/vg00/swap1

swapon: Device /dev/vg00/swap1 contains a file system. Use -e to page after the end of the file system, or -f to overwrite the file system with paging.

Oops! Problem 1, swapon is being overly cautious. If you get this message, the memory manager has detected what appears to be a file system already on the device. (Probably, left over from some previous use) You need to override. # swapon -p 1 –f /dev/vg00/swap1 swapon: The kernel tunable parameter "maxswapchunks" needs to be increased to add paging on device /dev/vg00/swap1.

Oops! Problem 2, the kernel cannot deal with this amount of swap. If you get this message, the tunable parameter, maxswapchunks, is set too small to accommodate all of the new swap space. We need to modify “maxswapchunks” and reboot. If you have this problem, use sam to double maxswapchunks. In 11i v2, maxswapchunks has been obsoleted and will not have to be modified. Recompile the kernel (if necessary), to increase maxswapchunks. Use the following procedure:

H4262S C.00 Solutions-60  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

11i v1 and earlier (ONLY!) # # # #

cd /stand/build echo "maxswapchunks mk_kernel -s system cd /

512" >> system

# shutdown -ry 0 10. If you had to rebuild the kernel to increase maxswapchunks, reboot the system. Otherwise, skip to step 11. 11i v1 and earlier (ONLY!) Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test And now add the new swap device: # swapon -p 1 –f /dev/vg00/swap1

Verify that the new swap space has be recognized by the kernel: # swapinfo -mt (rp2430) Mb Mb TYPE AVAIL USED dev 512 0 dev 512 0 reserve 141 total 1024 141

Mb FREE 512 512 -141 883

PCT USED 0% 0%

# swapinfo -tm (rx2600) Mb Mb TYPE AVAIL USED dev 2048 86 dev 2048 0 reserve 158 total 4096 244

Mb FREE 1962 2048 -158 3852

PCT USED 4% 0%

START/ Mb LIMIT RESERVE 0 0 -

14%

-

PRI 1 1

0

-

START/ Mb LIMIT RESERVE 0 0 -

PRI 1 1

6%

-

0

NAME /dev/vg00/lvol2 /dev/vg00/swap1

NAME /dev/vg00/lvol2 /dev/vg00/swap1

-

Done! 11. Start enough mem256 processes to make the system start paging. Answer: This depends on how much memory you have but on an rp2430 with 640MB, I found that 8 processes got things paging nicely! On an rx2600, 10 should do nicely. # vmstat 2 2 procs faults r b

memory

page

cpu w

http://education.hp.com

avm

free

re

at

pi

po

fr

de

sr

in

H4262S C.00 Solutions-61  2004 Hewlett-Packard Development Company, L.P.

Solutions sy 9 213 9 191

cs 0 471 0 355

us sy id 0 180106 100 0 0 0 180106 100 0 0

5064

34

0

192

340

99

0

3136

339

5056

23

0

122

217

63

0

2006

216

Note the system is paging constantly in the vmstat output and free memory is very low. 12. Measure the disk I/O to see what is happening with swap space. Go to question 15 when you have finished. Answer: The I/O should be balanced across both disks! # sar -d 5 2

(rp2430)

HP-UX r206c41 B.11.11 U 9000/800

03/18/04

14:22:12 14:22:17 14:22:22

Average Average

device c1t15d0 c3t15d0 c1t15d0 c3t15d0

%busy 87.03 60.68 82.60 72.20

avque 24.73 23.21 22.01 19.57

r+w/s 409 406 395 385

blks/s 12222 12093 12209 11976

avwait 33.45 31.03 28.53 25.00

avserv 13.86 9.24 12.26 10.57

c1t15d0 c3t15d0

84.82 66.43

23.39 21.43

402 396

12216 12034

31.03 28.10

13.08 9.89

# sar -d 5 2

(rx2600)

HP-UX r265c145 B.11.23 U ia64 11:28:10 11:28:15 11:28:20

Average Average

04/07/04

device c2t1d0 c2t0d0 c2t1d0 c2t0d0

%busy 9.38 3.79 21.40 6.60

avque 0.50 0.50 6.75 10.42

r+w/s 25 14 79 47

blks/s 542 271 2373 1229

avwait 0.00 0.01 2.85 3.86

avserv 6.05 4.71 5.35 3.94

c2t1d0 c2t0d0

15.38 5.19

5.25 8.13

52 31

1456 750

2.16 2.97

5.51 4.12

This has doubled the effective performance of swap space. The results would be even better if the swap disks were on different controllers. 13. If you have a single disk system. Create three additional swap devices with sizes of 20 MB. # lvcreate -L 20 -n swap1 vg00 # lvcreate -L 20 -n swap2 vg00 # lvcreate -L 20 -n swap3 vg00 Prior to activating these swap devices, make note of the amount of swap space currently in use. When the new swap devices are activated with equal priority, all new paging activity will be spread evenly over these swap devices.

H4262S C.00 Solutions-62  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

List the current amount of swap space in use. Answer

Varies with configuration. Use swapinfo –tm.

If 10 MB is currently in use on a single swap device, and we activate an equal priority swap device, what is the distribution if an additional 10 MB is paged out? A) B)

The distribution would be 10MB and 10MB. or The distribution would be 15MB and 5MB.

Answer

B. vhand does not consider what the previous utilization was.

14. Activate the newly created swap devices. Activate two with a priority of 1, and the third with a priority of 2. # swapon -p 1 /dev/vg00/swap1 # swapon -p 2 /dev/vg00/swap2 # swapon -p 1 /dev/vg00/swap3 Start enough mem256 processes to make the system start paging. Answer: This depends on how much memory you have but on a 640MB system I found that 8 processes got things paging nicely! # vmstat 2 2 procs memory faults cpu r b w avm free sy cs us sy id 10 0 0 175597 6489 271 58 26 4 70 10 0 0 175597 6414 300 254 100 0 0

page re

at

pi

po

fr

de

sr

in

12

8

2

31

11

0

467

0

20

0

27

87

22

0

1316

103

Note the system is paging constantly in the vmstat output and free memory is very low. Is the new paging activity being distributed evenly across the paging devices? Answer

No. It is confined to lvol2 (primary swap), swap1, and swap3.

15. When finished with the lab, reboot the system as normal (do not boot vmunix_test) to re-enable pseudo-swap and remove the additional swap devices. For 11i v1 and earlier, follow this procedure:

http://education.hp.com

H4262S C.00 Solutions-63  2004 Hewlett-Packard Development Company, L.P.

Solutions

# cd / # shutdown –ry 0 For 11i v2 and later, follow this procedure: # cd / # kctune swapmem_on=1 # shutdown –ry 0

H4262S C.00 Solutions-64  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

8–18. LAB: Disk Performance Issues Directions The following lab illustrates a number of performance issues related to disks. 1. A file system is required for this lab. One was created in an earlier exercise. Mount it now. # mount /dev/vg00/vxfs /vxfs We also need to assure that the controller does not have " SCSI immediate reporting" enabled. Enter the following command and check your current state: (fill in the device file name as appropriate) # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) If the current immediate_report = 1 then enter the following: # scsictl -m ir=0 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) 2. Copy the lab files to the file system. # cp /home/h4262/disk/lab1/disk_long # cp /home/h4262/disk/lab1/make_files

/vxfs /vxfs

Next, execute the make_files program to create five 4-MB ASCII files. # cd /vxfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /vxfs # mount /dev/vg00/vxfs /vxfs # cd /vxfs

http://education.hp.com

H4262S C.00 Solutions-65  2004 Hewlett-Packard Development Company, L.P.

Solutions

4. Open a second terminal window and start glance. While in glance, display the Disk Report (d key). Zero out the data with the z key. From the first window, time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null

glance Disk Report

real: user: sys:

Logl Rds: Phys Rds:

Answer: # timex cat file* > /dev/null (rp2430) real: user: sys:

0.73 0.01 0.11

Logl Rds: Phys Rds:

2560 500

Logl Rds: Phys Rds:

2560 2560

# timex cat file* > /dev/null (rx2600) real user sys

0.34 0.00 0.06

5. At this point, all 20 MB of data is resident in the buffer cache. Re-execute the same command and record the results below: # timex cat file* > /dev/null real: user: sys:

glance Disk Report Logl Rds: Phys Rds:

Answer: # timex cat file* > /dev/null (rp2430) real: user: sys:

0.06 0.01 0.05

Logl Rds: Phys Rds:

2560 0

Logl Rds: Phys Rds:

2560 0

# timex cat file* > /dev/null (rx2600) real: user: sys:

0.02 0.00 0.02

H4262S C.00 Solutions-66  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

NOTE:

The conclusion is that I/O is much faster coming from the buffer cache, than having to go to disk to get the data.

6. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the VxFS file system (and then removes the files). # timex ./disk_long • • • •

How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?

Answer: The disk got over 80% busy. The average number of requests in the I/O queue reached around 53 on the rp2430 and 442 on the rx2600. The average wait time of a request was around 65 ms on the rp2430 and 182 ms on the rx2600. The task took around 12.5 seconds on the rp2430 and 7.5 seconds on the rx2600. 7. The glance I/O by Disk report Exit from the sar -d report, and start glance again. While in glance, display the I/O by Disk report (u key). From the first window, re-execute disk_long. Record the results below: # ./disk_long

glance I/O by Disk Report Util:

Qlen:

Answer: Utilization reached 86% and queue length reached 55 on the rp2430. Utilization reached 85% and queue length reached 414 on the rx2600. 8. The glance I/O by File System report Reset the data with the z key, and display the I/O by File System report (i key). From the first window, re-execute disk_long. Record results below: # ./disk_long

glance I/O by Disk Report Logl I/O:

http://education.hp.com

Phys I/O:

H4262S C.00 Solutions-67  2004 Hewlett-Packard Development Company, L.P.

Solutions

Answer: Logical I/Os reached 4059 and Physical I/Os reached 806 on the rp2430. Logical I/Os reached 4702 and Physical I/Os reached 1528 on the rx2600. 9. Performance tuning — immediate reporting. Ensure the immediate reporting options are set for the disk that the file system is located on. If immediate reporting is not set, set it. # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) # scsictl -m ir=1 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) Purge the contents of buffer cache. # # # #

cd / umount /vxfs mount /dev/vg00/vxfs /vxfs cd /vxfs

10. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program (which writes 400 MB to the file system and then removes the files). # timex ./disk_long • • • •

How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?

How do the results of step 11 compare to the results in step 6? ________________________________________________________________

H4262S C.00 Solutions-68  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

9–14. LAB: HFS Performance Issues Directions The following lab illustrates a number of performance issues related to HFS file systems. 1. A 512 MB HFS file system is required for this lab. Use the mount and bdf commands to determine if such a file system is available. # mount –v # bdf If there is no such HFS file system available, create one using the commands below: # lvcreate -n hfs vg00 # lvextend –L 512 /dev/vg00/hfs /dev/dsk/cXtYdZ (second disk) # newfs -F hfs /dev/vg00/rhfs # mkdir /hfs # mount /dev/vg00/hfs /hfs 2. Copy the lab files to the newly created HFS file system. # cp /home/h4262/disk/lab1/disk_long # cp /home/h4262/disk/lab1/make_files

/hfs /hfs

Next, execute the make_files program to create five 4-MB ASCII files. # cd /hfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /hfs # mount /dev/vg00/hfs /hfs # cd /hfs

http://education.hp.com

H4262S C.00 Solutions-69  2004 Hewlett-Packard Development Company, L.P.

Solutions

4. Time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: Answer: # timex cat file* > /dev/null (rp2430) real user sys

1.04 0.01 0.16

# timex cat file* > /dev/null (rx2600) real user sys

0.45 0.00 0.05

The cat command took 1.04 seconds to complete on the rp2430 and 0.45 seconds on the rx2600. 5. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long • • • •

How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?

Answer: # sar -d 5 200 (rp2430) HP-UX r206c41 B.11.11 U 9000/800

03/23/04

11:53:15 11:53:20

r+w/s 13 950 10 1758 6 2983 18 558

11:53:25 11:53:30 11:53:35

device c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0

%busy 5.20 33.60 7.57 55.98 2.01 100.00 8.00 84.20

avque 0.50 6922.08 0.50 5215.11 0.50 8156.62 5.80 1237.19

H4262S C.00 Solutions-70  2004 Hewlett-Packard Development Company, L.P.

blks/s avwait 66 5.09 15049 629.53 36 5.40 27980 2113.38 44 3.92 47696 2591.43 108 25.31 8670 1555.06

avserv 4.54 14.85 6.82 13.70 5.01 16.45 18.95 17.68

http://education.hp.com

Solutions 11:53:40 11:53:45 11:53:50

c1t15d0 c3t15d0 c1t15d0 c3t15d0 c3t15d0

6.00 0.50 71.20 7379.94 0.20 0.50 25.80 2375.50 9.20 0.50

15 2168 1 950 16

76 4.69 34537 1322.90 5 0.08 15206 3478.83 258 5.06

4.72 14.77 8.35 14.42 5.21

The disk got up to 100% busy. The average number of requests in the request queue was about 5200. The average wait time in the request queue was about 1950 ms. # timex ./disk_long real user sys

22.76 4.57 3.45

The operation completed in 22.76 seconds. # sar -d 5 200 (rx2600) HP-UX r265c145 B.11.23 U ia64 13:20:25 13:20:30 13:20:35 13:20:40 13:20:45 13:20:50 13:20:55

device c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t1d0

04/07/04

%busy avque 4.39 0.50 27.15 0.50 41.00 104.29 99.20 24004.63 1.40 0.50 100.00 20020.69 4.00 0.50 57.20 5030.77 2.40 0.50 1.40 0.50

r+w/s blks/s avwait avserv 27 706 0.00 1.67 90 756 0.00 3.04 245 4026 173.18 12.76 3322 53129 2127.15 2.35 3 51 0.00 4.62 3895 62320 6436.22 2.04 13 287 0.00 5.68 2097 33482 9701.92 2.06 7 164 0.00 6.94 2 34 0.00 9.94

The disk got up to 100% busy. The average number of requests in the request queue was about 50,000. The average wait time in the request queue was about 6100 ms. # timex ./disk_long real user sys

16.87 0.83 1.96

The operation completed in 16.87 seconds.

http://education.hp.com

H4262S C.00 Solutions-71  2004 Hewlett-Packard Development Company, L.P.

Solutions

6. Performance tuning — recreate the file system with larger fragment and file system block sizes. Tuning the size of the fragments and file system blocks can improve performance for sequentially accessed files. The procedure for creating a new file system with customized fragments of 8 KB and file system blocks of 64 KB is shown below: # lvcreate -n custom-lv vg00 # lvextend –L 512 /dev/vg00/custom-lv /dev/dsk/cXtYdZ # newfs -F hfs -f 8192 -b 65536 /dev/vg00/rcustom-lv # mkdir /cust-hfs # mount /dev/vg00/custom_lv /cust-hfs 7. Copy the lab files to the customized HFS file system, execute the make_files program, and purge the buffer cache. # cp /hfs/disk_long

/cust-hfs

# cp /hfs/make_files /cust-hfs # cd /cust-hfs # ./make_files # cd / # umount /cust-hfs # mount /dev/vg00/custom-lv /cust-hfs # cd /cust-hfs 8. Time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: Answer: # timex cat file* > /dev/null (rp2430) real user sys

0.84 0.01 0.10

H4262S C.00 Solutions-72  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

# timex cat file* > /dev/null (rx2600) real user sys

0.43 0.00 0.03

The cat command took 0.84 seconds to complete on the rp2430 and 0.43 seconds on the rx2600. How do the results of step 8 compare to the default HFS block and fragment results from step 4? _______________________________________________________________________ Answer: The larger block and fragment size resulted in I/O operations which were almost 20% faster on the rp2430 and marginally faster on the rx2600. 9. Performance tuning — change file system mount options. The manner in which the file system is mounted can impact performance. The fsasync mount option can improve performance, but data (metadata) integrity is not as reliable in the event of a crash, and fsck could run into difficulties. # cd / # umount /hfs # mount -o fsasync /dev/vg00/hfs /hfs # cd /hfs 10. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long • • • •

How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?

Answer: # sar -d 5 200

(rp2430)

HP-UX r206c41 B.11.11 U 9000/800

http://education.hp.com

03/23/04

H4262S C.00 Solutions-73  2004 Hewlett-Packard Development Company, L.P.

Solutions

12:08:22 12:08:27 12:08:32 12:08:37 12:08:42 12:08:47

device c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 c3t15d0

%busy 6.20 61.20 7.00 58.60 8.40 92.80 6.60 100.00 71.20

avque 0.50 5592.30 0.50 7186.64 3.94 4986.82 0.50 15588.44 5725.86

r+w/s blks/s avwait avserv 9 38 4.18 6.19 2120 33818 1376.80 13.94 16 81 4.31 5.28 1675 26765 1295.53 17.00 24 146 20.12 13.03 1860 29579 2678.62 16.11 17 120 4.84 3.79 2344 37493 2943.35 16.95 2292 36664 6159.69 15.69

The disk got up to 100% busy. The average number of requests in the request queue was about 7800. The average wait time in the request queue was about 2900 ms. # timex ./disk_long real user sys

17.17 4.61 3.72

The operation completed in 17.17 seconds. # sar -d 5 200

(rx2600)

HP-UX r265c145 B.11.23 U ia64 13:39:39 13:39:44 13:39:49 13:39:54 13:39:59 13:40:04

device c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0

04/07/04

%busy avque 1.00 0.50 46.11 22190.48 2.00 0.50 100.00 30303.60 3.20 5.20 99.80 11176.41 0.80 0.50 5.60 716.00 4.00 0.50

r+w/s blks/s avwait avserv 4 67 0.00 2.51 1274 20184 1026.94 2.54 5 77 0.00 5.94 3684 58941 4021.91 2.15 9 141 11.85 12.77 3888 62008 8740.46 2.05 2 30 0.00 4.42 287 4562 11067.58 1.51 9 43 0.00 4.45

The disk got up to 100% busy. The average number of requests in the request queue was about 17500. The average wait time in the request queue was about 6100 ms. # timex ./disk_long real user sys

14.46 0.86 3.04

The operation completed in 14.46 seconds.

H4262S C.00 Solutions-74  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

How do the results of step 10 compare to the default mount options in step 5? _____________________________________________________________________ Answer: With fsasync turned on, the operation was about 25% faster on the rp2430 and 14% faster on the rx2600.

http://education.hp.com

H4262S C.00 Solutions-75  2004 Hewlett-Packard Development Company, L.P.

Solutions

10–23. LAB: JFS File System Tuning Directions The following lab exercise compares performance of JFS with different mount options. The mount options used with JFS can have a big impact on JFS performance. 1. Mount a JFS file system to be used for this lab under /vxfs. # mount /dev/vg00/vxfs /vxfs 2. Because the above mount command specified no special mount options, the default mount options are used. Use the mount -v command to view the default options, including the option for transaction logging type. What type of transaction logging does JFS use by default? Answer

Full logging

3. Change directory to /vxfs. Time the execution of the disk_long program, which writes 400 MB of data to the file system in 20 MB increments. After each 20 MB is written, the files are deleted. Run the command three times and record the middle results. # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer

Varies with configuration, live data from test

# timex ./disk_long real user sys

(rp2430)

12.34 4.82 3.45

# timex ./disk_long (rrx2600) real user sys

9.49 0.90 1.62

If you look back to the HFS results, you will see that this is faster. See question 5 from the previous lab; test time was 23 seconds or 17 seconds!

H4262S C.00 Solutions-76  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

4. Remount the JFS file system using delaylog option. This helps performance of noncritical transactions. Run the command three times and record the middle results. # # # # # # #

cd / umount /vxfs mount -o delaylog /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long

Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer

Varies with configuration, should be faster than before:

# timex ./disk_long real user sys

10.93 4.85 3.52

# timex ./disk_long real user sys

(rp2430)

(rx2600)

9.23 0.90 1.64

Based on the results, does the disk_long program perform any non-critical transactions?

Answer

The answer is yes; the disk_long program is performing some non-critical transactions. This is seen by some improvement in time to execute. Since the programs write data in 1 MB increments (that's it), just about every JFS transaction is critical, so mounting with delaylog versus full log does not greatly affect performance in this case. It will in other cases. 5. Remount the JFS file system using tmplog option. This causes the system call to be returned after the JFS transaction is updated in memory (step 1 from lecture), and before the transaction is written to the intent log. Run the command three times and record the middle results. # cd / # umount /vxfs

http://education.hp.com

H4262S C.00 Solutions-77  2004 Hewlett-Packard Development Company, L.P.

Solutions

# # # # #

mount -o tmplog /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long

Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer

Varies with configuration, live test data:

# timex ./disk_long real user sys

10.08 4.82 3.40

# timex ./disk_long real user sys

(rp2430)

(rx2600)

10.35 0.90 1.60

Based on the results, why does the disk_long program show little or no improvement when mounted with tmplog?

Answer

The disk_long program shows little performance improvement because the program is performing extending write calls. When an “extending write” call is issued, by default JFS writes the user data first before writing the JFS transaction to the intent log. As a result, even JFS file systems mounted with tmplog or nolog will still have to wait for the user data to be written to disk. This waiting for the user data to be written hurts the performance of JFS.

6. Remount the JFS file system using tmpcache option. This allows the JFS transaction to be created without having to wait for the user data to be written in extending write calls. Run the command three times and record the middle results. # # # # # # #

cd / umount /vxfs mount -o mincache=tmpcache /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long

H4262S C.00 Solutions-78  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer

Varies with configuration, live test data. Fastest yet!

# timex ./disk_long

real user sys

(rp2430)

9.13 4.51 2.69

# timex ./disk_long real user sys

(rx2600)

9.51 0.90 1.65

Answer

When the mincache=tmpcache option is specified, under 2 MB out of 400 MB is physically written to disk. When this option is not specified, all 400 MB (400 out of 400) is physically written to disk. Major performance improvements should be seen with using this option, especially for applications doing lots of “extending write” calls (like the one in the lab). 7. Remount the JFS file system using direct option. This option requires all user data and all JFS transactions to bypass the buffer cache and go directly to disk. Run the command just once and record the results. # # # # #

cd / umount /vxfs mount -o mincache=direct /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long

Record results: Real: _____________ User: ____________ Sys: ____________ Answer

Varies with configuration, live test data, not very impressive!

# timex ./disk_long

real user sys

(rp2430)

7:36.75 5.15 5.41

# timex ./disk_long

http://education.hp.com

(rx2600)

H4262S C.00 Solutions-79  2004 Hewlett-Packard Development Company, L.P.

Solutions

real user sys

3:06.72 0.90 2.45

Based on the results, why does the disk_long program show such poor performance results when mounted with mincache=direct? When would this option be appropriate to use? Answer

The performance is poor because system calls have to wait while user data and JFS transactions are written out to disk. Normally, the JFS transactions are written to buffer cache, and the system calls do not have to wait for the transaction to be written to disk. This option is appropriate when the application performs its own caching, like with an RDBMS (for example, Oracle).

8. Dismount the VxFS file system. # umount /vxfs

H4262S C.00 Solutions-80  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

11–20. LAB: Network Performance Directions The following two labs investigate network read and write performance. The labs use NFS and are performed against the JFS file system created in the JFS module.

Lab 1  Network Read Performance To perform this lab, two systems are needed: an NFS server and an NFS client. Pair up with another student in the class for this lab. 1. Make sure the JFS file system on the server contains the make_files program. Execute the make_files program to create files for the client to access. # # # #

mount /dev/vg00/vxfs /vxfs cp /home/h4262/disk/lab1/make_files /vxfs cd /vxfs ./make_files

2. Export the JFS file system so the client can mount it. # exportfs -i -o root= # exportfs

/vxfs

3. From the client system, mount the NFS file system. # umount /vxfs # mount server_hostname:/vxfs

/vxfs

4. Time how long it takes to read the 20 MB of files from the mounted file system. Record the results: # timex cat /vxfs/file* > /dev/null Record results: Real: _____________ User: ____________ Sys: ____________ Answer

Varies with configuration, live test data below,

# timex cat /vxfs/file* > /dev/null real user sys

1.80 0.01 0.07

# timex cat /vxfs/file* > /dev/null real user

(rp2430)

(rx2600)

1.17 0.00

http://education.hp.com

H4262S C.00 Solutions-81  2004 Hewlett-Packard Development Company, L.P.

Solutions

sys 0.02 5. Now that the data is in the client's buffer cache, time how long it takes to read the exact same files again. Record the results: # timex cat /vxfs/file* > /dev/null Record results: Real: _____________ User: ____________ Sys: ____________ Answer

Varies with configuration, live data below. Much faster once buffered.

# timex cat /vxfs/file* > /dev/null real user sys

0.05 0.01 0.04

# timex cat /vxfs/file* > /dev/null real user sys

(rp2430)

(rx2600)

0.02 0.00 0.01

Moral: Try to have a big enough buffer on the client system for a lot of data to be cached. Also, biod daemons will help by prefetching data. 6. Test to see if fewer biod daemons will change the initial performance. # # # # # #

cd / umount /vxfs kill $(ps -e | grep biod | cut -c1-7) /usr/sbin/biod 4 mount server_hostname:/vxfs /vxfs timex cat /vxfs/file* > /dev/null

Record results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, but significant change. Large sequential access appears to be independent of the number of biods. Not what theory suggests? Well this depends!

# timex cat /vxfs/file* > /dev/null real user sys

(rp2430)

1.80 0.01 0.07

# timex cat /vxfs/file* > /dev/null

H4262S C.00 Solutions-82  2004 Hewlett-Packard Development Company, L.P.

(rx2600)

http://education.hp.com

Solutions

real 1.15 user 0.00 sys 0.02 7. Once finished, remove the files and umount the file system. # rm /vxfs/file* # umount /vxfs

Lab 2  Network Write Performance The following lab has the client perform many writes to an NFS file system. The following parameters will be investigated: •

Number of biod daemons

•

NFS version 2 versus NFS version 3

•

TCP versus UDP

During this lab, the monitoring tools shown below should be used on the client and server CLIENT

SERVER

# nfsstat -c # nfsstat -s # glance NFS report (n key) # glance NFS report (n key) # glance Global Process (g key) # glance Global Process (g key) - monitor biod daemons -monitor nfsd daemons # glance Disk report (d key) - monitor Remote Rds/Wrts 1. From the NFS client, mount the NFS file system as a version 2 file system. # mount -o vers=2

server_hostname:/vxfs

/vxfs

2. Terminate all the biod daemons on the client. # kill $(ps -e |grep biod|cut -c1-7) 3. Time how long it takes to copy the vmunix file to the mounted NFS file system. Record the results: The first command buffers the file. # cat /stand/vmunix >/dev/null # timex cp /stand/vmunix /jfs Record results:

Real: _____________ User: ____________ Sys: ____________

http://education.hp.com

H4262S C.00 Solutions-83  2004 Hewlett-Packard Development Company, L.P.

Solutions

Answer

Varies with configuration

# timex cp /stand/vmunix /vxfs real user sys

33.95 0.00 0.44

# timex cp /stand/vmunix /vxfs real user sys

(rp2430)

(rx2600)

20.64 0.00 0.38

4. Now, start up the biod daemons, and retry timing the copy. Record the results: # /usr/sbin/biod 4 # timex cp /stand/vmunix

/jfs

Record results:

Real: _____________ User: ____________ Sys: ____________ Answer

Varies with configuration, the test data shows marked improvement. The biods are providing the “write behind” service which is reducing the wait time experienced by the cp command.

# timex cp /stand/vmunix /vxfs real user sys

29.27 0.00 0.16

# timex cp /stand/vmunix /vxfs real user sys

(rp2430)

(rx2600)

16.53 0.00 0.14

5. Change the mount options to version 3 and retime the transfer: # # # # #

cd / umount /vxfs mount –o vers=3 server_hostname:/vxfs /vxfs cd / timex cp /stand/vmunix /vxfs

Record results:

H4262S C.00 Solutions-84  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

Solutions

Real: _____________ User: ____________ Sys: ____________ Answer: Interesting, it would appear that Version 3 mounting is far better than version 2. The results were obtained using the same 4 biods started in question 3. # timex cp /stand/vmunix /vxfs

(rp2430)

real 2.63 user 0.00 sys 0.18 # timex cp /stand/vmunix /vxfs

(rx2600)

real user sys

4.13 0.00 0.13

6. Compare the speed of FTP to NFS. Transfer the file to the server using the ftp utility. # ftp server_hostname # put /stand/vmunix /vxfs/vmunix.ftp How long did the FTP transfer take? _________ Explain the difference in performance. Answer

The data below shows that ftp is well optimized to perform data transfer. The good news is that Version 3 of NFS keeps up with it and remember that at 11i, NFS is using TCP/IP and not UDP/IP.

# ftp r265c69 (rp2430) Connected to r265c69.cup.edunet.hp.com. 220 r265c69.cup.edunet.hp.com FTP server (Version 1.1.214.4(PHNE_23950) Tue May 22 05:49:01 GMT 2001) ready. Name (r265c69:root): 331 Password required for root. Password: 230 User root logged in. Remote system type is UNIX. Using binary mode to transfer files. ftp> put /stand/vmunix /vxfs/vmunix.ftp 200 PORT command successful. 150 Opening BINARY mode data connection for /vxfs/vmunix.ftp. 226 Transfer complete. 27573440 bytes sent in 2.55 seconds (10554.31 Kbytes/s) ftp> # ftp r265c145 (rx2600) Connected to r265c145. 220 r265c145.cup.edunet.hp.com FTP server (Revision 1.1 Version wuftpd-2.6.1 Tue Jul 15 07:42:07 GMT 2003) ready.

http://education.hp.com

H4262S C.00 Solutions-85  2004 Hewlett-Packard Development Company, L.P.

Solutions Name (r265c145:root): 331 Password required for root. Password: 230 User root logged in. Remote system type is UNIX. Using binary mode to transfer files. ftp> put /stand/vmunix /vxfs/vmunix.ftp 200 PORT command successful. 150 Opening BINARY mode data connection for /vxfs/vmunix.ftp. 226 Transfer complete. 47716848 bytes sent in 4.03 seconds (11557.24 Kbytes/s) ftp>

7. Test the potential performance benefit of turning off the new TCP feature of HPUX 11i. First, mount the file system with UDP protocol rather than the default TCP. # umount /vxfs # mount -o vers=3 –o proto=udp server_hostname:/vxfs /vxfs Perform the copy test again and compare the results with the TCP version 3 mount data in part 3. Is UDP quicker than TCP? # timex cp /stand/vmunix /vxfs Answer # timex cp /stand/vmunix /vxfs real user sys

2.44 0.00 0.15

# timex cp /stand/vmunix /vxfs real user sys

(rp2430)

(rx2600)

4.08 0.00 0.13

It would appear that UDP is marginally quicker than TCP but the difference is very small and probably not worth the risk. HPUX 11i version 3 NFS with TCP provides good performance and reliability.

H4262S C.00 Solutions-86  2004 Hewlett-Packard Development Company, L.P.

http://education.hp.com

HP-UX Performance and Tuning (H4262S)

Short Description

Description

Comments

We need your help!