Enterprise IT Troubleshooting Cross Functional IT Problem Solving

December 5, 2022 | Author: Anonymous | Category: N/A

Share Embed Donate

Report this link

Short Description

Download Enterprise IT Troubleshooting Cross Functional IT Problem Solving...

Description

Enterprise IT Troubleshooting

Copyright 2020 by Norbert Monfort & Robert Fortunato

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of the trademark claim, the designations have been printed with initial capital letters or in all capitals. The author and publisher have taken great care in the preparation of this book but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein.

All Rights Are Reserved. No part of this book may be used or reproduced in in any manner whatsoever without the written permission of the author, except in the case of brief quotations embodied in critical articles or reviews.

Preface All major corporations or enterprises worldwide need a large amount of personnel to support their computer systems. One of the most stressful moments these support is cannot when ado system notUnfortunately, working, and thousandsfor of employees andpersonnel customers their is jobs. this occurs all too often. If such outages are not addressed quickly, large amounts of revenue and/or productivity could be lost, clients could be lost, and reputational damage could occur. Sometimes the issue is straightforward, but often, it is not. The complexity of modern-day Enterprise IT shops has grown substantially over the past few decades and the trend seems destined to continue. This complexity has led to specialized skills and positions such as network engineer, server engineer, database administrator, storage engineer, web administrator, cloud engineer, security engineer, front-end developer, backend developer, database developer, etc. Even within these disciplines, some specialize with a particular product (e.g., SQL Server DBA vs. Oracle DBA or Windows server engineer vs. Linux server engineer, Java developer vs. .NET developer). This makes sense from an efficiency perspective, but when a problem occurs that crosses these disciplinary boundaries, incidents could linger and damage the enterprise. The inspiration for this book is trying to address the above challenge by looking at Enterprise IT problems in a new way by crossing technology disciplinary boundaries and providing troubleshooting best practices. The goal is to instruct as to how to isolate complex problems and arrive at quicker resolutions. This book is quite technical but not concentrated on deep diving into any technical discipline. Instead, this book focuses on the tools and techniques across all the most useful disciplines in a crisis. This book is born out of its authors' decades of experience in tackling complex Enterprise IT problems. This book includes dozens of real-life complex Enterprise IT problems and how they were resolved. This book was also designed to support a college course developed and taught by the authors. In support of this course, the authors created a set of tools that are collectively called “Clouducate”.

Clouducate includes Amazon Web Service Command Line Interface scripts (AWL/CLI), and .NET code to create an environment (basically a small datacenter – AWS Virtual Private Cloud) whereby the skills taught in this book could be exercised. There are also custom assignments that break these applications such that the student can work at resolving the problem in a simulation ofdownload what happens in real Enterprise IT shops. These tools are freely available for and explained in the final chapter.

Table of Contents

Chapter 1 – Enterprise IT Organization Page 8 1.1 How Enterprise IT shops are organized 1.2 Enterprise IT focus areas for this book 1.3 Technology and Organizational Interdependencies in Enterprise IT 1.4 Emerging Industry Trends Affecting Enterprise IT Organizations Chapter 2 – Technology Basics Page 14 2.1 Operating System Basics 2.1.1 The Process 2.1.2 Multitasking, Multiprocessing, Multithreading 2.1.3 Virtual Memory 2.1.4 Server Virtualization 2.2 and storage 2.3 Filesystems Programming Basics 2.4 Networking Basics 2.4.1 Layer 2 Networking 2.4.2 Layer 3 Networking 2.4.3 Layer 4 Networking 2.4.4 DHCP, DNS, NAT & PAT 2.4.5 Firewalls 2.4.6 Network Security Products 2.4.7 Transport Encryption 2.5 Database Basics 2.5.1 Database Performance Considerations Chapter 3 – ITSM and Troubleshooting Best Practices Page 46 3.1 – ITSM & ITIL 3.2 – Troubleshooting Goals 3.3 – Troubleshooting Principles (a.k.a. The 10 Commandments of Troubleshooting) 3.4 – Troubleshooting Steps 3.5 – Identifying & Mitigating Limitations 3.6 – Application Development & Infrastructure in

Troubleshooting Chapter 4 – IT Problem Fundamentals Page 60 4.1 – Dimensions of Problems/Constraints 4.2 – Hard vs. Soft Constraints 4.3 – CPU vs. RAM vs. I/O vs. Network (the big four) problems/constraints 4.4. – Log Analysis

Chapter 5 – Operating System Troubleshooting Tools Page 75 5.1 – Operating System Overview and Common Operating Systems 5.2 – Key Statistics & Tools Overview 5.3 – Windows OS Tools 5.4 OS Tools 5.5 – – Linux Agent-based Monitoring (ITIM Tools) 5.6 – Hypervisor Overview and Tools 5.7 – Cloud Tools 5.8 – Time Synchronization 5.9 – Troubleshooting Operating Systems - Summary Chapter 6 – Network Troubleshooting Tools 6.1 – Summary of Network Layers 6.2 – Layers 1 & 2 in Enterprise IT Environments 6.3 – Layer 3 Tools and Considerations

Page 107

6.4 – Name Resolution 6.5 – Layer 4 Protocols 6.6 – Packet Capturing and Analysis 6.7 – Troubleshooting Network Connectivity - Summary Chapter 7 – Application Protocols 7.1 – The HTTP Protocol 7.2 – HTTP Tools 7.3 – HTTP Web Services 7.4 – SSL/TLS

Page 139

7.5 7.6 – – AAA Other Concepts Applications Protocols Chapter 8 – Hosting Platforms Page 164 8.1 – Databases 8.2 – Web and Application Servers 8.3 – Web and Application Server Configuration 8.4 – Application Logs and Stack Traces 8.5 – PaaS – Platform as a Service 8.6 – Browsers 8.7 – Hosted Desktops or VDI (Virtual Desktop Infrastructure) Chapter 9 – Application Architecture Techniques

Page

193 9.1 – Load Balancing 9.2 – Global Server Load Balancing (GSLB) 9.3 – Content Delivery Networks (CDNs) 9.4 – Message Queuing/Message Bus 9.5 Service Bus 9.6 – – Enterprise API Gateways 9.7 – Job Scheduling 9.8 – Network Segmentation 9.9 – Tunneling 9.10 – Parallelization Chapter 10 – Cloud Computing Fundamentals 10.1 – What is Cloud Computing 10.2 – Deployment Models 10.3 – Service Models

Page 215

10.4 – Security in the Cloud 10.5 – Cloud Providers 10.6 – Troubleshooting Issues in the Cloud Chapter 11 – Application Architecture Patterns 11.1 – Tiering and Layering 11.2 – Terminal Emulation 11.3 – Client-Server 11.4 – Web-based 11.5 – SOA and Microservices

Page 229

11.6 11.7 – – Cloud-Native Business to Business Integration Chapter 12 – Complex Troubleshooting Techniques and Examples Page 243 12.1 – Follow the Packet Method 12.2 – Problem #1 – No One Can log in!! 12.3 – Problem #2 – Can’t See Image…Sometimes 12.4 – Problem #3 – Why is Email So Slow? 12.5 – Problem #4 – System Keeps Freezing, but Where’s the Problem? 12.6 – Problem #5 – Broken Portal…And No Background

12.7 – Problem #6 – Another Portal…Another Mystery Chapter 13 – Clouducate Page 267 13.1 – AWS Glossary and Concepts 13.2 – Clouducate Environment and AWSVPCB Scripts 13.3 – Clouducate Components and Interdependencies 13.4 – AWSVPCB Documentation 13.5 – Havoc Circus Documentation 13.6 – Clouducate Getting Started

About the Authors Norbert Monfort  has has over 33 years of experience working in Enterprise IT organizations at Fortune 500 companies. Norbert is currently Vice President, IT Technical Transition and Innovation at Assurant (a global fortune 300 with operations in 23 countries). In his current role, Norbert is charged with

developing executing aefforts. transformation strategy for the IT organization and leading and all Innovation On the transformation side, Norbert directs efforts to ensure the resiliency and reliability of Assurant’s most critical applications worldwide while driving their migration to new technologies in the cloud such as FaaS (Functions as a service) and PaaS (Platforms as a service like database, application hosting, API management, service buses, etc.). On the innovation side, Norbert directs efforts to implement new technologies to provide efficiencies across the enterprise like Machine Learning, Robotic Process Automation, Natural Language Processing, etc. Norbert is also currently and since 2018 a member of the AI World Executive Advisory Board – AI World is a worldwide provider of AI conferences focused on its applicability within large corporate enterprises and governments. Norbert is a frequent panelist and speaker at Innovation conferences such as AI World. Before his current role, Norbert led the infrastructure design engineering team at Assurant and was charged with leading a team of engineers developing the roadmap for all infrastructure components across the enterprise including local area networking (switches, firewalls, wireless), wide area networking (routers, circuits), compute (servers, virtualization), storage (SAN, NAS devices), databases Oracle), web hosting (load balancers, web/application servers,(SQL proxyServer, servers, etc.), telephony and specialty compute (z/OS, iSeries, etc.). Before that Norbert led the engineering operations for various areas (networking, DBAs, web hosting, telephony, solutions architecture, etc.). Before that Norbert was a DBA (SQL & Oracle), UNIX administrator (Solaris, AIX, HP-UX), and mainframe (z/OS) systems engineer. Norbert also has programming experience, having automated many operational processes over the years. As a result, Norbert has coded in over 20 programming languages on various platforms. Norbert received his MS degree in Information Technology and BS in Computer Science from FIU

(Florida International University). He is currently an adjunct faculty at FIU. As an employer of college interns for several years, Norbert recognized that it took several months for new hires to even begin to understand the complexities of a corporate IT environment. As such, he proposed a new course (CTS-4743 Enterprise IT Troubleshooting) to the FIU faculty. Norbert and co-author Robert Fortunato developed the course objectives, structure, materials, exams, and assignments. The course was accepted and approved by the state of Florida as a new offering. This course is now being taught regularly at FIU. The course assignments use automation to build a small data center in AWS for each student with AWS Educate credits including load balancers, Domain Name Servers (DNS), web servers, database servers, firewalls, subnets, etc. with flaws causing issues with web applications and batch applications for students to solve.

Robert J. Fortunato Jr. has over 20 years of experience in enterprise-scale IT across the the Fortune 500 and public sector higher-ed spaces. Robert is currently the Director of Infrastructure Products, DevSecOps Automation Enablement, at Assurant – a global Fortune 300 company with operations in 23 countries. In this role, Robert focuses on enabling the direct provisioning,

use, management enterprise and technology bytalent its consumers. This role also and entails recruiting,of developing, retaining for the global technology organization through a collaborative partnership with university system STEM programs. Both areas focus heavily on the use and development of software-driven and software-defined capabilities across the full stack of infrastructure technologies in the cloud and on-premise spaces. Before his current role, Robert led several teams of engineers in both the operational break/fix support and strategic solution design of all aspects of Assurant’s global technology infrastructure. The scope of these roles covered global WAN/LAN networks and network services, block and protocol storage, compute and virtualization hosting, OS platforms, database and web hosting capabilities, and cloud computing services. Before that, Robert was the principal architect and manager of Assurant’s global SQL Server platform, which supported over 15,000 databases distributed in several key worldwide datacenters hosting applications that individually processed over 75,000 peak transactions per second against a single SQL instance. Before Assurant, Robert was at Miami Dade College (MDC), the secondlargest institution of higher education in the United States, and, the largest college in the state of Florida. At MDC, Robert led several key efforts to replace mainframe-based systems with modernized distributed over 100 technology systems. TheseERP systems were responsible for processing million dollars in annual financial aid for students. They also serviced Miami Dade College’s 100,000+ student constituency and 6,000+ faculty, staff, and administrative team members. At MDC, Robert was also a senior member of a product-oriented research and development team. He first worked as an application developer, then as a systems developer, and then as an infrastructure systems engineer in the distributed technologies space. To that end, Robert has written, architected, and managed extensive software systems and infrastructure platforms. Some of those systems were ahead of their time homegrown implementations of modern-day technologies like OAuth, SAML, multifactor authentication, and web-based microservices.

Robert possesses several undergraduate degrees in Computer Science, holds a Master’s Degree in Information Technology, has deferred conferral of his Master’s Degree in Computer Science given his postgraduate studies, and is close to completing his Ph.D. in Computer Science. Robert also has the pleasure of serving as an invitational speaker and industry professional advisor for several university-based courses in the graduate andprograms at undergraduate Computer Science and Information Technology Florida International University. University. His long-term goal and passion are for teaching, but in a manner that brings theory and practice much closer together in the minds of individuals. To that end, Robert, like Norbert Monfort, saw the need for the CTS-4743 course, curriculum, and this textbook based on their professional observations of newly hired interns recruited into Assurant.

Chapter 1 – Enterprise IT

In this first chapter, we simply want to briefly describe Enterprise IT, the disciplines included within an Enterprise IT organization, how Enterprise IT is typically organized, and the interdependencies between the Enterprise IT disciplines as it related related to IT troubleshooting. So, what is Enterprise IT? Simply put, it’s an IT organization that supports a large company. These would be Fortune 1000 companies or those of similar size that include multiple business units, have multiple company locations, and even possibly do business in multiple countries. All of these large enterprises have similar IT organizational structures. Understanding these structures helps you understand how to navigate the organization when troubleshooting an issue. 1.1 How Enterprise IT shops are organized Enterprise IT shops generally have the same major disciplines as follows: ◆ Application Development – This discipline is responsible for developing and supporting the applications/systems that internal users and customers of the enterprise utilize. This includes internally developed websites, services, fat client applications, etc. There can be multiple application development groups or just one depending on the organization of the company. However, for security and compliance reasons application development is typically organizationally separate from other groups. ◆ Infrastructure Support – This is a combination of many disciplines that support the backend technologies that allow applications to function properly. There are some core disciplines that all infrastructure teams would include and then some tangential disciplines which may or may not be included with the Infrastructure support umbrella depending on how the company is organized. The following are the disciplines commonly included under Infrastructure Support: ☐ Core Disciplines - Networking, Servers, Storage, Database Administration, Web Hosting, Cloud. ☐ Tangential Disciplines (sometimes separate of Infrastructure Support) – Desktop support (including virtual desktops), Voice/Video systems support, Mainframe or other specialty compute support, Shared enterprise platform product support (for

example support of email/calendaring systems, instant messaging

systems, reporting systems, document management systems, etc.), Service Management areas (Help Desk, change managers, incident managers, etc.) ◆ Security – Information security groups protect the enterprise from cyber threats, internal or external. This group is typically separated into own group within IT organization. group oftenitsincludes BC/DR BC/DR. . Inthe theEnterprise past, you could sometimesThis see Security under Infrastructure support, however, this is now rare due to perceived conflicts of interests. ◆ Project/Program Management – Project & program management groups typically work with all the Enterprise IT disciplines and thus are usually separate. ◆ Miscellaneous – Several other groups are often found within Enterprise IT organizations depending on the size of the organization and the focus of the company. These include Architecture, Innovation/Transformation, Data Analytics, etc. This book will focus on the experience from an Infrastructure Support personnel perspective and will include aspects of Application Development and Security since these are crucial when troubleshooting Enterprise IT problems. An important note is that even Technology companies (e.g. Apple, Google, Microsoft, Cisco, etc.) have similar internal IT organizations separate from their product development organization. Sample Enterprise IT Organizatio Organizational nal Chart

1.2 Enterprise IT focus areas for this book In terms of the focus areas for this book, let’s describe what each discipline typically supports: Network Engineering – Typically responsible for routers, switches, firewalls, DNS, Dynamic Host Control Protocol configurations (DHCP), Internet Packet addressing (IP) scheme for the company, Virtual Private Private Network devices (VPN), Wide Area Network circuits (WAN) with 3rd party carriers. Server Engineering – Typically responsible for operating system configuration/standardization (including server build automation and naming standards), Virtualization product/configuration, Server platform security (e.g., Active Directory), server utilization monitoring, and capacity planning. Storage Engineering – Typically responsible for NAS (Network Attached Storage) and SANs (Storage Area Networks). SANs in particular require setup of their architecture, utilization monitoring, and replication. Backups/restores are often within this same group but sometimes separated into a specialized group. Database Administration Administration – Typically responsible for databases (architecture, creating, maintaining, securing, backing up/restoring, utilization monitoring, and performance tuning). Note that DBAs (Database Administrators) are usually granted access to schedule their database backups

utilizing the standard backup supported by the Storage Engineering team.

Web Hosting – Sometimes under Server Engineering and/or Network Engineering, but often separate. These engineers are typically responsible for web servers, application servers, load balancers, and web security products. Incident/Problem Management – Typically responsible for organizing a response to an incident by getting the right personnel involved and regularly

communicating This personnel also follow-up on longlasting problemstoasstakeholders. well. 1.3 Technology and Organizational Interdependencies in Enterprise IT

The above diagram depicts the interdependencies between the various infrastructure disciplines within an Enterprise IT shop. Key observations about this visual are as follows: 1) The server is at the center of everything. Ultimately, all applications run on servers, and all other services are geared to support this. Thus, all groups work with Server Engineers in one way or another. 2) Application development teams work directly with DBAs and Web Hosting Engineers (and to a lesser degree Server Engineers) and less directly, if ever,application with Storage Engineers andtoNetwork Engineers. This is true because developers need define their tables on

databases and deploy their applications to web servers. 3) Database Engineers work closer with Storage Engineers than Network Engineers or Web Hosting Engineers. This is true because databases are heavily dependent on fast storage given that they support large amounts of it. 4) Web Hosting Engineers work This closer NetworkWeb Engineers Storage Engineers or DBAs. is with true because Serversthan arewith often locally and geographically load-balanced which introduces the need to interact with network engineering. 5) Incident and Problem Management needs to work with all of these disciplines when addressing problems that arise. This overlapping of infrastructure disciplines leads to a variety of possible organizational implementations. There’s no one right way to organize your Enterprise IT organization – the key is understanding all the roles needed and making sure they are addressed. 1.4 Emerging Industry Trends Affecting Enterprise IT Organizations Software-defined infrastructure – The concept of having the computing infrastructure mostly or completely under the control of software through APIs exposed by the hardware manufacturers. The goal is to minimize manual human involvement. This includes SDC (Software Defined Compute, aka, Virtualization), SDN (Software Defined Networking), SDWAN, SDS (Software-Defined Storage), SDDC, etc. Software-defined infrastructure enables infrastructure as code. Infrastructure as code - The process of managing and provisioning computer data centers through machine and human-readable definition files (e.g., YAML), rather than physical hardware configuration or interactive configuration tools. The resulting definition files are managed as code files, including source code control, version control, etc. Infrastructure as code automates and enables cloud computing and DevSecOps. Cloud computing - Shared pools of configurable computer system resources and higher-level services that can be rapidly provisioned and scaled with minimal management effort, often over the Internet. This allows enterprises to leverage economy of scale from cloud providers while focusing on core business competencies instead of managing IT infrastructure. Varieties

include SaaS, Serverless computing, PaaS & IaaS.

DevSecOps or DevOps (Product-based teams) – A software development methodology that combines software development (Dev) with information technology operations (Ops) and sometimes Security best practices (Sec). The goal of DevOps is to shorten the systems development life cycle while also delivering features, fixes, and updates frequently and in close alignment

with objectives. TheofDevOps approach is to include automation and eventbusiness monitoring at all steps the software build. The coverage of these emerging trends in the book is minimal. However, one chapter on “Cloud Computing” and another on Clouducate utilizes AWS to build a mini-enterprise environment for students to test out their troubleshooting skills. Clouducate, itself can be considered partially “Infrastructure as code” since the scripts are used to build out the assignment environments inclusive of all the supporting infrastructure components.

Chapter 1 – Review/Key Questions

1) What are the various groups that make up an Enterprise IT shop, especially those in the groups that we are focusing on in the class? 2) Which “IT Infrastructure Support” teams are most likely to work directly with the application development teams? 3) Which disciplines are typically considered "IT Infrastructure Support"? 4) What are the emerging technologies that affect Enterprise IT shops?

Chapter 2 – Technology Basics

In this chapter, we cover some of the foundational technology concepts needed to be able to fully comprehend later concepts that build upon these foundations. This is not intended to be a deep dive into these areas as each could be the subject of a book on their own, but instead an overview of the most critical components that need to be understood when troubleshooting Enterprise IT problems. We will start with operating systems & virtualization, then move on to then some high-level programming concepts, then networking & networking security, and finally, some database concepts. After this chapter, you should have a basic understanding of these foundational concepts. 2.1 Operating System Basics As mentioned in chapter 1, the server is at the center of everything since applications execute there. Servers and in fact all computing relies on operating systems. Understanding the fundamentals of operating systems is paramount to being able to troubleshoot Enterprise IT problems. In later chapters, we will dig deep into diagnostic tools that can be used to identify issues within operating systems, but in this section, we will cover the basics that need to be understood. The Big Four While there are now countless components and subcomponents that make up our modern complex operating systems, the four main parts remain key to understanding how things work. We will continue to refer to these “big four”

throughout the book. These four parts are as follows: CPU(s) – Central Processing Unit . Performs all of the operations carried out by the computer (e.g., arithmetic, I/O, memory manipulations, etc.). The faster and more CPUs a computer has, the more it can process within a period of time. However, the other factors below can limit how much can be processed regardless of the CPU capacity available. RAM – Random Access Memory (volatile, fast access memory). Working storage for process data manipulation and buffering. A

lack of RAMdegradation. can cause excessive paging and dramatic performance Additional RAM can be used to “cache”

more items that reside on permanent storage for faster retrieval and to avoid delays. Permanent Storage – Nonvolatile storage for persistent data. This can be mechanical spinning drives or solid-state devices (e.g., flash drives). I/O to permanent storage is orders of magnitude slower than access to RAM and thus reducing the Programs I/O required an application, greatly improves its performance. thatby perform I/O are placed in a “wait state” by the operating systems. Also, I/O to the same disks can cause extreme delays. Solid-state devices are orders of magnitude faster than traditional disk drives, but also more expensive. Network – Protocol encapsulated communication between one or more hosts. Calls out to the network to another device can be as slow or often slower than accessing permanent storage, yet new architectures encourage more (not less) network traffic. The key to solid performance network communication is to ensure there isn’t congestion or for bottlenecks across the various links in between devices. 2.1.1 The Process The “Process” is the primary logical unit of an operating system. A process is an instance of a computer program that is being executed. A process contains the program code and its activity including current memory usage. Every program, including operating systems components, executes within the context of a process. All applications execute within one or more processes. Also, it’s important to understand that several processes can be running the same program.

For example, when you start your browser on your PC and visit a website, you start a process. If you then start another browser session to visit another website, you are starting another process that’s running the same program (your browser). However, each process is unique with different RAM and activity because it has performed different functions. Another key concept is that each process executes “as if” it has exclusive access to the CPU(s) and a large amount of the computer’s RAM (virtual address space). While there are many processes active simultaneously on your computer, the individual processes do not worry about this – they just

execute their program. Similarly in terms of RAM, for 32-bit addressing systems, processes behave as if they have access to all the 4GB of addressable RAM and for 64-bit addressing systems all of the 16 exabytes of addressable RAM (more than we will have available on any system for quite some time). We will discuss this concept of a virtual address space more in the next section. Yet another key point is that processes can either execute in "Kernel Mode" or "User Mode" and they switch between the two modes depending on the function being executed at the time. Kernel mode is used for accessing shared system resources (e.g. calls to the operating system to communicate to the disk, RAM, network, etc.). User mode is for everything else. When a process is running in “User Mode”, it has no way to access the virtual address space of another process – this is known as “memory protection”. Lastly, each process is given a unique “Process ID” or PID and processes have a hierarchical relationship (e.g., parent-child). The process that initiates another process is called the parent process and can be identified in most systems as the Parent PID (PPID). COMMANDS:

To find the processes running on a Windows machine use with their associated PID, use “Task Manager” and click on the “Details” tab. To find the processes running on a MAC or Linux machine, type “top” from the terminal.

2.1.2 Multitasking, Multiprocessing & Multithreading These terms may all sound like they are similar, but they each mean very different things. Understanding these concepts well is key when troubleshooting issues on a system. Multitasking

Multitasking is also called timesharing. This entails the operating system providing time slices on a single CPU to running processes based on runnability, prioritization, and other factors. The key hereatisathat if you only have oneOf CPU, thenthere onlyare onemany process can be on a running given moment in time. course, processes system that need to run concurrently, and this is simulated by giving each process a very small slice of time before swapping it out for the next process.

Multiprocessing Multiprocessing is timesharing across multiple CPUs (or CPU Cores) such that multiple processes can run simultaneously. In other words, multiprocessing is simply multitasking, but across multiple CPUs.

Multithreading All processes contain at least one thread (thread

0), however, applications can be written such that one process can use more than one thread and thus utilize more than one CPU at a time. Note that the applicati application on would need to have been written to support multithreading. Most enterprise systems (databases, web servers, mail servers, etc.) support multithreading and often provide configuration options for how many threads to execute. Conversely, Enterprise IT homegrown applications are rarely multithreaded, although they may utilize a platform (e.g., an application server) that allows them to multithread by supporting many single-threaded requests at a time. 2.1.3 Virtual Memory Memory management is a key function of any operating system and a lack of RAM can very easily cause issues on a server or even PC. Thus, another critical concept to understand when troubleshooting Enterprise IT problems is how virtual memory works. Virtual Address Space As mentioned earlier, each process executes “as if” it has exclusive access to a large amount of the computer’s RAM. An operating system achieves this by providing each process with a virtual address space. This virtual address space contains all the memory addresses that can be accessed by the process, even if it’s not currently available in RAM. Of course, there are many processes on a computer and nowhere near enough RAM to allow each process to have that large an amount of RAM all allocated ocated to it. Thus, the concept of virtual memory allows an operating system to share the physical memory space among many processes.

It’s important to understand that virtual address spaces are divided into small

chunks called pages (usually 4K in size but could vary). In the figure below, you will note how virtual address space pages are mapped to physical memory. While a process has access to all the addressable memory it needs, most processes only used a small fraction of the address space. Whatever addresses are used are mapped to a memory map table that’s associated with the process and indicates where in the real RAM a particular virtual memory address is located. Even though only a small fraction of the virtual address space is needed by most processes, the amount needed can rarely fit entirely in the real RAM. Thus, as virtual memory address pages age (e.g. haven’t been used in a while), they may be “paged out” to the swap area, which utilizes permanent storage. Since permanent storage is many times slower than RAM, if a process needs to retrieve those “paged out” virtual address pages again, it will pay a severe performance cost.

Paging vs. Swapping vs. Thrashing Paging, swapping, and thrashing are all terms associated with the process of having virtual address memory pages change from being in RAM to being placed in the swap area and vice versa. It’s important to understand how all of these are defined. Paging is the allocation of the virtual address space into smaller, addressable chunks, for memory management. A page fault occurs when a process requests a “page” from the virtual address space memory that is not resident in physical memory. This causes the operating system to “page in” the memory contents from permanent storage (usually segments of the program code). Swapping is the process of evicting a physical memory resident page of the virtual address space to free up physical memory that will be occupied by another page of virtual address space, possibly from a different process (hence the term swapping, swap one page for another). Victim pages that are swapped out are typically chosen via LRU (least recently used) algorithms. This is occurring all the time on all operating systems and in and of themselves are not a problem as long as it’s limited. Thrashing is when almost no effective work is being accomplished on a system because so much time is being spent swapping memory pages back and forth from RAM and the swap area.

2.1.4 Server Virtualization The vast majority of servers within an Enterprise IT environment are now virtual. A virtual machine or VM is a type of virtualization whereby an operating system is emulated within another operating system such that multiple independent systems are running on the same underlining hardware (shared CPUs, RAM, network interface cards, storage cards, etc.). Since virtualization is so heavily used within Enterprise IT, understanding how it works is important to be able to resolve problems. Hypervisors A hypervisor or virtual machine manager (VMM) is the underlying operating

system supporting the emulation of the Most of is these operating systems are a variant of guest Linuxoperating under thesystems. covers (Hyper-V

an exception as it runs on Windows), but administrators mostly utilize the graphical user interface provided by the hypervisor product to perform various functions, such as: Create new virtual servers on a specific physical host server by providing operating system image and the sizing of VM (e.g. CPUs, RAM). Adjust the resources of existing virtual servers (CPUs, RAM) Monitor virtual servers to determine how much CPU each guest is using CPU and how much the overall CPU usage of the physical machine is at. Shutdown, start, recycle or even delete virtual servers. It’s important to understand that there are two types of hypervisors: Bare metal hypervisors – These are installed directly on the hardware and include Dell’s VMWare, Microsoft’s Hyper-V, Citrix’s XenServer, Nutanix AHV, open-source (KVM, XEN). This is the type of hypervisor used within Enterprise IT environments. Hosted hypervisors – These run within an Operating system like Windows, Linux, or MAC and include VMWare Workstation, Oracle’s Virtual Box, Parallels for MAC, Microsoft Virtual PC. While this type of hypervisor is useful for testing, it is not used within Enterprise IT environments.

In the diagram below, you can see the contrast between the traditional method of deploying an operating system (now rarely used in Enterprise IT environments), bare metal virtual machines, and hosted virtual machines (not used within Enterprise IT environments).

We will dig deeper into virtual machines and hypervisors in a later chapter, but for now, we wanted to cover the advantages and disadvantages of server virtualization to provide a better understanding of the concepts. Advantages of Virtual Machines (VMs) – ☐ Hardware Decoupling (Migration) – If there is a hardware problem or lack of resources on some physical hardware, the server can be quickly moved. ☐ Automatic Failover – If the physical server fails, then all the guests on it can be automatically moved to other physical machines in a cluster. ☐ Adjustable Resource Needs – If additional RAM or CPU resources are needed, they can quickly be added as long as capacity exists on the same physical machine. ☐ Optimal Utilization of Resources – CPUs and RAM can be oversubscribed such that the overall usage of hardware is optimal even if the systems rarely use all that they need. Disadvantages Disadvantag es of Virtual Machines (VMs) ☐ Overhead of Emulation – There is a certain amount of “wasted” computing resources as the emulation of the hardware takes additional CPU cycles. ☐ The complexity of Performance Diagnosis – Due to oversubscription, a lack of CPU or RAM (e.g., thrashing) may not be apparent on the guest machine but occurring on the host machine.

It is generally accepted that the benefits of server virtualization far outweigh the disadvantages for Enterprise IT servers. Also, server virtualization has allowed cloud computing to flourish. 2.2 Filesystems and storage As mentioned earlier, one of the big four in four in terms of operating systems is permanent storage. All servers and applications rely on permanent storage. Typical enterprises today have petabytes or even exabytes or permanent storage and the trend is for these amounts to continue to increase over time. Thus, it’s not surprising that issues with permanent storage can be the root of Enterprise IT problems whether caused by the slow performance or the

failure of a device. As such it’s important to understand the options available within an Enterprise IT environment for permanent storage. In this section, we will briefly discuss these options. Storage options There are a few critical dimensions to storage options such as the underlining storage hardware, the access method, and the connection method. Below we will briefly discuss each.

Types of storage hardware: Mechanical spinning disks – These are less expensive, but also provide slower response times. There are still many mechanical disks used within Enterprise IT environments, especially for storing larger amounts of data that are infrequently accessed such as backups and lightly used file shares. Solid State Disk (SSD) – These are costlier but offer much better performance. SSD has become heavily used within Enterprise IT environments for any highly used storage such as those supporting databases. Tape – While tape was heavily used for backups in the past, it has become exceedingly rare within Enterprise IT. Using mechanical storage for backups and then replicating the backup data to mechanical storage at another location is now the preferred option for offsite backups. Another emerging trend is backing up to the cloud, albeit this is still rare with larger Enterprise IT

environments. Storage Access Methods: Block Mode Access – This type of access means that the operating system will read and write blocks from the storage device. This is necessary for many applications, such as database servers, that are reading and writing only portions of the large database files they operate. For such applications, reading an entire large file from disk would be very inefficient. File Mode Access – This type of access means that the operating systems will read and write an entire file from the permanent

storage location. This would suffice for most common PC applications like Word, Excel, etc. as well as server software like web servers and application servers that need to process entire files (e.g., like a web server that send entire files to the browser).

Storage Connectivity Options: Direct Attached Storage (DAS) – This is typically block-mode storage that’s attached to the server via a direct cable connection. Most servers have some kind of local storage; however, the

amount Evenwith withlocal the move to blade servers over the past varies decade,greatly. most came storage or at least locally shared storage at the chassis level. A recent trend to move to hyper-converged infrastructure (HCI) has increased the amount of local storage that comes with most servers. Storage Area Network (SAN) – This is storage that utilizes similar storage cards as are used with DAS, but they connect to a dedicated network (typically fiber channel). This dedicated network then connects to an array of storage servers that front-end a very large amount of shared storage. A SAN acts like DAS in that it’s block mode storage. Network Attached Storage (NAS) – This is storage that is accessed via an IP network (typically Ethernet). As with a SAN, when connecting to a NAS, you connect to an array of storage servers that front-end a very large amount of shared storage. Unlike a SAN, NAS devices are typically file mode access as opposed to block mode. Often, access is encapsulated by an abstraction protocol whereby you access files or objects. Cloud storage like Amazon’s S3, Google Drive, Microsoft’s OneDrive, DropBox, etc. have taken NAS to an extreme by serving storage over the Internet The I/O performance your server will experience is influenced by each of these dimensions as well as the speed of the connection (e.g. fiber channel, ethernet, etc.) and the amount of contention at various levels (the disk itself, the storage servers for SAN/NAS, and the network in between whether fiber channel, ethernet or something else). Storage performance is a complicated topic in and of itself and not a focus of this book but can certainly play a role in Enterprise IT problems. Filesystems

Filesystems a method storing files such that the OS can track and secure them.are Windows & of Linux work with hierarchical filesystems, but they

are configured slightly differently. Linux filesystems have only one “root” directory (“/”) and all additional storage is mounted within subdirectories. Also, Linux has a broader array of filesystem types/options. Meanwhile, Windows has multiple “root” directories associated with drive letters (e.g., c:\, d:\, etc.) and NTFS is pretty much the only filesystem type used, albeit you may find fat32 on some older systems. Both Linux and Windows allow for the association of security at the directory or file level. 2.3 Programming Basics While this book focuses on Enterprise IT troubleshooting from the perspective of infrastructure support staff, exceptional troubleshooting requires some level of programming knowledge. After all, it’s the application that users care about. So, while it’s important to understand all the infrastructure that helps these applications function properly, a certain level of understanding about the application programming is necessary as well.

Understanding how applications function internally, connect to other applications and other platforms (e.g. databases) is critical and will be a focus of this book. This brief section introduces some high-level concepts that are important to understanding how to troubleshoot Enterprise IT applications. Platforms Many applications execute within a platform such as Microsoft’s IIS, IBM’s WebSphere, Tomcat, Oracle’s Weblogic, Node.js, etc. Platforms offer applications many services out of the box (e.g. security, load balancing, versioning, protocol abstraction, etc.) that simplify their development tasks. With the introduction of the cloud, the number of platform offerings has grown exponentially in recent years. We will discuss some of these platforms further in later chapters. Frameworks Many applications utilize frameworks, which is software providing generic functionality to ease the burden of common programming tasks. Using frameworks simplifies application development by providing pre-built code incorporated into user-written code. Some common examples include .NET, Java, React.js, AngularJS, Spring, Hibernate, Struts, etc. Unfortunately, compatibility between the operating system, platform & framework versions

need to be verified to avoid problems. Frameworks are outside the scope of this book.

Call Stack Within a running process, each thread maintains its own call stack. A call stack is the history of called procedures/sub-routines such that when a subroutine ends, the previous routine that called it will resume. This is critical to understand because when an unhandled exception (error) occurs, the entire call stack is unwound and typically provided in a log. Modern applications that utilize platforms and frameworks can create very deep and difficult-tofollow call stacks. This will be discussed more in-depth in a later chapter Instrumentation Instrumentation is the ability to monitor or measure the level of an application’s performance, diagnose errors, and write trace information. This is critical to troubleshooting Enterprise IT issues. Instrumentation can be implemented by writing to log files, writing to a database, or writing to a 3 rd party tool. This topic will also be discussed more thoroughly in a later chapter.

2.4 Networking Basics Networking is crucial to all Enterprise IT supported applications today and understanding the basics of how networks function is critical to being able to perform troubleshooting. To understand how networks function today, it’s useful to briefly understand how they developed to this point.

Before the late 1960s, the networking resources needed for any communication were pre-allocated and reserved before communications began. In that way, communications were guaranteed once established and if there wasn’t capacity available, the failure would occur before any communications began (e.g. a busy signal). This pre-allocating of networking resources would later be coined “circuit switching”. The big disadvantage to

circuit switching was that this pre-allocation of networking resources meant that it was difficult to scale the number of systems that could communicate with each other simultaneously. Packet Switching In the 1960s there were several theories and eventually implementations of a new form of communications that would become known as “packet switching”. The concept of packet switching was that you could break up communications into smaller packets of information and send them to their destination via any of several paths. These packets can then be appropriately reassembled at the destination. The primary advantage of this was to allow exponentially more communication vectors between systems because there was no need to pre-allocate any resources. You just needed to start communicating with whichever system you wished to communicate with. Of course, there are still physical limitations (e.g. bandwidth), but you could grow those as needed.

Today, just about all communications around the world utilize packet switching and this book only focuses on this type of networking.

ISO & TCP/IP Models

With theprotocols introduction of packet switching, incompatible vendorspecific flooded the market in themany 1970s. So, while packet

switching allowed for many systems to be able to concurrent communicate with each other since the protocols used were different, communications were still very limited. To provide some order to all these various protocols, ISO (International Standards Organization) developed the OSI (Open Systems Interconnection) model, which was published in 1980. The intent was to begin to drive vendors to common nomenclature and eventually to standard communications. Ironically, as the Internet exploded in the late 1980s and early 1990s, the TCP/IP protocol used on the Internet became the de-facto standard, and eventually, the use of all other proprietary protocols began to shrink. Today, TCP/IP is by far the dominant protocol used worldwide by all types of systems. There are still some proprietary protocols in use between some systems, but these are rare and niche deployments. Even though it’s only a theoretical model, the OSI model is still referred to today in literature and enterprise technology conversations. The model still has relevance because the layers are still theoretically needed in many communications. TCP/IP has its own communications model, which is much simpler (e.g. fewer layers) because it only focuses on the lower layers of the communications stack. While the higher layers of the OSI model are relevant, as far as TCP/IP is concerned this is simply part of the “application” layer. As such, these higher layers are not as standardized around the world as the lower layers. Encapsulation A key concept regarding these networking models is “encapsulation”. This is the idea where the higher layer of the communication is encapsulated by the

lower layer as it moves down to the actual media that will send the communication. Once the communication is received at the other end, each layer is unwrapped and pushed up the stack until only the data sent by the application is received by the recipient application. This allows application developers to avoid thinking about all these details and simply send messages to the other application. This idea of standardized network layering also allows network devices in between the communicating devices to manage or even alter the outer layers (e.g. layers 2 through 4) without impacting how applications work. This has allowed for the seamless introduction of many features such as the guaranteed ordering of packets, optimization of window sizes, network

address translation, etc. over the years without needing to rearchitect any higher-level applications. We will discuss some of these features later in this chapter and many others in future chapters. Below is a graphic comparing the OSI model to the TCP/IP model. As you can see, the TCP/IP model has fewer layers and the higher layer (Application) is pretty much irrelevant to the TCP/IP model. As the name implies, the focus of the TCP/IP model is with layer 3 (IP) & 4 (TCP) of the OSI model. This is not to diminish the importance of all the layers, but the other layers are not core to the TCP/IP model. Due to this, layers 5-7 are typically not discussed within the context of Enterprise IT networking and are generally lumped together. For this reason, from a networking perspective, this book only focuses on layers 2-4. However, a later chapter is devoted to important application protocols (layers 5-7) like HTTP/S, TLS/SSL, etc.

The below graphically depicts the encapsulations that occur at each layer and the standard names used to describe each layer's encapsulated data.

2.4.1 Layer 2 Networking While layer 2 networking will come into play at times when troubleshooting Enterprise IT issues, it’s rare. Most problems occur at higher layer protocols where situations are more complex. Still, a brief review of layer 2 in the Enterprise IT environment is warranted. The predominant layer 2 protocol in Enterprise IT shops is Ethernet and as such, that will be the focus of this section and any other layer 2 discussions. MAC Addressing First, it’s important to understand that the addressing for layer 2 is based on MAC (media access control) addresses. MAC addresses are made up of 6

pairs of Hex numbers (e.g. 00:A0:D2:39:4D:74), which are typically assigned to the network interface controller (NIC) of any device on the network. However, MAC addresses can be changed by the operating system to any value. There’s a lot that we can discuss just on the topic of MAC addresses, but the key here is that they need to be unique only within their local network segment and not necessarily globally. Duplex Communications The next key concept of layer 2 communications that is relevant to Enterprise IT environments is duplex communications. There are two types of duplex communications, half-duplex and full-duplex. Half-duplex is like a walkietalkie in that only one device can transmit at a time, while full-duplex is more like a telephone call where both parties can communicate simultaneously. In most modern Ethernet networks, all devices auto-negotiate their duplex communication mode to full-duplex and thus it’s rarely a problem. Despite this, misconfigurations or a bug can still occur whereby the NIC and the switch port are not both talking full-duplex. This is a difficult problem to detect that presents itself as inferior performance for the device in question. Collision & Broadcast Domains Another key concept in layer 2 communications is the idea of a “collision domain”. Ethernet connectivity started with a simple device called an Ethernet hub whereby devices plug into a shared wire and listen for activity. If activity arrives that is meant for the device (e.g. the destination MAC address matched their MAC address), then the device retrieves the data. Whenever a device wants to send a message, it senses whether there was any

incoming traffic and if not, sends the message. The problem is that between the time the device senses for incoming traffic and the message is sent, another device could have sent a message causing a collision that would corrupt both messages. When these collisions occur, both devices randomly calculate when to try to send a message again and then retry. This continues until the message is sent without a collision. This collision issue greatly limited the concurrent transmittable capacity of devices on the same hub. Ethernet hubs are now rare in an Enterprise IT environment, but still common for home usage or within smaller businesses. The solution to this concurrency issue was the Ethernet switch. The Ethernet switch (as opposed to a hub) is a more intelligent device that does not

transmit data across a shared wire. Instead, an Ethernet switch only sends data to the port to which the destination MAC address belongs to. It is said that Ethernet switches end collision domains. This leads us to a few questions: 1) How do Ethernet switches know which MAC addresses are attached to which ports? Well, ports? Well, it creates an in-memory MAC address table over time. As it begins to see packets come from a port, it associates the “from” MAC address to the port that it came from. 2) What if the switch receives an Ethernet frame for a destination MAC address that it does not yet know? In know? In this case, the switch will simulate a hub in that it will send the frame to all of its ports. This is called a broadcast and Ethernet switches do not end broadcast domains. VLANs The last critical concept to understand as it relates to layer 2 networking is the

idea of virtual LANs or VLANs. When Ethernet switches were first created, the “broadcast domain” (or LAN – local area network) was all the ports attached to the switch as it was simulating an Ethernet hub and simply eliminating the collisions. However, within an Enterprise IT environment, this was very limiting as you had to connect all of the servers or edge devices associated with the same broadcast domain to the same switch. So, the concept of VLANs was created. A VLAN decouples the binding between Ethernet switch and “broadcast domain”. Thus, the broadcast domain is no longer confined to the physical switch. Instead, devices across several switches can all be considered part of the same VLAN. Conversely, devices on the same switch could be a part of different broadcast domains. This further allows for Enterprise IT shops to dynamically place devices into certain restricted VLANs. For example, if someone with a non-company laptop connects to a network port within an enterprise, the switch can detect that it’s not a company-controlled device and place it in a VLAN that only has access to the Internet (e.g. a guest VLAN) and no access to any internal devices, which are connected to different VLANs. Devices are not allowed to communicate with a device on another VLAN without first going through a layer 3 device as they need to change from one network segment to another. Below is a graphic depicting a couple of switches with multiple VLANs. In this example, only the devices on the same VLAN, even though they are on

different switches, can communicate with each other without going through a layer 3 device (e.g. a router).

Question 1 - Do switches know the IP addresses of the devices connected to them?

2.4.2 Layer 3 Networking

Layer 3 networking gets a lot more complicated than layer 2 and thus, unsurprisingly, comes up more often with Enterprise IT problems. While Ethernet switches are the primary devices involved with layer 2, routers and firewalls are the primary devices that manage layer 3 communications. Having said that, within Enterprise IT, large network core switches perform both layer 2 & layer 3 communications and every operating system contains a local routing table and can act like a router further complicating the situation. Bottom line is that many devices are involved in layer 3 communications. IP Addressing While layer 2 deals with MAC addresses that only need to be unique within their local segment, layer 3 deals with IP addresses that need to be unique within their network. While layer 2 network segments will rarely be more than a few hundred devices, layer 3 networks can be as small as your home with a few devices or as large as an enterprise with thousands of devices, or even as large as the Internet with billions of devices. So, the uniqueness of an IP address can get complicated.

On top of that, there are two types of IP addresses (IPv4 and IPv6). IPv4 utilizes 4 octets (8-bits or values 0-255) displayed in decimal format with dots in between such as 131.94.130.43. This provides a total of 32 bits or 4 billion possible values. This may seem like a lot, but by the early 1990s, as the Internet exploded, it was evident that we would run out of IP addresses

very soon. So, IPv6 was developed and completed in 1998. IPv6 uses 8 sections of 4 hex numbers (e.g. FE80:CD00:0000:0CDE:1257:0000:211E:729C) or 128-bit addressing, which allows for far more addressing than the world would ever need. The problem was that switching to IPv6 was costly and complicated. So, the industry kept inventing ways to make IPv4 last longer. We’ve been so successful that IPv4 is still the dominant IP addressing scheme within not only Enterprise IT organizations but also cloud computing environments. IPv6 is used heavily by mobile devices and embedded systems like security systems, thermostats, refrigerators, etc., but not within Enterprise IT. IPv6 is even rare within home networks. Network providers allow IPv6 devices to seamlessly communicate with IPv4 devices through tunneling (discussed in a later chapter) further reducing the need to eliminate IPv4. Because IPv6 is still rare within Enterprise IT, we will not be discussing it any further in this book. Subnetting The next topic of interest with layer 3 networking is the concept of subnetting. Subnetting is the practice of grouping IP addresses that begin with the same numbers (i.e. an IP address range). There are two ways to denote a subnet as follows: 1) CIDR (Classless Inter-Domain Routing) notation – Looks like 24,, where the number after the “/” (1 to 32) denotes the 131.94.1.0 /24 number of bits in the address that are a part of the subnet. In this case, any address starting with 131.94.1. would be a part of this subnet.

2) Subnet Mask or netmask – Looks like 131.94.1.0 255.255.255.0, 255.255.255.0, where the second set of numbers (255.255.255.0) indicates that bits of the IP address that makes up the subnet. As with the CIDR notation example, this indicates the first 24 bits indicate the IP address range. Both of these notations are used interchangeably. However, CIDR notation is the most common method of denoting subnets in firewalls and routers, while netmasks are more common on host devices (PCs & Servers).

Subnetting is used for several purposes. The following are the most common uses, but not a comprehensive list: 1) For hosts, your local “subnet” is the range of IP addresses that you can speak to directly with a MAC address (e.g. those devices within your VLAN). If you wish to communicate with an IP address outside of your subnet, then you need to go to a layer 3 device, most often your local gateway or local router. 2) Subnets are often used in firewall rules to control access and the subnets can be grouped into a larger subnet if it makes sense to do so for the rule. 3) Subnets are often used in route tables on hosts, routers, or firewalls to help the device determine where next to send a packet to eventually get to its destination.

COMMANDS: To find the IP address(es) and MAC address(es) assigned to a

Windows machine, there are several options, but the most comprehensive is issuing the command “ipconfig /all”. This all provides your DNS servers (discussed later in this chapter), default gateway, whether DHCP is enabled, etc. A similar, but less comprehensive command on Linux or MAC would be “ifconfig”. The below is a graphic depicting which network devices continue or end collision and broadcast domains. Note that while switches do not end broadcast domains within a VLAN/subnet (broadcasts do not traverse VLANs), routers and firewalls do as they are considered layer 3 devices. This is why you need to goof through a router (your default gateway) to communicate outside your subnet.

Reserved IP Addresses

Another layer 3 networking topic of interest is “reserved IP addresses”. There are several ranges of IP addresses that are segregated for special purposes. The full list is available at (https://en.wikipedia.org/wiki/Reserved_IP_addresses ) , but below are a few of special note: 1) Loopback IP or “localhost” – 127.0.0.1 – The entire 127.0.0.0/8 range is reserved, but this specific IP address has become the standard for a host to communicate back to itself. When using this IP address, the request does not go out to the network but instead is sent back into your local machine's IP stack “as if” it had come from outside. This is useful for testing. 2) Private subnets - 10.0.0.0/8 (10.x.x.x), 172.16.0.0/12 (172.16.x.x – 172.31.x.x), 192.168.0.0/16 (198.168.x.x), others* - Private subnets are not routable on the Internet. You cannot communicate directly with another device on the Internet when using a private IP address. You would need to go through a router that would change your IP to a public IP for you to be able to communicate on the Internet (more on this later in this chapter). While there are other reserved Private IP address ranges, the ones denoted above are typically the only ones used within an enterprise or even local residence. 3) Multicast - 224.0.0.0/4 (224.x.x.x-239.x.x.x) – Multicast is the concept that you send packets to a group IPs that subscribed to a multicast group. Multicast is rareofwithin anhave Enterprise IT environment

and thus will not be discussed further Routing The most complex aspect of layer 3 networking is network routing. There are multiple theories and many books have been written on this topic. For this book, we will try to keep it simple. All devices that communicate using TCP/IP, have a “routing table” defined. These routing tables let your local IP stack know where to send a packet next based on its destination. For example, below is a sample of a simple routing table and what each entry means:

Route anything else to 192.168.0.1 Route to 127.0.0.1 Route packets packets for for 127.x.x.x 192.168.0.x to 192.168.0.100 Route packets for 192.168.0.100 to 127.0.0.1 Route packets for 192.168.0.1 to 192.168.0.100 This may seem a bit confusing, but it’s simple once you understand some basic principles: 1) This is looked at for each packet and is driven by the destination IP the packet is trying to get to. 2) The more specific the “network destination” (based on the netmask), the more priority it gets. So, the 0.0.0.0 entry gets the least priority and the last 2 entries get the highest priority. 3) You can have duplicate “network destination” entries, in which case the “metric” determines the priority. Typically, the lower the metric (sometimes called cost), the higher the priority. Here are some additional items to help you understand this routing table: - In this particular case, the IP address of the host is 192.168.0.100. This is why there is an entry stating that if the destination IP is 192.168.0.100, it should be sent to the loopback IP. - The subnet of this VLAN is 192.168.0.0/24 or 192.168.0.x. This is why there is aan n entry stating that 192.168.0.x should be sent to 192.168.0.100 as this would communicate out to the VLAN. - There is only one way for this host to communicate to anything outside of its VLAN and that is to route the packet to 192.168.0.1, which is its default gateway/router. This routing table is simple, but on network routers and firewalls, routing tables can get very complex. The principles remain the same, but the number of entries and interfaces are much more abundant. The way these routing tables get built is an even more complicated topic as there are multiple routing protocols to make this happen such as BGP, OSPF, IGP, RIP, EIGPR (Cisco proprietary), etc. and sometimes an enterprise uses a mix of them. Discussing these is outside of the scope of this book.

COMMANDS: To display the local host routing table on a Windows machine, use the “route print” command. On a Linux machine, it would be the “route” command.

ddress Resolution Protocol (ARP) ARP translates IP addresses to MAC addresses. ARP is part of the IPv4 protocol suite (IPv6 uses a different, but similar protocol called NDP – Network Discovery Protocol). As mentioned earlier, when communicating to a device within your subnet or VLAN, communications occur directly with

MAC addressing as there is no need to route through a layer 3 device. To avoid constantly asking for the MAC addresses of the devices within your subnet, these translations are cached on each host. MAC addresses can be obtained in two ways: 1) Broadcast ARP Request – Whenever a host wants to communicate with an IP within their subnet and it does not yet have the MAC address for that IP cached in its ARP table, then it sends a broadcast ARP request. This request is propagated by the switch to all devices within the subnet (i.e. broadcast domain). This ARP request is ignored by all devices except the one who owns the IP address on the request. This device replies with its MAC address. 2) Gratuitous ARP – A device can simply broadcast its IP address and associated MAC address to all devices on the subnet. This is a common method for failing over a service from one device to another and is covered further later in the book. Both Windows and Linux, including MAC OS, contain an arp command. Simply type “arp -a” from a command prompt to view the contents of your system’s ARP cache. 2.4.3 Layer 4 Networking

Layer 4 networking (the transport layer) provides a transition from simply getting packets to their destination host (layers 1-3) to getting messages to their destination process on the host. As such, layer 4 networking is heavily involved in troubleshooting Enterprise IT problems. This section will focus on the major aspects of layer 4 networking. Layer 4 Protocols There are many layer 4 protocols, but within an enterprise, the vast majority of traffic will utilize either the TCP or UDP layer 4 protocols. As a result, these are the protocols that we will focus on in this book. ICMP is another layer 4 protocol used in Enterprise IT, but mostly by troubleshooting tools like ping and traceroute, so we will briefly cover it in a later chapter.

Ports While layer 2 relies on MAC addresses and layer 3 relies on IP addresses, layer 4 relies on ports. A port is local to the host and is opened by a running program (or process) on that host. Thus, when a packet is sent to a port, it is being sent to a specific process. Each IP address on a host can utilize port

numbers 0-65535. Here are some key facts about these port numbers:

➢

When a process opens a port, the layer 4 protocol (e.g. TCP or UDP) must be provided in addition to the port number. ➢ Only one process (running program) on a machine can utilize a specific port, IP address & protocol combination. ➢ Ports below 1024 (1-1023) are reserved for system (root/administrator) processes and are usually used for "well-known" services; Applications usually bind to these ports in passive mode (LISTENING), waiting for active connections from clients. ➢ Ports from 1025 to 49151 are considered “registered” ports that can be reserved at the request of an organization and published IANA for use by servers. ➢ Ports from 49152 up to 65535 are called ephemeral ports and are used as a pool for the OS to use to make client connections to servers; the client needs to know the server-side port, but the local client port is a random ephemeral port. In summary, a unique layer 4 conversation involves a 5-tuple set of values: 1) Layer 4 protocol 2) Source IP address 3) Source port 4) Destination IP address 5) Destination port COMMANDS: To view the conversations on your PC or most Linux machines (like MAC OS), you can use the netstat command. If you use “netstat -apno”, you can also see the associated process (identified by the

PID) on your that’s involved with this conversation. Any between output line that stateslocalhost “ESTABLISHED” indicates an active conversation

your host and another host, while any output line that states “LISTENING” means that your local process is waiting for other systems to initiate a conversation with them. Question 2 – In a conversation, how do you know which is the client (the

host that initiated the conversation) and the server (the host that was listening)? TCP Protocol The most popular layer 4 protocol by far is TCP (Transmission Control Protocol). TCP is a connection-oriented protocol that also provides flow control, reliability, and virtual circuit (packet ordering) capabilities so that the application can simply focus on reading/writing messages to the other application it’s communicating with. A TCP conversation starts with a 3packet handshake initiated by the client and can be ended by either the client or server. Because of the reliability features that TCP provides, it’s used by

most applications such as HTTP (web), FTP, SSH, SMTP (email), databases, etc. We will perform a deeper dive into TCP in a later chapter. UDP Protocol The second most popular layer 4 protocol and the only other one of interest within this book is UDP (User Datagram Protocol). UDP provides no services to the application layer; applications need to order packets (if desired) and have no guarantee of packet delivery. So, why would any application want to use UDP? While there are exceptions (e.g. NFS v2), there are two common reasons for using UDP: 1) Realtime communications – In real-time communication, there’s no reason to wait for a packet to be properly ordered because it’s useless if it’s late. Video and voice conversations are prime examples. It’s better to get a pixelated video or jitter on a call than to have the communications freeze. 2) Single request/packet applications – If an application only needs to have one packet returned and it can come from several responders, then the TCP's overhead makes no sense. DNS and DHCP discussed next are examples of this. pplication Layers

As mentioned earlier, OSI network layers 5-7 are not as easy to identify because they are not as standardized by the TCP/IP model. Many applications

only use layer 7, while others may use two or all three. Due to this, from a network perspective, typically only layers 2-4 are discussed. However, some application protocols (layers 5-7) will be covered in a later chapter. 2.4.4 NAC, DHCP, DNS, NAT & PAT There are many other important networking concepts applicable within the Enterprise IT space. For now, we will focus on reviewing just a few of them, namely NAC, DHCP, DNS, NAT, and PAT. All of these are heavily used within the Enterprise IT space, so they are essential to understand at least at a high level. Having said that, we will only deep dive into DNS later in this book. NAC NAC (Network Access Control) is considered a layer 2 security mechanism. NAC allows network administrators to place a device that connects to the

enterprise within a specific VLAN depending on the device’s attributes. The most common attribute looked for is whether the device is a corporatemanaged device or not. Most enterprises do not allow non-corporate managed devices access to their internal network, even when plugged into a switch port within one of the company’s buildings. Thus, when such a device connects to the network, the NAC solution detects this and automatically places the device in a “guest network” that only has access to the Internet. While NAC is an interesting topic, a deeper dive is considered outside of this book's scope. DHCP

DHCP (Dynamic Host Configuration Protocol) is used to automatically provide a host with an IP address and other relevant network connection information. DHCP is typically only used for edge devices (e.g. PCs, VoIP phones, IoT devices within a home, etc.). Servers typically have static IP addresses and thus do not need DHCP to get an IP address or other info. DHCP kicks in after NAC has determined which VLAN a device should reside in. At that point, the device sends a DHCP broadcast on the VLAN to request an IP address and the DHCP server responds with an IP and other relevant information like DNS server(s), DNS search suffixes (DNS is covered next), default gateway, etc. DHCP addresses are considered “leased” to the host and thus can change

over time. The lease period is typically a week or two, so if a PC connects to the network often, its IP address may remain the same for an extended period. The DHCP server on any VLAN is typically the default gateway/router device. While DHCP is also an interesting topic, a deeper dive is considered outside the scope of this book. DNS DNS (Domain Name Services) is used to convert English names (e.g. www.fiu.edu) to IP addresses. For a client device to speak to the server, it needs to know the server's IP address. However, IP addresses are difficult to remember and easy to mistake, so DNS was created to bridge the gap and allow humans and applications to access names rather than IP addresses to initiate communications. DNS provides any additional benefit in that it provides a layer of virtualization on top of IP addressing. In this way, you can

have multiple IP addresses in very different locations that provide the same service because they all respond to the same DNS name. DNS names are hierarchical with periods “.” separating the levels. Name searches start on the right-hand side of the DNS name and move left. For example, “.edu” is considered the top-level domain for www.fiu.edu. The left-most level is the hostname (e.g. “www” in the above example) and the rest is called the domain (e.g. fiu.edu). Domains are purchased by organizations for use on the Internet, but there are also internal domains that are used within the enterprise only. For example, server1.cead.prd, where “cead.prd” is only resolvable within the enterprise. Authoritative DNS servers maintain the master copy of the database of entries with multiple types of records (NS, A, CNAME, MX, PTR, SOA, etc.). All DNS server entries have a TTL (Time To Live) to denote how long other DNS servers should cache the entry and consider it valid. DNS is a complicated topic and we will be discussing it further in a future chapter as it is critical to troubleshooting issues within an enterprise environment. COMMANDS: To find the IP address returned for a DNS name, you can use

the nslookup on parameters both Windows Linux/MAC nslookup command cancommand take several and and we will look intoOS. thisThe further in a

later chapter, but the simplest use is “nslookup DNSname “nslookup DNSname”, ”, where DNSname is replaced with the name you wish to resolve (e.g. “nslookup www.fiu.edu www.fiu.edu”). ”). NAT & PAT NAT (Network Address Translation) is used extensively in all enterprises and

thus understanding them is important when dealing problems. There are several types of NAT implementations, but the mostwith common in enterprises (as well as home Internet deployments) is a one-to-many implementation, which is also called PAT (Port Address Translation). NAT converts one IP address to another midstream during a conversation. The two hosts involved in the conversation are unaware that this IP address conversion is occurring. There are several reasons for implementing NAT, but the most common are: 1) Allow many private IP addresses to access Internet resources since private IP addresses are not routable on the Internet. This also allows you to use just a few public IP addresses to have thousands of users access the Internet – remember that public IPv4 addresses are now quite difficult to come by. This use case is the primary reason that IPv4 has survived and IPv6 remains rare within enterprises. 2) Hide your internal IP address from external parties for security reasons. 3) Avoid IP conflicts with external partners. For example, suppose you are communicating with a business partner through a private link. In that case, there is a good possibility that they use similar private IP address ranges (e.g., 10.x.x.x) that would conflict with yours. By using a NAT, you hide your true IP address from theirs. Regardless, the point is that if NAT is involved, the IP address used to connect to your destination resource is not the IP address that the destination resource owns; it’s changed along the way. The graphic on the next page represents the most common use case, internal users accessing Internet resources through a PAT.

In this graphic, the machines on the Internet are only aware that they are communicating with IP address 124.20.1.3 (known as the “hiding IP address”). They are unaware this This one IP address is representing hundreds or thousands of users behind athat NAT. NAT translation usually occurs at a router or firewall. The router or firewall maintains the table shown above to map all the conversations traversing through it so that it knows what to change as each packet goes through. Note that with PAT, the layer 4 conventional port usage does not apply. The router or firewall will use any port during the translation, so it can support 65,535 active conversations with ust one hiding IP address. However, even that’s not enough in large enterprises, so often a small pool of IP addresses (called a PAT pool) is sometimes needed to support all the conversations from within an enterprise to the Internet. A couple of additional points about NAT. While the above example is to the Internet, NAT is also used for communications to external partners and in that use case, both sides (your company and the external partner) may end up using NAT. Lastly, NAT is considered a layer 3 function as the router or firewall does not establish a new layer 4 connection with either side of the NAT. The only intent is to changes the layer 3 IP headers. The rest of the encapsulated packets are untouched. COMMANDS: To see NAT in action, perform the following: - Find your PCs IP address (“ipconfig” for Windows or “ifconfig | grep inet” for Linux/MAC)

- -

-

-

Go to ipchicken.com to see what IP address this site thinks it’s communicating to. Make sure your smartphone is connected to your WiFi and visit ipchicken.com again. You should see the same IP address noted by ipchicken and it would be different from the IP on your PC. Now, disconnect your smartphone from your WiFi and try again. You should see a different IP address because now your cell phone carrier is performing the NAT instead of your home router, thus it’s a different IP. You can also try with whatismyipaddress.com, which gives you IPv6 addresses as well.

2.4.5 Firewalls Firewalls are also heavily used within an enterprise. As such, firewalls feature prominently in enterprise troubleshooting. Also, there are several

types of firewalls, and firewall functionality has been evolving over the past decade. In this brief section, we will cover the basics of firewalls and what to look out for. Stateful Firewalls All firewalls within the enterprise today are stateful firewalls. A stateful firewall maintains a table in memory of all the IP conversations traversing it to be able to determine whether to allow traffic or not. This is important because firewall rules are based on where the traffic is initiated (e.g., inbound or outbound connection attempt). Firewalls are layer 4 aware and thus track all conversations regardless of the layer 4 protocol (e.g. TCP, UDP, etc.). Since the conversation tracking is layer 4 aware, the conversations tracked include the 5-tuple values of a layer 4 conversation (source IP, source port, destination IP, destination port, layer 4 protocol).

For example, at home, you can initiate an outbound conversation with www.fiu.edu and thus the firewall allows packets started from your computer to continue to www.fiu.edu. However, www.fiu.edu needs to reply to your browser. So, how does the firewall know to allow those packets through to your computer? The firewall knows to allow them through because it finds the 5-tuple conversation values in its in-memory table and knows that these packets from www.fiu.edu are in response to the conversation that your computer started. Conversely, if www.fiu.edu were to attempt to send an

unsolicited packet (e.g. not part of any conversation your computer started) to your computer, the firewall will deny it because it is not aware of the conversation and inbound conversations to your computer from the Internet are not allowed. Of course, atable, firewall maintain all conversations indefinitely in-memory so ifcannot a conversation is gracefully ended or not usedwithin over aits certain period of time, the entry in the firewall’s table is removed. Due to this, many TCP conversations utilize periodic keep-alive packets to avoid such a timeout from occurring. Below is a graphical example where server 170.20.1.6 (using source port 50003) initiated a TCP conversation with an Internet source, 198.98.93.2, on destination port 443 and the firewall recorded the conversation. Thus, when 198.98.93.2 sends a packet back to the firewall from source port 443 to 170.20.1.6:50003 the traffic is allowed. However, if 198.98.93.2:443 initiates a TCP conversation to 170.20.1.6:80 (the standard HTTP port), the firewall rejects it because this is not a conversation it is aware of.

Firewalls and OSI Layers As mentioned above, there are several types of firewalls and their relationship to the OSI model can be confusing. First, it’s important to distinguish between the layers of the data packets that a firewall can inspect vs. the layer at which the firewall itself operates. Most firewalls are layer 4 aware, as with

the above example. This means that the rules used to allow or deny traffic include the IP, port, protocol, and initiation direction (inbound or outbound).

However, firewalls themselves are considered either layer 2 or layer 3 devices. A layer 2 firewall is essentially firewall capabilities added to a switch. This is useful because a layer 2 firewall can stop traffic to and from a device within your local subnet. Layercommon 2 firewalls areofnot common the 3 Enterprise IT environment. The more type firewall is ain layer firewall. Layer 3 firewalls cannot interfere with traffic to or from your local subnet. They can only get in between traffic that crosses to another subnet. Given that subnets are typically small within an Enterprise IT environment, this is not usually a limiting factor. Also, the added complexity of a layer 2 firewall is rarely ustified. Firewall Rules

At the core of any firewall, are its rules. Firewall rules are in essence an ordered ordered table table that includes 4 of the 5-tuple conversation elements (source IP, destination IP, destination port, layer 4 protocol) and whether to allow or deny the traffic. A key concept is that this table is ordered. As such, when examining whether to allow a packet to go through, the firewall begins to read the table at the top and stops whenever a rule is “hit”. If later on in the table there’s a rule that would do something different with the packet, it is ignored because it never gets to that entry. Another important firewall rule concept is subnetting subnetting.. You can create rules for one specific IP or for an entire subnet or even for a bunch of subnets (this applies to both source and destination). As an example, below there’s a rule that allows 170.40.0.0/16 access to 170.20.1.10/32 on port 443. This rule is allowing devices in both the 170.40.1.0/24 and 170.40.2.0/24 networks access to 170.20.1.10/32 on port 443. An extreme example of this is the keyword “ANY”, which is sometimes displayed as 0.0.0.0/0 or a subnet of all possible IPs. Yet another key concept is the idea of using groups groups.. IP addresses can be placed in a group and then applied to one rule as either source or destination. In this way, you can re-use this group of IP addresses for multiple purposes

and limit the number of rules in the firewall. This also makes the firewall ruleset more readable. In the example below, “Partner-IPs” and “ExternalDNS” are

examples of groups. Both can include specific IPs and/or subnets. Although less common, services (port numbers) can also be grouped for the same purpose. Lastly, as a best practice, all firewalls contain a “deny ANY” as the last rule to indicate that unless it explicitly matched an allow rule, a packet is to be blocked. Many firewalls may have only one deny rule, this last deny ANY rule. 2.4.6 Network Security Products Firewalls are not the only network security product that you will find within an Enterprise IT environment; there are plenty more. These products need to be understood at least at a high level to evaluate the potential for them to be involved in an Enterprise IT problem. Having said that, further discussion about these types of products is outside the scope of this book. Deep Packet Inspection

Many network security products rely on “Deep Packet Inspection” to accomplish their goals. This entails having a device analyze the full body of

all network packets (through layer 7) that traverse it for certain content, typically as it relates to data security (e.g. detecting malware, data exfiltration, common attacks). Intrusion Detection System (IDS) & Intrusion Preventi Prevention on System (IPS)

IDS devices use detected. defined rules including layer 5-7 data) to report on potential threats This(typically would include identifying fingerprints (known strings of data that are part of malware) or even malicious patterns of behavior. Because IDS devices only report on what they find, they do not cause issues. However, they may be a good source to help you identify an issue if it’s malware related. IPS devices also use defined rules (typically including layer 5-7 data) to detect potential threats, but IPS devices can either just report on the threat like an IDS or actively work to block the threat by dropping packets or resetting TCP connections. Because IPS devices can take action on potential threats, they can also accidentally take action on regular traffic or even slow down regular traffic if they are placed “in-line” as is often the case. Thus, understanding the location and support staff of IPS devices is important in an Enterprise IT environment. Web Application Firewall (WAF) WAF devices use defined rules to inspect HTTP traffic for known vulnerabilities and actively prevent them. WAF devices act like a proxy or man-in-the-middle for web traffic to protect the website. Because WAF devices can take action on potential threats, they can also accidentally take

action on acceptable web requeststoand inadvertently cause for web applications. Thus, it’s important understand which webproblems applications reside behind a WAF and which don’t as well as the support staff that can identify when the WAF has taken preventative actions. Web Content Filtering Web content filters limit the websites that users are allowed to visit based on the categorization of the site (e.g. type of content) or whether it has been blacklisted due to certain detected behavior. Web content filtering is typically directed at the user community and not at protecting servers. Thus, any issues introduced by web content filters typically impact the end-user devices.

While the blocking of unapproved websites rarely causes a problem, if web content filters become overwhelmed, they can impact the performance of

your users even when accessing legitimate websites. Network Data Loss Prevention (DLP) Network DLP devices inspect traffic to stop or at least limit the exfiltration of sensitive information. As with IPS devices, these devices may have rules

implemented to simply report on Network suspectedDLP problems or take actions actively stop suspicious activity. systems are rarely thetocause of Enterprise IT problems, but another factor to be queried depending on how it was implemented. Next-Generation Firewalls Next-generation firewalls include the capabilities of stateful firewalls plus IPS, Web Content Filtering, DLP, Malware detection, WAF, and other ways to determine whether to allow/disallow traffic (e.g., userid, user group, application, etc.). These devices have been slowly replacing regular firewalls over the past few years within Enterprise IT environments and will likely

become the new standard in the future. It’s important to note that just because a next-generation firewall can perform all of these functions, doesn’t mean that they are always set up to do so. The more you ask any device to do, the more horsepower it needs, so there are limits and support considerations to putting everything into one device. 2.4.7 Transport Encryption A great deal of network traffic within an Enterprise IT environment is encrypted. Thus, understanding how encryption works is sometimes critical to determining the root of a problem. While we will dig deeper into TLS/SSL in a later chapter, below are some foundational basics. Symmetric Encryption These are algorithms that require both parties involved in the conversation to share the same secret key. You encrypted and decrypt the data with the same key. Examples of symmetric encryption algorithms include DES, TripleDES, AES & RC5. The size of the key determines how hard it is to break the encryption (as of 2020, 128-bit and 256-bit key sizes are common and sufficient). Symmetric encryption performs many times faster than asymmetric encryption and thus is needed to transmit any material amount of

data. downside symmetric encryption is the difficulty in sharing a secretThe without othersofintercepting it. This is resolved with asymmetric

encryption.

symmetric Encryption These are algorithms that require a pair of keys to operate. If you encrypt with one key, you can only decrypt with the other and vice versa. Examples of asymmetric algorithms include Diffie-Hellman (DH), RSA & Elliptical Curve (EC). As with symmetric algorithms, the size of the key determines how hard it is to break the encryption (as of 2020, 2048-bit & 3072-bit key sizes are common for DH & RSA, while a 256-bit key size is common for EC). Because asymmetric algorithms have two keys, one can be shared with

the (your key)communicate and the otherwith keptothers privatewhile (yourensuring private key). In thispublic way, you canpublic securely that no one else can intercept the transmission. However, because asymmetric algorithms are much more resource-intensive than symmetric algorithms, they are usually only used to share a symmetric key in a protected fashion.

Hashing Algorithmsare one-way transformations of data to a set output size Hashing algorithms regardless of the size of the input. Unlike encryption, you cannot get back to the original data from the hashed output. However, the same input always results in the same output. Multiple inputs can in theory produce the same outputs (hash collision), albeit the larger hash's output size, the less likely that this would occur. So, if you can’t get the original data back, what use is a hashing algorithm? Well, it has many uses, including the storing of passwords in many systems (even if you see the stored hashed password, you don’t know the real password), indexing & retrieving items in a database (the hash points to an area on disk) and guaranteeing that data has not been altered (the hash of the original data is passed along with that data so that hashing the

retrieved data should match the hash). In terms of encryption, hashing algorithms are mostly used to ensure that data was not altered. Common hashing algorithms include MD5, SHA-1, and SHA-2. SHA-2 is now the recommended standard. The most typical out size for SHA-2 is 256 bits, but 512 and others are also used.

SSL/TLS SSL, which was later renamed to TLS although SSL is still colloquially used is an encryption protocol that utilizes all three primary categories of

encryption (Asymmetric algorithms, Symmetric algorithms, and Hash functions) to facilitate widespread use of encryption. We will go into greater detail on how TLS works in a later chapter. Certificates While asymmetric algorithms made it possible for two devices that know nothing about each other to securely communicate, it also introduced a new problem. While this guarantees that the communication will be private, it makes no guarantee that you are communicating with whom you think you are communicating with (e.g., how do you know to trust that a website is

truly “fiu.edu”?). This is where certificates come into play. All devices “trust” certain certificate authorities (these are typically pre-installed and later updated via security patches). These certificate authorities provide certificates to companies and vet them as they do this. In this way, you and the company share a common trust. This will be further discussed in a later chapter. 2.5 Database Basics Most Enterprise IT applications utilize databases. As such, databases are quite often at the heart of Enterprise IT problems. Understanding database concepts is important to be able to troubleshoot Enterprise IT issues. In this

brief section, we will cover some of thewithin basic an concepts important to understanding how databases are used Enterprise IT environment.

Also, we will dig deeper into databases in a future chapter. Having said that, databases are a complex topic worthy of an entire book, so only a few key concepts related to troubleshooting will be in scope for this book. Database systems provide several services that make them essential for applications, such as the following: Access optimization – Determining the best path to retrieving requested data. This is particularly important when you have a great deal of data spread across many tables. Security – User permissions to read or update data can be controlled. Replication – Ability to automatically copy data to another location/server. Atomicity – Ensures that each transaction with multiple changes will either succeed or fail as a single unit. The need for this is sometimes unclear, but a prime example is a bank money transfer. The operation includes the application subtracting the dollar amount from one account and adding it to another. If a failure occurs after the first step, you need to make sure the money remains in the original account and isn’t simply lost. In other words, the entire transaction must be successful, or the entire transaction must rollback. Consistency – Ensures that the database remains valid to defined rules. This is another service that may be unclear at first. However, there are many simple examples such as making sure that certain columns remain unique within a table (e.g., account number, SSN, credit card number, etc.), making sure that required values are provided and not left blank/null, making sure that a value is stored in the right format (e.g., a date is entered in a date column, a number is entered in a numeric column, etc.), etc. More complex examples include referential integrity or parent/child relationship enforcement (e.g. an employee cannot be assigned to a department that does not exist). Isolation - Ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially. Again, this service can be confusing on the surface. The concept here is that

even though transactions are executing concurrently, they do not impact each other. This is accomplished by locking rows to avoid transactions from inadvertently affecting each other. Durability - Guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure. This is a simpler service to understand. In essence, it states that even if a database crashes, it will recover as you would expect with any successful transactions remaining in place. This is accomplished by always writing to a log before writing to the database. 2.5.1. Database Performance Considerations Database performance can be impacted by multiple factors and given that most database servers are accessed by multiple applications, a problem with a database server can have broad implications. Also, the fact that multiple

application servers serving the same application often connect to the same database server concurrently, creates the opportunity for conflicts. While we will dig deeper into what factors could impact database performance in a future chapter, the following are the most typical root causes of performance problems experienced with database servers: ☐ Locking – Concurrent access to the database can lead to locking issues if applications are not designed/tested appropriately. ☐ Query Performance – Larger databases can become bottlenecked by poorly executing queries that unnecessarily require excessive I/O; Execution plan analysis should help to add or other improvement opportunities. Lack ofindexes Resources – Database performance can dramatically ☐ identify improve with additional RAM as large portions of the database can be cached in memory; CPU resources could also be needed to improve performance. ☐ I/O Performance – Poor I/O performance can greatly impact database performance; Use of SSD drives has been a growing trend with databases for this reason.

Chapter 2 – Review/Key Questions

1) What is the primary logical unit of an operating system? 2) What the difference between a process and a thread? 3) What are the advantages of virtualization? 4) What's the difference between multitasking, multiprocessing, and multithreading? 5) What’s the difference between paging, swapping, and thrashing? 6) What are the types of storage available to Enterprise IT? 7) How are filesystems similar between Linux and Windows? 8) What’s the difference between a framework and a platform? 9) What is instrumentation used for? 10) What do DNS, DHCP, ARP and NAT stand for and what do these technologies do? 11) Which OSI layer works with MAC addresses? Which OSI layer works with IP addresses? Which OSI layer works with ports? 12) What’s the difference between symmetric and asymmetric encryption algorithms? 13) What’s a call stack and at what level does it exist? 14) What are the performance considerations for RDBMS databases?

Chapter 3 – ITSM & Troubleshootin Troubleshooting g Best Practices

In this chapter, we will briefly discuss ITSM, ITIL, and more deeply two disciplines within ITIL called Incident Management and Problem Management. After that, we will cover a set of goals, principles, and steps that are recommended for all troubleshooting Then, we will discuss identifying when you have limitations during aefforts. troubleshooting effort that you need to address and finally a brief discussion on the role of application development and infrastructure personnel when troubleshooting Enterprise IT problems. 3.1 ITSM & ITIL ITSM (Information Technology Service Management Management)) refers to the entirety of activities that are performed by an organization to design, plan, deliver, operate, and control information technology (IT) services. This is

considered a “process approach” towards IT management. ITSM is a general concept and needs to be made more concrete through a documented standard or framework. Many standards and frameworks can be used as a reference for ITSM, but the most common is ITIL ITIL (Information Technology Infrastructure Library) was developed in the UK in the 1980s. ITIL is a set of detailed practices for IT service management (ITSM) that focuses on aligning IT services with the needs of the business. ITIL is a large topic that is out of scope for this book, however, a small portion of ITIL is directly relevant to Enterprise IT Troubleshooting, specifically “Incident Management” and “Problem Management”. Some other topics like “Change Management” (the documenting of all changes made to the environment) and “Configuration Management” (the documenting of all IT components in the environment) are tangentially relevant to IT Troubleshooting, so we mention them briefly later in the chapter, but do not focus on them directly. Incident Management Incident Management is the discipline around dealing with incidents that are currently affecting normal business operations. The primary goal of Incident Management is to restore normal service operation as quickly as possible and

to minimize the impact onLevel business operations. The ofnot service must be per your Service Agreements (e.g. therestoration systems are only

available but also as responsive as expected). MTTR (Mean Time to Repair) is a key metric tracked during Incident Management as you want to minimize it over time (e.g. get better at detecting root cause and correcting it). Incidents are usually categorized by impact to the business and thus priority to have it low resolved. wouldhow be P1, P2,want P3, P4 or critical, medium, – eachCategorizations organization decides they to name them.high, Typically, P1 (highest priority) and often P2 incidents call for the assembly of a cross-section of technical resources led by an incident manager. Such a team would include Application Development personnel, various infrastructure personnel (e.g., Network, Server, Storage, Database, Web, etc.), and even business-side personnel – whatever is needed to resolve the incident as quickly as possible. Incidents could even include non-service interrupting issues (lowest priority) such as the failure of a redundant component (e.g. redundant disk drive or server) whereby the business is not impacted, but at risk was another failure to occur. Low priority incidents could also include individual personnel issues (e.g. PC problems). This book will focus on P1/P2 type issues that impact multiple personnel and/or systems – we will not focus on single-user PC type issues, although best practices still still apply for those Problem Management Problem Management is the discipline that involves identifying the root cause of incidents after the fact and developing ways to avoid or mitigate the recurrence of incidents. The primary objectives are to prevent incidents from happening, to eliminate recurring incidents, and to minimize the impact of

incidents that cannot be prevented. As such problems are related to incidents. Often one problem record is related to multiple incident records. Not all incidents need to be followed up by a problem record, but most should. Reasons for creating problem records include: 1) Incidents for which no root cause is identified (e.g. server reboot got users working again but no one knows why). This is needed to followup and avoid recurrence. 2) Incidents that have occurred multiple times without known cause and thus at risk of recurring (e.g. network connection failing sporadically). 3) Incidents where the root cause was determined, but that have only occurred in one location, and yet the condition is present elsewhere. This is so that you can proactively correct other locations and prevent

future incidents (e.g. rolling out a patch to other servers). 4) Incidents where the root cause was determined, but detection of the incident or corrective action (MTTR) took too long. A problem record is needed to mitigate future similar issues by catching them sooner or correcting them quicker.

To the right is a graphical view of the typical incident and problem management processes:

3.2 Troubleshooting Goals Per the ITIL disciplines of Incident & Problem management, it is important to remember what we are trying to achieve as we troubleshoot issues. To that end, the below troubleshooting goals are provided in order of importance/urgency, but all should be considered.

1) Get user(s) working again – As you work through an incident, it’s critical that getting business operations back to normal is paramount in the short term. Thus, if a viable workaround is available, it should be implemented immediately. There are many possibilities, but here are a few: 1) If only one server in a cluster is having problems, shut it down. 2) If rebooting a server or service may get them working, then do so. 3) If the users or service can be moved to another server or location quickly, then move them. 4) If providing limited functionality (e.g. some functions do not work, but the majority do) is an option, then provide it. The ability to do some work is better than allowing the complete productivity loss to persist. The point is to minimize the business impact as much as possible. This is the primary goal of incident management. 2) Save evidence for root cause determination – While getting user(s) working again is crucial, it needs to be balanced with the ability to save evidence to determine root cause (e.g. save logs, create memory dump). With recurring issues, this is particularly critical as in the long term the productivity will be greatly impacted by recurrences, offsetting any value of getting the user working again quickly. While this goal server problem management, it needs to occur during incident management. 3) Determine root cause – Don’t mistake symptoms for the root cause. Just because you reboot and the problem temporarily goes away, it does not mean that you solved the problem. Without understanding the root cause of an issue thoroughly, there is no way to put in place measures to avoid the problem in the future. This goal is more aligned with problem management. 4) Take action to avoid future occurrences – Once the root cause is determined, re-occurrences can and should bealso avoided; notrelated just for the particular situation of the given problem, but for any

situations (e.g., a problem that affected one server/database/location that has not yet affected others, but may – proactive measures should be taken to avoid those future possible issues). This goal is at the heart of problem management.

Anecdotal Story – Recycle it and forget it For the second time in two days, a worker process on an application server in a cluster of 3, started taking up 100% CPU causing that application to become unresponsive and even negatively impacting other applications on that server. Because the application impacted is highly sensitive, the Web Hosting engineer immediately killed and restarted the worker process taking

up all the CPU. Everything returned to normal. Was this the proper action action to take? Why or why not? At first glance, this appears to be a perfectly reasonable action to take. It aligns with the primary goal of incident management and troubleshooting goal #1 (get users working again). However, since this has now occurred twice, it is likely to continue to recur. Thus, while recycling immediately achieves the short-term goal of getting users working again, it does not achieve the goal of minimizing business impact because recurrences will persist. The best action to take in this situation is more subtle. First, order comply server with troubleshooting goal #1, the engineer simplyintake thetotroubled out of the load-balanced cluster of 3 could without touching the server itself. This allows for a deeper investigation of the problem, including calling the developer, perhaps vendor support (e.g. Microsoft if this was a Microsoft-based application), and run diagnostics (e.g. take memory dumps to capture which modules are taking up the CPU). This is a more thoughtful approach that would provide an opportunity to determine the root cause and avoid recurrences. 3.3 IT Troubleshooting Principles (a.k.a. – The 10 Commandments of Troubleshooting)

During the troubleshooting process, it is easy to get off-track and fall into

common pitfalls. As a result, we developed these principles that should be adhered to in order to move the process along rapidly and effectively. These are in order of applicability and importance as some principles only apply to certain situations. 1) Do not make things worse – The criticality of the situation must guide you in terms of the risks that should be taken and when. If systems are partially working or the change required to fix a problem could impact more critical and as of yet unaffected systems, then risky maneuvers should be avoided until a more opportune time. For example, if only a handful of users out for hundreds are not able to work and the fix could impact everyone, then it may be wiser to wait until after all the users are done for the day. Conversely, if the the most critical systems are completely down anyway, then performing risky maneuvers are much more acceptable. 2) Ensure clarity of symptoms/evidence/actions symptoms/evidence/actions – Miscommunication about the symptom of a problem or the nature of certain evidence or actions can derail a troubleshooting situation for a lengthy period of time. Share screens or at least screenshots so that the group can all see the same thing. Also share error log text, names of servers, etc. so that miscommunication is kept to a minimum. 3) Suppress any assumptions – Assumptions come in many forms. All conclusions must be confirmed. Some example assumptions that can be problematic: Assuming something is NOT a problem. Do not elimi eliminate nate any potential failure points, withoutconnect provingtothey are not do the cause. For example, if you cannot a process, not assume that a process is up and running just because you have not received an alarm. Verify that the process is running and that it is listening for work. Assuming that something was already checked. If you have not received confirmation that a particular item was checked, confirm it. For example, do not assume that the monitoring system or error logs were checked. Assuming the skillset or thoroughness of an engineer. Until confirmed, question all engineering staff about the information needed.

4) Remember Occam’s Razor - William Of Occam was a Franciscan friar and philosopher that lived about 1300 AD. Occam’s Razor – how to shave a problem – says that the simplest conjecture, that has the fewest assumptions and variables, is probably the right one. Don’t move on to unlikely causes for a problem until the simple possibilities are eliminated. 5) Use reductive problem isolation techniques – Perform experiments to eliminate large potential problem areas. This is particularly important when there are many possibilities as to where a problem could be occurring. For example, a failing, or slow-performing website could traverse several network links, firewalls, routers, switches, load balancers, etc. If you test this website locally on one of the web servers and the error or slow performance persists, then the entire path from the user(s) to the webserver(s) can be eliminated as part of the problem. Focus can then be on the webserver(s) and any components the webserver(s) communicate with (e.g., backend database). Only change one variable at a time – While there is a tendency to make many changes at once to quickly resolve a problem, identifying the root cause could be a serious challenge if this principle is not obeyed. If a problem is resolved after you made multiple changes, how do you know which one corrected the issue? This leaves you at a disadvantage in terms of avoiding such problems in the future or even knowing how to mitigate recurrence. 7) Leadership/engag Leadership/engagement ement needed with larger teams – If your troubleshooting effort is being addressed by a large team (more than 3),

6)

then someone should be identified as leading the effort to ensure that someone is driving decisions. Also, the llarger arger the group, the more important that all personnel remain fully engaged and provide theories – you never know where the answer will come from. Theories should be encouraged, even if discarded quickly. 8) If actions were working before, what changed? – Identifying the precise moment when something “stopped” working is important, especially if your organization keeps a history of changes (change management), which is a best practice. Identifying potential changes that caused this new behavior offers a viable shortcut to solving the problem. For example, a code change wasbest recently deployed, thenway simply backing out thatifcode change is the (and perhaps only)

to resolve the issue. 9) If this works in a different environment, what’s different? – If a similar environment is working, then a comparison of the two environments provides a potential shortcut to solving the problem. For example, if this is working in your test or staging environment, what’s different? If this is working at another location, are there any differences? 10) With difficult, long-lasting issues, document your steps – While not always necessary, once you realize that the problem is not straightforward, documenting your steps will avoid others repeating the same actions and wasting valuable time. necdotal story – Jumping the gun? Two mid-sized clients (out of hundreds) call in that they cannot reach one of your applications over the Internet. You find out that they both share the

same cloud provider. Further, after it is determined thatresolution their cloud provider made a change on their endanalysis, that no longer allows name from DNS servers that do not support eDNS (Extension Mechanisms DNS). Your current DNS servers are not currently set up to support eDNS. A workaround is put in place by your clients to use host file entries and this gets them working again. Also, you have fairly new DNS servers already in place and schedule to move the websites that these clients depend on to the new DNS servers that evening. More investigation reveals that today is DNS is DNS Flag Day – Day – a grassroots movement by the IT industry to push for DNS servers to require eDNS support or be rejected (just like what happened with this cloud provider). Since migrating all websites to the new DNS servers will take weeks, if not months, there is now a concern that there will be a flood of these problems over the next few days as other providers do the same as this one. To avoid other clients from having the same problem, a quick attempt is made to resolve the problem by attempting to configure the legacy DNS servers to support eDNS. Was this a wise move? Why or why not? Of course, the intentions were good, but this move put at risk of violating the first principle of the troubleshooting 10 commandments (do not make things worse). Since the only affected clients were now working and the workaround put in place was already scheduled to be made permanent, there

was no immediate issue at all. Thus, performing a maneuver that entailed any risk was unwarranted. If this was to be attempted, at the very least it should be performed off-hours, when usage would be very low. Attempting such a change in the middle of the day without even having tested it before was too great a risk, even if there is a fear of more issues coming. You can always react to such an avalanche of incidents when they arrive. As you would suspect, this change caused a greater outage to most clients for a short time before being backed out. It was decided to accelerate the migration to the new DNS servers instead through automation. As it turned out, there were a few other clients that suffered this same problem, but they were dealt with quickly as they came up with the already identified workaround and fixed permanently the next evening. This is a lesson learned that enforced the need to adhere to the first principle of troubleshooting. 3.4 Troubleshooting Steps Now that we’ve covered Troubleshooting goals and principles, it’s time to discuss the steps to take when troubleshooting an Enterprise IT problem. While every problem is different, in general, the steps below are solid guidelines to follow. 1) Verify that a problem actually exists – Just because something is not working, doesn’t mean that it’s supposed to be working. The error detected may not actually be causing a business interruption or the user may not be allowed to perform that function without additional permissions or new, valid restrictions may have been put in place, or

simply the usereven mayseen be performing actions incorrectly error). We’ve cases wherethe a P2 incident was run(a.k.a. for an user hour because someone saw an alert in a log, just to find out that the alert had been occurring for months. 2) Ensure a clear understanding of the problem and its history – What errors are appearing? Under what circumstances? When did this start? Who is affected (e.g. number of users, applications, locations)? Can you easily replicate the problem or is it sporadic? Are there any system monitoring alarms that may be related? How bad is the business impact (e.g. how many users impacted or is a critical client impacted?). All of this may require some experimentation to determine the boundaries of the problem (e.g. what’s working and what is not working) but creates

a solid foundation for everyone to continue forward. The answers to the questions should be documented in the incident record. 3) Isolate the cause of the problem – This is the core of the process and thus we break it down further. Of course, this could be a lengthy and iterative process depending on the problem. Does anyone recall this occurring in the past? If so, find the documentation to review it. “Follow the packet” – understand what systems talk to which systems, how they reach the next system and how to confirm that this linkage is working as expected. This helps you isolate which system the problem is located on. This would require application development personnel that understand how the application works as well as infrastructure personnel that know each hop within the infrastructure. Aside from obvious errors provided by the user, question where else may errors be present? For example, are there application or platform log files or system monitoring tools that can be reviewed? If so, look in those areas. Lookup any identified error messages in the documentation or the Internet (Google knows all). Most often you are not the first to have run into the problem, so seeing if others have experienced this problem is a good start. Document what you find, but make sure it is applicable – use the information found as a guide, not definitive proof of your problem (Not everything knows at is hand correct). Based on allGoogle the evidence (error messages, monitoring alarms, past experience, expertise available, recent changes, differences between other environments that work, etc.), theorize a possible cause (remember the principle of reductive problem isolation – shrinking the problem space). Experiment on your theory but keep it minimalistic – just enough to prove/disprove the theory. Eliminating possible causes is essential to being able to progress to isolate the root cause. If progress stalls, then follow the workload step by step to

identify error points and/or engage vendor(s) and/or question whether settings can be changed to experiment, or tools used (e.g. Wireshark) to collect more data. Do whatever it takes to get more information and keep things moving. Iterate until the problem is isolated. 4) Devise the best approach to address the problem – Taking into consideration the Goals and Principles previously discussed, determine the best way to address the problem once isolated. This may require backing out a change, restarting services, rebooting devices, making configuration changes, or applying a vendor fix. It’s key to understand the risks associated with all potential actions to determine when they should occur (e.g. now, after hours, or weekend). 5) Apply corrective action and confirm results – Once corrective action is applied, always confirm the problem is completely solved (remember the principle of confirming assumptions). Do not end the troubleshooting session without confirmation from all parties. There may be multiple or more complex issues at play and resolving for one party may not resolve for the other. Also, when dealing with performance issues, a situation may be improved but not completely solved with some corrective action. Under any of these circumstances, further analysis may be needed. 6) Ensure that corrective actions are persisted and applied globally – Do what needs to occur to avoid similar problems elsewhere or from having the problem recur. Ensure that correcti corrective ve action is employed accordingly. necdotal Story – Calls on the rise One day, your company’s primary call center experiences a sudden and unexpected increase in calls. The hold times began to increase dramatically and by noon, customers were waiting for over 30 minutes. Management had to call on outsourcers to increase their allocation of agents to help. Finally, in the afternoon a few agents noted that customers said they could not complete their request on the self-service website that is supposed to handle over 50% of the requests. A P2 incident is declared and the team is brought together to determine the issue. Within five minutes, they determined it was a problem with a new code deployment that had occurred the night before. The code was backed out and the website is functional again allowing call volumes to

decrease to normal. The developer assigned to the problem record, notes the root cause of the code problem, that it was corrected, and closes the problem record. Was this the proper action to take? Why or why not? This is more of a problem management anecdotal story than an incident management anecdotal story. In fact, from an incident management perspective, everything was performed well, and you could not ask for a much quicker MTTR. At first glance, even the problem record details seem adequate, but a deeper introspection notes a major gap – the mitigation of future occurrences. Unforeseen code deployment problems will inevitably occur from time to time as testing all aspects of a system in lower environments is difficult. So, what was done in the above problem record to mitigate a future occurrence of a code deployment problem? Nothing and that’s the gap. It’s important to note that the company did suffer in terms of reputational damage (customers waiting longer than necessary for service) and unnecessary expenses (additional outsourcer expenses). So, what could have been done? The below are just some examples. The point is to train support personnel to think through these possibilities when performing problem management. - Identify why this problem was not detected during lower environment testing. Perhaps some additional functional tests can be added to testing procedures. - Implement smoke testing of the application after deployment to ensure that problems are detected soon after deployment as opposed to many hours later by customers complaining. -

Implement monitoring on the website to detect the issue automatically and alert support personnel so that they can address it before too many customers are impacted.

3.5 Identifying & Mitigating Limitations Even if you have a strong grasp of the IT troubleshooting best practices discussed, it is still important to recognize when there are limitations that need to be addressed within the IT troubleshooting team. These limitations can cause severe delays or even an inability to resolve the problem at hand. 1) Technical Experience/knowledge Experience/knowledge – IT troubleshooting requires

experience & knowledge in each discipline involved. You need to know where to look for errors, what conditions could cause a particular

problem, what questions to ask, how to make the changes needed to resolve the problem or simply experiment, etc. Experience also provides the ultimate shortcut – having experienced the problem before and knowing how to resolve it. If you have insufficient technical experience or knowledge in a particular discipline needed to address the problem, you need to recognize this and compensate by adding personnel with this experience to the team. For example, if the problem appears to be related to a SAN, then you an engineer knowledgeable with that specific SAN infrastructure to be engaged. This engineer would have to be able to issue commands to validate and/or address the problem. If you do not have this experience readily available, then calling on a vendor may be necessary. This is particularly true when the issue appears to be with a vendor’s tool as opposed to your company’s proprietary or configuration. Therefore, it is important make sure that all code products used to support any significant businesstofunction have vendor support. 2) Technical Complexity – Many Enterprise IT issues transcend multiple areas of expertise. This increases the complexity of the problem as issues can exist within the interactions of these areas and not solely within one area. Without a solid understanding of how each area intersects, it may be difficult to properly theorize what the problem could be – a critical step in the troubleshooting process. If such a gap exists, you need to recognize it and compensate by either engaging personnel with these skills or focusing questions at the intersection of technologies, ensuring that the experts in these areas concentrate on this interaction. For example, if an application is having difficulty connecting to a database, you need the developer to provide the database connection string as well as the exact error and then have someone that understands the implications of the error. If the error is authentication-related, then network connectivity can be ruled out, and the focus can shift to a DBA to determine the reason for the security problem. 3)

– Even if the problem-solving team has Environmental Knowledge the proper level of technical expertise across all the disciplines

involved in the problem, there may still exist a gap in terms of knowledge with the environment. Historical/tribal knowledge about the applications, networks, platforms, etc. involved with the problem is often critical to developing theories as to the possible cause. If such experience does not exist, then you need to recognize it and compensate by doing one of the following: Find personnel with this environmental knowledge Look for environmental documentation (e.g. technology/application architecture diagrams, networking subnet listings, past problem incidents, etc.) Use reductive problem isolation techniques while carefully reviewing all environmental configurations before applying any changes. This will slow down the process but avoid making the situation worse. Environmental Knowledge What does it mean to have environmental knowledge? Enterprise IT environments are extremely complex, and many have a long history. Thus, understanding how the network is configured, how load balancers are configured, how authentication works, etc. is critically important. It also helps to understand why items are configured as they are to avoid making a change that will break other systems.

Here are just some of the questions that someone with environmental knowledge would be able to answer: ☐ How is the network configured? ➢ How many subnets – is there a list? ➢ Where are the firewalls located? ➢ Which subnets are behind firewalls and which are not? ☐ How is virtualization setup? ☐ How is the storage architected? ☐ What platforms do you use (e.g. application server technology, database technology, message bus technology, etc.) ☐ What security products are implemented (e.g. IPS, WAF, Content Filtering, etc.)? ☐ ☐

What kind of infrastructure and application monitoring is in place? Who are the experts in each area of expertise?

☐ ☐

How are changes tracked? Application architecture ➢ Two-tier, three-tiers, or more? ➢ What servers are involved? ➢ Where are the users? ➢ What kinds of platforms are being used? ➢ How is application security implemented?

3.6 Application Development & Infrastructure in Troubleshooting As noted in chapter 1, in Enterprise IT organizations, Application Development, and Infrastructure teams are separated for several reasons including segregation of duties. As such, these groups tend to have distinct cultures. However, when it comes to troubleshooting, these groups need to work closely together. This section discusses this topic.

Sometimes when a problem occurs, the root cause is easily detected (e.g. a server crashes, a network circuit fails or a new code deployment has a bug), however, often this is not the case. Often, it’s unknown where the root of the problem lies. When this occurs, troubleshooting is optimally performed by both Application Development and Infrastructure personnel together. The following are the reasons why: ☐ You should avoid assuming that a problem is either application or infrastructure-related until the issue is diagnosed successfully. Any such assumption could lengthen the troubleshooting effort because it will lead to all efforts being focused on what may not be the ☐

☐

problem. Even if the problem is an Application code issue, infrastructure personnel are needed due to the following: ➢ Application Development personnel are generally not familiar with the infrastructure and all the components that could impact their application. ➢ Application Development personnel do not have access to production (and often some pre-production) systems, so they are reliant on infrastructure personnel to aid with debugging problems. Even if the problem is infrastructure related, Application Development personnel are needed to do the following:

➢ ➢

Explain what their code is attempting to do. Test their code under different circumstances as changes are made or to capture trace information (e.g. a network trace). ➢ “instrument” their code – that is place debugging statements to capture needed information at various stages of their application’s processes. Bottom line is that both types of personnel are most often needed to resolve problems, and everyone involved should understand this and work as a team. pplication-focused vs. Infrastructure-focused Troubleshooting pproaches There are generally two approaches to troubleshooting an issue. An application-focused approach that looks at the problem from the perspective of an impacted application and an infrastructure-focused approach that looks at the infrastructure components involved for errors or problems.

When troubleshooting a problem that is only affecting one application, then it is best practice to follow an application-focused approach to troubleshooting. That is, you focus on what’s wrong with the application and look for symptoms directly related related to that application. That doesn’t mean that you ignore infrastructure components that could be causing the issue, it’s just that unless you know that there’s a problem with a specific infrastructure component, you let the application inquiry lead you there. However, when troubleshooting broad issues that impact multiple applications, the decision is not so clear. In such situations, the priority is isolating the problem space as much as possible. Either approach may help accomplish this depending on the circumstances. Below is more detail on these two approaches: Application-focused cused approach - Start with the errors being ◆ Application-fo generated by the applications (focusing on one at a time) and then work to isolate the component (e.g. database, server, network connection, etc.) that may be causing the problem. This can be accomplished by looking at application logs, application instrumentation tools, application configurations, platform (e.g. application or database servers) logs, etc. ☐ PRO – More focused on the problem at hand and thus less likely to chase false leads

☐

◆

CON – May take long to find the cause if sporadic as with capacity issues

Infrastructurefocused approach – Start by checking the monitoring tools for alarms or servers for capacity issues (CPU, swapping, etc.) on all related infrastructure components and/or looking for errors in system logs. ☐ PRO – May provide a quick resolution if the problem has been detected by a tool or is obvious in a log ☐ CON – May lead to following red herrings as system monitors and logs often contain errors & warnings not associated with the problem at hand

Both approaches are appropriate and can be followed concurrently, but the pitfalls of each must be kept in mind to avoid extending the outage longer than necessary.

Chapter 3 – Review/Key Questions

1) What are the differences between Incident Management and Problem Management? 2) What are the primary goals of troubleshooting Enterprise IT problems? 3) What are the principles to adhere to while troubleshooting Enterprise IT problems? 4) What are the steps to take when troubleshooting Enterprise IT problems? 5) What are the potential limitations to look out for when troubleshooting Enterprise IT problems? 6) What does “environmental knowledge” mean? 7) What are the differences between an “application-focused” and an “infrastructure-focused” approach to troubleshooting? Why are both disciplines are necessary?

Chapter 4 – IT Problem Fundamentals Fundamentals

In this chapter, we will cover what we call the “IT Problem Fundamentals”. These fundamentals include the various dimensions of IT problems and system constraints. We will also take a deeper dive into hard vs. soft constraints, and CPU vs. RAM vs. I/O vs. Network problems/constraints. Finally, we will review a little about how to analyze log files. This is a transitional chapter to discuss IT problems in more general terms before deep diving into tools in the next chapter. 4.1 – Dimensions of Problems/Constrai Problems/Constraints nts Identifying the root cause of an Enterprise IT problem is often difficult due to the complexity of the environment. When a problem arises, it’s important to look at what’s common about the reported symptoms. These symptoms can be looked at in terms of dimensions, which can help you isolate the root

cause. A common factor in IT problems is resource limitations. All environments have capacity constraints, however, identifying the limitation impacting your specific problem is not always simple to determine. Thus, it’s critical to understand the possible bottlenecks/constraints that you can run into. Dimensions of problems & constraints There are many dimensions to IT problems and constraints, but in this chapter, we will discuss a few and then deep dive into a couple of these dimensions. Here’s a list of some of the dimensions of IT problems and

constraints: Single system vs. Multiple systems vs. System interaction Operating System vs. Application/Product/User Sporadic vs. Persistent Temporary vs. Permanent Hard vs. Soft CPU vs. I/O vs. RAM vs. Network Single System vs. Multiple systems vs. System interaction

As you tackle an IT problem, one dimension of the problem to look at is whether this is a single system problem, multiple system problem, or system interaction problem. What’s the difference? Single system problems/constraints are isolated to a single server. Only the application(s) and/or product(s) that reside on or depend on this server would exhibit issues. No other systems are impacted. This typically occurs with isolated systems that do not share components with other applications.

When multiple systems are impacted by a particular problem, then the problem or constraint could be one of the impact systems or “outside” of the impacted systems. It could be an issue with the network, storage, or perhaps another system (e.g. Active Directory or load balancer) that impacts an entire environment. The key to problems like these is to find what’s common about the impacted systems.

Occasionally a problem/constraint is not with a particular server or network device, but instead with the interaction between systems (e.g. a specific application server unable to connect to a database server). With problems like these both of the systems at play and the network in between need to be analyzed. Is there a firewall in between? Is there a network link failing in between? Was there a recent change on either server preventing resolution of the name or authentication? etc.? Operating System vs. Application/Product/User Another important dimension to understand is whether the problem/constraint

is at the Operating Systems-level, impacting all of the applications on that server or at the application, product, or user-level whereby only the rest of the applications on the server are performing fine. Operating system problem/constraints include overall CPU, I/O, RAM, Host network bandwidth, OS configuration, etc. Per Troubleshooting Principle #1, since such issues impact all applications/products on the server, it influences the range of actions to be considered (e.g. more likely to consider a reboot of the system since everyone is impacted already anyway). Conversely, with specific application/product/user limitations, there are still overall system resources available, thus only the one application, product, or user is impacted thus remedies likely be more avoid impactingand those still workingwould fine (e.g. more likelyrestricted to recycletoan app pool

worker process taking 100% CPU that reboot the whole server). Sporadic vs. Persistent Yet another dimension of IT problems to consider is whether the issue is persistent and thus easily replicated or sporadic and thus not easy to replicate.

Persistent issues are typically easier to diagnose as they can be tested to see if a change has impacted the situation. If you change something and it has an effect, you can be fairly certain that it’s due to your change. This is not the case with sporadic issues. Sporadic issues can be quite challenging especially if a pattern is not able to be determined. The key to handling sporadic issues is to figure out how to “catch” the problem in the act such that a comparison can be made to a time when the issue does not occur. An example would be to create a trace that only triggers upon the condition you are looking for like a failure or long-running transaction. You can set up such traces on web servers and database servers.

Temporary vs. Permanent A more subtle dimension of an IT problem is temporary vs. permanent. When dealing with a sudden increase in demand, for example, it’s important to determine if the increase is temporary (e.g. related to a temporary spike in activity or perhaps the deployment of troublesome code) or more permanent. For example, if you deploy new code and it starts to take 100% of the CPU, then increasing the number of CPUs may not help at all as this is likely a temporary issue due to a code change. Backing out the code would resolve the problem.

Similarly, if there’s a spike in activity due to the time of year (e.g., Black Friday), then looking at temporarily adding horsepower by taking it from lesser-used (e.g. non-production) systems as opposed to permanently purchasing horsepower may be more appropriate. On the other hand, if the spike appears to be permanent, then plans must be made accordingly. necdotal Story – Everything’s slow, but just for one site Early in the morning, personnel from several departments at one large site in the company report that their systems are extremely slow when logging in, opening the browser, and accessing any file on their PC – it takes as long as a minute at times. No other locations are calling in with issues. Below is a diagram to help visualize the issue. What questions do you ask to get to the root of the problem?

Since there are many users within one location being impacted, this is a multiple systems issue and the problem is unlikely to be with any specific user, but instead something that they all rely on. The problem seems to be persistent, so ififwe can find changes, something to test, we should ableFollowing to be confident that something we’ve identified the be issue. the

troubleshooting principles laid out in chapter 3, the following are some questions that could be asked: - Any changes performed overnight? – The answer is no. - Are all users within this location impacted? – The answer is no. Some users are performing fine, but others are not. - What’s common about all the users impacted? – After some brief research it’s found that all the users impacted share the same file server for their home directory. This home directory is used as the location of the browser cache which explains why any browser activity is slow. - Now that we know what’s common, you investigate the file server and find that it’s at 100% CPU. This confirms that you’ve likely identified the immediate cause. What actions should you take next? Because so many users are impacted and it’s difficult to do anything on the server (very slow to respond), the decision is to reboot the server and contact Microsoft. the reboot, CPU usage is but normal again and users are performingAfter fine. The incident is resolved, a problem record is created. Through the problem record, a look back at the history of the CPU usage of this server indicates that it has never been very high, however, it was recently patched a few days ago. This appears to have been a temporary problem and it is likely related to a recently introduced bug. Sure enough, after conversations with Microsoft, it’s discovered that the latest patch introduced a bug that only occurs under certain circumstances. An additional patch is applied to avoid recurrence and this patch is applied to other file servers that are on the same patch level to avoid them experiencing the same. 4.2 Hard vs. Soft Constraints A very important dimension of IT problems is to identify between hard and soft constraints and to avoid hard constraints as much as possible. A soft constraint can be adjusted dynamically requiring a simple configuration change and perhaps recycling the system or a component of the system in order to alleviate the constraint. Conversely, a hard constraint deals w with ith the physical world such that additional hardware needs to be purchased or implemented for resolution. Hard constraints take much longer to resolve and are thus avoided as much as possible. Examples of Hard Constraints:

Lack of CPU power or RAM Lack of storage capacity Lack of Network bandwidth How to deal with hard constraints c onstraints Use virtualization to adjust resource usage – You can reduce the capacity of non-production or lower priority systems/components and provide it to the systems suffering hard constraints. This is typically a temporary remedy but could be an effective one. Optimize your application – Identify the cause for the high utilization within the application to see if it can be tuned to reduce its capacity needs. Acquire more resources – This can take time and is not a

quick fix but may be necessary if #1 does not buy you enough time to resolve the issue with #2. The key is to avoid hard constraint issues through capacity planning and load testing

necdotal Story – Now a small site has an issue You get a call from a small remote location that their core application, residing in your primary data center, is horribly sl slow. ow. Most other local functions are performing fine. All of the other locations that use the same system remotely as well are performing fine and all the servers involved with this core application in the primary data center seem to have minimal utilization – plenty of CPU, no swapping, low I/O, and low network utilization. Below is a visual diagram of the situation. What do you ask about next?

Since this core application is performing well when accessed elsewhere, per the troubleshooting principle of reductive problem solving, you can reduce the scope of the problem to a system interaction issue between “Site D” and the core application in the Primary Data Center. So, what components would be in play? The obvious component to start with is the network. In this case, the communication from Site D to the Data Center would traverse the “Corp Network”. It’s important to understand that most private corporate networks today are MPLS, albeit there is also some VPN access over the Internet for smaller sites. As with Internet access from your home, the bandwidth into your corporate network varies at each site. Smaller sites need less bandwidth and thus pay for smaller circuits. Given this, the following questions would be good to start with: - What’s the capacity of the corporate network circuit at Site D? – 20mbps -

What’s the utilization of the corporate network circuit at Site D? – 100% inbound, 15% outbound

- -

What’s the capacity of the corporate network circuit at the Data Center? – 1gbps What’s the utilization of the corporate network circuit at the Data Center? – 40% inbound, 50% outbound

Based on the above information, it’s easy to conclude that the problem is inbound bandwidth saturation on the corporate network circuit at Site D. So, now, what can you do about it? Well, the next step is to identify the type of traffic that’s taking up all of this inbound bandwidth and where it’s coming from. Most Enterprise IT network teams can do this by utilizing tools that capture “NetFlow” traffic from various locations. In this case, these tools indicate that an SCCM server is staging large amounts of desktop installations, including upcoming patches to the SCCM server at Site D overwhelming the circuit. The incident management team contacts the SCCM administration team to have them cancel this transmission and the performance ends. The incident is resolved, but of course, problem management problem of this issue continues. The SCCM engineer is now provided with this problem. The data that was being transmitted to Site D (about 100GB) still needs to get there, but without impacting the users. So, what can be done? Well, the answer is within the SCCM tool itself. This tool allows you to “throttle” your transmission traffic to a certain amount during certain times of the day and a different amount at other times. Thus, the engineer sets the throttle to 5mbps during the day (or 25% of the circuit) and 15mbps (or 75% of the circuit) after hours (7 PM-7 AM). This allows the transmission to complete without impacting the users. The engineer goes on to check how transmissions are set up for other smaller sites and sets up similar throttling rules to avoid impact in the future. Soft Constraints Sometimes the constraint causing a problem is not due to the physical limitations of the systems involved. Most systems have configuration settings that impose artificial limitations to avoid one process from impacting others or the overall system. Some examples of these soft constraints include:

The number of CPUs or amount of RAM allocated to a Virtual Machine Guest operating system - In this case, there

may be plenty of CPU horsepower available to be used on the underlining physical machine or cluster, but the virtual server may be artificially constrained. NOTE: If the host machine (or machines, if you have a cluster) is suffering from a lack of CPU or RAM, then you have a hard hard constraint. constraint. User limits such as how many processes, RAM, disk space, open files, semaphores, etc. that a userid can utilize - This is done to avoid one user from overrunning the entire system, however, when a system is dedicated to one process (e.g. a database server), then these limits should not apply to that process. The number of connections allowed - Some products limit the number of connections or processes allowed to utilize it. Licensing limitations - Many software products impose limits on concurrent requests, network bandwidth, degrees of parallelism, RAM, CPU, etc. Resource governors and quotas - Some products (especially database engines) include the ability to specify user-defined limits on resource consumption. Resources limited can include CPU, I/O, RAM, certain types of queries, files in use, etc. ETC. – There are countless other soft constraints and looking out for these is critical when troubleshooting necdotal Story – Why won’t it go faster?

You have a time-sensitive process that is called a few times a minute. It needs to respond within a few seconds each time it’s called. You acquire a large 20CPU core server with a lot of memory for the Postgres database server that is used by this process. In fact, the entire database now fits in RAM. Still, the process is taking over 30 seconds to execute even though the overall CPU % is less than 10%, there is no swapping and the I/O rate & I/O response time is very low. What else do you look at? At first, this problem may seem to be quite befuddling. However, if you start to break it down, you realize that the scope is rather constrained. First, it’s a single system type issue and no other system is at play. It’s also a persistent problem and the response is consistently poor. It’s also a single application issue as the overall operating system is not constrained at all. Given all of

these factors, it would appear that some artificial constraint may be at play. So, what to do next? Well, you can look at how much CPU the Postgres DB process is taking. When you do this, you notice that each time a request comes in, one process takes 100% of a single CPU, but all the other processes remain low in terms of CPU utilization. Thus, only one of the 20 CPUs is being heavily utilized, while the others are mostly idle. What does this tell you about the process? Simple – it’s not multi-threaded. You can throw another 100 CPUs at this server and you could not speed up this single request because it’s only going to use one of the CPUs. A quick look through the Postgres documentation shows you that the default with Postgres is that each request only uses one CPU (e.g. no parallel processing of a query), however, this can be adjusted. You adjust the setting to allow up to 8 CPUs to be used by a single request and the requests speed up by a factor of 8 completing consistently in about 4 seconds. Now, you know that this can be tweaked further if necessary, in the future to improve performance further depending on how many concurrent requests you may plan to receive, 4.3 CPU vs. RAM v. I/O vs. Network (the big four) Perhaps the most important dimension of an IT problem, however, is determining the source of a constraint on a system. The source of a constraint is typically one of the big four (CPU, RAM, I/O, or Network). In this section, we will review how to identify each of these constraints and some potential remedies for each as well. CPU constrained (aka CPU bound) Before we continue with what it means to be CPU constrained, let’s review the three general states your CPU can be in:

Idle - This means that the CPU has nothing to do. Please note that on Linux systems, the “I/O Wait state” is a variant of the Idle state that indicates that while the CPU is idle, processes are waiting on I/O requests that otherwise would have work for the

CPU. Running a user space program – This means that the CPU is

running the core code for an application, like a web server, an email server, or a database server. On servers, most CPU time should be user time. Running the kernel – This means that the CPU is servicing interrupts or managing resources (e.g. swapping RAM, performing I/O, etc.). So, what does it mean to be CPU constrained? This occurs when there aren’t enough CPU cycles available across all the CPUs to satisfy the load and continue to perform as expected. The result is typically a combination of high (or even flatline – 100%) CPU usage (user and kernel time combined) as well as a backlog of processes waiting to run (CPU wait queue). Note that if the process wait queue is 0, then even if you have a high CPU percentage, you are not yet constrained, but running close to it. If you are CPU constrained, resolutions could include the following: Application tuning – This typically takes time to implement, so not generally a quick fix. Increasing CPU power on the server (vertical scaling) – This is particularly a reasonable path to take when utilizing a virtual server where there is capacity available on the physical host or at least another host in the cluster. Spread workload across multiple systems (assuming this is possible) – This is particularly possible when you’ve combined several functions into one server. For example, if you have the application, database, email services, etc. all on one server, then it’s simple to spread these components to their own virtual servers. Adding systems (horizontal scaling – if allowed depending on the application/product application/product architecture) – Some types of platforms (e.g. web or application servers) are easily scaled horizontally by adding a load balancer upfront and then adding web/application servers as needed to handle the load. Other types of platforms are more difficult (e.g. database servers) but may be able to be done depending on the specific product in use.

necdotal Story – No more CPUs to give One morning many calls start to come in from a couple of large logistics centers that the system they use is extremely slow. slow. You look at the central database server, which has been running at nearly 90% for a few months and it’s now at 100% capacity with many processes queuing. You find out that a new large client started to send in work today and that this is now the new normal. The problem is that this server is the only guest OS on a host within a virtualized cluster and it’s been given all 20 CPUs on that host – there’s no more CPU to give on that server. What do you do?

At first, this may look like a dire hard constraint situation and to some degree it is. Some of the immediate ideas you may have for such a situation do not work out, like: - Because only the database resides on this server, splitting the work to other servers doesn’t work. - -

-

Unfortunately, the database architecture you are using does not lend itself to be spread across multiple servers, so horizontal scaling (adding another database server) is not possible. Since all the host machines on a cluster of hypervisors typically have the same number of CPUs, shutting down work on any other server in the cluster does nothing for you, because it can only run on one host and it already has all the CPUs available on a host. Tuning the application is a possibility, but this will take time, and the business is suffering and in big trouble right now.

So, what can you do now? Well, this is a pretty bad situation that clearly shows the importance of capacity planning. In other words, knowing that new work was coming to a system that was already somewhat constrained should have raised red flags earlier, and allowing any critical system to run at over 90% is dangerous as any little bump up in work, can lead to a crisis. Luckily, this is not a complete hard constraint because the database is on a virtual server that can be moved to a big box (vertical scaling) with minimal downtime. In large enterprises, there is usually a larger box available somewhere since systems get bigger all the time and replacements are in constant flux. In this case, a 28-core was available. It was supposed to be used for another project, but newmachine hardware could be ordered and only minimally delay that

other project. Thus, this 28-core machine was added to the hypervisor cluster, then the database server was v-motioned to the new host and finally, 8 additional CPUs were added to the database server through the hypervisor providing short term relief and resolving the incident, but of course, problem management did continue. Even after the CPU additions, the database server was still running way too high on CPU for such a critical system and more new work was coming soon, so through problem management, a longer-term solution was addressed. This included tuning the application queries to the point that the overall utilization dropped by 33%, planning for even more CPU additions in the future, and implementing more formal capacity planning for this server. I/O Constrained (aka I/O bound) Being I/O constrained means that a system is not performing up to expectations and the reason is that there is too much I/O going on for it to be able to be responsive enough. I/O constraints are best detected by an increase in I/O latency – this assumes that you know the baseline. Typically, I/O latency should be less than 10ms (even as low as 1ms or 2ms) on directattached or SAN storage. Anything above 50ms is general considered unacceptable, however, it’s important to understand what “normal” has been in the past to gauge whether anything is worse.

Another key indicator of a system being I/O constrained is a backlog of I/O requests, generally called I/O queue depth. The CPU statistics of an I/O constrained system would typically show a high Idle or I/O wait state % as well indicating that processes are waiting on I/O requests. Resolutions for I/O constrained systems vary widely, but could include: Tune the application to reduce I/O load – This could be a simple fix in some cases (e.g. adding RAM to a database) or more complex in other cases Increase the bandwidth between the system and the storage devices – This may require a hardware purchase and implementation and thus, could take time. Spread workload across multiple systems (if possible) – This

is particularly when you’ve combined several functions into one server.possible For example, if you have the application,

database, email services, etc. all on one server, then it’s simple to spread these components to their own virtual servers. Increase the number of drives to spread out the I/O – This is only generally applicable if mechanical disks are in use where hot spot disks could severely impact performance. Add cache to the SAN – This would typically require the purchase and implementation of hardware. Change to faster drives (e.g. SSD) – If capacity happens to be available on faster media (e.g. SSD), then this may be able to be performed fairly quickly. RAM Constrained (aka memory bound) Being RAM constrained means that a server is unable to provide the expected performance because it does not have sufficient RAM to support what the applications are requesting. The primary symptom of being memory-bound is

a high swapping rate (thrashing). When this occurs, the CPU will exhibit high kernel usage time. Note 100% RAM utilization is not necessarily a problem as many modern systems utilize RAM to cache filesystem reads and thus utilize all the RAM they can. Thus, the key metric to look at is usually the swapping rate. However, there ar aree exceptions as with database systems where the swapping rate may not be high, but the database’s lack of buffer/cache space greatly hinders its performance. This condition cannot be detected from any operating system metric, but instead can only be detected with database system metrics like a low cache/hit ratio or low PLE (Page Life Expectancy) indicating that most queries are having to get their data from disk. Resolutions are almost identical to that of being CPU bound and include: Application tuning – This typically takes time to implement, so not generally a quick fix. Increasing RAM on the server (vertical scaling) – This is particularly a reasonable path to take when utilizing a virtual server where there is RAM capacity available on the physical host or at least another host in the cluster. Spread workload across multiple systems (assuming this is possible) – This isinto particularly possible when you’ve several functions one server. For example, if you combined have the

application, database, email services, etc. all on one server, then it’s simple to spread these components to their own virtual servers. Adding systems (horizontal scaling – if allowed depending on the application/product application/product architecture) – Some types of platforms (e.g. web or application servers) are easily scaled horizontally by adding a load balancer upfront and then adding web/application servers as needed to handle the load. Other types of platforms are more difficult (e.g. database servers) but may be able to be done depending on the specific product in use.

necdotal Story – Slow performance after upgrade An application was moved to a new database server over the weekend.

Testing on the weekend after the change was all good; in fact, the performance was better than before. However, now on Monday morning with a full load of users on the new database server, performance has become horrible. Backing out the change would cause a full day outage, so the business wants to avoid it. The database server has a low swapping rate, 90+% CPU usage, but the processor queue length of 0. In terms of I/O, the I/O response time is low, but the I/O rate and I/O queue depth are much higher than in the past. What would you ask next to try to get to the root cause? First, it seemed pretty clear that this was a single system type issue, so the focus could remain strictly on the database server. Also, with a high I/O rate and high I/O queue depth, the initial diagnosis would certainly be that this server is I/O constrained, and to some degree, that is correct. The server is indeed trying to perform more I/O than it can do while still responding to the users as expected. However, you need to ask the next question, which is why? It’s not because the I/O subsystem is slower than usual since the I/O response time is still low. It’s instead because the database server is trying to perform much more I/O than it ever did before. Why is that? Well, since this is a new database server and the old database server was performing well, per our troubleshooting best practices, a good start is to compare the new server to the old server. The CPU count (16), RAM

(256GB), and even SAN backend is all the same. The database version changed, but there’s no evidence that the queries are any less efficient; in fact, testing had shown they were more efficient. So, what else can be checked? Since the database software installation was new, the configuration of the database was reviewed, and a difference was found. The old database server software was configured to utilize 196GB for its buffer cache, but this new server was only defined to utilize 64GB. Thus, while there was plenty of RAM still available to use at the operating system level, the database was not configured to use it, so more queries were having to read straight from disk than before. When the number of users was low as was the case during the weekend, this was not a problem, but as the number of active users grew, the amount of buffer space in the DB was no longer sufficient and more I/O’s had to go to the disk directly. Once the database was configured to match the old server, the I/O rate diminished, and the system's performance improved dramatically. The incident could be resolved, but a follow-up problem management record was created to review how this kind of configuration mismatch could be avoided in the future. Network constrained (network bound) Being network constrained means that the server or application cannot perform up to expectations due to the inability of the network to keep up with the demand. Unlike other constraints, a network constraint is not easily detected on a server. This is because most network constraints will only be

perceived as poor performance by users of the server, but the actual constraint is not on the the server itself. Network constraints are more likely to be between long-distance circuits, security devices (e.g. firewall), switch uplinks, etc. where there is oversubscription that could become a bottleneck. To resolve network constraints, the root cause needs to be determined via network monitoring. Some things to keep in mind: The network adapter on the server should be analyzed to make sure the constraint isn’t local (e.g. maxed out on local adapter capacity), however, this is very rare. The shared nature of networks leads a great deal of the oversubscription in uplinks to the core andtobandwidth over WAN/Internet.

The constraints are almost always upstream from the server’s NIC. Assuming the NIC capacity is fine, then analysis is needed upstream to see if any links or WAN/Internet circuits are at capacity. If only communications to a particular server or segment are delayed, then analysis of that server and/or segment is warranted. Resolutions will vary greatly depending on where the constraint lies The gray constraint – network access storage (I/O or network bound) If the network delay is with accessing storage (e.g. a NAS device through NFS) then is it an I/O constraint or a network constraint? How about when connecting to a SAN network? Note that both appear as a backlog of I/O

requests (I/O depth queue) on the OS and the CPU statistics would show a high I/O high I/O wait state %. state %. In either case, the problem could be with the storage itself, the network, or even the server providing the storage – analysis of all components is needed. In essence, you need to treat this kind of constraint as if it could be either I/O or network and cover both aspects. necdotal Story – Slow at night Performance for some 24x7 web services is becoming poor at night (alarm thresholds are exceeded as reported by your monitoring tools). These servers reside in your DMZ network area. You note that all the servers involved have low CPU utilization at night, a low swapping rate, and I/O response time remains low all night as well. The associated internal database server is the same. The below is a visual representation of the problem. What do you check next?

At first, this doesn’t seem to make sense. Why would performance during a low activity period in the middle of the night be worse than during the high activity period of midday? Also, what has changed recently to have this start to happen? Following our troubleshooting best practices, what’s different or better said what’s happening during the time of the slowdowns that’s not happening at other times? This question led the engineers to look at the backup schedules. Sure enough, one particular backup schedule matched the problem timeframe. This backup schedule performed a backup of all the devices in the DMZ (several dozen) all during the same period of time. These backups all had to traverse the firewall to hit the backup server on the internal network. Because they were all executing at the same time, they would overload the firewall and slow down the requests from the application servers in the DMZ to the database servers on the internal network. It was further discovered that recently 4 additional servers had been added to the DMZ, apparently pushing the firewall bandwidth over the edge. Now that the problem was discovered, what could be done to fix it? Well, the immediate action was to split the DMZ backups across multiple backup schedules throughout the evening and early morning instead of all at the same time. This corrected the immediate issue, but there remained the potential of this recurring once more servers were added. Thus, a project was initiated to change how the backups were performed to avoid going through the firewall altogether and instead, performing direct SAN access backups whereby the backup software reads off the SAN instead of through the server that uses the SAN.

4.4 Log Analysis Basics Reviewing log information is a critical component of most troubleshooting exercises. While each log is different, there are some basics common across all log analysis exercises. In this brief section, we wanted to cover some of these basics.

First, it’s important to understand of logs exist and what kind of information is available on each type of log, and what to look out for: Operating System – Operating system logs contain information about the overall OS, so they may not be specific to your issue – beware of red herrings. Platform/Product – Platforms (databases, web/applications servers, and similar products/platforms) have their own logs that provide more specific information about the applications utilizing

the platform/product, however, they still may not bealign. specific to the problem at hand, so make sure the timings of errors Application – Custom applications often generate their own logging. These logs would be the most specific to review. Log aggregation – Some Enterprise IT shops may have employed a log aggregation tool that brings several logs together into one interface. If this does not exist, then logs need to be reviewed on each server and there could be several involved in any particular problem making it a bit cumbersome. So, what should you look for in logs? Here are some brief general guidelines: Timing – If you know when the problem occurred/started, then finding the location of that time is a good start as you need to make sure any messages align to the time of the problem. Look out for time zone differences as some logs contained dates with the local time zone and others use UTC. Keywords – If you don’t know what you are looking for, simply looking for keywords like “error”, “exception”, “timed out”, etc. is a good start. Use string search tools like “grep” (Linux) or “find” (Windows). Pattern Variance – Look for a variance is the pattern of the log messages. If everything looks uniform and then there’s a strange

range of messages, this would be good to review. Repetition – Look for the errors to repeat for every occurrence of the problem, if applicable. Vendor Documentation Documentation – Major vendors provide documentation on most error codes – reviewing this documentation often yields valuable information. Search Google – When reviewing Operating System and occasionally Platform/Product logs, it’s good to search Google for the meaning of messages to make sure they could apply to your situation. Ask Developer – When reviewing Application and occasionally Platform/Product logs (if the error is application generated or a stack trace), then developer involvement is necessary for interpretation. Look back in time - Many errors in logs are common and not necessarily related to the issue you are investigating. Always remember to look at older log entries to make sure the error messages are not a regular occurrence.

Chapter 4 – Review/Key Questions

1) What the dimensions of Enterprise IT problems and/or constraints to consider when troubleshooting?

2) What’s the difference between a hard and soft constraint? 3) What are the symptoms of a system being CPU constrained and what are some possible mitigations? 4) What are the symptoms of a system being Memory/RAM constrained and what are some possible mitigations? 5) What are the symptoms of a system being I/O constrained and what are some possible mitigations? 6) What are some of the best practices to consider when analyzing logs?

Chapter 5 – Operating Systems Tooling

In this chapter, we will cover the primary tools used and primary statistics to review when troubleshooting an operating system. We will focus on the Windows and Linux operating systems since they dominate the enterprise technology landscape. However, we will start with a brief review of the main features provided by an operating system. 5.1 Operating System Overview and Common Operating Systems

Operating systems provide many services to allow applications to focus on their primary purpose. These services include: Hardware management – Applications don’t need to worry about how to communicate with the specific device that’s connected to their server. Applications only need to call a common operating system routine. The operating system worries about the details of communicating with the hardware through device drivers provided by manufacturers. Memory management & protection – Applications are provided a virtual address space such that they simply utilize the memory provided without having to worry about any other process

accessing modifying address space. Applications also need not worry or about whethertheir there’s enough RAM to satisfy their

needs as the operating system will swap in the addresses needed. CPU scheduling (multi-tasking, (multi-tasking, multi-processing, multithreading) – Operating systems manage multiple applications all executing concurrently regardless of how many CPUs there are to serve them. Operating systems time slice the CPU across whatever number of CPUs are available to simulate concurrent activity. Interrupt handling – Operating systems manage the complexity of hardware devices interrupting with needs like mouse moves, mouse clicks, keyboard taps, disk drives returning data, network devices returning data, etc., and then passing along the necessary information to the appropriate application. Security – Operating systems are tasked with protecting data and programs such that only authorized personnel can access them. Etc. General Purpose operating systems like Windows, Linux, Apple iOS, Android, etc. are designed to allow any kind of application to execute and include user interface services. In this module, we will be focusing on the tools that can be used to troubleshoot (detect constraints) on general-purpose operating systems and specifically on Windows, Linux and hypervisors. In the next module, we will focus on what we argue should be considered purpose-built mini-operating systems (a.k.a. hosting platforms). Common Operating Systems Operating systems exist in all computing devices. Thus, it’s no surprise that

there operating systems used within anand Enterprise IT environment. Beloware is many a summary of these operating systems a little about what they are used for. While all of these operating systems vary in terms of features, administration, and purpose, they all have many things in common such as the concept of processes, threads & virtual memory, constraints with CPU, RAM, I/O & network, TCP/IP communications, etc. General Purpose Operating Systems Most operating systems in the enterprise are considered general-purpose operating systems as opposed to special-purpose operating systems. Generalpurpose operating can be used for just about any application purpose. These

operating systems host application servers, database servers, file servers, print servers, etc. The following is a general grouping of these operating

systems: Windows & Linux - In most enterprises, 90+% of all troubleshooting issues will revolve around Windows or Linux servers. For that reason, these are are the only two that are covered in this book. Note that several Linux variants could be found in the enterprise, but most troubleshooting tools are common among them. Legacy UNIX servers – These are a pre-cursor to Linux. Oracle Solaris, IBM AIX, and HP-UX are the most common. Most of the Linux tools will work on these OS’ as well. Legacy IBM Systems – IBM mainframes (z/OS & z/VSE) still exist in many older companies as well as IBM’s predominant legacy midrange platform (i – aka OS400, iSeries, AS400). These are often referred to as “green screen” systems as they are

character-based, but they are still powerful data processing engines in many companies. We address these only briefly in the application architecture pattern module. Other Legacy Systems – While exceedingly rare, other legacy operating systems from HP, Unisys, and even the defunct DEC company may still be found in the enterprise. Apple MacOS – Many enterprises support MacOS for their employees’ workstations or at least allow MacOS as part of a BYOD (bring your own device) policy. MacOS is another Linux variant, but MacOS tools are more commonly used than Linux tools. Special Purposes Operating Systems Network routers, firewalls & switches, hypervisors, SANs, load balancers, etc. all utilize special-purpose operating systems (as opposed to generalpurpose like the above), most of which are rarely accessed directly – administration consoles are typically the only interface used. Many are Linux variants under the covers, and you may be able to run limited Linux commands if the command prompt is available. Some (especially network devices) have a completely proprietary command-line interface.

5.2 Key Statistics and Tools Overview There are countless metrics available for most operating systems and especially for Windows and Linux, which are the focus of this book. However, there are a few key statistics to focus on when troubleshooting issues and in particular when looking for operating system constraints or in other words, bottlenecks that could be causing performance degradation.

In this section, we will focus on the key statistics to look for in order to identify whether there is a CPU, networking, memory, or I/O constraint on the overall system. However, even if there is a constraint, this does not mean you have a problem. The key is whether performance is acceptable. If performance is NOT acceptable, then looking as to where the constraint is located is warranted. Having a baseline of these metrics when performance is acceptable for comparison is a great way to determine if you have a problem. Note that just because the overall system is not constrained, doesn’t mean that an individual process is not constrained. CPU related key statistics To determine if a system is CPU constrained, the following statistics are critical (in order of importance): Processor Queue Length – How many processes are waiting for CPU. This should generally be less than the number of CPUs on the system but is certainly a problem once it consistently exceeds 2x the number of CPUs on the system. CPU Utilization % – This is a good gauge of the capacity you have available for growth, but not necessarily a gauge of CPU constraint unless “flatlined” at 100% in which case there’s a high likelihood of CPU constraint.

If an individual process is taking excessive CPU that seems to be unwarranted, it can be suspended or terminated, but best practice is to obtain a memory dump before terminating the process to allow for diagnosis. Memory related key statistics To determine if a system is RAM/memory constrained, the following statistics are critical (in order of importance). However, often a combination of statistics are needed to confirm a constraint:

Major faults – Depending on the operating system or tool being used, major faults are a key “per process” statistic to review because this indicates that a page is needed and not available in memory for that process; the page needs to be retrieved from disk. Major faults will occur from time to time and especially at the start of a process, but if a process is experiencing a large number of major faults (hundred or more per second consistently), then that process is likely memory constrained. NOTE – minor faults, or faults depending on the tool, are not a concern. Paging swap rate (in & out) – How many pages are being swapped in and out per second from “swap/page file” at the operating system level. This is a similar metric to major faults, however, the paging rate is at the operating system level as opposed to the process level as with major faults. The acceptable paging rate varies depending on the speed of your disk, but generally, if it gets over a hundred per second consistently, then there could be a memory constraint. NOTE: Initial Page Faults (demand paging) are normal and not a problem (e.g. loading the executable from the disk where the program is stored). However, this distinction (initial vs. secondary page fault) is not always clear when reading statistics from tools, so be careful to understand what you are measuring – in general, pay more attention to the paging out rate more than the paging in rate. Also, on Windows, most of an executable is

loaded pre-emptively using “prefetch logic” and this will also skew numbers. System or Kernel CPU % – If the “system” process or kernel CPU percentage is consistently over 25%, this is another sign that the swapping rate may be getting too high; this could also be caused by heavy context switching, so alone it is not a good measure. Swap space used – The more swap space used, the higher the likelihood of swapping. RAM Utilization – Although least in terms of importance, the amount of free RAM can indicate how much RAM is available for growth and free RAM is usually an indicator that the system

is NOT RAM constrained. I/O related key statistics To determine if a system is I/O constrained, the following statistics are critical (in order of importance). However, often a combination of statistics is needed to confirm a constraint:

I/O Queue Length – How many I/O requests are pending for any particular drive; The amount that would constitute a constraint depends on the speed of the I/O for that device, however, if the queue length is greater than 5 consistently, then there is some concern. CPU I/O Wait % - If the system has no work to do other than wait for I/O, this is a clear sign of an I/O constraint. Anything over 25% is a possible sign, but other factors need to be considered. Disk Utilization % – The percentage of time that a particular drive is utilized. While not a clear indicator, if drive utilization utilization is consistently at 100%, then you are likely I/O constrained. I/O Response time – Local disk should respond in under 5ms and usually within 2ms. SAN and NAS-based I/O could be as much as 10ms, although often less. If I/O response time is consistently above 20ms, then you may have a storage constraint within the storage infrastructure causing I/O’s to delay. NOTE: This does not apply to remote or cloud-based storage where longer response times are expected and completely dependent on distance. Networking related key statistics Since networking involves other systems, this is the least likely to detect with local operating systems tools. Most networking issues are not rooting within individual hosts, but instead at some chokepoint within the network (e.g., WAN circuit, core switch uplink, firewall, load balancer, IPS device, etc.). Still, if the overall network bandwidth utilization is consistently at the capacity limit for the NIC (network interface card) of the server, then there could be a local constraint. Also, noting the amount of bandwidth being used and by which processes are helpful to compare to baselines. Operating System Tool Overview There are many operating system tools and they monitor a variety of metrics.

It’s important to understand the kind of metric that any particular tool is providing. Is the metric for the entire server or a particular process or storage device? Is the information simply configuration information or dynamic system metrics? Below are the different kinds of information provided by these operating system tools: Overall System Stats – Many tools look at how the overall system looks in terms of CPU, RAM (paging), I/O, and Network. Process Specific Stats – Many tools look at how individual processes are utilizing CPU, RAM (paging), I/O, and Network. System Configuration – Some tools report on the current system configuration (e.g. IP address, DNS server, disk volumes, CPUs, etc.). Process Activity – Some tools can trace the activity of a process -

what it’s doing external to itself (e.g. talking to the network, opening a file, opening a registry key, starting another program, etc.). Internal process information – Some tools allow for you to acquire a memory dump so that the vendor or a developer can determine certain characteristics like: ☐ ☐ ☐

Why is the process taking so much CPU? Why is the process taking up so much RAM? Why is the process crashing?

Logging – Some ◆ system andActivity application logs.tools simply provide an interface into the

Some tools can do more than one of these functions and ALL of these functions can be important depending on the problem that you are troubleshooting. It’s important to understand the kind of information you can glean from each tool. 5.3 Windows Tools Most (but not all) Windows tools used for troubleshooting are GUI-based as opposed to text-based. Having said this, understanding how to operate in the Windows command-line interface (cmd.exe) is still critical to be proficient with troubleshooting on Windows. Also, basic Windows performance

troubleshooting can be performed with a single tool that comes standard on all machines – “Task Manager”. For more in-depth Windows performance troubleshooting, “Resource Monitor” would be the best tool to use. Some Windows tools may require administrative privileges, but most do not. Before going any further, it’s important to note that the rest of this chapter will assume that you can log into the server to perform the tasks required. For Windows, logging into a server is typically performed via RDP (Remote Desktop Protocol). The Remote Desktop client is automatically available on Windows machines and can be installed on a MAC as well (Microsoft Remote Desktop 10 or higher in the App Store). This interface allows the user to work with the Windows desktop of the server as if it were his/her machine. Now, assuming local RDP access, from a process perspective, when dealing with a performance on Windows, it’s important to determine overall health of theproblem system first – is it suffering from CPU, RAM, I/O,the or Networking constraints? If the overall system is constrained, then the focus shifts to determining which process(es) are causing the constraints – this can be challenging if a system is running many processes. Having a baseline is important to be able to more quickly spot an anomalous situation. If the overall system is not constrained, then the focus shifts to whether the process(es) in question are constrained. For example, is there a process that’s taking up one and only one CPU all the time? If so, that process could be single-threaded and thus constrained even though the overall system is fine. Another example could be that a process may be 32-bit and thus limited to 4GB of RAM and could be constrained even though the overall system is fine. Lastly, a process could be limited by product soft constraints (e.g. only so many concurrent sessions/users or some other kind of throttling. Task Manager Task Manager is a tool that comes pre-loaded on all Windows operating systems. It’s the first tool to look at in terms of determining the state of a Windows machine as it provides a consolidated view of most key aspects. In other words, it’s a great place to start looking to see if there are any constraints on the system. To start Task Manager, simply right-click on the Taskbar and then select Task Manager.

Task Manager will open on the “Processes” tab, which can give you a quick glimpse at the big four (CPU, RAM, Disk & Network) and which processes are using the most of them. Please note that you can sort by any of these measures by clicking on the column header (e.g. by clicking on “CPU” or “Memory). Also, from this tab, you can end a process or create a memory dump, however, it is not recommended to create a memory dump from Task Manager as it creates a mini-dump that is not as useful. Process Explorer or the Debug Diagnostics Tool is preferred for memory dumps.

While theaProcesses provides a lotusage of info, tablast provides graphical tab view of the total of the the “Performance” big four over the minute. While this timeframe is limited, it provides a little more insight. Also, this tab includes a link to “Resource Monitor”, which will be covered next. The key of these two tabs is to begin to determine if the system is constrained and what type of constraint is it? Once you determine this, then identifying which process(es) may be causing the constraint is key to figuring out the next steps.

The last tab of interest in Task Manager is the “Details” tab. This tab

provides a wealth of information and allows you to better understand the nature of any process. For example, the “Processes” tab doesn’t provide the PID, nor does it provide the name of the service, if that’s what is of interest. However, in the details tab, you can find all of this out and more. The first key step is to expose all of the columns. To do this, you need to right-click on any column header and then click on “Select Columns”. You should then select all the columns. As with the processes tab, you can sort by any column simply by clicking on the column header. So, what are some key columns to look at? Here’s a brief brief noncomprehensive list: - PID - Useful for finding the process in other tools - CPU – Current CPU usage - Username - If identifying the security context is needed - Working set (memory) – Current RAM in use - - - -

Commit size – Size of the virtual address space PF Delta – This is how many page faults are occurring per second I/O columns - Note these are totals since the process started not the current I/O rate Command Line – This allows you to identify details like the service name and how the process was started

Resource Monitor The next Windows tool to focus on is Resource Monitor. As mentioned in the section above, Resource Monitor can be reached from the “Performance” tab in Task Manager. You can also simply type Resource Monitor on the Windows search bar. Resource Monitor allows you to drill a little deeper into each of the big four, but it’s particularly interesting for Disk and Networking

information. Resource has 5(CPU, tabs including Overview tab andEach then one tab for each ofMonitor the big four Memory,an Disk & Network).

tab provides a list of the active processes and if you right-click on a process you can “end”, “suspend” or “resume” the process. At state earlier, suspending an unknown troublesome process is a better short-term approach than ending the process as it allows you to restore service while you investigate. The “Overview” tab gives you a summary of each of the other tabs, but if you are going into Resource Monitor, you should already have an idea of what you are looking for (e.g. confirmation of a particular kind of constraint), so the overview tab is typically not very useful. The “CPU” tab displays multiple boxes. The top box shows all of the processes and their current CPU usage, PID, & number of threads. As with Task Manager, you can sort by any column by clicking on the column header. This top box also allows you to select a process to see the file (handles) it currently – not a feature often, good to be aware of. Other has thanopen the filehandles, thisyou doesmay notneed provide toobut much more info that is available on Task Manager.

The next tab on Resource Monitor is the “Memory” tab. This lists memoryrelated stats for all the running processes. These are similar to the key ones called out above to view with Task Manager. The Hard Faults/sec is a key indicator of a Memory constraint as there should not be many for any process. However, it’s important to note that the process suffering the most Hard Faults/sec is not necessarily the process causing issues – it could simply be a victim. As with other screens, each column can be sorted.

The next tab on Resource Monitor is the “Disk” tab. This tab again lists all the processes with their associated key disk I/O statistics. The second box shows the I/O per file. If you select a process from the first box, then the files to which that specific process is reading from and writing to, appear in the second box. A key metric to review on this second box is “Response Time” – as stated earlier anything consistently over 20ms could be cause for concern. The last box provides overall statistics for the disk drives themselves. This last box is most telling in terms of whether the system is I/O constrained or not as it provides “Active %” and “Disk stated if a drive is consistently at or near 100% active andQueue”. the diskAs queue is earlier, consistently above 5, then there’s reason to be concerned.

The last tab on Resource Monitor is the “Network” tab. The reason for viewing this tab is not to look for a networking constraint as it’s rare that a networking constraint could be detected by Resource Monitor, but instead to view the wealth of networking information provided by this tab. As with the other tabs, the first box is a list of the processes with their associated networking statistics and like the other tabs, you can select a process to look at the details of its networking activity in the below boxes. However, if you do not choose a process, you can see all of the information for all of the processes in the below boxes. This includes how many bytes each process is sending/receiving to/from each IP address, what ports/protocols are being

used, how much packet loss is being experienced, what the latency is with each connection, and which processes are listening on which ports/protocols. This is very valuable information when troubleshooting network issues.

Other Tools

Tool

Description

Process Explorer

Downloadable from Microsoft. This tool provides a process tree view (displaying the parents of processes – this is a unique feature); you can use this tool to discover the open file handles, registry keys, threads, semaphores, etc. in use by the process (also available from resource monitor). This tool is also useful to suspend, terminate or resume processes as well as perform “full” memory dumps for analysis

Process Monitor (procmon)

Downloadable from Microsoft. This tool shows real-time file system, Network, Registry, and process/thread activity (tracing of activity is unique to this tool, but only useful in specific circumstances); Use with caution as it can impact performance; start as “procmon /Noconnect” to allow for configuring exactly which PID, user, activity,

etc. you are looking for Perfmon

Comes with Windows. Can be used to monitor any operating system statistic, including a critical one not readily available elsewhere (processor queue length – found under “\System” or pages input/sec & pages output/sec – found under “\Memory”); Use this tool with caution as it can take up a significant amount of resources

“System Information”

Type this on the Windows search bar to launch GUI with system information including number and type of CPUs, OS version, RAM, etc.

XPerf, XPerf Viewer, WPR, WPA,

The XPerf tools are older (not supported for newer OS versions), while WPR (Windows Performance Recorder), WPA (Windows Performance Advisor), and

SPA

SPA (Server Performance Advisor) are for Windows 2012 and beyond. All of tthese hese tools record the activity and collect tons of information from your system for analysis. While useful for working with support, other tools may be easier to use and less impactful on your server.

netstat (netstat -abno)

This command-line tool allows you to see what ports are opened by which processes (information also available from resource monitor, but using netstat is less impactful from a resources perspective)

ipconfig (ipconfig /all)

A command-line tool that displays IP interfaces with associated IP addresses

route (route print)

A command-line tool that displays (and updates) local network routing table

Of the above commands, the command line tools ipconfig and route are the most interesting to look at a little more deeply. These tools will not indicate any kind of constraint, but they provide very valuable networking configuration for the system that you are on. ipconfig /all

The ipconfig command provides the current IP configuration for the system that you are using. This includes IP address, MAC address, Subnet mask, DNS, DHCP, WINS, etc. for the host and each network interface. Some key concepts to realize when reviewing ipconfig output is that not all interfaces are actively connected (e.g. Your Ethernet may not be connected if you are on WiFi and your Bluetooth may not be connected to an IP network either) AND there are often “virtual” interfaces on Windows systems (e.g. a VPN connection). Let’s look at an example output: Windows IP Configuration

Host Name . . . . . . . . . . . . : Laptop8030N5R Primary Dns Suffix . . . . . . . : fiu.edu Node Type . . . . . . . . . . . . : Hybrid IP Routing Enabled. . . . . . . . : No WINS Proxy Enabled. . . . . . . . : No DNS Suffix Search List. . . . . . : fiu.edu cs.fiu.edu

Ethernet adapter Local Area Connection* 13: Connection-specific DNS DNS Suffix . : fiu.edu Description . . . . . . . . . . . : Juniper Networks Virtual Adapter Physical Address. . . . . . . . . : 02-05-85-7F-EB-80 DHCP Enabled. . . . . . . . . . . : No

Autoconfiguration Enabled . . . . :

Yes IPv4 Address. . . . . . . . . . . : 10.110.32.26(Preferred) Subnet Mask . . . . . . . . . . . : 255.255.255.255 Default Gateway . . . . . . . . . : 0.0.0.0 DNS Servers . . . . . . . . . . . : 10.32.3.23 10.20.4.31 NetBIOS over Tcpip. . . . . . . . : Enabled

Ethernet adapter Ethernet: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS DNS Suffix . : cs.fiu.edu Description . . . . . . . . . . . : Intel(R) Ethernet Connection (4) I219-LM Physical Address. . . . . . . . . : 80-CE-62-9F-1D-82 DHCP Enabled. . . . . . . . . . . : Yes Autoconfiguration Enabled . . . . : Yes Wireless LAN adapter Wi-Fi: Connection-specific DNS Suffix . : Description . . . . . . . . . . . : Intel(R) Dual Band Wireless-AC 8265

Physical Address. . . . . . . . . : E4-70B8-26-96-4D DHCP Enabled. . . . . . . . . . . : No Autoconfiguration Enabled . . . . : Yes IPv4 Address. . . . . . . . . . . : 192.168.1.77(Preferred) Subnet Mask . . . . . . . . . . . : 255.255.255.0 Default Gateway . . . . . . . . . : 192.168.1.254 DNS Servers . . . . . . . . . . . : 192.168.1.254

NetBIOS over Tcpip. . . . . . . . : Enabled An interesting question that may be asked from reading this output is - Which DNS server will be used since two different interfaces are providing DNS servers? Well, for the answer to that question, you need to look at the next command – route print. route print The “route” command has a few options, including “print”, “add”, “change” and “delete”. As the options imply, the “add”, “change” and “delete” options allow you to make changes to the routes defined on your Windows system. The “print” option, meanwhile, simply provides an output indicating what routes are currently defined. This “print” option is critical to understanding how traffic flows out from your Windows system. As with the ipconfig command, the output has several sections including listing the interfaces with

their interface number (useful forroutes applying any changes), listingassociated the routes, and finally any persistent - persistent routes then are saved in the registry and re-applied after every reboot. Below, let’s review a sample output: =======================================================

Interface List 13 85 9f 7f 1d eb 82 80 ......Intel(R) ......Juniper Networks Virtual Adapter 12...02 ...80 05 ce 62 Ethernet Connection (4) I219-LM 20...e4 70 b8 26 96 4d ......Intel(R) Dual Band Wireless-AC 8265 1............... ...........................Software ............Software Loopback Interface 1 =======================================================

IPv4 Route Table ======================================================= Active Routes:

Network Destination Netmask Gateway Interface Metric 0.0.0.0 0.0.0.0 192.168.1.254 192.168.1.77 291 0.0.0.0 0.0.0.0 On-link 10.110.32.26 1 3.7.35.0 255.255.255.128 192.168.1.254 192.168.1.77 291 3.21.137.128 255.255.255.128 192.168.1.254 192.168.1.77 291 8.5.128.0 255.255.254.0 192.168.1.254 192.168.1.77 291 10.110.32.26 255.255.255.255 On-link 10.110.32.26 256 52.197.97.21 255.255.255.255 192.168.1.254 192.168.1.77 291 52.202.62.192 255.255.255.192 192.168.1.254 192.168.1.77 291 52.215.168.0 255.255.255.128 192.168.1.254 192.168.1.77 291 64.69.74.0 255.255.255.0 192.168.1.254 192.168.1.77 291 127.0.0.0 255.0.0.0 On-link 127.0.0.1 331 127.0.0.1 255.255.255.255 On-link 127.0.0.1 331 127.255.255.255 255.255.255.255 On-link 127.0.0.1 331 192.168.1.0 255.255.255.0 On-link 192.168.1.77 291 192.168.1.77 255.255.255.255 On-link 192.168.1.77 291 192.168.1.255 255.255.255.255 On-link 192.168.1.77 291 =======================================================

Network Address 0.0.0.0

Persistent Routes: Netmask Gateway Address Metric 0.0.0.0

192.168.1.254 Default

So, given the above information, which DNS servers will be used? Well, first we need to look at the IP addresses of each of the DNS servers. The VPN interface notes its DNS servers as 10.32.3.23 & 10.20.4.31, neither of which have a specific path, so they would fall under the 0.0.0.0 path in the routing table above – this has a metric of “1” associated with it. The other DNS server, which is associated with the physical wireless adapter is 192.168.1.254. This IP would be covered by the 192.168.1.0 route in the routing table above – this has a metric of “291” associated with it. Since the DNS servers associated with the VPN interface have a lower metric 1 vs. 291, those DNS servers would be used instead of the ones associated with the wireless adapter. Event Viewer Aside from identifying constraints with Task Manager & Resource Monitor and identifying network configuration information with ipconfig & route print, another critical action needed when reviewing a Windows machine are the logs. The main tool for this is “Event Viewer” (just type “Event Viewer” on the Windows Search bar to access it). There are 2 primary folders to look into for log information (there are others): “Windows Logs” and

“Applications and Services Logs”. You would typically start with the System, Application, and Security logs (in that order) under “Windows Logs”, however, pertinent information can also be located within the “Applications and Services Logs”. Any events that appear to be related to the issues at hand can then be researched online for relevance

Each event in a log entry contains the following information: Date, Time,

User, Computer, Event ID (A Windows identification number that specifies

the event type), Source (program or component that caused the event), Type (the type of event, including information, warning, error, security success audit or security failure audit). For example, an information event might appear as: Information 5/16/2018 8:41:15 AM Service Control Manager 7036 None A warning event might look like this: Warning 5/11/2018 10:29:47 AM Kernel-Event Tracing 1 Logging By comparison, an error event might appear as: Error 5/16/2018 8:41:15 AM Service Control Manager 7001 None A critical event might resemble: Critical 5/11/2018 8:55:02 AM

Kernel-Power

41

(63)

Whenever reviewing logs, you must be aware that many log messages may seem like a problem but may have nothing to do with your  problem. problem. There are many backend processes always executing and some of those may generate errors that perhaps should be addressed but are not necessarily related to the issue you are troubleshooting. You need to be wary of going down rabbit holes. Vendor documentation and Google searches are a good resource for determining the applicability of most messages, but confirmation/remedy may require contacting vendor support. Vendor support could be someone other than Microsoft depending on the SOURCE of the error. Although not as common, developers may write to an application log; for any such messages, application development support would need to provide the interpretation of the message. Regedit Another important aspect of the Windows operating system that an administrator needs to be familiar with is the Windows Registry. The Registry is a hierarchical database where most products and core operating system functions store configuration information (just type “Regedit” on the

Windows Search bar to access it). Unfortunately, the registry is large, and its organization can be confusing and complex, so a deep dive into the Registry

is outside the scope of this book. Still, it’s important to note as some vendor instructions may point you to entries in the Registry and even call for you to modify the Registry. The Registry can be accessed with the regedit.msc tool that is pre-installed on all Windows machines. The regedit.msc tool should be used with great care.

PowerShell PowerShell is a task automation framework from Microsoft which includes a sophisticated scripting language. Most Enterprise IT environments use PowerShell to perform all sorts of automated support tasks such as setting up a new server or changing a specific configuration setting on a group of servers. From a troubleshooting perspective, PowerShell includes many useful standalone commands called “cmdlets” that can be used to test connectivity, verify settings, etc. So, while PowerShell is beyond the scope of this book, it is recommended for anyone that plans to be a Windows Server administrator. necdotal story – A failing import

A batch import process is failing, but the purchased application is not providing any detailed information as to the reason for the problem. You find out that the file storage location from which the import grabs its files was moved over the weekend as part of a large project. You change the import location from the name “oldsan” to the new IP address (10.80.200.14), but it’s still failing. Backing out this change is not possible without impacting a lot of other systems, so you need to figure out why this is broken. You call the vendor, but historically the vendor has been poor about responding. What can you do in the meantime to figure out what’s happening?

This is an interesting problem because it is not related to a capacity issue (CPU, RAM, I/O, and even Network is not constrained). Instead, it’s simply a failure with an unknown reason. So, what tools can be used in such a situation? The primary tool that should always be considered when errors are occurring in applications is “Event Viewer” as most products will write out to logs and/or the error could generate a systems error. Another possible tool is “Procmon” since this particular process is easily duplicated and can be traced by name (the name of the failing import process). Reviewing the “Event Viewer” log, there was an error generated indicating that the network path “\\;RdpDr\;:0\10.80.200.14\groups1\business\prod\....” could not be accessed. This seemed like a very odd path, but this seemed related to our problem because the IP address 10.80.200.14 appeared in the message. A quick search on the Internet indicated that this occurs when trying to access a Windows Netbios name that does not exist. At this point, it dawns on you that you changed the location from a name (oldsan) to an IP address (10.80.200.14) thinking that there was no difference, but that may not be the case if the application is expecting a Netbios name. You changed the location from the IP address to the new name (newsan) and the process began to work. A day later the vendor confirmed what you were suspecting. 5.4 Linux Tools Linux has a wide variety of commands to allow an administrator to detect constraints and almost all are free. The vast majority of these commands are issued from the command line on Linux (the shell). Becoming proficient with

a Linux shell (bash is the most common) is a pre-requisite to being able to debug system issues on Linux. Most Linux systems have similar tools

available and pre-installed, but for some tasks adding a tool is useful. Most tools are available for most flavors of Linux and are freely downloadable. Learn how to find and install tools on the Linux flavor you are using (e.g. apt-get, zypper, yum) – NOTE: Most enterprises have standards in terms of which tools to use and how to install them, so ask before you act. As with Windows, when troubleshooting in Linux, it’s important to determine the overall health of the system first – is it suffering from CPU, RAM, I/O, or Networking constraints? If the overall system is constrained, then the focus shifts to determining which process(es) are causing the constraints – this can be challenging if a system is running many processes. If the overall system is not constrained, then the focus shifts to whether the process(es) in question are constrained (e.g. single-threaded, process-specific RAM limitations or product soft constraints). Some basicthe commands Display list of files in the current working directory ls -al –very cd  – – Change directory; this command accepts 1 parameter (the directory you want to move to) wd  – – Display the current working directory cat  – – Display the content of a file; this command accepts 1 parameter (e.g. cat filename)) filename grep – search for a string in a file or set of files; this command accepts several parameters, but in its simplest form it accepts the string you want to search for followed by the file(s) you want to search in (e.g., grep text filename)) filename man - short for manual - provides the available options for any command (e.g. man grep) grep) sudo – runs the command as administrator – aka “root” (assuming the user is allowed) (e.g. sudo grep text filename) filename) Commands useful to detect CPU & RAM constraints top – The top command should be the first command to look at when trying to determine if a Linux system is constrained in any way. It provides CPU & RAM usage information for the overall system and the top processes in terms of CPU usage; Also shows load avg (key metric to determine wait queue) &

uptime. While the top command accepts some parameters, it’s generally started with no parameters. Below is a screenshot of the output from “top”:

Here's an explanation of all that the top command displays: The time the system has been running since the last reboot Number of users Load average What does “load average mean?” – These 3 numbers indicate the average number of processes that are actively using the CPU or waiting for I/O over the past 1 minute, 5 minutes, and 15 minutes. If this number is consistently higher than the number of CPUs on your system, then your overall system may be constrained, albeit you can’t determine the type of constraint simply from this metric. Number of active processes Number of processes in each state Overall CPU percentages us=user time, sy=kernel (system) time, ni=prioritized time (rare), id=idle time, wa=I/O wait time, hi=HW interrupt time (rare), si=SW interrupt time (rare), st=steal time (related to time stolen from the hypervisor) The key is idle time (id). If it is consistently 0, then you are constrained. If I/O wait time (wa) is taking a great deal of the time (greater than 25%), then you could be I/O constrained. If kernel time (sy) is taking a great deal of the time, then swapping may be the issue. Any of the others taking all of the time would indicate a CPU

constraint. RAM in KB Swap space in KB Top CPU usage processes their virtual address space size (VIRT) and RAM used (RES)

If you are CPU constrained, then this list is key to determining which processes are causing the problem. s – The next command of note on Linux from a troubleshooting perspective is the ps command. This command lists information about ALL current processes. There are countless options for the ps command. Any information needed about running processes can be extracted with this command (similar to the Details tab in Task Manager on Windows) – it’s meant to help you better understand any given process. The most commonly used options are aux (e.g. ps (e.g. ps -aux). -aux). The information available with this command is detailed below. Another common tack with the ps command is to pipe it to grep to only look for a specific process (e.g. ps -aux | grep http). From a RAM perspective, minor & major page faults by a process can be reported with the following options ps options ps -eo min_flt,maj_flt,pid,cmd. min_flt,maj_flt,pid,cmd.

Below is the output of the following command: ps command: ps -aux | grep -E 'PID|docker' First, about the command issued. This is piping the output of the ps command to grep with a -E option for extended support and then providing two strings to look for in the output (PID & docker). The reason to look for both is to retain the first line of the output which contains the column titles making it easier to understand the output. As for the output, it’s similar to the output provided by top, but a key

distinction is that ps will display ALL processes while top only displays the

top CPU users. Also, ps can provide a lot more info depending on the options provided. In this case, we can see the COMMAND issued to start the process, the START date of the process, and the CPU TIME it has taken in minutes and seconds since it was started. The ps command can be used to find the process(es) using the most RAM with the following options: ps options: ps -eo rss,vsize,pid,args | grep -v COMMAND | sort -k 1 -n -r | more This produces a sorted list of processes using the “rss” or RAM size as the key sort. vmstat  – – Another very common troubleshooting command is vmstat. This command displays system CPU, RAM & swapping rate over a time interval provided as a parameter in seconds (e.g. vmstat 1 will display the information every second). Below is the output displayed by the vmstat command:

The ”procs” section shows the number of running (r) processes and the number of blocked (b) processes. The blocked processes in essence the processor queue and thus a critical metric to determine if the system is CPU constrained – a value consistently 2 times the number of CPUs or more is another indicator of CPU constraint. The “memory” section displays the available memory. The “swap” section is key to determining whether the system is RAM as itvalues displays memory pages and swapped outconstrained (so) – If these arethe consistently in the swapped hundredsinor(si) more, then you may be RAM constrained. The “io” displays total bytes read from storage (bi) and bytes written to storage (bo), however, there are better tools to determine whether the system is I/O constrained as this command does not provide enough context around I/O. The “cpu” section is similar to what was discussed with the top command. – Yet another popular command is the sar (system activity report) sar  – command. This command also has countless parameters and can report on ust about any system metric desired. Some administrators pre-create a series of sar commands with different parameters to utilize when an analysis of a server is needed. However, covering all of these options is beyond the scope

server is needed. However, covering all of these options is beyond the scope

of this book. One example of the sar command is to list out some RAM statistics over a time interval provided as a parameter in seconds. The below is the output of the following command: sar -B 1

The key column to focus on would be the “majflt/s” column as this is a clear indicator of whether the system is RAM constrained. If this value is consistently in the hundreds or more, then the system may be RAM constrained. lscpu – The lscpu command lists the CPUs, their state, and type. It’s a simple, but useful command. There are a few options to the command, but the basic command with no options is usually sufficient. Below is the output of the command: lscpu

This output indicates that this server has 1 CPU of type Intel Xeon Gold clocked at 2.30GHz, although this VM has it measured at 2164.672 MHz (slightly slower). cat /etc/os-release – Determining the version of the operating system is also important during troubleshooting and this can be achieved by simply listing

out the contents of the /etc/os-release file. The following is a typical output

although it will vary by OS and installation procedures:

strace – The strace command traces the activity of a specific process (similar to procmon in Windows, but needs to be attached to a running process or used to start a process). To attach to a running process, you would use the following format: strace –o output_file -p pid -yy –y. –y . To start a program with strace, use the following format: strace –o output_file program. program. While this command is useful in terms of troubleshooting specific issues, as with procmon, covering it in detail is outside the scope of this book. gcore – The gcore command memory dumpproblematic of the specified PID. As with Windows, this is verycreates usefulafor debugging processes. There are only a couple of options for this command and typically the -o (output file) option is the only one needed. Sample command: gcore –o output_file pid kill – The last command to cover in this section is the kill command. As the name implies, The Linux “kill” command is used to terminate, suspend or resume processes. Examples: kill pid pid - Terminate the process with PID provided (e.g. kill 945) kill –STOP pid pid - Suspend the process with PID provided (e.g. kill –STOP

945) kill –CONT pid pid - Resume the process with PID provided (e.g. kill –CONT 945) kill -9 pid – pid – Force terminate the process with PID provided (e.g. kill -9 945) Commands useful to detect I/O constraints co nstraints - The iostat command displays I/O activity by disk over a specified iostat  interval of time in seconds. This command has many options, but -xt is the most commonly used. For example iostat –xt 1 will 1 will provide the stats shown below every 1 second. This is an excellent command to use to determine whether the system is I/O constrained and which mount point (disk) is I/O constrained. The key metrics to focus on would be the following:

avgqu-sz – This provides the average size of the disk queue for each disk. This is a critical indicator of how much of an I/O constraint each disk has. An average disk queue of greater than 5 consistently is a reason for some concern. await – This provides average wait time for is each I/O including thethe time spent waiting in (in themilliseconds) disk queue. This another key indicator as an average wait time consistently greater than 10ms is subpar and greater than 20ms is likely a problem. %util – While the least critical indicator, if this % is at 100% consistently, then it is also a cause for concern.

iotop – The iotop command displays overall system I/O and the “top” processes in terms of I/O utilization. This is the next command to look at in terms of investigating I/O constraints as it gives you an indication as to which process(es) may be causing the I/O constraint.

df  – – The df command displays the disk space available and in use. This command is more relevant when disk space is the issue versus there being an I/O constraint, but. obviously. running out of disk space is a serious issue in and of itself. The are several options to the df command, but -k (show the output in kilobytes) is the most common (e.g. df -k ). ).

du – The du command displays the amount of disk space in use by each file or directory starting from a given path. The du command is critical in terms of finding what file(s) and/or directories are taking up all of the space. The are many options to the du command, but the most common is the -d option

to indicate how to traverse the indicates given path. example, issuing command: du -ddeep 1 awsvpcb-scripts indicates awsvpcb-scripts thatFor you only want to seethe the amount of space used in the awsvpcb-scripts directory and one level of directories below it.

/dev directory – While not a command, it’s important to note that to match the mount points from the “df” command to the devices in the “iostat” command, you need to refer to the /dev. Simply issue the following ls command: ls -l /dev/mapper/* to view this mapping (sample below):

Commands useful to detect Network issues netstat  – – The netstat command provides many options to display all current

network connections (source/destination IPs & ports and associated processes). This is similar to the netstat command on Windows and the “network” tab on Resource Monitor on Windows. One common use of Netstat is to find what processes are listening on what ports on the local machine using netstat -tulnp. -tulnp. Sample output is shown below:

A broaderand usefor of Netstat to identifycan allbe active (not .just the listeners) that, theisfollowing used:connections netstat -apnotu. -apnotu Sample

output below:

iftop – The iftop command displays the most active network connections in a bar graph format; includes source and destination IP addresses and port numbers. This is useful if you suspect that the system is network constrained, albeit this is typically rare at the host level and more likely at some oversubscribed point in the network (e.g. WAN link, firewall port, switch uplink). Below is some sample output of the iftop command:

ifconfig – The ifconfig command displays network interfaces on the server with their current configuration. This command can also be used to modify the configuration of each network interface. This command does not provide as much information as the ipconfig command on Windows but is as close a

comparison as we have. Below is some sample output of the ifconfig command:

etc/resolv.conf  – – This is not a command, but instead a file that contains DNS

configuration information (DNS servers and suffix list). This is a critical file to review if there are issues with name resolution. Below is a sample output for cat /etc/resolv.conf :

route – The route command displays the local network routing table on the

server (e.g. what interface should be used for what destination IP and default gateway) and also allows you to make modifications. This is analogous to the same command on Windows, but the options and syntax of the command are different. Linux Logs As with Windows, simply looking for constraints is not enough to analyze the health of a server. All the metrics may be fine and still, the server is having problems. Another key aspect of troubleshooting on a Linux server is to review the system logs and if appropriate some of the application logs. Although not necessarily respected by all products and applications, the standard Linuxlogs systems is reside to place in the “/var/log” directory.convention The Linuxon system certain inall thislogs directory.

The “messages” file in the /var/log directory is considered the general system log. This file should be monitored for critical errors. As with most Linux files, it is text-based and can be viewed using “tail”, “cat” or “more”. The “grep” command would also be a good option to search for a string. Other log files can be reviewed as applicable, including the “boot.msg” file for bootup messages. /var/log/messages will mostly be informational like the following: Mar 11 10:45:01 MyServerName syslog-ng[2303]: Configuration reload request received, reloading configuration; Mar 11 10:45:01 MyServerName syslog-ng[2303]: New configuration initialized; Mar 11 10:45:02 MyServerName run-crons[48583]: logrotate: OK However, there could be key errors like this one indicating a network link going Mar 30down 06:32:45 MyServerName kernel: [541322.867110] e1000e: eth0 NIC Link is Down OR even serious failures like kernel panics (hard crashes of the OS) like this: [ 1561.519959] [Hardware Error]: Machine check: Processor context corrupt [ 1561.519960] Kernel panic - not syncing: Fatal Machine check [ 1561.519962] Pid: 0, comm: swapper/5 Tainted: P M C O 3.2.0-35generic #55-Ubuntu [ 1561.519963] Call Trace: [ 1561.519964] [] panic+0x91/0x1a4 [ 1561.519971] [] mce_panic.part.14+0x18b/0x1c0 As with Windows, vendor documentation and Google searches are a good resource for determining the applicability of messages, but confirmation/remedy may require contacting vendor support. Vendor support may not be the support for the operating system. Third-party products may write to the system log as well although more often, each product has its own log. Linux Configuration Information Information While Windows has a centralized repository called the registry to hold configuration information for the operating system and installed products,

Linux has a standard convention of using the /etc directory instead. As with

the use of “/var/log”, this convention is used by most installed products. Typically, each product would create a subdirectory under /etc for their configuration’s files. As with many other aspects of Linux, these files are typically text-based. necdotal Story – Just give it more horsepower Users call in about very poor performance on a critical web application. A colleague notes that the load average on the application server is 15 and there are only 4 CPUs on the server, so we need to add CPUs. Is there anything else that should be checked to confirm this diagnosis and if so, what?

This seems like a reasonable first impression as a load average of 15 on a 4CPU server is a clear indication that the server is constrained. However, there are several flaws with simply adding CPUs based on this one metric. Here are a couple of these flaws: may not be CPU constrained - The load average metric 1) The does system not mean that the system is CPU constrained – it just means that the system is constrained by something. It could be RAM or I/O constrained instead of CPU constrained and adding CPUs may not help at all. 2) There could simply be a runaway r unaway or problematic process taking up all the resources (CPU or otherwise) – If the problem is a looping process or something like that, then you can add all the CPUs that you want and the problem will not go away.

So, before adding CPUs, it would be wise to 1) verify the type of constraint using the tools provided (e.g. vmstat, sar, iostat, etc.) and then 2) verify that the problem isn’t simply one runaway process that should be suspended for analysis – a memory dump with gcore would be a possible next step in this case. If neither of these is the case, then adding CPUs may indeed be warranted, albeit analysis as to why this started to occur all of a sudden would also be warranted. 5.5 Agent-Based Monitoring (ITIM Tools)

While it is often necessary to connect directly to a server for troubleshooting as described in the previous sections, there are also monitoring solutions at most Enterprises that allow you to monitor resources (to varying degrees of detail) from a centralized, typically web-based interface. These tools typically require an “agent” (small executable) to be installed on each server so that it can report information to a centralized location. These tools are often classified as ITIM (IT Infrastructure Monitoring) tools and they will typically monitor more than just operating systems. There are many tools in this space from vendors like CA, BMC, Solarwinds, Dynatrace, AppDynamics, IBM, Microsoft, Microfocus, Kaseya, etc. These tools are quite useful in terms of allowing you to quickly scan the performance and state of many systems as well as see the historical trend of utilization, however, once you have honed in on a system that is challenged, then the effectiveness of the tool (how much detail can you get from it?) will determine whether direct access to the server for further troubleshooting is warranted – It is often the case that direct access is preferred. When at tool a problem with a server, the first step woulddifferent typicallynow be to reviewlooking the ITIM to determine if something is materially

versus when there wasn’t a problem as this is not possible with the local

operating system tools. 5.6 Hypervisor Overview and Tools

A hypervisor is typically a general-purpose (GP) operating system that includes applications that can host other operating systems. While it is generally best to only utilize the provided GUI to manage the hypervisor, under the covers, hypervisors utilize Windows or Linux kernels and can be accessed as such. A guest OS is simply a single multi-threaded process within the hypervisor. This means that the entire guest OS is provided with virtual memory and scheduled for CPU by the hypervisor OS in the same way as any other process. Furthermore, a process within the guest OS is provided with virtual memory and scheduled CPU by the guest OS, who is, in turn, getting it from the hypervisor. The hypervisor provides a “Hardware Abstraction Layer” or “virtual hardware” to fool the guest OS such that it thinks it is accessing real memory, real storage, a real network card, etc. A couple of important benefits to virtualization are “live migrat migration” ion” and “automatic failover”. For these to be possible, there needs to be a cluster of physical hosts connected to the same storage and using the same hypervisor software. Given the high rate of virtualization in most enterprises, troubleshooting often requires not just seeing how the Virtual Server is performing, but also the underlining host. To this end, all Hypervisors offer built-in monitoring

interfaces with their products. This allows the administrator to determine if

there is a capacity constraint (RAM or CPU) at the host level and if so, which VM guest(s) is taking up most of the resources. It's important to establish if it’s just the guest OS that’s constrained and not the host as the remedy to either situation is very different. If the host machine is constrained (CPU or RAM), then moving guests off that host to a lesserused host is the remedy as opposed to adding CPU or RAM to the guest server. On the other hand, if the host machine has plenty of resources and it’s the guest OS that’s constrained, then the remedy could be to add virtual CPUs or RAM to the guest server.

Containers – The Mini Hypervisors

Containerization has begun to gain popularity, among the large cloudwhere providers. Containerization is simplyespecially another form of virtualization, instead of virtualizing an entire operating system, you virtualize a mini-OS that’s enough to support a particular application. Unlike hypervisors, containers run on top of a general-purpose host OS (typically Linux) and do not perform hardware management. in fact, containers are most effective when the OS they reside on is not virtualized to avoid double virtualization. Docker is the predominant container provider, but there are others like Apache Mesos & Windows. Container providers only serve containers on a single host OS, but to truly unleash the power of containerization, you need to create a cluster (similar to a virtualization cluster) whereby if one machine fails, the others take over. To create a cluster, you need management software like Kubernetes from Google or Docker Swarm. Despite their growing popularity, containerization has yet to take off with most Enterprise IT shops and thus will not be covered further in this book.

5.7 Cloud Tools Running your applications in the cloud means that you are giving up some level of control to the cloud provider to take advantage of the benefits they can provide. This in turn means that monitoring your services will also

require different tooling. As such offer somelimited level ofbut monitoring of their services. Manyallofcloud theseproviders tools are somewhat have been improving over time. Amazon Web Services (AWS), for example, includes monitoring for all services. These can be viewed as followed: Individually within the “monitoring” tab of each instance or service as selected in the AWS console Or within a customized view/dashboard created in the “CloudWatch” service While high-level monitoring (overall CPU, RAM, Network, and I/O usage) is available, detailed per process or connection info is not available. Logging information is also available from within the “CloudWatch” service. Alarms can be created to provide email notifications or take automated actions when thresholds are surpassed, or certain information is detected in the logs.

5.8 Time Synchronization Synchronization While it’s not immediately obvious, time synchronization across an IT Enterprise is critical, in particular for security-related functions. For example, Kerberos (used for Windows authentication/authorization) will fail if server

times are more than 5 minutes apart each other. NTP (Network Time Protocol) is a hierarchical system of from time sources. Each level of this hierarchy

is termed a stratum and is assigned a number starting with zero for the

reference clock at the top. A server synchronized to a stratum n server runs at stratum n + 1. The number represents the distance from the reference clock and is used to prevent cyclical dependencies in the hierarchy. Utilizes UDP port 123.

General best practices: Have a small number of servers (e.g. DNS servers) that query at least a Stratum 1 server on the Internet for time; This becomes your internal time authority Do NOT allow VM guests to acquire time from the host Windows best practice – point PDC to your internal time authority and force all BDCs to PDC for time Linux best practice – point each server to internal time authority Stratum 0 are not typically publicly accessible, however, there are many Stratum 1 NTP servers publicly available. List at http://support.ntp.org/bin/view/Servers/StratumOneTimeServers http://support.ntp.org/bin/view/Servers/StratumOneTimeServers. . Time.Windows.Com (Windows non-domain environment default) is a Stratum 2 service.

5.8 Troubleshooting Operating Systems Summary This brief section is a summary of the chapter. The basic premise of this chapter was that there is a problem and you need to look at a Windows or Linux server to determine if the problem is detectable from an operating systems perspective. These steps are just meant to provide a basic guideline as to the typical thing to look for: 1) Identify all the servers that could be involved in the problem as all of them should be checked. This alone can be a challenging step if you do not understand all of the application dependencies, but this is a critical step. 2) If you have access to an ITIM product, then FOR EACH SERVER check CPU (non-idle % utilization), RAM (% utilization, paging rate/hard faults), I/O (disk queue, disk utilization %, disk free space %, I/O response time). It’s important to see if any of these key metrics have changed recently and specifically since the reported problem began. 3) If you do not have access to an ITIM product, then login into each server and check for the same using the tools covered in this chapter. If symptoms indicate that the system is CPU, RAM or I/O constrained, then determine what processes may be causing the constraint. 4) If specific processes seem to be the problem, then look up the process on the Internet to see if it’s a well-known process. If it is a well-known process, then research carefully to make sure that you do not make things worse by suspending or killing the process. Here are a couple of well-known generic processes on Windows that run unique processes depending on the configuration. For both of these, taking a memory dump before performing any other actions would be a good idea. Also, for both of these restarting can be performed, but through specific tools,

as follows:

w3wp.exe is the standard IIS worker process on Windows so need to find the app pool associated with IIS and speak to a developer. Restarting such a process should be done within IIS Manager (discussed in a later chapter). Performing a restart is

generally OK, but standard it will briefly impact current user sessions. svchost.exe is the service process on Windows so need to find the service using task manager details. After identifying the service, you need to use “services.msc” to stop or restart the service. 5) If the process(es) are not well known from Internet searches (e.g. likely a homegrown application), then suspending the process as opposed to killing would be a better immediate action allowing for research. 6) If there are no constraints, then look in the logs to see if any errors align with the issue at hand. It’s important to line up the errors from a timing perspective to avoid chasing red herrings as there are always errors in the logs and most are not related to the problem at hand. 7) If you find an error in the log, research it with vendor documentation or the Internet to see if there’s a relation, but also check with the developers to see if the error aligns with what is happening. Contact the vendor if there is any chance that this could be an issue with their product. 8) If no errors in the log appear to align or there’s no progress identifying the root cause, look to see what programs were recently installed to see if anything aligns to the time the issue began.

Chapter 5 – Review/Key Questions

1) What are the metrics to consider when trying to determine if a system is CPU constrained? 2) What are the metrics to consider when trying to determine if a system is Memory/RAM constrained? 3) What are the metrics to consider when trying to determine if a system is I/O constrained? 4) What are the tools on Windows that can help you identify whether a system is constrained or not? 5) Which metrics can be determined using which tool on Windows? 6) What are the tools on Linux that can help you identify whether a system is constrained or not? 7) Which metrics can be determined using which tool on Linux? 8) Which commands allow you to view network configuration on Windows and Linux? 9) How can you review system log information on Windows or Linux? 10) What are the steps you should follow to inspect for problems on a server? 11)

What are the advantages of ITIM tools?

Chapter 6 – Network Tooling

In this chapter, we will cover the tools that can be used to troubleshoot problems from the network perspective. A key concept to understand is that network tools are not strictly used to detect network problems. On the contrary, most of the time network tools are used to detect non-networking issues. The reason is that if you understand what you are looking for in a networking conversation, then it can tell you a great deal about what is going on with the applications involved. To get us started with networking, we will start with a brief review of the OSI layers that we covered back in chapter 2. 6.1. Summary of the Network Layers It’s important to understand the philosophy behind each layer so that as you troubleshoot, you obtain insights that could help you find a solution. For example, layers 1 & 2 are about getting to the next device on the same

network. It’s all about the immediate next communication and thus more worried about the physical world (electrical signaling) than much else. Meanwhile, layer 3 is about one host (IP address) reaching another host (IP address). Depending on the network in between, the path could be quite complex and may need to traverse several types of networking devices (switches, routers, firewalls, etc.). However, in the end, it’s simply about how to best get one host to talk to another. Once you get to Layer 4, the focus shifts away from the network to the hosts involved in the communications. The IP stack on each host determines which process to send the communication to and the rules of the conversation. Lastly, there’s layer 5-7, where focus shifts to the applications speaking with each other. These applications determine their rules for their conversation.

6.2 Layers 1 & 2 in Enterprise IT Environments Before we move on to the common troubleshooting topics of layer 3 & layer 4, it’s important to understand how layer 1 and layer 2 connectivity typically occurs within an Enterprise IT environment. This section is broken up into two distinct topics – LAN & WAN. LAN (Local Area Networking) connectivity, as the name implies, focuses on one location or site while WAN (Wide Area Networking) focuses on the long-distance connectivity between locations. LAN (Local Area Networking) The diagram below is a representation of a typical Enterprise IT local area network for a location that includes servers (on the left) and a sizable number

of (onuncommon the right). Of locations only have oneThere or theare other, butusers it’s not forcourse, there tosome be locations that have both. several key points to call out in the diagram below as follows: - Redundancy – Note that all the switches at the center have fully redundant connectivity. This is also true for the top of rack switches, firewalls, routers/circuits to the Internet, and routers/circuits to the internal network. The only parts of an Enterprise IT network that is typically NOT redundant are the user access switches as the expense is rarely worth the added benefit. - Layer 2/3 Switches – The core network switches are usually both layer 2 and layer 3 devices. This means that VLANs can span across them, but also, they can route traffic to avoid extra hops. For instance,

if a server needs to talk to another server that resides in a different

VLAN/subnet, that communication only needs to go through the server switch, not to the core. The top of rack switches and user access switches, however, are typically only layer 2 type devices, so communication does have to go to the next level to be routed to another -

subnet. Soft Center Design – The diagram below is of a traditional design with firewalls only at the edge of the network. Some modern zero-trust networks place the firewalls at the center of the network controlling all traffic between all subnets. This is still not the most common design but gaining traction in some industries.

Local Area Networking Speeds

Network bandwidth - Measured in bits per second as opposed to Bytes per second (NOTE: Storage throughput is measured in Bytes per second). WAN most measurements are in also in bits second. With computer systems, people think bytes, notper bits, but with networking it’s bits. All of the below is in bits per second accordingly. Host Machines: Server port speeds are typically 1Gb to 40Gb Server rack backplane network speed is usually 10Gb to 160Gb Workstation port speeds are typically 100Mb to 1Gb Intermediate devices: Top of rack switch port speeds are typically 1Gb to 40Gb User access switch port speeds are usually 1Gb Devices at the center: Server, Core & Distribution switch speeds are usually 1Gb to 100Gb (these are usually similar type devices. Devices at the edge: Firewall port speeds are typically 10Gb to 40Gb Router port speeds are typically 1Mb to 10Gb Oversubscription – Networks are always designed with fairly high levels of oversubscription. This means that if all host machines (servers or workstations) attempted to consistently transmit at their full capacity (or even half of it), they would overwhelm all the networking devices in the LAN. This is why networking issues are rarely on a host machine, but instead somewhere in the network, especially the lower bandwidth long-distance circuits. WAN (Wide Area Networking) The diagram below is a representation of a typical Enterprise IT wide area network including multiple data centers, large corporate locations, smaller remote offices, and even work from home offices. Of course, the number and

typical size of offices vary greatly depending on the type and size of the

business.

WAN Concepts and Considerations

-

-

-

-

Point-to-point private circuit* - These are private & dedicated circuits with guaranteed bandwidth. However, the circuit only allows for one location to communicate with one other location. MPLS (Private mesh network)* - These are also private circuits with guaranteed bandwidth, however, they are a part of a large shared provider network and can communicate to any location that has a circuit on the same network. Internetthat circuit (canyou add to VPN encrypted privacy) These are circuits connect the for Internet. While you can– purchase guaranteed bandwidth, Internet speeds to any given destination can never be guaranteed and prioritization of traffic is not possible. VPN connectivity can be added for privacy between locations through encryption. Bandwidth (bits per second) vs. Latency (~1 ms per 50 miles + router queuing = RTT) – It’s important to understand that bandwidth and latency are two very different concepts even though they can both impact performance. Bandwidth is simply how many bits you can send through the circuit per second – if your need for bits per second

exceeds your bandwidth, then you will suffer performance issues. Meanwhile, latency is how long it takes a packet to get to the

destination and back. Thus, latency is dependent on distance, so it’s different for each destination. Latency is not an important consideration for real-time UDP conversations because applications are not waiting for packets to return. However, TCP conversations need

-

-

-

-

acknowledgments to continue sending, so latency can cause performance issues. Symmetrical vs. Asymmetrical – Most at-home Internet connections are asymmetric in that the download speed is much greater than the upload speed. These types of connections are less expensive and can be used at smaller Enterprise IT locations, however, business class symmetric Internet circuits (same speed for download and upload) are more common and definitely what’s used in large offices and data centers. Centralized Internet Access – For security reasons, many Enterprise IT environments do not allowInstead, outbound Internet access fromare all simply locations that have Internet circuits. these Internet circuits used as a way to provide communications back to the centralized data centers via VPN where the Internet traffic can be analyzed by security devices like IPS (intrusion prevention systems), Web Content Filters, Malware detection, and DLP (data leak prevention) software. Circuit redundancy – Redundancy is even more important for longdistance circuits due to the higher probability of issues (e.g. wire cuts). However, secondary circuits do not need to be the same size or type (e.g. an MPLS circuit could be backed up by an Internet VPN connection). Network Optimization Tools – Many Enterprise IT environments utilize special devices at the edge of each of their locations to optimize the traffic by using compression, caching, and local acknowledgments to hide the impact of latency. These tools can dramatically improve performance, but also introduce issues, so knowing that they are involved is important.

Layer 1 & 2 Network Monitoring

Most Enterprise IT organizations monitor bandwidth utilization across longdistance circuits as well as major uplinks into server switches, distribution switches, core switches, firewalls, and routers. Except for low-cost Internet circuits, all utilization is symmetrical (LAN & WAN), so bandwidth could be constrained in one direction and not the other (e.g. inbound traffic could be high, but outbound traffic could be low). Once bandwidth regularly exceeds 80% utilization in either direction, an analysis should be considered as any burst in activity activity can cause performance issues. Beyond 90% utilization, negative effects are likely to begin. During troubleshooting, it’s critical to be able to have access to network monitoring to rule out bandwidth constraints somewhere in the network as a possible root cause. If high utilization somewhere in the network is a possible root cause, then the next step would be to identify the conversations causing the high usage. Most of these monitoring tools can detect the conversation driving the utilization as well. So, what are the symptoms of high utilization? With TCP conversations, bandwidth limitations will cause high rates of retransmissions or out-of-order packets. This not only causes extra work but also limits windows size increase which means that acknowledgments are needed more often. All of this begins to exponentially impact performance. With UDP traffic, bandwidth limitations will cause timeouts (DNS) and jitter

(VoIP/Video). QoS (Quality of Service) will limit the impact on time sensitive real-time communications even at 100% utilization, but router

buffer overrun can occur if bandwidth remains at 100% continuously making QoS a moot point since the packet never gets into the router to be prioritized. 6.3 Layer 3 Tools and Considerations

Layer 1 & layer 2 is the first place to look in terms of networking issues as high utilization is typically easy to identify and thus easy to rule out. Having said that, many networking issues occur at the higher layers, and detecting issues at these higher layers is not as simple. Please review the chapter 2 review on layer 3 for a foundation as we will pick up from there in this chapter. As noted above, layer 3 involves getting packets from one IP address to another. If the two IP addresses are within the same subnet, then this becomes more of a layer 2 conversation as ARP (Address Resolution Protocol) is used to identify the MAC address of the other host and then communication occurs in that manner without needing to go through a gateway/router. This brings us to an interesting layer 3 concept that is implemented using ARP – virtual or floating IP addresses. Virtual IP Addresses All devices must have an IP address to communicate with the rest of the network. This device IP address can either be statically applied (as with servers) or dynamically provided w/DHCP (as with PCs & mobile devices). However, devices on the network can advertise multiple IP addresses, not just one. Anything other than the base IP address for the device is typically

considered a virtual IP address or VIP. VIPs are most commonly used to represent a “cluster” or multiple redundant servers, but all serve the same purpose. Load balancers, for example, use a VIP to represent a “pool of servers”. VIPs can “float” (sometimes called “Floating IP addresses”) from one device to another. This is common for failover purposes for the cluster that they are representing. It is simple for a VIP to move from one device to another as long as they reside in the same broadcast domain – a device simply performs a gratuitous ARP (as opposed to respond to an ARP request) declaring that its MAC address on its port on the switch is now to be considered this IP address.

Let’s take a look at an example.

In the above example, there are two servers (Saturn & Neptune) that can serve the application. The type of service being provided by these servers is irrelevant. The key is that the application server (192.168.128.110) needs the service provided. The application server does not know anything about Saturn or Neptune, it only connects to the VIP or floating IP address 198.168.138.130 to get the service it needs. This VIP would likely be resolved to a DNS name (e.g. service.acme.com). Under normal operations, the Saturn server responds to the VIP by replying to ARP requests for the MAC address associated with this VIP. Any configuration or real data stored on Saturn is constantly replicated to Neptune such that Neptune is always up to date and a duplicate of Saturn. There is also a heartbeat occurring every few seconds between the two servers such that each knows that the other is alive and well. If Saturn were to fail for some reason, then Neptune would detect the failed heartbeat and declare itself the new master by performing a gratuitous ARP to the local VLAN/subnet stating that its MAC address is now the owner of VIP address 198.168.138.130. All future requests will now start to go to Neptune instead of Saturn. So, what could go wrong with a VIP failover? First, it’s important to note that this mechanism works perfectly when the primary/master server (Saturn in this case) fails with a hard failure. Unfortunately, in the real world, systems

often brown-out instead of fail. That is, that the system is still up and running,

but responding so slowly that it might as well be down. In such a case, several things can go wrong, such as: 1) Saturn is almost unusable but working enough to get the small heartbeat to Neptune within whatever the timeout threshold is set to. In this case, a failover never occurs. 2) Saturn could be suffering sporadic issues whereby it’s not available for some time and then available again. In this case, Neptune may take over temporarily and data could be lost because requests were transferred to Neptune temporarily. Worse, this could happen again and again. 3) The communication between Saturn and Neptune fails and they both think they should be primary because they think the other server is down. This can cause what’s called a “split-brain” situation, which would likely be the worst scenario. Of course, vendors that produce products that use this technology are aware of these potential issues and attempt to implement mitigations to avoid these from occurring, however, it’s important to understand these as they may still occur. Testing Connectivity and Routing As noted earlier, the key function of layer 3 is to “route” a packet from one host to another. In the previous chapter in the Windows section, we discussed in detail what a routing table looks like on a server. Such routing tables exist on layer 3 network devices (e.g. core switches, routers & firewalls) across the

enterprise. Also, these shared byetc. these devices via a “routing protocol” suchrouting as bgp,table ibgp,entries eigrp, are OSPF, RIP, The intricacies of these routing protocols are beyond the scope of this book, however, it’s important to understand the key concept that routes are shared throughout the network. For example, a bad route injected into some portion of the network can affect other parts of the network. Route entries include “cost” or “metric” as noted in the previous chapter. It’s not just how to get a packet to a certain destination, but which is the preferred path. Thus, if a router at one location is inadvertently changed to make it the preferred path for an IP range it does not

own, then this route throughout the network all packets to that IP could range propagate will, in essence, disappear as they and will suddenly, route constantly to a router that sends it in a circle until the packet expires.

So, what commands can we use to determine whether at least this routing is working as expected? Well, there are a couple of commands that can prove the routing is working, but even if they fail, it does not necessarily mean that the routing is broken. If these commands do not provide a definitive answer, then network engineers are needed to trace the routes from one device to another until the problem is found. ing The ping command is a simple utility to verify that layer 3 (host to host routing) is working between hosts. The command has a few options to control the size of the packet, how many to send (or to send until a ctrl-c is hit), etc. The ping command accepts either an IP address or a DNS name (e.g. ping (e.g. ping www.acme.com or or ping ping 10.10.2.3). 10.10.2.3). The command sends a packet to the destination host and then waits for a reply. If a reply is not received in a certain amount of time, then it’s considered a failure. Typically, 4 or 5 attempts are made by the ping command, albeit this can be overridden with an option.

While the ping command is very useful and can be used to confirm layer 3 is working as expected, if it fails, that does not mean that there is a problem. The reason is that pings can be blocked by firewalls, so a failed ping doesn’t necessarily mean that host-to-host routing is broken. The ping command uses the ICMP protocol (this protocol is not communicating between processes so there is no concept of a port) and for security purposes, this protocol is often blocked. traceroute or tracert The traceroute command on Linux and tracert command on Windows are essentially the same. The traceroute command has several options, but typically you only need to pass in the destination host (DNS name or IP as with ping). The traceroute command will then send packets with a TTL of ever-increasing amounts. In terms of IP packets, TTL is reduced each time that a router touches a packet. When TTL hits 0, an error is sent back to the original host. This allows the traceroute command to identify each stop along the path from its starting point to the destination IP.

By default, the traceroute command uses ICMP, which as discussed with ping, can be blocked by firewalls. So, as with ping, just because a traceroute command fails, it does not mean that routing is broken between the hosts.

The key with traceroute is to analyze the route if the command does not fail or analyze whatever portion of the route did not fail to make sure that it’s routing as expected. Even if the entire route does not populate, the portion that does can provide clues as to whether an appropriate path is being taken. Lastly, just because a portion of the route does not populate does not mean that at some point the rest of the route will not start up again. This is because routers and firewalls can be configured to not provide TTL expired responses for security reasons and yet still allow the packets through. Thus, once the packets get to a device that does respond again, the trail continues. Below is a sample traceroute command:

The first line indicates the destination that is trying to be reached (my.fiu.edu – 13.35.118.68). Then, each entry is the next hop on the path and about how long the RTT (round trip time) is to that destination. Whenever you see * * *, this means that there was no response from that hop – this would typically be a firewall. Despite the firewalls and/or non-responding routers, you can see that eventually, we arrived at the destination. An interesting note regarding this traceroute is that a couple of the hops went to different IP addresses (hops 7 & 8). This indicates that there are multiple devices you can traverse at this stage and they are both viable. symmetric Routing Before moving away from layer 3, there’s one last topic that needs to be addressed – asymmetric routing. This is when the path from one host to another is different on the way back. In other words, when you go from 10.1.2.3 to 198.168.1.2, the path the packets take is different from when the packets go in the opposite direction (from 198.168.1.2 to 10.1.2.3). Asymmetric routing is not necessarily a problem except when a firewall sits in between one of the paths and not the other. Under such conditions, the

firewall onlyinsee the conversation and likely reject the packets causing awill failure thehalf communications.

The best way to detect asymmetric routing is to perform a traceroute in both directions (e.g. from both hosts involved in the conversation). If the routes are different, then a network engineer needs to determine if a firewall is in either path stopping the flow. 6.4 Name Resolution Resolving a DNS name to an IP address seems like it should be a simple topic, however, due to how DNS is implemented across products, operating systems, and Internet, it gets quite complicated, and thus many things can go wrong. Whenever debugging connectivity issues, the first step should be making sure that name resolution is working as expected. In this section, we will look at all the steps involved with name resolution and the things that can go wrong. The first thing to understand about name resolution is that several factors

need to be considered as follows: 1) DNS Servers – Most resolution occurs through DNS, so it’s important to understand which local DNS servers are being used by your system. This was covered in chapter 5. Are these local DNS servers accessible? If there’s a problem resolving a name, do you have another DNS server you can try instead? If the name you are trying to resolve is public, then you can try Google’s DNS, which is IP 8.8.8.8. 2) Host file entries - All operating systems provide a mechanism to override the IP address for any given name using a hosts file. In Enterprise IT, the use of hosts files entries is discouraged for anything other temporary testing situations theyfile canislead to unexpected resultsthan as things change. On Linux, the as hosts /etc/hosts and on Windows, it’s located at C:\Windows\System32\drivers\etc\hosts. On both platforms, a # makes the line a comment. Also, an IP address can be associated with multiple names by simply separating the names with a space as with the gargoyle server in the sample below:

3) DNS Cache – All operating systems and some products cache DNS names to improve performance. While this certainly improves

performance, it also introduces a great deal of complexity in terms of knowing how long changes will take to propagate across all host

machines. The amount of time that any particular DNS entry is cached is determined by the TTL (time-to-live) property associated with that name in the DNS database. Thus, each DNS name can have a unique TTL. 4) DNS Registrar – Since most enterprises need to communicate on the Internet, they need to make their DNS names available on the Internet. To do this, your DNS name needs to be registered through a third party such that it is added to one of the root servers on the Intenet (e.g. “.com”, “.net”, “.edu”, etc.). Also, your DNS servers need to be known and accessible on the Internet. The whois command allows you to query registered DNS names as well as registered IP ranges. 5) WINS – While rarely an issue nowadays, the legacy Windows name service (WINS) is still present in many enterprises and needs to be a consideration as it will also resolve a name to an IP address. whois The whois service provides registration information for the domain given (e.g. fiu.edu). This service is best accessed through freely available hosted https://www.whois.com/whois (there solutions like https://whois.icann.org or https://whois.icann.org or https://www.whois.com/whois (there are many of them). Here’s sample output for FIU: Domain Name: FIU.EDU

Registrant: Florida International University Tamiami Trail Miami, FL 33199 US Administrative Contact: ….. Technical Contact: Domain Admin ….. Name Servers:

NAMESERVER2.FIU.EDU NAMESERVER1.FIU.EDU

Domain record activated: 20-Nov-1989 Domain record last updated: 26-Sep-2020 Domain expires: 31-Jul-2021 The key output to look for is the list of “Name Servers:” as these will be queried to determine the IP address to search for. In this example nameserver1.fiu.edu and nameserver2.fiu.edu. Name Resolution Walkthrough Let’s assume that we try to access my.fiu.edu from our PC. What are the steps that occur in the background when we do this in order for that name to be translated to an IP address? Below are 10 common steps.

DNS Cache As you can tell from the above, DNS cache plays a big role in name resolution. Due to this, knowing how to view and/or clear DNS cache on a particular system is important. The following are commands that are useful

on Windows and Linux to accomplish this: ipconfig /displaydns /displaydns - Windows command line statement to display the DNS cache ipconfig /flushdns /flushdns - Windows command line statement to flush(erase) the DNS cache nscd –i hosts or hosts or rndc restart   - Linux commands (depending on implementation) to flush(erase) the DNS cache on the localhost. NOTE: There are other implementations, but nscd & rndc are the most common. Neither the nscd nor rndc DNS caching implementations on Linux provides the capability of easily displaying the current DNS caching, so flushing it is the only option. nslookup The nslookup command-line tool (Linux or Windows) resolves a name to an IP address using DNS. This tool includes many options for debugging purposes. Sample usage: nslookup my.fiu.edu. my.fiu.edu. An important consideration is that nslookup only deals with DNS resolution, thus host file entries are not considered. So, just because a name resolves to a certain IP address using

nslookup does not necessarily mean that this is the IP address that would be resolved by other applications since host file entries would override the DNS result. The nslookup command offers more than simply name resolution and actually provides a command prompt to allow for other commands. If you simply type nslookup, then you are greeted with the nslookup command prompt. The following commands are available from this prompt: ? – Display all commands available ls – list name server records (several options)

set nosearch – do not append domain search list (by default, when you perform a name server lookup, several lookups are performed

by appending domain names to the name you provided – this facilitates the use of short names, which are considered bad practice) set d2 – debug mode to display all DNS requests and responses (one of many other set options) server NAME server NAME – – change your local DNS server for subsequent commands to the NAME the NAME of of the server provided. The NAME can be a DNS name or IP address (e.g. you can use 8.8.8.8 to try Google’s DNS). NAME – NAME – perform a DNS lookup for the NAME the NAME provided provided necdotal stories 1) You perform a nslookup command on a name and get IP 1.1.1.1, but when you perform a ping command you get back a different IP address

– nslookup what could be happening? my.fiu.edu – answers 1.1.1.1 ping my.fiu.edu – answers 1.2.3.4 Since nslookup only looks up the name using DNS, there must be something causing ping to pick up a different IP address. The only likely possibility would be a hosts file entry. There’s likely a hosts file entry as follows: 1.2.3.4 my.fiu.edu 2) You accidentally changed the IP address for an internal name and then changed it back a few hours later, but some servers are still failing to resolve do aboutthe it?name properly. What could be happening and what can you Since you corrected the DNS entry, these servers must have cached the incorrect entry during the time it was active. These servers will retain this DNS entry in their DNS cache until the TTL expires. So, if the TTL is short (3 minutes or less), then it may be best to let the time expire, and the cached entry gets replaced. However, if the TTL is longer, then it may be wiser to flush the DNS cache on at least the most critical of servers to get the user working again.

3) Per your instructions, provider changes thepair Name Servers for resolving your domainyour on the Internet to a new of DNS servers. You check whois to confirm the change and then turn off your old

Names Servers. Soon, you start receiving complaints from many customers that they cannot reach any of your websites. What could be happening and what can you do about it? Again, DNS caching is the cause. All those Name Servers on the Internet still have your old DNS server IP addresses cached as the resolver for your domain. It will take time, likely hours for your new DNS servers to be propagated throughout the Internet. So, what do you do? Switching the entries back to the old DNS servers doesn’t make sense as you eventually need to make this change. Also, some machines on the Internet may have already cached your new DNS servers, so you would still have a problem. The answer is quite simple. Just start up your old DNS servers. This way both the old and the new are available for name resolution until everyone on the Internet refreshes their cache to reflect the new servers. Short names vs. Fully Qualified Names (FQN) Up to now, we have discussed fully qualified names (FQNs) like my.fiu.edu or server_name.fiu.edu. Using FQNs is a best practice. However, within many enterprises, there is the ability to use “short names”. This is typically to allow for simply typing in the server name without needing to append the domain. While this is certainly convenient, it can lead to issues depending on how it’s accomplished. The two most common methods of accomplishing this are by using a domain search list or WINS (Windows Internet Naming Service).

Domain Search List or DNS Suffix Suffix Search List The local IP configuration on a PC or server can include (and often does) a “Domain search list” or “DNS Suffix Search List”. These are analogous terms. This is a list of domains that will be automatically appended to

whatever name you attempt to access in order for a DNS lookup to be performed against the derived FQN. This list is ordered as once a name is is found, the searches cease. For example, if your Domain Search List includes “fiu.edu” and “cs.fiu.edu” in that order, then if you attempt to access the name “my_server”, your local IP stack will automatically attempt to find “my_server.fiu.edu” and then if not found, attempt to find “my_server.cs.fiu.edu”. Within enterprises, on PCs, this list is typically configured within DHCP as additional IP configuration settings added when a computer joins the network, while on servers, this is typically included in build scripts or manually assigned. The “ipconfig /all” command on Windows or “cat /etc/resolv.conf” on Linux can be used to check the suffix list. WINS (Windows Internet Naming Service) WINS is a legacy Windows naming service for NetBIOS names. This technology is still supported by Microsoft but considered deprecated and should not be deployed any longer. However, many enterprises may still have it active. By default, WINS resolution occurs only if DNS resolution fails.

Unlike DNS, WINS is NOT hierarchical. A single unambiguous name is defined. In other words, the “short name” is the fully quali qualified fied name within a WINS environment. Windows machines (PCs and Servers) can be configured to refer to a WINS server similar to how they refer to a DNS server. Most often, Windows Domain Controllers provide the WINS service similar to how they could provide the DNS service. As with DNS, the WINS server can be defined through DHCP (for PCs), build scripts, or manually. If WINS is enabled on a Windows machine, then the C:\Windows\System32\drivers\etc\lmhosts file would be queried before performing a WINS lookup. This is analogous to the hosts file mentioned earlier, but only for WINS short names. WINS uses port 42 (UDP & TCP)

for replication between domain controllers and either port 137 (UDP) or 1512 (UDP & TCP) for name resolution. As with DNS, the “ipconfig /all” command on Windows will show WINS configuration.

6.5 Layer 4 Protocols This next section is all about layer 4 protocols and communications, in particular, the TCP and UDP protocols. Please review the basics of layer 4 communications in chapter 2 as a refresher. Key review topics include:

- -

-

5-tuple uniqueness - The idea that a conversation requires 5 values that must make a unique set (source IP, destination IP, source port, destination port, and protocol. Port values – High port values (called ephemeral ports) are dynamically assigned by operating systems to be used by the client-side of the communication to initiate the connection to a known remote port. Port/Process Relationship – Processes open ports and thus ports are directly related to processes by the operating system. When communication comes into a machine for a port, then that means it’s going to be picked up by a specific process. If the destination port is not

currently listening, then the process associated with it, is likely down. Listeners and Port Access For one process to establish a connection with another process, one of the two processes (the server) needs to be listening for communication on a specific and known (or at least known to the client) port. The ports that are listening on a host can be detected in several ways as follows: - On the host itself, you can use the “netstat” command (-tulnp options for Linux and -anbo options for Windows) and look for the “LISTENING” ports (Resource Monitor on Windows can be used as well). Note that the same port can be opened for multi multiple ple processes on multiple IP addresses if the host has multiple IP addresses. - From a remote host, you can use tools like “nmap” (free download – most useful if you don’t know the port number you are looking for), “PortQryUI” (free download – best if you know the port), telnet (only useful for TCP and if you know the port), and several others. Remote host connectivity would fail if blocked by a firewall. One port may work, and another could be blocked. As with the “ping” utility, if ANY port works from one host to another host, then you have verified that layer 3 connectivity is good between these two hosts. If you are then having trouble

with another port, now the port level (e.g. firewall blocking the port, or the hostthe is problem not activeison theatdestination host).

nmap The nmap utility is a utility available for Linux and Windows that can be used to determine if ports are accessible from one host to another. This tool is only viable if used by the client host that is attempting to access the server.

The reverse does not make sense as there is no specific port on the client-side to target. You can only simulate connectivity from the client-side. This utility is especially useful if you simply want to see what ports are accessible and not necessarily looking for a specific port. If a port is not detected, then either there is no process on the remote host listening on that port or there is a firewall blocking access to that port from this source. This utility is available for free download at https://nmap.org/download.html. https://nmap.org/download.html. Below is a screenshot of the GUI, however, a command-line command is available as well.

PortQryUI The PortQryUI tool is another method of checking whether access to a port is available from a particular host to another host. However, PortQryUI is only

available on Windows. nmap,of this can only be run from the client machine. Below isAsa with screenshot thetool tool:

The following are the possible results from PortQryUI: - LISTENING – Port responded to TCP handshake - NOT LISTENING – Server responded that the port is not listening (the process is down, or the port is firewalled) - FILTERED – No response from the server (the server is down, or the port is firewalled) - COULD NOT RESOLVE – Could not get IP from the server name PowerShell As mentioned in the previous chapter, PowerShell is a broad Windows automation framework that is generally beyond the scope of this book. However, the Test-NetConnection cmdlet in PowerShell is a good alternative to PortQry for Windows machines for TCP connections. At this time, TestNetConnection does not support UDP. To use PowerShell, you should open a cmd prompt “as administrator”. At the prompt, you type: powershell. This starts the PowerShell engine.

The Test-NetConnection accepts several parameters, but the key ones to remember are -ComputerName, -port, and -InformatioLevel. The following

is an example of the command and its output: Test-NetConnection -ComputerName my.fiu.edu -port 443 InformationLevel Detailed

If the “TcpTestSucceeded” line indicates “True”, then connectivity to the destination IP/Port (in this case my.fiu.edu:443) was successful. telnet While nmap is available on Linux to test remote port access, there are other tools available as well. One of the most commonly used is telnet since it’s natively available on all Linux systems. By default telnet attempts to connect using port 23, however, this can be overridden to attempt to connect to any port such as port 443 - telnet my.fiu.edu 443. 443. If the connection fails, then you will get a “refused connection” message. If the connection succeeds, then you

will likely get a blank screen, but the results will vary depending on the application. Regardless, it’s pretty obvious, if the connection is a success and at that point, you can simply hit ctrl-c to break the connection. A key point to keep in mind is that the telnet test only works for TCP, not UDP and telnet communicates with TCP. If you need to test UDP, then you would need to use nmap or acquire some other tool that would support this. NetCat NetCat, which is usually called with the nc command is a utility freely available for Linux and Windows, but typically only used on Linux. NetCat

can test UDP and TCP connections and can even pass data into the connect to retrieve HTTP raw output. There are many options to NetCat, but the most

often used are -v (verbose), -z (only establish handshake and then end connection) and -w (timeout in seconds). In addition to the options, the destination host and port must be provided. Below is an example of a success and a failed connection: SUCCESSFUL CONNECTION bash-4.4$ nc -vz -w 2 my.fiu.edu 80 Ncat: Version 7.70 ( https://nmap.org/ncat ) Ncat: Connected to 13.35.118.102:80. Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds. FAILED CONNECTION bash-4.4$ nc -vz -w 2 13.35.118.102 81 Ncat: Version 7.70 ( https://nmap.org/ncat ) Ncat: Connection timed out. TCP A key point to understand about TCP is that not all TCP implementations are the same. In fact, TCP has been an area of research over the years and there are several competing versions of the protocol. The version being used depends on the operating system or even which version of the operating system. Even though there are several versions of TCP, they all follow the same rules in terms of how to communicate with each other. The variations in their implementations are mostly around how they handle congestion control, which we will cover in a couple of sections below. The TCP Handshake and Acknowledgements With UDP, there is no handshake, data just starts to flow from client to server. The received data and any associated responses, if any, are sent directly to the application without any logic interrupting the flow. With UDP, speed is of utmost importance. The application protocol (layer 5-7) completely dictates the rules of the conversation and manages the conversation.

With TCP, there’s a lot more at play. It all starts with a 3-way handshake at the beginning of the conversation that establishes the rules for that specific

conversation. Some of the key rules established include the following: - What’s the MSS in either direction?

MSS – Max Segment Size is the largest allowed single TCP segment in bytes -

What’s the Window Scaling Factor in either direction? This allows the windows size (the amount of data that can be sent without an acknowledgment being received) from the default 65KB to about 1GB, which is more appropriate to obtain decent speeds on high bandwidth circuits.

-

Is SACK or ECN possible? SACK – Selective Acknowledgements allow for the recipient to acknowledge the packets it received even if some were missed. This is an optimization for larger circuits with larger window sizes to avoid unnecessary retransmissions. ECN – Explicit Congestion Notifications are explained in the next section.

-

What are the starting sequence numbers in both directions?

Since TCP is guaranteeing packet delivery, acknowledgments (ACKs) have to be received for each packet or set of packets. The ACK number represents the latest packet received in order. Packets are only accepted in order. The TCP protocol also includes several key flags including, but not limited to, the following (NOTE: than one flag can be set on each packet): - SYN (start amore conversation) - FIN (end a conversation) - ACK (acknowledge receipt of a packet) - RST (rejected connection) - PSH (immediately send the packet to the application – no more packets coming for this application message) The below is a graphical display of the TCP handshake:

The first thing that happens (not shown here) is that the client opens a client port and is provided an ephemeral port by the operating system to be able to begin the conversation. The handshake then begins with the client sending a SYN packet to the serverport on the remote is a process listening on that remote andknown the packet wasport. able Iftothere get there, then there is a SYN-ACK response (e.g., both flags are enabled). Finally, the client sends an ACK with the initial payload. The rules of the conversation mentioned above (MSS, Windows Scale, SACK, ECN, etc.) are all sent with the SYN packets only and in both directions. Comprehension Questions: How many senders & receivers in a TCP conversation? 2 – While there is only one client that initiates the conversation, both sides are sending information and both sides are receiving information. This is important because both sides are looking for their previous packets to be acknowledged and expect to receive their packets in order. If an out-of-order packet is received, what does the recipient do? The recipient sends a duplicate acknowledgment of the last in-order packet it received. This tells the sender that a packet was missed, so it resends the missed packet(s). This slows the conversation but is necessary to make

suretonoreceive information is lost. Senders need acknowledgments to continue sending. What if they do not get one? How long does a sender wait for an

ACK and what does it do if an ACK is not received? This varies by the implementation of TCP, but generally, the sender is keeping track of the average RTT (round trip time) between ACKs. So, it knows what a reasonable wait time should be. If that wait time is doubled, then the sender retransmits the last sent packet to see if it gets a reaction from the receiver. It will do this a certain number of times (again implementation specific as to the exact number of retransmissions) with longer wait times in between and then end the conversation if nothing has come back. Since TCP provides all these services, why use UDP at all? TCP’s need for acknowledgments to receive packets in order can slow a conversation. If there is no need for packets to be received in order or at all (e.g. real-time communications), then this overhead is unnecessary and even problematic. Thus, most real-time communications like live video or voice use UDP. Also, single packet protocols like DNS and DHCP use UDP as well since there’s no need to order the packets. TCP Flow and Congestion Control TCP not only guarantees the reliable delivery of packets in order, but it also provides for flow control and congestion control to avoid issues on the network. However, these topics are only relevant when transmitting a large amount of data as with a file transfer or something similar. In the below brief section, we will discuss both of these topics.

Flow control is designed to avoid overwhelming the receiving host. This is accomplished by having the receiving host advertise its “receive window size”. The sender will avoid sending more bytes than would ffill ill the receive buffer without receiving an ACK. The receiver can then adjust the receive window with each ACK.

The receiver can even stop the sender from sending any data temporarily by setting the receive window to 0. In fact, this is a common method for cloud providers to throttle traffic in order to keep users within their allotted levels of service (e.g. transmission bits per second). Congestion control is designed to avoid overwhelming the network path between the sender and the receiver. This is accomplished via several mechanisms depending on the version of TCP. Regardless of the mechanism, it’s the sender that controls the congestion window size, not the receiver. The basic premise is simple. Start the windows size (amount of data to send without receiving an ACK) small and grow it exponentially until one of the following occurs: - A timeout (ACK not received in time - based on RTT) - Duplicate ACK is received meaning that an out-of-order packet was received on the other side, implying a dropped packet occurred somewhere in between. - The ECE flag is set. This is done by routers when a packet almost missed getting into the router’s buffer (e.g. the router’s buffer is almost full)

The above graph is typical with TCP conversations sending a great deal of data. The sender continues to increase the window size (amount of data sent in between ACKs) until one of the above happens at which time, the sender reduces the window size to avoid making things worse and then starts to climb it again. TCP Algorithms vary greatly on how much to slow down and how quickly to speed up. necdotal Story – Network error? Maybe not… You implement a firewall change one evening and the next morning you receive a call from a support person. They are receiving an error on a server attempting to communicate with another server that resides within the network zone to which you applied firewall changes. The error states that there is a “network error” connecting to the remote server. However, when you check the firewall, it appears that the traffic should be allowed. What are some possible next steps to take?

There are several things you can try but testing basic connectivity from the client machine is likely the best way to go. From the client machine, you can test the connection with PortQryUI, nmap, or telnet. You use PortQryUI and receive back “NOT LISTENING”. At this point, the next appropriate step was to go to the server and verify whether the destination port is listening on that server. Sure enough, the port was not listening, and the reason was quite simple. The Windows service on the remote server was down. It had crashed the night before and there was no monitoring on it. Once the service was started, communications worked as expected. The key takeaway from this brief anecdote is that an application can claim a “network error” when the problem is simply that the remote side process is simply down. There’s no simple way for the client to differentiate between a true network error and simply the process being down, so always check for

the obvious solution first. Of course, from a problem management perspective, adding monitoring/alerting on this Windows Service is also

warranted. 6.6 Packet Capturing and Analysis Because of the heterogeneous nature of modern computer environments (various types of operating systems speaking to each other), most communication protocols have publicly documented standards. This allows for a wealth of information to be able to be derived from reading network conversations because the communication is not a black box. Reading information from the network interface is called “packet capturing” or “using a sniffer” and requires kernel-level access on host machines.

A plethora of tools have been developed to analyze these packet captures for all sorts of purposes such as: Security – looking for known bad signatures, the exfiltration of data, etc. Monitoring – Determining end-user performance, identifying errors, throughout, capacity Network mapping – Identifying which applications/servers depend on each other and for what purpose Troubleshooting – Determining if packets are arriving, how long they take to return information, why errors are occurring, etc. What “packets” are captured depends on where the “sniffer” resides (capture point) as follows: On a server or pc - only data coming to and from your port would be available

On the switch – a span port can be used to capture all traffic

traversing the switch

On a firewall or router interface – all traffic traversing that network interface

It’s critical to understand where on the network a packet capture was taken from to be able to analyze it properly. Also, the packet capture point can be used as a form of reductive problem isolation by eliminating areas of the network as being part of tthe he problem. For example, review the below:

If packets are being dropped, in which section is it occurring? By packet capturing at various points, you can reduce the problem area. Further reduction within the “local network” on either side can occur as well. While packet captures provide a wealth of information, it’s important to

realize that encryption does limit the ability to troubleshoot using packet captures. There are two types of transmission encryption and they limit your visibility differently. Here are the two types of transmission encryption:

1) Payload only encryption (TLS/SSL) – This is the most common and affects a lot of traffic including HTTPS (encrypted web traffic). Email, File Transfers, database connections, etc. are all often encrypted at this level. When there is such encryption, only the metadata is available (layer 2, 3, and 4 information). Layer 5-7 is what’s encrypted. However, because layer 4 is visible, you can see TCP retransmissions, duplicate ACKs, etc. so you can get an idea of the stability of the connection. 2) Transport encryption (VPN, IPsec) – With transport encryption, layers 3 & 4 of the true communication are also encrypted. All you see is the layer 3 & 4 of the encrypted tunnel which provides no useful information at all. The saving grace to most VPN solutions is that they create virtual interfaces that can usually be captured. So, while the Ethernet interface capture is useless, you can point your capture to the virtual interface instead and get useful information. Packet Capturing on Linux Wireshark (discussed in the next two slides) can be used to perform packet capturing on a Windows, Linux, and MacOS system. However, when dealing with Linux servers, the “tcpdump” command is a simpler method of performing a packet capture. There are ports of tcpdump for Windows, but packet capturing with Wireshark is typically simpler on Windows and MacOS. You need administrative privileges (e.g., root) to execute tcpdump (and any other packet capturing tool on Linux). The below is a summary of some of the tcpdump options. There are many more options – refer to the man page for more information.

Wireshark

Wireshark is an open-source packet analysis tool. Its graphical user interface is the most popular method for analyzing packet captures. It can read captures in just about any standard format (pcap being the most common). Wireshark can be used for analysis regardless of where the packet capture took place, however, Wireshark includes a local capture component as well in case you need to perform a capture locally. Wireshark provides for “capture filters” used to limit the packets you capture and write to the pcap file – this is only applicable with a local capture. Wireshark also provides “display filters” to limit the packets displayed on the screen – applicable on any capture you are analyzing. Filters available (e.g. protocol, port, IP address, etc.) are the same for both. There are hundreds of filters and they can be combined with operators (e.g. and/or). Wireshark understands all popular protocols and breaks down the sections of the protocol for you, although the raw packet is also available. Wireshark has a ton of other useful features such as: The combining of multi-packet application messages (e.g. HTTP) into one logical record The ability to track conversations (TCP or UDP) such that they can be graphed (for throughput, roundtrip time, etc.), colorized, used as a filter, etc. The ability to find a packet by searching for text in the packet headers or body The automatic colorization of retransmissions, “out of order” packets, and other conditions worth reviewing nalyzing Network Traffic with Wireshark Analyzing network traffic with Wireshark can be fairly simple or get quite complicated depending on the issue. To be effective in analyzing traffic, you two basic things: 1) Understand what you are seeing – If you look at a capture and do not understand what each row of data means then you aren’t going to be able to figure anything out. So, understanding what the values in the packets (especially the metadata) mean is crucial.

2) Understanding how to filter to find what you want – Even the smallest captures have hundreds of packets and it’s not surprising to see

captures with thousands of packets. Finding the conversation, you care about in all of this can be very challenging, so understanding the filtering & search options well is also key to being able to make use of a capture. Lastly, it’s important to note that even a skilled packet capture analyst may struggle to find certain issues. It all depends on how obvious the problem is and how well the analyst understands the type of conversation being had. Below, I will review a few captures, starting with a simple capture that does not have a problem, then a simple capture that has a problem, and finally a more complicated problem to decipher from a capture. Let’s begin. Please review the capture below and try to answer the questions below it.

Questions:

Which IP initiated the conversation (i.e. the client IP) and how do you know? Which IP is the server running on? What ephemeral port number is the client using? What port is the server listening on? How many sequence (seq) numbers are there at play? How many acknowledgment (ack) numbers are there at play? How is the next sequence (seq) number calculated? How is the acknowledgment (ack) number determined? What does the Win value mean? What does WS mean? What does SACK_PERM=1 mean? nswers:

Which IP initiated the conversation (i.e. the client IP) and how do

you know? 10.100.196.5. You can tell because you see that it is the “Source” on the SYN packet and the “Destination” on

the SYN-ACK packet. Also, its port number (37716) is higher than the other IP’s port (80) thus it’s more likely to be the one using an ephemeral port. This is important to note as you may not catch the beginning of a conversation (the SYN, SYNACK) in a packet capture, but you should still be able to determine the client. Lastly, this is obviously a call to a web server (80 is the default HTTP port) and this is the IP that’s the “Source” on the GET request, which we know would come from the client. Which IP is the server running on? 10.100.14.129 What ephemeral port number is the client using? 37716 What port is the server listening on? 80 How many sequence (seq) numbers are there at play? 2 – One for each sender (remember that there are 2 senders and 2 receivers) How many acknowledgment (ack) numbers are there at play? 2 – One for each receiver How is the next sequence (seq) number calculated? A couple of key concepts need to be understood to grasp the answer to this question as follows: TCP is considered a streaming protocol. That is that TCP does not care about the format of the payload data it’s sending – it’s just sendingthe bytes. So, it counts these bytes throughout conversation to identify the sequence and acknowledgment numbers. The real sequence number starting point is randomly generated, but Wireshark changes it to 0 for simplicity’s sake. Now that the above is understood, the sequence number is defined as the last byte of payload data (not overhead) sent. Thus, to get the next sequence number, you add the length of the payload of the latest packet sent to the

previous sequence number. A key point is that the “length” displayed by Wireshark

includes the payload + the overhead of layers 2 (ethernet frame), 3 (IP packet header), and 4 (TCP segment header), so the increment will be less than the length you see. This total overhead can vary but is typically 50 – 80 bytes in length. You can find the actual payload length by looking at the details of the packet. How is the acknowledgment (ack) number determined? The acknowledgment number is defined as the next expected byte of data, so it’s the last in-order sequence number it has seen +1. What does the Win value mean? This is the receive window discussed in the Flow Control section. A key point is that if the capture being analyzed with Wireshark does not include the initial handshake, then this number could be incorrect as Wireshark is unaware of the Windows Scaling factor. What does WS mean? This is the Windows Scaling factor that’s only included in the SYN packets. What does SACK_PERM=1 mean? This indicates that Selective Acknowledgements are allowed. This is also only shared in the SYN packets. Now that we understand a little about what we are seeing in a Wireshark capture, take a look at a simple example of a Wireshark capture that includes let’s a failure.

Questions:

Which IP is the client and how do you know? What’s the IP of the server?

What’s the port(s) of the client? What’s the port of the server? What went wrong with this conversation(s)? What could be the problem? nswers:

Which IP is the client and how do you know? 10.110.32.178 because this is the IP that sent the SYN packets. An interesting note is that the port of the client is lower than that of the server, which is rare. The reason for this is that the client is connected through a VPN and in that case the VPN assigns the port numbers and the range is much broader than when the operating system provides ephemeral ports. What’s the IP of the server? 167.79.187.254 What’s the port of the client? There are actually 3 because three conversations were initiated. The ports are 2698, 2699, 2700 What’s the port of the server? 9080 What went wrong with this conversation(s)? All three conversations failed to complete the handshake because they did not get a SYN-ACK back from the server. Instead, they received an RST (rejection). In fact, the client tried to send SYN packets again for all 3 conversations and even a third time for one of them, but the result was always the same. What could be the problem? There are two possible issues when you see this kind of behavior: The port is being blocked (rejected) by a firewall. Firewalls can drop or reject packets. Within an internal network, rejects are more common, but on the Internet, drops are more common.

There is no process listening on port 9080 on IP 167.79.187.254. In this case, this was the

answer, but it could have been a firewall, so more investigation would be needed if you weren’t certain. Lastly, let’s take a look at a much more complicated scenario. The below is a capture broken into both sides of the conversation for a clearer picture. The symptom being experienced in the below capture is that the client times out when trying to connect to an Azure SQL Instance.

Questions:

Which IP is the client? Which IP is the server? What port is the client using? What port is the server using? Where does the conversation seem to break down? Where can you assume the problem lies (e.g. the client or the server or perhaps the network in between)? Here’s some additional material you can reference to try to figure this out https://www.connectionstrings.com/all-sql-server-connection-stringkeywords/ nswers:

Which IP is the client? 172.18.2.23 as it performs the SYN

Which IP is the server? 191.236.155.178 What port is the client using? 52656 What port is the server using? 1433 Where does the conversation seem to break down? There appears to be a 9-second gap (from the 21-second mark to the 30-second mark) after the client issues the first pre-login message. There’s an immediate ACK from the server, but it contains no data, so it’s not a response from the application, just an ACK from the TCPIP stack letting the client know that it received the packet. The actual response comes at the 30-second mark. The same pattern occurs at the 31-second mark as the pre-login ends and the TLS exchange begins. The server provides an immediate ACK, but the application does not respond until the 45-second mark. The Microsoft documentation link provided indicates that the default connection timeout is 15 seconds and is controlled by the client. Thus, once we hit the 36-second mark, the client ends the conversation. Thus, the server’s response at the 45-second mark is rejected.

Where can you assume the problem lies (e.g. the client or the

server or perhaps the network in between)? Since both delays in the conversation come from the server end of the conversation, the problem is not on the client. Also, since in both cases, there are immediate ACKs sent to the client, the issue is not on the network either. The issue is strictly with the server application, which is the Azure SQL service. After this analysis was sent to Microsoft, they agreed that the Azure SQL service was exceeding threshold limits and thus was being throttled and not responding immediately to requests. The tier of service was upgraded to the next level to increase the thresholds to a level where performance was not impacted.

6.7 Troubleshooting Network Connectivity - Summary As a summary of this chapter on Network Tools, I wanted to provide a summary of recommended steps to troubleshoot a communications issue between two systems. The key to this troubleshooting is to identify the two

systems involved in the conversation. 1) Identify the IP addresses, subnets, and ports for the source (client) and destination (server) in the conversation, then test mostly from the client machine. This step alone can be challenging if it’s unclear how the application works, but this is a critical step. 2) Is the destination name resolving at all? If not, then why? Do you have DNS servers defined properly? Is the name simply not in DNS? (you can use nslookup and ping - NOTE that the ping may fail, but at this point, you are testing the name resolution only). 3) Is the destination IP address that the name resolved to correct IP? If not, why? Are you referring to the correct DNS servers? Is DNS setup correctly? Is the IP being overridden by the hosts file? (again, use nslookup and ping, however, ping can be blocked by firewalls, so its failure is not necessarily indicative of a problem). 4) If the above does not expose any issues, then does connectivity to the destination port work? (you can use PortQry, PowerShell TestNetConnection, nmap, and telnet on the client machine). 5) If connectivity to the destination port is not working, why not? Is the port being blocked by a firewall? Is the process running on the remote end? (you can use netstat to see if the port is listening on the destination and task manager or ps to see if the process expected is evenservices.msc, running). 6) If the above does not expose any issues, then is the conversation between the systems having logical issues? Are there high rates of retransmissions or duplicate packets? Does the server or client simply not respond or not respond for an extensive period of time? (you can use Wireshark for this – tracert/traceroute is an option as well if you suspect asymmetric routing).

Chapter 6 – Review/Key Questions

1) What logical entity (e.g., device, host, process, application) communicates at each OSI layer? 2) What are the primary types of network devices on the LAN within an Enterprise IT environment? 3) What are the primary types of WAN circuits used within Enterprise IT environments? 4) What are the typical steps to determining name resolution on a system? 5) What five values make up a unique network conversation? 6) What tools can be used to test network connectivity between two hosts or processes on Windows? 7) What tools can be used to test network connectivity between two hosts or processes on Linux? 8) What’s the difference between TCP and UDP? 9) What’s the difference between TCP flow control and congestion control? 10) At what point within the network can you perform a network capture? 11) On a network capture, how do you determine which IP address is the client? 12) What are some of the features available to you within Wireshark? 13) What are the steps to troubleshooting communications between two processes?

Chapter 7 – Application Protocols

In this chapter, we will cover network layers 5-7 (a.k.a application protocols). Of course, there are way too many application protocols for an entire book to cover alone a chapter), wegowill narrow focus two to the most prominent (HTTP(let and SSL/TLS). We so will fairly deepour on these protocols and then skim over some other highly used protocols like SOAP, REST (not actually a protocol), SMTP/POP3, FTP, LDAP, Kerberos, SAML & OAUTH. While each application protocol is unique, there are common traits that will be reviewed. Overall, the intent is to provide exposure to application protocols in general such that when others are encountered, they do not appear completely foreign. 7.1 The HTTP Protocol

The HTTP (Hypertext Transfer Protocol) has become the most common inter-application communication protocol in the world. HTTPS is simply encrypted HTTP. There are two primary ways in which HTTP (or HTTPS) is used: Browser-based Usage – This is where HTML, JavaScript, and the like are delivered to your browser for user interaction. API-based Usage – This is where an application is communicating to another application without user interface information being transferred.

These two methods are often used in combination to deliver a solution, but each HTTP conversation is usually only of one type or another. HTTP uses

port 80 by default. HTTPS uses port 443 by default. HTTP Protocol Structure The HTTP protocol is a request-response protocol. This means that the client

sends to the relationship. server and each is provided response from the serverrequests – it’s a 1-to-1 Therequest client can send anyaof 9 “request methods” and receive any of about 100 “response codes” (these are reviewed later on). Also, there are over 50 headers that can be included in the request or response – this is where things like cookies are provided. Finally, some requests and responses have a “body” that includes the data being sent to the server in the request (e.g. data input by the user) or data being provided by the server to the client in the response (e.g. HTML, downloaded file, etc.).

HTTP Versions HTTP was initially developed in 1991 (version 0.9) when the Internet was still nascent. The idea was simply to share documents with links to other documents. The concept of HTML documents that included these links was also introduced at the same time. Of course, these concepts quickly took off and there was a flurry of activity with HTTP through 1997 (HTTP version 1.1.).

While the use of HTTP continued to explode and the richness of the browser UI continued to evolve quickly, the HTTP protocol itself did not change

much further for protocol. the next 18 years.said Thisthat, speaks to 2015 the simplicity and there flexibility of the Having since (version 2.0), have been major changes to the protocol, and they’ve all been about speeding

up the user (browser) experience. Google has driven most of these changes. HTTP 0.9, 1.0, 1.1 and 2.0 all utilize TCP as its transport protocol. HTTP 1.1 (1997) is still the most used version, but HTTP 2.0 (discussed in the next section) is slowly gaining popularity. which builds on but HTTP 2.0, was proposed in November 2018 and HTTP is being3.0, lightly used today, it will take time for it to become widely accepted for multiple reasons, primary of which is that it is UDP-based and thus transfers all of the reliability responsibility to the application layer (browser & webserver). Given the lack of acceptance, HTTP 3.0 is out of scope for this book. One of the key items that each version has been trying to address over the years is increasing browser performance. Since HTML pages typically refer to many objects (e.g. several images, JavaScript files, CSS files, etc.), optimizing access is critical. To that end, several advancements have been made over time, such as: Persistent Connections - The concept of persistent connections is to keep these TCP connections open so that multiple requests can use the same connection. Pipelining - The concept of pipelining is that you do not wait for the reply of one object before requesting the next. Multiplexing - The concept of multiplexing replaces pipelining over a single connection – new with HTTP 2.0.

HTTP 2.0 HTTP 2.0 was accepted as an IETF standard in 2015 with RFC 7540 (https://tools.ietf.org/html/rfc7540). HTTP 2.0 is based on the SPDY protocol developed by Google to improve web performance. HTTP 2.0 is still not as widely used as HTTP 1.1 but gaining broader acceptance. As of Dec. 2020,

about 50% of Internet websites support HTTP 2.0, however within the enterprise, that percentage is much smaller. As operating systems are modernized, this will change over time.

There are several significant changes with HTTP 2.0, including:

It is a binary protocol (not text-based) & compresses the headers Allows multiple “streams” via one TCP connection (multiplexing) Adds the ability to interleave requests and responses in the same

message Adds flow control & prioritization HTTP 2.0 can only be used if both the client and server agree to use it during the initial HTTP connection. HTTP 2.0 still uses the same request methods (GET, POST, etc. ), response codes, and URI syntax as previous versions. HTTP 2.0 is typically only transmitted via encrypted channels. HTTP Request Methods The HTTP protocol (regardless of version) supports several “request methods”, but two are by far the most used:

GET – Requests URL. There is no body to this message (everything must be in the URL); however, cookies can be sent to the server in headers. This is typically a read-only operation. POST – Sends information to a URL and includes the information in the body of the message. This is the most common method for web service APIs.

HTTP Response Codes The HTTP protocol (regardless of version) provides for about 100 response codes, but they are categorized into five categories as follows:

1xx (Informational): The request was received, continuing process 2xx (Successful): The request was successfully received, understood, and accepted

3xx (Redirection): Further action needs to be taken to complete the request 4xx (Client Error): The request contains bad syntax or cannot be

fulfilled 5xx (Server Error): The server failed to fulfill a valid request Some common response codes are as follows: - -

- - - - -

200: This is the response code for most calls – successful completion of the request. 302/304: This tells the client that it should attempt a request to another URL which would be provided in the “location” header. The client will then attempt a GET for that URL. 400: This is typically caused by a malformed request 401: The server is letting the client know that it is not authorized for this request 404: The server is letting the client know that the requested object does not exist 500: The server had a failure in its execution; this is a general, but common error. You would need to get more info from logs or other tools to understand the root cause 504: The server is acting like a proxy (discussed more later) and the back-end service it is calling did not answer in time

HTTP Headers We have covered HTTP request methods and HTTP responses, however, there’s a lot that occurs because of HTTP headers and there are a lot of them.

HTTP headers have been added throughout the history of the HTTP protocol. If a client or server does not understand a particular header because it’s new and was not supported when the client or server was created, that’s fine as the

header is simply ignored. This has allowed the HTTP protocol to add features throughout the years without having to change dramatically. There are over 50 request headers and over 50 response headers, so we aren’t going to cover them all, however, they can be found here if you wish to review them -> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers.. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers

Here are some common HTTP Request Headers: - Host – The host in the URL request (e.g. my.fiu.edu) - Connection – Tells the server to keep the TCP conversation active because more requests are coming - Cookie – Provides the name of the cookie and the value of the cookie for the server to read; there could be several of these Cookie headers in a request - Referer – Provides the URL from which this request was linked - X-Forwarded-For – Provides the IP address of the original client as opposed to the proxy (more on this later) Here are some common HTTP Response Headers: - Location – The URL for the client to access on a redirect response - Set-Cookie - Provides the name of the new cookie to save on the client, the value of the cookie, the domain associated with the cookie (e.g. fiu.edu), when it should be expired, and other details; there could be several of these Set-Cookie headers in a response - Content-Type – Let’s the client know the format of the content included in the body bsolute vs. Relative Pathing As noted earlier, the entire reason HTTP was started was to link documents together. While clicking on a link can now do a lot more than simply take you to another document, it is still true that just about all HTTP activity involves going to another URL, whether that URL is another document or something that will execute some code. Thus, the way that you provide a path to any URL is important to understand as there are a few options.

Absolute pathing is the simplest to understand as you are providing all the information for a URL explicitly. In fact, this is needed when you first access

any new website. An example of absolute pathing would be: https://my.fiu.edu/students/index.html.. Relative pathing requires a starting https://my.fiu.edu/students/index.html point (e.g. the current web page you are viewing). If you are not on a web

page because you just started your browser, for example, then relative pathing is not applicable. So, let’s assume that you are on the web page https://my.fiu.edu/students/index.html and and want to go to page URLs: https://my.fiu.edu/students/index.html https://my.fiu.edu/students/grades.html https://my.fiu.edu/students/grades.html, , then consider theweb following - /students/grades.html - WORKS - grades.html - WORKS - /grades.html - FAILS (resolves to https://my.fiu.edu/grades.html https://my.fiu.edu/grades.html which which is incorrect) So, why would you use relative pathing? The main reason is to avoid being tied to a DNS name or protocol (e.g. HTTP or HTTPS). You want your application to work no matter how a user gets to it. If you use absolute pathing, then you are hardcoding the protocol and DNS name that the user needs to use. More HTTP Request & Response examples

HTTP Session & Cookies HTTP is a sessionless protocol. This means that HTTP has no concept of a user session – yet this is a critical component of any user-based application. User session information is needed to understand who the user is, what permissions the user has, what information the user has already provided (e.g. what’s in his cart on an e-commerce site). This user information is called “session state”. Session state is typically only needed for browser-based applications but occasionally used for API calls as well.

So, how do applications maintain session state information, if it’s not native to HTTP? A critical mechanism used by HTTP-based applications is storing information in cookies (small bits of data stored in the browser’s memory and/or on the client’s system). Cookies are often used for authentication & session-state identification. It’s important to note that cookies do not contain any session state information other than a random string called a session ID. This session ID is simply the key referring to finding an entry held in the application server’s RAM or in a database where all the session state information is maintained.

Where application storesissession state information (application server’s RAM oranexternal database) a serious consideration, especially when deploying within a cluster of application servers. This is because if the

session state is stored in the RAM of an application server, then once a user session begins, the user must remain on that application server. The user cannot be sent to another application server as the session information is not there. However, if the session information is stored in a database that all the application servers access, then a user does not need to remain on the same server through his/her session. We will discuss this further when we discuss load balancers. Below is a chart of the pros/cons of each approach:

7.2 HTTP Tools Several tools can be used to diagnose problems with HTTP-based applications. We will go over just a few of them in this section. wget Allows you to make an HTTP GET request from the command line on Linux or Windows (rarely used on Windows since the browser is available). The wget command is designed to be used without user interaction (e.g. for a batch job). The most common usage is to download a file, but the command has many options, including the ability to crawl websites using the –r (recursive) option. Typical usage: wget http://www.a.com/file.Z

Fiddler Fiddler is a freely downloadable local proxy that can intercept all HTTP/HTTPS traffic and analyze it. The downside of Fiddler is that since it is not inside the browser, it cannot automatically decrypt TLS traffic – permission would need to be provided and ugly security messages would appear on the browser. The positive of Fiddler is that it captures more than

ust browser HTTP traffic and provides for the ability to create an output trace file that can be viewed remotely by engineers.

In-browser Developer Tools

All modernof browsers in-browser developer tools. These tools allow for the tracing browserhave traffic and you can view encrypted traffic since it can view the data after decryption occurs. These tools can also see when the browser using cache to acquire data for the page as opposed to calling the webserver. These tools are not quite as sophisticated as Fiddler but catching up quickly and typically good enough. At this time, none allows for the creation of a trace file for remote viewing, but this will likely be available in the future.

In the above example of the Chrome in-browser developer tool, you can see

how every HTTP request can be seen in detail by clicking on the request. For http://my.fiu.edu received received a 301 response code example, the initial request to http://my.fiu.edu and provided a “location” header with the value https://my.fiu.edu forcing https://my.fiu.edu forcing the browser to go get that page instead. Once that page was downloaded, it contained links to a slew of other pages, which were subsequently requested of the server. At the bottom is a nice feature called the “console” that most of these tools have added indicating possible issues with the website as it relates to security rules implemented by the browser. Java Applet Console While the popularity of Java applets is diminishing, there is still a good chance of running into some within an Enterprise IT shop. Java applets are defined in HTML or Javascript, downloaded by the browser, and then launched through a pre-installed plugin (Java has to be separately installed on

your PC for this to work).

When Java applets are launched a separate process is started. The in-browser developer tools do not have access to see what is going on within Java applets. To view what happens within a Java applet, you need to utilize the “Java Console”. To start the Java Console, you need to run “Configure Java” from the Windows search bar. This will open the “Java Control Panel”. Click on the “Advanced” tab, then you can set the “Java console” setting to “Show console” and select “Enable tracing” and/or “Enable logging”. After this is completed, the Java Console will appear whenever a Java applet is started by the browser and all actions will be logged for debugging purposes.

necdotal Story – All systems good, why are we slow? A critical internal application is suddenly taking over a minute to load each page slowing down major operations. All the backend servers appear to be

fine (no CPU, disk, RAM, or network constraints). DBA’s have confirmed that the database has no locking and response is acceptable. You perform a Wireshark on the backend server and find that it is responding immediately to all HTTP requests from the client. What are some possible next steps to take?

This is a perfect example of where you should utilize the in-browser developer tool to figure out which request(s) are taking a long time. Since nothing in the backend seems to be an issue and no other applications are impacted, you need more information to determine the next step and the inbrowser developer tool could provide that. You take this next step and find that the web page was being held up making a call to some analytics tools on the Internet. The call would eventually time out and allow the application to load, but the delay would then recur once the user attempted to go to another page. The failing URL call was researched, and it turned out that the external analytics site had been blocked for some reason by the firewall. This was causing the requests to time out. Once the site was allowed through the firewall again, the application began to perform normally once more. The problem had nothing to do with the application itself on any of the infrastructure that the application was using, but instead, an external call is made by the client (browser). Without looking at this from the client-side, this would never have been determined. 7.3 HTTP Web Services As mentioned earlier, HTTP is not just used for browsing; it’s also the most used API protocol in the world by far. APIs are used for program-to-program communication. The two most used API methods are REST (or RESTful)

and SOAP.

SOAP is a standard protocol and rigidly defined to use XML. In fact, SOAP can be used over more than just HTTP, albeit HTTP is by far the most common transport. SOAP is also older than REST, which was started to avoid the rigidity of SOAP. REST is not a protocol at all, but instead more of an architectural style of API programming. REST can only be used over HTTP but can use any kind of text formatting (JSON being the most popular). The simplicity of REST has made it much more popular over the past 5 to 10 years and now this is typically the API method of choice. HTTP RESTful APIs simply utilize the standard HTTP request methods. The most used are POST & GET, but PUT, DELETE, OPTIONS & PATCH will also be seen. Testing API Calls When problems are occurring, it’s important to test components separately to validate their responses and timing. HTTP websites can be tested easily with a browser or the wget command from the command line, but how do you test API calls? Well, the first thing you need to do is understand the input considerations for the API call (e.g. what information do you need to pass into the API?). All APIs should have these well documented as the typical reason for an API is for others that are not necessarily on your development team to call the API. Refer to this documentation to test. Once the input requirements are known, several tools tools can be used to test APIs. The most

used today are curl (command-line tool), Postman, and SoapUI. curl The curl command is a freely available and highly flexible command-line

utility for Linux, Windows & MacOS to access all sorts of protocols (not just HTTP) without user intervention. It’s designed to facilitate data transfers in any kind of batch process. In terms of HTTP, it’s useful for testing HTTP any request method, with POSTs being the most common to test. Sample usage: curl --header "Content-Type: application/json" –header “Cookie: SessionID=219ffwef90f” --request POST –data '{"username":"xyz","password":"xyz"}’ http://localhost:3000/api/login The above example is a POST request method being sent to http://localhost:3000/api/login http://localhost:3000/api/login with with two request headers (Content-Type & Cookie) and a body (--data portion). There are many options to the curl command, so there’s very little that can’t be tested. Please read the man page for more details use curl --help. Postman Postman is a freely available GUI for Windows, Linux & MacOS (downloadable at www.postman.com). There are “pay for” versions of Postman with some additional features and support. Postman allows you to format any kind of HTTP request and save it for future use as well as to share with others. SoapUI SoapUI is similar to Postman. SoapUI is a freely available GUI for Windows, MacOS, or Linux (downloadable at www.soapui.org). The free version has limited functionality, the licensed version (SoapUI Pro) has more features. As with Postman, this tool allows you to format almost any kind of HTTP request. There is also community sharing available. necdotal Story – It’s not me, it’s you! An internal application is having sporadic performance problems when executing a function that performs a RESTful web service call to an external service. The external provider claims that the issue is not with their service and must be something on our end. What are some possible next steps to take?

This is a great example of how you can use reductive problem-solving isolation techniques to help you figure out what’s going on. The external provider is claiming that the issue is somewhere within your environment, so eliminate that variable. Test the web service from the Internet outside of your environment. You do this and record the same sporadic performance issues. Provided with this evidence the external provider looked closer at their systems and realized that there was a server in the segment utilized by your company that was overloaded. They eliminated the server from their load balancer and our performance stabilized. 7.4 SSL/TLS SSL/TLS was made possible by the discovery of asymmetric encryption algorithms in the 1970s, SSL was created in 1994 and has been updated several times, with the most current version being TLS 1.3. TLS is considered a “session layer” protocol used for many applications (not just HTTPS). Other examples of TLS-based protocols include FTPS, LDAPS, SMTPS, POP3S, etc. Certificate Trusts Before we dive into the SSL/TLS protocol, it’s important to understand the concept of certificate trust relationships. All operating systems (PCs, servers, smartphones, etc.) come pre-installed with a set of trusted Certificate Authorities (CA’s) from companies like Verisign, Sectigo, GoDaddy, etc. These trusted CA’s have been vetted by the operating system makers and thus added to their distribution. As security patches are rolled out by the operating system vendors, the list of certificate authorities can change (typically new ones are added).

As an administrator of your operating system whether it be PC, server, or smartphone, you can add certificate authorities to your system as trusted. Enterprises often add private certificate authorities as trusted to their systems. Any certificate presented by a website or any other operating system must be signed with a trusted CA for it to be accepted. The operating system simply uses the public key of the CA, which it has in its trusted certificate store, to unencrypt the signature on the certificate to confirm its validity.

SSL/TLS Protocol Overview With TLS, the client & server negotiate the cipher suite to use (asymmetric, symmetric & hash algorithms). The client then uses asymmetric encryption to allow it to privately share material for creating shared keys in both directions by encrypting it with the server’s public key. Shared keys are then used for symmetric encryption of payload in both directions.

In the diagram above, after the initial TCP handshake (SYN-SYN/ACKACK), the TLS process does the following: 1) The client initiates the conversation with a ClientHello. This ClientHello message will include which TLS version the client supports, the cipher suites supported, and a random string (for key generation later on). 2) Next, the server replies with a ServerHello message, which includes the server's SSL certificate, the server's chosen cipher suite, and a random string (also for key generation). 3) The client verifies the server's SSL certificate with the certificate authority that issued it. This confirms that the server is who it says it is, and that the client is interacting with the actual owner of the domain. 4) The client sends another random string of bytes, the "premaster secret.", but this string is encrypted with the public key and can only be

decrypted with the private key by the server. (The client gets the public key from the server's SSL certificate.) 5) The server decrypts the premaster secret. Both client and server

generate session keys from the random strings and the premaster secret. They should arrive at the same results. 6) Both send “Change Cipher Spec” messages to warn the other that the rest of the communication will be using the symmetric keys and finally both send “finished” messages encrypted with symmetric session keys to verify all is working as expected. Some steps are skipped in the above to keep it succinct and there are variations to the above process with certain ciphers (e.g. Diffie-Hellman), but the general idea is the same. While this process works well, there have been subtle issues that have been exploited over the years and thus new versions of SSL/TLS have been deployed to adjust for these vulnerabilities. Today, only TLS 1.2 and TLS 1.3 are considered free of vulnerabilities. Most enterprises still utilize older versions of TLS internally, but due to compliance issues, most externally facing services should be on at least TLS 1.2. Below is a timeline of the versions:

What can go wrong with SSL/TLS? As with any technology, things can go wrong with SSL/TLS. In fact, in large enterprises, it’s quite common for there to be issues related to SSL/TLS. Regardless of the root cause, the actual error could be a warning by the browser for web-based HTTPS applications or web service call failure. Below are the most common issues, but this is not a comprehensive list:

Expired or Revoked Certificate – SSL/TLS certificates typically

expire annually or every two years or can be revoked, if compromised. If an administrator fails to update a certificate on the server or load balancer before its expiration, then it could no longer be accepted by the client and errors could occur. This is by

far the most common error and it can even occur with private certificates used within products. It’s critical to have a solid process for ensuring that certificates are replaced before their expiration date. Certificate Not Issued to Name – Certificates are issued to DNS names. If the DNS name used to access the site is not on the certificate, then the certificate is not valid for this name. This error can occur if someone tries to test temporarily with a new name or there’s a change in name for a site. It’s important to note that certificates can be issued to support any name within a domain (e.g. *.fiu.edu or a list of names). Untrusted Certificate Authority – This is more likely when deploying a private certificate authority, but the concept could occur under various circumstances. If the client does not trust the certifier of the certificate, it cannot trust the public key provided. Note that on rare occasions, your TLS public CA provider may change the CA it’s using and there’s a chance that some older clients may not yet trust it. Incompatible Protocol Support – This is especially possible if either side is running older software that does not support newer TLS versions and newer implementations are no longer allowing SSLv3, TLS 1.0, or TLS 1.1. Incompatible Cipher Suites – This is especially possible if either side is running older software that does not support newer ciphers and newer implementations are no longer allowing certain older encryption & hashing algorithms and/or smaller key sizes. For

example, in 2018, Chrome stopped accepting any certificates using a SHA-1

hashing algorithm. SSL/TLS File Types There are many types of files related to SSL/TLS. The primary file types are

noted below. It’s important to note that some file types include only the public key (e.g. .crt, .cer, .der), some include only the private key (e.g. .key) and some can include either or both (e.g. .pem, .pfx, .p12, .pkcs12). If a file contains a private key, then it is typically password protected.

SSL/TLS Tools As with any technology, there are tools necessary to diagnose issues with SSL/TLS. The key is being able to view the contents of a certificate and what, if any, flaws exist with it (e.g. expiration date passed, DNS name not issued to this cert, the cert is not trusted, etc.). The most common method for viewing a certificate is through the browser, typically by clicking on the padlock, which is usually near the URL.

Upon clicking on the certificate, you are typically presented with the following type of interface:

This interface provides all the details associated with the certificate. On the General tab, you can see the “issued to:” which indicates the DNS names for which this certificate can be used; this list can be lengthy. Also, the “Issued by:” indicates the name of the Certificate Authority (CA). Lastly, the “Valid from” provides the date range within which this certificate is valid. The ”Details” tab has a lengthy list of attributes for the certificate and the “Certification Path” provides the list of certificate authorities in the certificate chain. It’s important to note that a CA can create subordinate intermediary CA’s that can then issue the actual certificate authority. Your operating system only needs to trust one CA in the chain. The following tools can also be used to view/convert SSL/TLS certificate or in the case of Wireshark, the handshake between client and server:

necdotal Story – Secure handshake denied

A call comes in that some web service calls to an external provider are failing all of a sudden. The developer says that they have made no code changes on their side and that the failures just started in the morning. The developer has tested accessing the external web services directly through Postman and they function properly. Also, there are other applications on the same application servers that make calls to the same services and they are working fine. However, a clue is identified by the web hosting engineer that finds “TLS Handshake Failure” messages in the application logs. What are some possible next steps?

Let’s focus on the troubleshooting principles underlined in the below section.

What GivenChanged? that nothing changed on the internal side, the next question asked was whether something was changed on the external service side? Also, given

that the error was indicating that TLS was the problem, looking at the external service provider’s SSL certificate seemed like the next logical step. This is best done by simply taking the URL used by the application to make the web service call and putting it in the browser. Most web services will provide you with an informational page about their service, but most importantly, you can now inspect the SSL certificate by clicking on the padlock in the browser. Upon reviewing the SSL certificate in the browser, you notice that it’s “valid from” date was the day before – this was a new SSL certificate. Thus, you have detected a change. Still, while the certificate is new, the certificate appears valid. Also, since other applications on the same server are working, we know that the certificate authority is trusted on these servers. A call is made to the external service provider, but they do not appear to be very responsive, so our investigation needs to continue. What can be done next? Compare what works with what doesn’t work? How can we compare a successful call to the service with a failure? Since the TLS handshake is failing, a network capture from the application server that has both successful and unsuccessful calls would be the place to start. Unfortunately, this server is heavily used, and a network capture could negatively impact it if we try to capture all packets. Thus, we need to limit the packets captured to avoid making things worse. This is done by simply limiting the capture to network traffic that includes the destination IP or in this case network (104.18.0.0) with the following tcpdump command: tcpdump -ni eth0 net 104.18 You have users test both the working and non-working applications that call the same service and compared the network captures with Wireshark. You first look at the failure (shown below):

As you can see after the “client hello”, the server responds with a “Handshake Failure (40)”. This is a generic message from the server stating that it cannot communicate given the information provided by the client. So, you then look at the “client hello” of this failing conversation (shown below):

The “client hello” seems normal enough. In fact, it’s sending 85 cipher suite options to the server to choose from. What more does it want? Well, the next step is to look at the “client hello” that works to see what different. At first glance, the successfully “client hello” seems identical, but upon closer

inspection, there’s a difference at the end of the packets as shown below:

The difference was subtle, but the successful calls were including the “Server Name Indication extension” of TLS, which included the DNS name we were targeting. The failing transactions were not including this extension. This led you to review the SSL certificate once more and you notice that there were multiple DNS names in the “Details” tab indicating that the certificate could be used for multiple names and thus there was a potential that if the name was not passed along, it could reject the request. To further prove your theory, you use the OpenSSL command to test a TLS connection with and without SNI (server name indication) as follows: $> openssl s_client -connect www.acmeproc8.com #Fails $> openssl s_client -connect www.acmeproc8.com -servername www.acmeproc8.com #Works With suspicions confirmed, the next step was to have the developer review the applications that were working against the failing application, and sure enough, the working applications were newer and passing along the SNI with the requests, while the failing transactions were not. A simple code change was made to get the failing application working again. 7.5 AAA Concepts Understanding how the security of all systems involved in a particular problem work is often critical when troubleshooting as misconfigured access will cause errors. In data security parlance, AAA stands for Authentication,

Authorization & Accounting – when troubleshooting, just the first two elements are typically the most relevant. uthentication

Authentication is all all about answering the question “who is the user?”. When troubleshooting, there are some special considerations, such as: How is the user or process authenticated to the various systems involved (e.g. web services, database, message queue, etc.)? Is the authentication by userid/password, certificate, token, multi-factor, something else? Has the user been authenticated properly everywhere (e.g. is the user’s identity accepted)? When dealing with a service, daemon, or batch job (non-userrelated process), under what userid is the process running? This will impact what the process is allowed to do, what files it can open, etc. uthorization Authorization is all about answering the question “is the user allowed to perform the requested task?”. When troubleshooting there are are some special considerations, such as:

After authentication, how is the user’s access defined in the various systems involved? Is it by group membership, membership, direct access provided to resource, or something else? Be wary of hierarchical limitations. limitations. On occasion, in order to access something, you need to have access to something else. The classic example is a directory, where if you do not have access to the c:/usera directory, then even if you are given access to the c:/usera/t/t.txt file, you will not be able to access it. ctive Directory Active Directory is Microsoft’s proprietary directory service that is used to configure and protect all components within a Microsoft-centric environment.

AD is a complex tool and its operations are well outside the scope of this class, but awareness of AD and the need to have personnel well-versed in this product when working within a Microsoft-centric environment must is hierarchical with Trees the following levels: Forest (contains onebeornoted. more AD domain trees), Domain (contain domains), Domains (local database container), Organizational Unit (directory structure within the database), Object (user, group, machine, printer, file server, policy, etc.). AD can be accessed via LDAP protocol, but also utilizes Kerberos and acts as a CA. There are a few special considerations with Active Directory in an enterprise environment as follows: Domain controllers service access requests for all devices/users. There is one Primary Domain Controller (PDC) domain and many Backup Domain Controllers (BDCs) per domain and there is constant replication between them. Site preferences dictate which domain controller serves which locations. In large environments, replication to all locations can take an hour or more. This is a key point when troubleshooting issues related to updates to AD. It may take quite a while for the update to get to the DC servicing your requests – good practice is to directly

update the DC servicing you to avoid delay. necdotal Story – Can’t get web apps started Multiple Microsoft-based IIS applications suddenly begin to fail. Users are

getting a 503 error in the browser. You look in the Event Viewer log and see that the AppPool failed and is failing to restart. All you see is the following error in the log – “A process serving application pool ‘MyApp(default)(4.0) (pool)’ suffered a fatal communication error with the Windows Process Activation Service”. What are some possible next steps to take?

Reviewing the troubleshooting principles, you ask whether anything has changed and from the search on the ITSM tool, it did not appear so. Another key question is what’s common about the failing app pools besides the error message? The answer to this is that all the app pools failing were associated with the same application development team. While that’s a clue, in and of itself it did not provide further guidance. So, you need to investigate the error message itself. This is an example of where Google can be your friend. Research of this error message indicated that there are a variety of possible causes, but many centered on issues with permissions. Since these app pools were all associated with the same application development team, could they all be using the same userid? This leads you to check IIS (the Microsoft webserver) to see what userid was being used for the failing app pools and they were all indeed the same. This causes you to look for this userid in Active Directory (AD). A glance in AD shows you that the account had been locked due to too many invalid login attempts. The userid was unlocked, and the app pools were able to start up again. Not done yet! Per troubleshooting best practices, you could not simply walk away from this problem after fixing it because you had not identified the true root cause. Why was this userid locked and how can you avoid this from recurring? A

problem record was created, and further investigation occurred. An important thing to understand is that app pools and Windows services are not supposed to use a regular userid. These processes are supposed to use a service account that is specifically set up without the ability to login to a Windows machine

such that it cannot be used for anything other than to be the userid for a specific process. This also avoids anyone from locking the account by trying to login with it. As it turned out this service account had not been blocked from being able to login with it, so someone had been trying to log in unsuccessfully and caused this issue. The service account permissions were altered to avoid recurrence. uthentication and Authorization Protocols As mentioned in the previous section, many issues in Enterprise IT could revolve around authentication and authorization. As such an awareness of the most common protocols is appropriate, however, fully discussing these protocols is outside the scope of this book:

7.6 Other Application Protocols While we covered HTTP and SSL/TLS in-depth and briefly discussed AAA concepts, there are many other application protocols in use within an Enterprise IT environment. While it is impossible to cover them all with any depth, we thought it appropriate to at least mention some of the most common protocols. Other common protocols

Chapter 7 – Review/Key Questions

1) What’s the structure of the HTTP protocol?

2) What’s been the focus of changes to the HTTP protocol over the years? 3) What are the primary HTTP request methods and how do they differ? 4) What are the categories of HTTP response codes? 5) What are the advantages and disadvantages of storing session state in a database versus RAM? 6) What tools can be used on the client-side to view the details of each HTTP request and response? 7) What are the steps to a TLS handshake? 8) What is a Certificate Authority and how does it impact TLS? 9) What can go wrong with a TLS handshake? 10)

What tools can be used to work with certificates and TLS?

11) What’s the difference between authentication and authorization? 12) How can Active Directory (AD) synchronization potentially impact troubleshooting efforts?

Chapter 8 – Hosting Platforms

In this chapter, we will cover hosting platforms like database, web, and application servers. These components are heavily involved in the execution of Enterprise IT applications and thus understanding them well is critical to troubleshooting problems. These hosting platforms can be considered mini purpose-built operating systems in that they provide several services similar to that of operating systems, but to serve a very specific purpose. These hosting platforms have sessions (analogous to processes), perform CPU scheduling (thread management), memory management (buffer caching), security, transaction management, and other services on behalf of applications so that the applications can focus on business logic. While hosting platforms can be considered mini purpose-built operating systems, they differ from general-purpose operating systems in many important ways: They do not manage hardware and must execute within a generalpurpose operating system. They cannot host any type of application – they are limited to managing specific types of applications or sessions. In this chapter, we will focus on database, web, and application servers as well as PaaS and VDI. 8.1 Databases There are several types of database platforms used within the enterprise. Understanding which type you are dealing with and the monitoring tools available for it is crucial to provide any assistance in terms of troubleshooting. The most common types of databases:

Relational (RDBMS – ACID properties) – Microsoft SQL Server, Oracle, MySQL, Postgres, IBM DB2, Azure SQL, Amazon Aurora NoSQL (BASE properties) - MongoDB, Azure Cosmos, AWS

DynamoDB, Couchbase, Riak, CouchDB, Apache Cassandra, and Hbase Key-Value Pair (typically used for session state) – Memcache, Redis

Hierarchical (imbedded in certain products – rarely standalone) – IMS, IDMS For RDBMS & NoSQL databases, a DBA (Database Administrator) is typically charged with managing the database and should certainly be involved in troubleshooting any potential issue with a database. Each database product is unique in terms of how it’s organized, what tools to use to monitor/troubleshoot, and even what techniques to improve performance or alleviate issues. This is why experienced product-specific DBAs are important. RDBMS databases are still by far the most used in enterprises. Thus, the rest of this module will focus on troubleshooting RDBMS databases. RDBMS Troubleshooting

So, you if there isthe a problem within ithin an RDBMS Well, first,what you do need todo understand scope ofw the problem. Is thedatabase? overall system slow or just some functions? What do the OS stats look like (e.g., high CPU, high I/O, thrashing, thrashing, or all good)? Here are some general high-level guidelines to start with: If the increase is sudden, ask about recent changes to the application or the database as backing out a recent change is often the most direct solution path. If the overall system is fine, but a particular function is having difficulties, then perform a trace of a session performing that function to identify any performance issues within those commands. The key is to identify the problematic SQL query or queries. If CPU is high (consistently near 100%), then look at the top queries using CPU to see if they can be optimized. Temporarily adding CPU horsepower to the database server may be needed until the situation is stabilized. If CPU is fine, but I/O wait is high, then look at top queries using I/O to see if they can be optimized. Temporarily adding RAM RAM to

the database server may be needed until the situation is stabilized. Note that the database configuration may need to be adjusted to utilize the additional memory. Another possibility is poorly performing storage, so

checking the I/O response time is also warranted. If thrashing (high swap rate) is occurring, then the database configuration may need to be adjusted to lower RAM needs and/or RAM needs to be added to the server. If the system looks fine, but several functions are being impacted, then look for the following possibilities: Blocking issues within the database. This is very common, and DBAs should have scripts or tools ready to find blocking issues. Soft constraint limits like the number of concurrent sessions allowed. Out of space condition with either a data file or the transaction log (if the database cannot write to either, then no updates can occur on the entire database). Problematic Database Queries All major RDBMS vendors provide the ability to find “top queries”. This can be done because each SQL query is stored in the “procedure cache” of a database with its execution plan. As such, queries can be viewed on an individual query basis (CPU time, elapsed time, I/O, etc. per execution) or combined basis (sum of all executions of the query). The latter (combined basis) is typically the most appropriate way to address the overall CPU or I/O usage issue. While the top queries may be optimal, despite their high utilization, most often that is not the case.

You can identify these top queries with a database-specific monitoring tool (e.g. OEM for Oracle) or a generic ITIM tool with such functionality. However, a DBA can simply run a query to obtain such information. Below is a sample query for identifying top queries in SQL Server based on “total logical reads”, which is a good way to look at it since the more logical 4k pages (whether in RAM or on disk) a query needs to read, the more work it is doing and the more CPU it requires. To only focus on CPU consumption, then simply uncomment the last line ( -- ORDER BY qs.total_worker_time DESC -- CPU time ) and comment out the current “order by” line ( ORDER BY qs.total_logical_rea qs.total_logical_reads ds DESC -- logical

). Note that you would need DBA privileges to execute this command. A similar SQL can be used to find similar information for an Oracle instance, although OEM is much better if available. reads

SQL Server Query to Identify Top Queries:

So, what can you do once you find these top queries? First of all, such investigations should include a DBA and the application development team as familiarity with the database and application is crucial to this level of troubleshooting. Corrective action could include the below: Update the statistics of the tables involved to make sure the optimizer makes the best decision in terms of table access and join order. This is usually something that can be done quickly and is unlikely to cause further harm, so a good first step. Perform execution plan analysis to determine where most of the cost is. This is an iterative process as you re-check the execution plan andproduction-like its effectivenessdata when changes are introduced. Having in aany non-production database is key to being able to do this without impacting production. If the execution plan analysis indicates that indexes could help, then add indexes to avoid full table scans, create more selective intermediate result sets, or select optimal join order. Adding indexes is generally a quick, non-disruptive activity with a minimal downside as long as too many unnecessary indexes aren’t added.

If necessary, add hints or a profile (static execution plan for a particular query)This to force the optimizer to accessby thethe data in asenior certain manner. is typically only employed most DBAs and developers in rare situations where the optimizer

refuses to take the most efficient path. Database Blocking Aside from long-running queries, the second most common issue that occurs with RDBMS databases is blocking. This occurs because RDBMS databases are designed to preserve the “isolation” of each transaction such that no transaction is impacted by another. It’s important to note that blocking is always occurring within large, heavy hit RDBMS databases. Some very short-lived blocking is normal, however, anything long-lasting is typically a problem.

As with top queries, you can identify these blocked and blocking database sessions with a database-specific monitoring tool (e.g. OEM for Oracle) or a generic ITIM tool with such functionality. However, a DBA can simply run a query to sessions, obtain such Below issessions, a samplethe query for identifying blocked theinformation. associated blocking database involved, and the SQL text of the blocked session in SQL Server. SQL Server Query to Identify Blocked and Blocking Database Sessions:

When investigating blocking SQL, programming rationale for the blocking needs to bethe understood. Thus,the again, DBA and application developer involvement is crucial. Corrective action could include: Killing a user SQL session (a developer or business user) running an ad-hoc query blocking active application users. This should be rare but does occur. Killing a batch job that is running at the wrong time. It is common for batch jobs, which are typically meant to execute overnight or

on weekends, to accidentally run during the day and cause a blocking issue. You may need to change the schedule of the job to avoid recurrence. Having reports run against a read-only copy of the database if they are causing blocking issues.

Changing the isolation levels of sessions being blocked to reduce blocking. If possible, have application designs utilize optimistic locking and leverage snapshot isolation for concurrent access. Changing the code of the blocking session to commit more frequently such that locks are held for a shorter time (once a SQL session commits work, locks – in particular exclusive locks - are released). Ensure application screens do not wait for users before releasing update or exclusive locks. This is rare and unlikely unless introduced with a recent code change. Other potential RDBMS problems Other things to look for in terms of troubleshooting database issues (performance or otherwise) include:

Storage Space – Ensuring that the database has sufficient space for growth, transaction logs, logs, and temporary tables is cr critical. itical. NOTE: Transaction log purging is dependent on when transactions commit as well as other database configuration settings, so a longrunning job without commits can cause the database to run out of log space. Errors in the logs – Database log files can contain errors that could provide important clues to problems. For example, deadlocks could be occurring causing sessions to be killed randomly and this would appear in the database logs. Corrective action would vary depending on the errors found, but if the is any doubt as to how to approach the situation, then contacting vendor support is advised Low cache hit ratio – Is there a low cache hit ratio (under 95%) or low buffer memory life expectancy? This could be an indicator that additional RAM would improve performance. Assuming that the top queries have been tuned, then corrective action would be to add RAM to the system and to the database (adding RAM simply to the OS alone would not help – the database needs to utilize it

and this is dependent on the database configuration). I/O performance – Isthan the I/O response time within range (typically less 10ms but depends greatlyanonexpected the type of storage). Corrective action would include: An analysis of the storage system to identify where the bottleneck is (e.g.

connectivity to the storage or retrieval time within the disk itself – especially if mechanical disk) If the bottleneck is with the disk itself, then re-balancing of the disk could provide some relief. Occasionally additional disks need to be added for performance reasons even though the capacity is not needed

Oracle and SQL Server Oracle and Microsoft’s SQL Server are by far the two most common databases found within Enterprise IT environments. Because of this, we will cover these in a little more depth over the next few sections. Database Processes Oracle and SQL Server both run as Windows Services on Windows servers. To see if a particular database instance is executing on Windows, start services.msc and look for services that include SQLServer or Oracle at the front to identify them. Both Oracle and SQL Server have only one primary Windows service per database server instance, albeit there are other services to support adjacent features.

Oracle and SQL Server both run as daemons on Linux. On Linux, Oracle launches a new process for each connection, so if you perform a ps -ef | grep oracle, you could find a lot of processes associated with one database instance. Meanwhile, SQL Server follows a similar model on Linux as on Windows and new processes are not launched with each database connection. Having said that, it is very rare to find SQL Server running on Linux in an Enterprise IT environment due to its limited distribution & support. Database Tools Above we mentioned a little about how to find these issues (Blocking sessions, top queries, etc.), but what tools would normally be used for this? This varies greatly by product. Third-party monitoring tools are available for Oracle, SQL Server, MySQL, DB2, and Postgres (and a few others). These are not often used with Oracle due to the sophistication already built into the Oracle OEM tool (reviewed below). Below are the product-specific offerings for the two top products (Oracle & SQL Server).

Oracle – OEM

Oracle provides a tool called OEM (Oracle Enterprise Manager), which is a web-based tool that allows you to view all aspects of the database. This tool does require additional licenses but is typically worth it if you are investing in

Oracle database technology. Oracle automatically takes hourly snapshots of the database statistics such that you can go back in time to see which queries took up the most CPU time, performed the most I/O, etc. as well as what other conditions took place for any period of time. Within OEM, the ADDM (Automatic Database Diagnostic Monitor) component recommends errors to look at, queries to tune, and even configuration settings to adjust. The AWR (Automatic Workload Repository) tool within OEM allows you to create detailed reports on the statistics for a given period of time. Oracle also can trace a particular session to find the top queries in that session. Sample screenshot of OEM:

Microsoft SSMS Microsoft provides a tool called SSMS (SQL Server Management Studio), which includes solid monitoring capabilities with a feature called “Activity Monitor” and with many customizable reports. This is a fat client application that only works on Windows operating systems, however, there is no charge for SSMS – it’s freely downloadable from Microsoft. There is a less featurerich tool called Azure Data Studio for MacOS and other Linux distributions, but as of 2020, it’s quite lacking as compared to SSMS.

SQL Server has other tools and features outside of SSMS. For example, SQL Server provides Dynamic Manage Views (DMVs) that can be queried with SQL statements and include all sorts of database usage statistics. A feature called “Query Store” allows you to identify problem queries over time. Database Tuning Advisor (DTA) gives recommendations on queries provided

from the active procedure cache, a query store, or a SQL Trace. Sample Screenshot of Activity Monitor within SSMS:

Database Logs As with Operating Systems, key information can be found in log files

generated by the database engine. These are text-based files and they should be reviewed as part of any troubleshooting effort that is “error” related. These can also be reviewed for database-wide “performance” related

issues, but less likely to provide anything useful in those circumstances. In Oracle, the logs can be accessed through the OEM tool and/or directly from the filesystem (with (with “cat”, “tail”, “more”, “grep”, etc.). To find the location of the file on the filesystem, the following query can be run: select * from v$diag_info where name='Diag Trace’

In SQL Server, the logs can be accessed from SSMS and/or directly from the filesystem (with “notepad”, “find”, “more”, etc.). While it could vary by

installation, SQL Server logs can usually be found at “%Program Files%\Microsoft SQL Server\MSSQLvn.instanceName\MSSQL\Log” where vn is the version of SQL and instanceName is the name of the instance Messages can be informational or errors - all can be researched with freely available online Oracle or Microsoft documentation. Some examples: ORA-1652: unable to extend temp segment by 16 in tablespace TEMP ORA-00060: Deadlock detected. 2019-03-27 20:03:36.86 spid18s The Database Mirroring endpoint is in disabled or stopped state. necdotal Story – Application frozen but all systems are lowly utilized? Calls come in from several sites that a critical application is not responding to users. All the servers involved (web, application & database) seem fine from a CPU, RAM, I/O, and network perspective – in fact, they all appear to have unusually low utilization for this time of day. What would you ask next to try to get to the root cause?

As always, you start with the question – have there been any recent changes? To this question, the application developer admits that there was a change implemented the night before, but not for the website. The change was related to code for batch jobs that run throughout the day. These batch jobs do not execute on the same servers as the website, but they do access the same database.

This leads you to review the database for contention. The DBA looked for blocking in the database and found dozens of blocked database sessions. The “lead blocker” for all of these was the batch job that had been changed the night before. The developer was asked if the batch job could be canceled and

the code rolled back. The answer was affirmative. The batch job was canceled, and the web application began to perform normally once more. The code was rolled back, and the batch job was restarted without incident. Upon further analysis, the developer had attempted to optimize the batch job, but in doing so reduced concurrency. While testing had proven that the optimizations allowed the batch job to run quicker, it was never tested while users tested the online system at the same time. Thus, this issue was not discovered during testing. A critical takeaway for the development team is to make sure that some level of online testing occurs when changes are made to batch jobs. 8.2. Web & Application Servers The next hosting platforms that we will cover are web and application

servers. These are combined into the same section because it’s often difficult to differentiate where a web server ends, and an application server begins. In fact, it’s rare to find an application server that does not include at the very least a lightweight web server. Due to this, we’ve combined these into one section. Web Servers Web servers are defined as processes that accept HTTP & HTTPS connections. These processes can serve static content (e.g. text, images, video, etc.) and optionally dynamic application server content. Web servers

can be loosely to backend application servers by acting proxy” (we willcoupled cover proxy servers more in-depth in chapter 10)asora “reverse more tightly integrated by using plugins that create child processes that behave as application servers. Some web servers can act as both a “reverse proxy” and an application server.

A reverse proxy configuration would typically connect to a tightly integrated backend configuration as almost all application servers today come with at least a lightweight web server. The most common web servers in the market today are Apache, IIS, Ngnix, Litespeed, and GWS (Google Web Server). Web servers can handle anywhere from less than 100 to thousands of requests per second. The amount depends on the server capacity (in particular CPU and RAM), whether it is acting as an application well as a web server, and its configuration settings. Most web serversserver launchaschild processes or threads for each new request. As discussed in the previous chapter, while the HTTP protocol is sessionless (sessions must be implemented within the application), it does include the concept of a “persistent connection” (to maintain TCP connections open for performance purposes). As a result, result, each child process or thread maintains its connection to the client until the connection is closed or times out. Below is a simple diagram of how this works with Apache.

The number of child processes, threads, and concurrent connections are all

limited by the web server’s configuration and should be monitored accordingly to determine if any of these “soft constraints” are nearing. Understanding the configuration parameters of a web server is critical to maximizing the amount of work it can do on a particular server. pplication servers Application servers allow for the execution of business logic. As mentioned earlier, most often, application servers also include at least a thin HTTP (web) server. Application servers are typically designed to support a specific language or framework (e.g. .NET, Java, JavaScript, Perl, Python, etc.). The most common application servers are IIS (.NET Runtime), Apache Tomcat (Java), RedHat Jboss (Java), Oracle Weblogic (Java), Node.JS (JavaScript), Apache mod_perl (Perl), and Apache mod_python (Python), however, this is ust a small subset.

Most application servers include dozens of configuration settings. Some of these configuration settings have similar characteristics to that of web servers (since they typically contain a thin HTTP server), but other settings vary greatly due to their language/framework-specific nature of application servers. Expertise with the specific application server in question during a troubleshooting session is critical. Application instrumentation is also very useful when debugging application server issues as it provides where a failure is occurring and/or where a performance problem exists in the code. Application instrumentation can be implemented with a 3rd party tool (e.g. Dynatrace, New Relic, Solarwinds, AppDynamics, etc.) plugged into the application server or via output generated by the application (e.g. application logs). Web & Application Server Troubleshooting If a web or application server is constrained, whether CPU, RAM (swapping), I/O (unlikely), or network (also unlikely), the following process could be followed: If the spike in activity is sudden, ask about recent changes as

backing out a code change is often the quickest method of resolution. If web/application servers are load-balanced, determine if only one server in the group is having a problem. A quick solution may

simply be to eliminate the one bad server from the mix. However, a thorough analysis of the problem server is still needed as the issue could resurface and the server would likely need to eventually be re-added to maintain redundancy or fulfill capacity needs. As with troubleshooting a constraint on any operating system, identify whether there is a particular process that’s causing the constraint. Once the process is identified, obtaining a memory dump and recycling the process would be the best immediate step (note that in the case of high CPU, it may be best to take 3 memory dumps a minute apart to allow the vendor to analyze which thread is taking the CPU time). If the problem/constraint is not isolated to one process, then expanding capacity could be warranted. Adding CPUs, RAM, I/O, or network bandwidth to the server could help, or alternatively, adding a server (especially if already load balanced) may be best. Web and application servers can easily scale horizontally when placed behind a load balancer. Overall application tuning is another option, but often a lengthier process in terms of identifying significant improvement, thus at least temporarily adding capacity is typically required If there’s no visible constraint (e.g. CPU/RAM) and yet application performance is poor, then the following steps could be followed: If available, utilize application instrumentation to determine what part of the code is driving the high utilization or poor performance. IMPORTANT NOTE: The cause of poor application performance may be calls to a database or calls to another service on another set of servers altogether. In other words, the problem may not be with the primary application server at all.

Identify the process of the application that’s experiencing issues and review configuration settings for thattoapplication (e.g. userid, maxthe threads, max connections, etc.) see if perhaps there’s a soft constraint in play.

If the problem is with a downstream database or service call, the application server could end up queuing requests to the point where it begins to impact applications not related to the initial problem, but that share the same web/application server. Thus, recycling the web/application server may be necessary to get other applications working. Finally, if the problem is not performance related, but instead the application failing with an error, then the review of system logs, application logs, platform logs, and once again application instrumentation tools would be most appropriate. Instrumentation tools and Web access logs in particular provide a wealth of information and should be reviewed. The next sections focus on instrumentation tools and these logs. pplication Instrumentation Tools Tools We’ve now mentioned application instrumentation tools a few times and they will be repeated in later chapters as well. The reason is that these tools are very useful for diagnosing problems whether they be code-related or infrastructure-related. In this brief section, we will provide screenshots of one particular instrumentation tool called Dynatrace, however, this tool is only used for illustrative purposes. There are many ways to instrument your applications, the key is not how but instead that you can get to the level of information that helps you identify where the problem is located.

A unique feature of instrumentation tools is that they allow you to see how your application is performing over time as with the below diagram:

The problem with the above is that it’s summarizing all of the web pages and service calls for your application into one set of values. However, you can also see which pages are being executed most often and/or performing the worst as shown below:

The above diagram allows you to focus on the specific pages that could be a problem. Also, you can look at your application through the lens of how each server is doing as with the below:

The above diagram lets you see if any specific server is an issue. You can even combine these views by selecting a specific web page or service to see if only that page/service is having an issue with a specific server. Searchability is a key to instrumentation tools as well, including the ability to look for requests that failed (e.g. did not return a 2XX HTTP response) or perhaps that took longer than a certain amount of time. The selection criteria are only limited by the information available on the HTTP Headers. Below is a sample of the selection criteria:

You can also combine multiple search criteria to focus further on the potential problem requests. Once you have identified the subset of requests that you’d like to inspect, you can then drill into them to see what took the most time and/or failed. Below is an example of such detail:

The above is just a portion of all the work performed by this one request, but it is clear that the majority of the time is being spent on one call to the database. If other requests appear to be taking a long time on this step, then an inspection of this SQL Stored Procedure would be appropriate to see if improvements could be made. The above discussion on the typical capabilities of instrumentation tools is ust a small example of what these powerful tools can provide. It is important to become familiar with the capabilities of whichever tool is available within your enterprise in order to be most effective at identifying where within an application a performance issue or even failure condition exists. Web and Application Server Logs

Web & Application server logs are typically the most relevant to review in terms of troubleshooting error conditions, however, they can also be useful when troubleshooting performance issues as they allow you to validate response times. This next section will focus on the various web/application server logs and how they can be used in troubleshooting.

pache & NGINX Logs Apache & NGINX both have two types of logs – Error logs and Access logs. Both of these log types are interesting for review. Both Apache & NGINX can have multiple of these logs depending on the configuration of the webserver. An important point to understand is that web servers can have “virtual hosts” defined within them and then different logs per virtual host. To find the correct log, you need to search for “log” in the Apache or NGINX config files. The organization of web configuration files can get a little complex, so it’s important to fully understand the setup.

As the name indicates, “error logs” contain mostly error messages, although informational and warning messages will appear as well. If you are only using Apache or NGINX as a web server (not an application server), then these are typically sparse logs. While many of the errors are simple to understand, referencing documentation and performing general Internet searches to fully comprehend their meaning is warranted. Below are some sample lines for Apache: [Thu Jun 22 14:20:55 2006] [notice] Apache/2.0.46 (Red Hat) DAV/2 configured -- resuming normal operations [Fri Dec 16 02:25:55 2005] [error] [client 1.2.3.4] Client sent malformed Host header [Tue Mar 08 10:34:31 2005] [error] (11)Resource temporarily unavailable: fork: Unable to fork new process The “access logs”, however, are the most often referenced. Apache and NGINX access logs are configurable. There are several formats and you can decide which fields to include in the logs. As the name indicates, access logs contain all attempts to access the webserver. This is typically a large log that needs to be rolled over and deleted periodically. Below are some sample lines from the access log: 127.0.0.1 - peter [9/Feb/2017:10:34:12 -0700] "GET /sample-image.png HTTP/2" 200 1479 198.28.29.43 - - [28/Jul/2006:10:27:32 -0300] "GET /hidden/ HTTP/1.0" 404

7218 204.153.23.12 - - [05/Feb/2012:17:11:55 +0000] "GET / HTTP/1.1" 200 140 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.5 Safari/535.19“

What do these messages mean? Well, mean? Well, it would be ideal to view the header of the web access log as it contains the field names included in the log in order. In this way, there’s no need to guess. Having said that, you can usually figure it out without needing the header. For example, in terms of the bolded line above, the fields appear to be as follows: 1) The IP address of the client (127.0.01) 2) Unknown as there’s no value in any entry (-) 3) Likely the userid for any “authenticated requests” (peter) 4) The date/time of the request ([9/Feb/2017:10:34:12 -0700]) 5) The HTTP request line (GET /sample-image.png HTTP/2) 6) The HTTP response code (200) 7) The time in milliseconds that it took to complete the request (1479) How can this information be useful when troubleshooting? There troubleshooting? There are so many ways that this information can be useful, that it’s impossible to quantify. First, this allows you to see how many requests are coming in (is there more than usual?). It allows you to see how many requests are getting errors, how many are succeeding, how long requests are taking, what specific HTTP requests are failing, whether a particular client (IP address) is succeeding or failing, etc. Please note that instrumentation tools provide this information in a much easy to digest format, but this is a good secondary source. Microsoft IIS Logs Microsoft IIS servers also have access logs and error logs. The path of the error logs is typically “C:\Windows\System32\LogFiles\HTTPERR”. The location of the access logs is configurable via a GUI called IIS Manager. Thus, to find the location of the access log you need to query the tool (screenshot below). Also, as with Apache and NGINX, there could be multiple access logs as one is created per IIS WebSite (analogous to Apache Virtual Hosts).

Here’s how to find IIS Access logs: Start the IIS Manager GUI

Click on “Logging” The next screen will have the directory of the IIS access logs

The fields includetoinaccess Microsoft IIS access logs is also configurable and contain all to attempts the webserver. As with Apache and NGINX, these are typically large logs that need to be rolled over and deleted periodically. Below are some sample lines from two different logs: #Fields: time c-ip cs-method cs-uri-stem sc-status cs-version 17:42:15 172.16.25.25 GET /default.html 200 HTTP/1.0 #Fields: date time c-ip cs-method cs-uri-stem cs-uri-query s-port cs-username cs(User-Agent) cs(Referer) sc-status time-taken cs-version X-Forwarded-For 2016-09-13 21:45:10 10.100.4.11 GET /webapp2/getDetails.aspx item=9823&warehouse=2 80 ghr093 Mozilla/5.0+ (Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+ (KHTML,+like+Gecko)+Chrome/+Safari/537.36 /webapp2/inventory.aspx 500 5502 HTTP/2 52.0.274.116 What do these messages mean? Since mean? Since the fields are provided in this case, there’s no need to guess as the “#Fields:” line tells you what the value of each field means. Is one version more useful than the other? Of other? Of course, the second access log

with more information is more useful. Of course, the more verbose the log the space itintakes, but in logs. general it is best to include more rather than lessmore information the access What additional information is available on the second version? Date, version? Date, cs-uri-

query (part after ? in the URL), s-port (server port number), cs-username, cs(User-Agent), cs(Referer), time-taken, X-Forwarded-For What is cs(Referer)? The cs(Referer)? The web page from which the user clicked the link to make this request. What is X-Forwarded-For? The X-Forwarded-For? The IP address of the client before going through a proxy or load balancer. This is discussed more in chapter 10. Why would including cs-uri-query be a potential security risk? This risk? This is the portion of the URL that includes parameters passed into the application. Thus, it could potentially include sensitive information (e.g. account numbers) although it is best practice not to include any sensitive information in URI query parameters. What is the time-taken measurement in? Milliseconds in? Milliseconds 8.3 Web & Application Server Configuration While the administration of web and application servers is out of scope for this book due to its complexity, we did want to provide some basics on IIS and NGINX because identifying how applications are defined within these platforms is critical for troubleshooting. IIS IIS (Internet Information Services) is Microsoft’s web and application server platform. IIS is a good example of a tightly integrated model and is designed to serve as both a web server and an application server. Unlike SQL Server, IIS does not run as a Windows Service. IIS is a feature of the operating system that simply needs to be enabled to become active. Once activated, this executes within the SYSTEM PID itself. However, any application installed on IIS is launched as a separate process called w3wp.exe.

There could be multiple w3wp.exe processes running at the same time; these are called “application pools” or “app pools” for short. An app pool can include 1 or more applications – this is configurable through IIS Manager.

include 1 or more applications

this is configurable through IIS Manager.

IIS Manager is a desktop application that serves as the management interface IIS Manager for the local IIS server. You can launch IIS Manager by typing “IIS

Manager” on the Windows Search bar or by launching “Administrative Tools” and selecting it from the list of tools. You can manage multiple IIS servers from the same interface, but this is rare. Most of the time, administrators only manage the local IIS server via an RDP connection. Regardless of the way it’s launched, the interface will look something like the below:

From this interface if you click on “Application Pools” on the left-hand side, you are presented with the below:

You will note that while there are four application pools created, there is only one application and it is associated with the MyFIU application pool. From this interface, stop, start or recycle application (note a recycle starts ayou newcan w3wp.exe process, and an only when it’spool ready doesthat it end the existing one – this limits impact to active users). You can also view the the applications associated with the application pool and finally, you can view the “Advanced Settings” for the app pool. You cannot stop, start, or recycle an “application”, only an “application pool”. Of course, if an app pool only has one application, then it’s the same thing. However, in Enterprise IT environments, many app pools have multiple applications.

If you click on Advanced Settings , you can review some critical settings for the app pool:

Above, the “Identity” property is selected, and this brings up the “Application Pool Identity” dialog box which lets you choose the userid that should be used to run the app pool process. The current setting is “ApplicationPoolIdentity”, which creates a virtual account with the same name as the app pool. This is a nice feature for app pools that do not need integrated security into other services as it’s a no-privilege account. However, in most Enterprise IT environments, integrated security is used to allow connectivity to SQL Server and other services without the need to provide a password. Thus, often the “Custom Account” option is chosen instead, and an Active Directory service account is chosen. Other interesting parameters include .NET version, Queue Length, Start Mode (OnDemand – start the w3wp.exe process for this app pool when the first request comes in OR AlwaysRunning – start the w3wp.exe process immediately when IIS starts), Idle Time-out and Maximum Worker Processes. While app pools are a critical concept to understand in IIS, there is another critical concept – “sites”. A site is associated with an app pool and to the directory where the application resides.

It’s important to understand that sites can be stopped, started, and restarted separately from the app pools. However, unlike app pools, there is no process associated withIIS a site. a siteAnother is started or concept stopped is it’s only a logical setting within that’sWhen changed. key that sites can be limited to only respond to certain DNS names and/or ports.

Aside from these concepts, there are several key configuration items you can review within a site’s definitions including the location of the logs, how the session state is defined, the database connection strings used, whether there

are any HTTP redirects in configured, etc. To the right is a screenshot of the Session State settings below is a screenshot connection strings. Both of these settings and are particularly important of to the know when troubleshooting.

IISRESET Before moving away from IIS, it’s important to note the IISRESET command. This is a command-line utility that can be used to stop, start and/or restart all of the app pools with a single command. Example usage: iisreset /restart NGINX The configuration and administration of NGINX or Apache (which normally runs on Linux) are very different from that of IIS. NGINX administration is almost entirely handled through command line executables and driven by text-based configuration files. NGINX does include a web-based monitoring GUI and there have been some 3 rd party Apache configuration GUI’s developed, but these are rarely used in the enterprise.

NGINX executables areconfiguration normally installed in NGINX /usr/sbin/nginx, butlocated this is at configurable. The main file for is usually /etc/nginx/nginx.conf, but this is also configurable, especially if multiple instances are running on a server. Also, there are usually additional configuration files located in /etc/nginx/conf.d. Lastly, log files are usually located in /var/log/nginx. Below is a sample primary NGINX configuration file: user nginx; worker_processes 1; error_log /var/log/nginx/error.log warn; pid /var/run/nginx.pid; events { worker_connections 1024; } http { include /etc/nginx/mime.types;

default_type application/octet-stream; log_format main '$remote_addr - $remote_user [$time_local] "$request" '

'$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log main; sendfile /var/log/nginx/access.log on; keepalive_timeout 65; include /etc/nginx/conf.d/*.conf; } Some interesting parameters include user (the userid that NGINX will execute as), worker processes (max number of TCP connections per worker process), log_format_main (what to include in the log), access_log, and include (stipulates other config files to review). However, notice that there are no website names or server port numbers. This is because the primary configuration file typically does not include application-specific entries called “virtual hosts”. These entries are included in the /etc/nginx/conf.d directory independently so that each application can be managed separately. Below is a sample of an “included” configuration file for a specific application: listen

server { 80 default_server;

server_name rainforest.awsvpcb.edu localhost; location / { proxy_pass http://localhost:5000; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection keep-alive; proxy_set_header Host $host; proxy_cache_bypass $http_upgrade;

proxy_set_header X Forwarded For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } # redirect server error pages to the static page /50x.html #

error_page 500 502 503 504 /50x.html; location = /50x.html { root /usr/share/nginx/html; } } Some interesting parameters include the port number (80), the “server_name” (the are two of them - rainforest.awsvpcb.edu and localhost), and proxy_pass (this means that for this application, NGINX is acting simply as a proxy sending all requests to http://localhost:5000). http://localhost:5000). There’s a ton of configuration settings but covering all of this is well outside the scope of this book. NGINX Starting and Stopping Starting and stopping services on Linux vary depending on the Linux

distribution and version, however, theimplementation, most common implementation something called SystemD. With this you use the is “systemctl” command to perform activities against systems services like NGINX. NOTE: The systemctl command must be run as root, so it should be prefixed by the “sudo” command. Covering SystemD thoroughly is outside the scope of this book, but we can provide some sample commands: - sudo systemctl #This command will display all defined services and their current state. - sudo systemctl cat nginx #This command will provide the service definition for NGINX - sudo systemctl stop nginx #This command stops the NGINX service service - sudo systemctl start nginx #This command starts the NGINX service service - sudo systemctl restart restart nginx #This command restarts restarts the NGINX service 8.4 Application Logs & Stack Traces Aside from platform-level error and access logs, it is best practice for applications to generate logs. This is part of the instrumentation that is necessary to troubleshoot problems. Since many applications execute on hosting platforms, these logs are often found on the individual application

servers, but the best practice would be that they were sent to a remote log store. The locations of these logs depend on the application configuration. Thus, the application development team needs to provide this information. Some application development teams store application log information in

databases or aggregation tools like Graylog & Kibana. This is a best practice because it allows for troubleshooting from a centralized location. Some messages could be informational, but others can include unhandled exceptions generate a “stack trace”. Stack traces areproperly quite important to understand,that so we will deep dive into how to read them in the next section. Stack traces So, what is a stack trace? As mentioned earlier, all threads maintain a “call stack”. This is a list of the procedures that have been called so that when one ends the procedure that called it can pick where it had left off.

For example, review the following pseudocode: - Main; t_variable=1; call Proc_A()

-

Proc_A; x_variable=34; call Proc_B()

As Proc_B executes, the call stack includes Proc_A and Main; each call on the stack is encapsulated by a frame that contains the call and relevant local variables.

When Proc_B completes, then its variables and other relevant entries in the “stack” are discarded and the stack frame pointer returns to where Proc_A called Proc_B so that Proc_A can continue with its current instruction’s execution. If a failure occurs in Proc_B, and if Proc_A “catches” that exception, it can determine what to do (e.g. send a nice message to the user or the log

explaining what happened or perhaps retry).

However, if Proc_A does not catch the error, then it would flow up to Main. If Main subsequently does not catch the error, then it becomes an “unhandled exception” and would typically be output to standard output/error, which generally is either the screen (for a user-driven application) or a log for hosted applications.

This resulting output is called a “stack trace” – it’s an unwinding of the call stack indicating where each procedure was when the error occurred. Thus, the error is given at the top of the trace and the procedure that failed is also at the top. Here’s a simple stack trace to review: Unhandled Exception: System.InvalidOperationException: fail_on_purpose at stacktrace.Program.d__4.MoveNext() in /private/tmp/stacktrace/Program.cs:line 34 at stacktrace.Program.Proc_B() in /private/tmp/stacktrace/Program.cs:line 10 at stacktrace.Program.Proc_A() in /private/tmp/stacktrace/Program.cs:line 23 at stacktrace.Program.Main(String[] args) in /private/tmp/stacktrace/Program.cs:line 10 What program failed? failed? Program.cs Within which procedure did it fail? fail? Proc_B t what line number did it fail? 34 fail? 34

Which procedure had called it? Proc_A it? Proc_A Unfortunately, most stack traces in an Enterprise IT shop are not so simple. The main reason is the heavy use of frameworks and support code within the hosting platforms. Often, the key piece of information for application

developers is somewhere in the middle of the stack trace as opposed to the top. Below is a sample of the kind of stack trace that is common in Enterprise IT shops.

What does the first section represent? This represent? This portion is code related to a framework (Spring) being used by the application. What does the second section represent? This represent? This portion is the custom code written by the developers within the enterprise. What does the third sections represent? This represent? This portion is the code provided by the application platform (IBM WebSphere). What’s the error exception? java.net.SocketTimeoutException: exception? java.net.SocketTimeoutException: Read timed out What’s the failing program? WebServiceTemplate.java program? WebServiceTemplate.java What’s the failing procedure? sendAndReceive procedure? sendAndReceive t what line number? 561 number? 561 Is this useful to the developer? A developer? A little – the developer is calling sendAndReceive and the error exception is important, but they do not necessarily have the code for the framework, nor would they be changing it,

so the line number doesn t help. What’s the truly useful program, procedure, and line number for the developer? Program: developer? Program: PdfGenerationServiceImpl.java, Procedure: sendAndReceive, Line number: 67

necdotal Story – All servers and the Database are good, but stuff is still slow? Alerts begin to come in from your monitoring tools that a critical external

facing is performing (taking longer allowed SLA). web The service performance seems topoorly be consistently poor.than Youthe look at the application and database servers that it resides on and they have low CPU, I/O, RAM, and network utilization. Also, the DBA notes that there is no blocking on the database server. What do you do next to attempt to isolate the cause?

Well, as usual, the first question was whether anything had changed. After a quick review of the changes, nothing appeared to be related. This is a great example of where instrumentation is critical. Without a tool or some kind of logging letting you know what part of the web service call is taking time, it’s impossible to know where the slowdown could be. Luckily, you had an instrumentation tool that broke down how much time was being spent by each database call and web service call. This information allows you to notice that all of the time seemed to be spent calling a web service that resides on other servers.

Once this was understood, the server engineers changed their focus to these other servers and DBAs changed their focus to the database used by this other service. Once this was done, the DBAs found that the database server being used by this other service was indeed constrained heavily. A review of top queries identified the problematic queries and they all appeared to be

centered around the same large table. The DBAs then realized that the update statistics on this table had been disabled some time ago and never re-enabled. The DBAs ran an update of the statistic on the large table and performance immediately began to improve. A key point that we will continue to stress in the upcoming chapters is that applications are rarely self-contained nowadays. They make calls to other services all the time. So, it’s important not to assume that the web service or application that users or your monitoring systems are complaining about is the root of the issue. 8.5 PaaS – Platform as a Service While we will cover Cloud Computing in the next chapter, we thought it important to introduce the concept of PaaS in this chapter. PaaS is defined as a category of cloud computing services that provides a platform allowing customers to develop, run, and manage applications without the complexity of building and maintaining the infrastructure typically associated with developing and launching an app. In essence, as a developer, you deploy your application to the PaaS service, exposing a URL for others to consume your application without you needing to worry about the OS, server, storage, or networking Infrastructure.

A specific PaaS offering would be for a specific type of database, programming language/framework (Java, .NET, etc.), or middleware (e.g. queuing). With behind PaaS, you not have to someonofthe thecloud constraints that may occur the do scenes. Youvisibility need to depend provider that sufficient capacity is available or in the case of some solutions, you need to increase the amount of capacity that you’ve purchased. Access to the operating system and host platform is typically not provided. Application instrumentation is provided with some PaaS services, but not others. Instrumentation is even more important in a PaaS environment given the lack of visibility into the underlining infrastructure. PaaS trades off the

simplicity of deployment and support for lack of visibility and control.

More on PaaS and other Cloud Computing offerings in chapter 10. 8.6 Browsers Yes, modern browsers are an application platform. They are simply a clientside platform as opposed to a server-side platform. Browsers allow not only for the execution of HTML (HTML5 includes the ability to execute code), JavaScript, CSS, & XML but also countless plugins of all types. Browsers perform multi-tasking (launching multiple processes that communicate with each other through IPC) and multi-threading.

From a troubleshooting perspective, the key is that as with all platforms, the version and type of the browser can influence the performance and even functionality of the application. Testing with multiple browsers and even versions is always warranted. For example, recent security updates from several browsers around when cookies will be accepted based on the samesite attribute have caused issues with certain older applications. 8.7 Hosted Desktops or VDI (Virtual Desktop Infrastructure) Hosted Desktops (also called VDI – Virtual Desktop Infrastructure) are

another common platform in today’s enterprise environments. A Hosted Desktop commonly involves a browser-based connection to a desktop environment which includes an office productivity suite alongside other desktop applications. The desktop is hosted, run, delivered, and supported from a central location, usually a secure data center.

There are several reasons to use hosted desktops: - Colocation of fat client applications with their back-end resources (e.g. database) while still maintain the ability to publish the client to remote

-

-

- -

consumers. Manyfor applications require the When database utilize to physically close performance reasons. thethey application onbethe desktop connects to a database, this means that the desktop needs to be in the same location as the database. If users are remote, this becomes a problem, and hosted desktops solve this problem. Allow 3rd party client machines to access internal desktop-based applications from an external connection securely. This is mostly a security consideration, but a critical requirement for many enterprises. Allow for the deployment of BYOD (Bring your own device) within an enterprise whereby the client machine has no direct access to internal systems. Simplify the management of desktops across the enterprise as they are all updated within the data center. Provide backup/disaster recovery for the enterprise desktop environment.

The most common hosted desktop solutions for Windows are from Citrix, VMWare, and Microsoft. MAC-OS and Linux machines can be used as “hosted desktops” through the X-Windows protocol, but this is not common in most enterprises. Thus, Hosted Desktops are typically Windows virtual machines on a physical server, thus the troubleshooting of constraints would be typical to that of any Windows server within a virtualized environment. The monitoring of the Hypervisor is key to determining constraints as even more so than on other servers as hosted desktops are often oversubscribed on the physical host. Modern hosted desktop environments are complex, with several services involved, thus when dealing with hosted desktops, having an experienced engineer on hand is critical to ensure that the issue is not at that

level.

Chapter 8 – Review/Key Questions

1) What are the potential problems that can impact database servers? 2) What are some actions you can take to address poorly executing queries? 3) What are some actions you can take to address blocking? 4) What’s the difference between web and application servers? 5) What are the potential problems that can impact web/application servers and what can you do to address them? 6) What kind of information can you find in a web access log? 7) What kind of information could an instrumentation tool provide? 8) What kind of information can you find about an application in IIS Manager? 9) Where can you find information about your NGINX applications or websites? 10)

On a stack trace, where is the most relevant information?

11)

What kind of code executes on a web browser?

12) Infrastructure What (VDI)? are the advantages of using a Virtual Desktop

Chapter 9 – Application Architecture Techniques

In this chapter, we will cover architecture techniques used in support of Enterprise IT applications. Many applications utilize these techniques to provide a complete and secure solution. Thus, it’s critical to understand these higher-level techniques, how they are used, what can go wrong with them and how to troubleshoot them. While the products used to implement these techniques vary from enterprise to enterprise, the basic concept behind these techniques is common and can generally be applied to any implementation. 9.1 Proxying

A proxy is a program that establishes itself in between two hosts communicating with each other. A key point to understand is that a proxy is maintaining two distinct TCP conversations with SYN, SYN/ACK, ACK handshakes with the hosts on both sides. The hosts on either side cannot guarantee that anything sent to the proxy will continue through to the other conversation; the proxy controls this completely. A “man-in-the-middle” attack is essentially a secret and unauthorized proxy. Again, the key is to understand is that a proxy can stop or alter the communication stream. Conversely, a NAT (Network Address Translation – discussed in chapter 2) is a layer-3 translation and thus not considered a proxy, but instead, a routing function as it does not interfere with layers 4-7 and does not respond to a SYN. So, why would you purposefully use a proxy? There are several use cases for a proxy. Below are some other common ones:

a proxy. Below are some other common ones:

*Both web and reverse proxies are very common in the Enterprise

Web Proxy The below is an example of a web proxy because the clients (users) are internal to the enterprise network and the server is out on the Internet.

Reverse Proxy The below is an example of a reverse proxy because the clients (users) are out on the Internet and the servers are internal to the enterprise network.

Bypass Proxy The below is an example of a bypass proxy because it is purposefully intended to bypass normal enterprise (or in this case school) controls.

Web Proxy Considerations

The purpose of a Web Proxy is to perform “content filtering”. That is the enterprise does not want to allow users to access certain websites to avoid security risks as well as legal risks. Some web proxies provide additional benefits, like performance enhancements by caching static content locally well as inspecting the traffic to avoid malware, certain types of attacks andas data exfiltration (allowing sensitive data to be sent out of your network). Regardless of its functionality, the intent is that when users within an enterprise attempt to access any website, they are directed to a web proxy to verify that they are allowed to access the site. So, how do you redirect all web traffic to a web proxy? Well, there are several ways to do this, but here are some common methods: VPN to Cloud Service – One way to do this is to force all Internet traffic from to an Internet-based cloudof service that performs thisuser webVLANs proxy functionality. These types services have become quite popular with enterprises over the past decade and will likely become the most common approach in the near future.

WCCP Redirect – This is a Cisco proprietary protocol, but it can be simulated by other vendors. Routers redirect all HTTP/HTTPS requests to a defined proxy.

Browser config (PAC file or proxy defined) – Assuming that users are blocked from making changes, the browser can be set up to use a PAC file (a script that dynamically determines proxy based on current address) or to use a hardcoded proxy address.

So, now that you have all users being directed to your proxy, whether onpremise or in the cloud, there’s yet another consideration – encryption. Remember that some proxies provide additional benefits like malware detection and blocking data exfiltration. Well, how can it do that if it cannot read the data? To get around this, oftentimes web proxies unencrypt the data, analyze it, and then re-encrypt on the other side. How is this done? Simple, by distributed a trusted private Certificate Authority (CA) to all PCs such that it can produce a certificate with any name and be trusted. Let’s walk through this:

When a client is trying to establish a TLS connection to a website, the proxy

intercepts the request and replies back to the client with its own certificate for the destination website which was created by the trusted private corporate CA. The PC trusts this certificate even though it was just created on the fly for thisreal Internet website. The proxy then establishes a normal TLS connection to the website. This process is important to understand because if your enterprise is set up in this way, then you cannot inspect an outside certificate from your PC because you never see it! You would need to try to connect from outside of your corporate network to inspect the certificate of an outside entity. 9.2 Load Balancing A load balancer (specifically, a network load balancer) is a reverse proxy where there are multiple backend servers. There are generally two types of

Network Load Balancers – Layer 4 and Layer 4 load balancers only inspect up toLayer layer74(referring (protocol to & OSI port model). #) to determine which backend server(s) to send traffic to. Layer 4 load balancers are not commonly used in the enterprise anymore. Load balancers can be generic (support many types of applications) or product-specific (included in a product to support just that product – e.g., database servers come with a built-in load balancer). For the rest of this section, we will be discussing generic Layer 7 network load balancers as they are the most common in the enterprise.

Load Balancer Primary Services Load balancers can provide several services (the list varies depending on the sophistication and version of the load balancer). The primary ser services vices are as follows:

Provide a Virtual IP (VIP) per pool – This is the primary service provided by all load balancers. Each “pool” of load-balanced backend services (not servers since we are talking about ports on a server) has a VIP/port assigned (this is what clients would connect to). There are typically many pools & VIPs behind a load

balancer.

Health Checking – This is another core service provided by all load balancers. The load balancer needs to know which backend services, called “members”, in each pool are healthy. There are two methods for validating the health of a member as follows: ACTIVE – ACTIVE – This is by far the most common approach. Provide the load balancer with a command that executes every few seconds and must return with certain output to indicate that the member is healthy. The health check command could be as simple as a ping or as complex as an HTTP GET request that runs a series of checks within the application. A balance needs to be struck between

the thoroughness of the keep-alive, the processing power it takes to execute, and how often it can be run. The most common approach is a TCPPing to the backend port to at least make sure the port is responding. PASSIVE – PASSIVE – This is a less common approach. This entails

simply having the load balancer look for conditions in the stream of data passing through it (e.g. consider an HTTP service down if an HTTP 5xx response code is seen). This is a more complicated approach because it’s difficult to know when to re-establish a member as valid and often errors may only be experienced by one specific page – so bringing down the entire application because one page is failing doesn’t make sense. Why is a simple Ping likely not a good enough health check? While ping is sometimes used, it is considered poor practice. At the very least a TCPPing to the backend port should be performed. The main reason is that ping only tests whether the backend server is up and running, not the backend service (in most cases the backend web/application server). If you only use ping as a health check, then if the web or application server goes down, requests will continue to go to that server/pool member because the ping will continue to be successful. With a TCPPing, at least if the backend web or application server goes down, the health check will fail and the load balancer will stop sending traffic to that server/pool member. Load Balancing – This is another primary load balancer service, although the number of options provided varies significantly depending on the sophistication of the load balancer. Load balancers need to send requests to the backend members of the pool in some kind of order. Below are the most common methods: ROUND ROBIN – each pool member gets the same number of requests one after another. This is by far the most common approach and usually the default. WEIGHTED ROUND ROBIN– some pool members get a higher percentage of requests. LEAST CONNECTIONS – to make sure the number of connections across pool members is balanced, requests

always go to the pool member with the fewest active connections. LEAST RESPONSE TIME to ensure that the most responsive pool members get– the most requests, send more traffic to the member responding fastest to the health checks.

Sticky Sessions – This is yet another primary service provided by all load balancers because it is common to require clients to “stick” to one pool member after their initial connection. In essence, this is the ability to have clients always go to the same pool member when traversing the load balancer. The most common need for this is with HTTP applications where the sessions are kept in RAM on the application server. There are two common methods for maintaining “sticky sessions” as follows: COOKIES – The load balancer a cookie HTTP connection to help it trackinjects the session andinto the determine which member needs to service each request. This is only applicable for HTTP connections, but those are the most common. IP AFFINITY – Track based on the client IP and always send the same client IP to the same pool member. There needs to be an idle timeout associated with this method;

typically, an hour or so. TLS/SSL Offloading – This is the last of the primary services provided by load balancers. Since load balancers terminate connections with the client, all of them offer TLS/SSL offloading (AKA SSL termination). termination). That is that the TLS TLS connection is

established with the load balancer. This means that the load balancer presents the required certificates and maintains the associated private keys. The back-end connection to the pool members can also be encrypted separately, but often is not.

dditional Load Balancer Services Aside from the primary load balancer services detailed above, some of the more sophisticated load balancers provide additional, advanced services such as:

High availability pairing – Each is actually a pair balancer that checkload withbalancer each other constantly andof if load the active onedevices goes down, the passive one immediately takes over. When this failover occurs, all VIPs (pools) automatically move to the other load balancer,

usually using “gratuitous ARP” to accomplish this. Rule Scripting – A scripting language may be included to allow for the this manipulation or selective logging data in transit. While must be used judiciously as to of notthe overwhelm the load balancer, it is a very powerful feature that is commonly used in many enterprise environments. HTTP Header Manipulations – HTTP headers can be dropped or added as it makes sense. For example, the X-Forwarded-for HTTP header is commonly added so that the original client IP address is passed along to the web/application server. Web Application Firewall (WAF) – A special type of load balancer is a WAF (web application firewall). This kind of device inspects the traffic passing through it for known security issues (e.g. SQL injection, cross-site scripting, etc.). If hosted in the cloud, this kind of device can also help ward off DOS attacks. Some popular load balancer vendors include F5, Citrix, Cisco, and Radware, but there are many more. Also, cloud providers (Amazon, Microsoft, etc.) include native cloud-based load balancers that continue to improve in functionality. Troubleshooting load balancers

From a troubleshooting perspective,application there are several factors to keep in mind when dealing with a load-balanced as follows: 1) Health Checks: a. The fi first rst step should always be to validate whether all pool members are still considered active or if any have been marked down due to a failed health check. If the health checks are failing, then all focus needs to shift to these health

checks and why they are failing. b. It’s important to note that sporadic health check failures could be happening, so just because all is green when you first look at the health checks, doesn’t mean they are not the problem. Checking the load balancer logs is also recommended to make sure that no failures have recently

occurred. c. Even if the health checks aren’t failing, it’s important to understand the nature of the health check for the pool being used bytothe application question because this will inform you as what “healthy”inmeans to the load balancer. For example, if the health check is simply a ping, then you can’t assume that the backend web servers are even up and running because they may be down, and the load balancer would not know. However, if the health check is a call to the application, then if the health check is green, you can assume the processes are at least up and running. 2) Constraints - It’s also important to understand that load balancers are systems that can be constrained like any other system. Under the covers, most load balancers are simply running some variant of Linux with similar shell prompts and commands available. So, verifying that the load balancer itself is not constrained is a critical step. 3) Session Stickiness - It’s important to understand the session stickiness of the application in question because this will inform you as to whether a user will remain on one server or bounce around during one session. 4) How is each server doing independently? a. Could the problem only be with one of the backend servers/pool members? Remember that unless the health check is sophisticated, a poorly performing pool member is likely to continue to get work sent to it. This is important because if it’s just one server causing a problem, then a quick fix is to take the problem server/pool member out of the pool (this assumes that there is enough capacity with the remaining servers/pool members to support the user load). b. If an instrumentation tool is available, it can be checked to

see if there are more errors or poorer response times from any of the pool members or if things are balanced out. c. When using the load balanced name to access the application, how can you determine which backend pool member is being used? It is best for the application itself to

provide this information by displaying it somewhere on the screen or a hidden field, however, another option is creating a web page specifically to display this information. Regardless, good to if there isby a way to do this. d. Lastly, the ability to ask troubleshoot accessing each backend server directly via SSH or RDP is an important tactic as well since it allows you to eliminate the load balancer and all the networking from the client to the web/application server. It also allows you to compare how each pool member is doing independently. 5) Access Logs – Some load balancers (e.g. AWS’ ELBs) can log all access requests to it similar to how web servers save to their access logs. Reviewing these access logs can be informative while diagnosing a problem. necdotal Story – Canary in the Coal Mine Alarms are triggered because several critical web services are exceeding their required SLA (less than 2 seconds) sporadically. There are multiple multiple web/application servers, but they all appear fine from a CPU, memory, and disk perspective. There are also database servers in the mix, but they also appear to be fine – no constraints or blocking. What are some possible next steps to take?

Gathering additional information would be the next step to take. What additional information would be helpful? Well, one question that should always be asked is whether any changes had been implemented for this application or any component involved in this application’s environment.

Another key question is whether any other applications were experiencing any slowdown since missing a 2-second SLA only sporadically is such a subtle problem that it could simply be the canary in a coal mine (i.e., the first indication of a bigger problem). The answers to these questions were quite revealing. First, a brief study of the instrumentation tool, showed that other applications were indeed performing slightly slower as well, but the difference was so minor (1-2 seconds slower) that users do not seem to have been complaining about it or at least not yet. In addition to that, it was discovered that there was a change implemented the night before on the load balancer. The change to the load balancer was to enable a script that would inspect all traffic and record information on any traffic that was not TLS 1.2. The intent of the change was to identify this traffic and shut it down for security reasons. However, given that the load balancer managed so much traffic, this could have been putting a burden on it. The web hosting team logged into the Linux shell of the load balancer and noticed that it was at 100% CPU utilization. The script was then disabled, and the performance returned to normal. However, that did not end the story. The load balancer had been at or near 100% utilization for a few hours and only when this critical transaction began to breach the tight SLA for it, did anyone even realize that there was a problem. That’s not optimal. The load balancer is a critical component and alerting on it when it becomes overloaded is an important control to have in place. Through problem management, monitoring and alerting for this condition was added such that it would be caught sooner were it to recur. 9.2 Global Server Load Balancing (GSLB) The previous section was all about the most common form of load balancing – that is “local” load balancing. It’s called local load balancing because the load balancers typically reside on the same network and location as the

backend pool members (e.g. they are on the same LAN). Such load balancing is all about providing services within one data center. However, what about if your application must survive even if the primary data center from where it executes, failed? This is a rather common requirement for large enterprises. Many critical

applications need to be able to switch to another data center within minutes or seconds. The problem is that while switching many applications quickly from one location to another during a major event could be done manually, it would take a great deal of time. To help companies overcome this problem, new technology was developed about 20 years ago called Global Server Load Balancing or GSLB. The purpose of GSLBs is to allow applications or pools in load balancing terms to automatically failover to another location. GSLB is performed via name resolution (DNS) as opposed to using a reverse proxy as with local load balancers. As a result, GSLB features are more limited due to the technology involved. Health checking and automated failover are available, but session stickiness, WAF, TLS, and Rule Scripting are not applicable. So, how does it work?

As shown on the diagram to the right, a GSLB is simply a DNS server with some additional

functionality. The DNS names that need to be able to failover to another location are served by the GSLB. The GSLB is constantly performing health checks against two or more local load balancer VIP pools to verify that they are functioning. The clients simply look up the DNS name, the request goes to the GSLB and it responds with the IP address of one of the two locations depending on its configuration and which sites are responding properly to the

“health checks”. Because health checking is involved and failover is included, the TTL on the associated DNS names must be kept small (typically 30 seconds). This is needed to minimize the duration of the DNS caching on theoccurs, client-side. This DNS caching must be keptthey shortlived so that that occurs if a failover the clients will recognize it because had to do another DNS lookup and the GSLB can then respond with the new IP address. Besides automatic failover, GSLBs also offer full redundancy by synchronizing updates across geographically disbursed copies of themselves. In this way, no one location can cause a GSLB failure. GSLBs can also perform load balancing based on round robin and even load statistics gathered from the local load balancer. However, if session stickiness is required or a specific site is preferred (both are common), then load balancing cannot be performed and the GSLB is only used for site failover. Let’s take a look at a couple of questions more deeply to understand GSLBs better. Why can’t GSLBs handle session stickiness? First, why is it that if session stickiness is required, the GSLB can’t do load balancing between the two (or more) locations? Remember that session stickiness is needed because the session state of the application is in RAM on one specific web/application server. Thus, all requests need to be sent to that same server and thus that same location. However, the GSLB is communicating with the client via a DNS request and response, not HTTP.

Thus, headers or similar information can bename sent to GSLB to informtoit that it no needs to send all future requests for this tothe a specific location maintain session stickiness. Since the GSLB has no way of knowing where to send the client, if session stickiness is required, then the GSLB must always send users to the same location unless that location’s health check fails. Of course, if the health check fails, the session would be lost as requests are sent to the other location, but that’s OK because the primary location is down anyway.

Why are GSLBs typically only used for site failover? Secondly, assuming session stickiness is NOT needed (e.g. the session state

information is stored in a database), why would it be common to prefer to send all clients to one location over another and use the GSLB only for failovers? This comes down to performance. Even today, it is still quite

common that to sustain solid performance, it’s preferred to keep the application code (in this case running on the web/application server) physically close (e.g. in the same data center) as the backend database it is using. case of RDBMS databases, while have their read-only copies In at the other locations, it’s rare for them to bethey ablecan to have primary read/write database in multiple locations. Thus, since the database can only be in one location, for performance reasons, it’s usually better to have the application execute from the same location. It's important to note that this second point about preferring to have just one primary GSLB location even when session stickiness is not needed, is not absolute. If the application uses a highly distributed NoSQL database for instance, then it could very well run from any location accessing the local copy of the NoSQL database. Also, many services do not access databases at all, but instead other services that can be remote without impacting performance. Lastly, even if the application is using an RDBMS, it may be architected such that the application can reside physically distant from the database without noticeable performance impact. Having said all of this, in most enterprises, it is still most common to require that the application reside physically close to the database and thus GSLBs are not often able to be used for load balancing. It is also important to stress that normal DNS servers do not provide GSLB functionality. The standard enterprise DNS server will typically delegate only the DNS names to the GSLB (e.g. only a small set of DNS names are applicable typically globally load-balanced). GSLBs are typically tightly coupled with the local load balancer (e.g. same product set and same top vendor list). Below is a comparison between Local Load Balancer and GSLB features:

Troubleshooting GSLBs Since GSLBs are DNS servers, troubleshooting them is similar to troubleshooting DNS name resolution issues. Also, since GSLBs are also

load there areBelow some are similarities withtohow load balancers, balancers as well. some things lookyou for:troubleshoot local 1) As with any name resolution activity, host files and DNS caching issues can get in the way of resolving to the correct IP. 2) As with local load balancers, health checks can fail, so the first thing to check is whether all locations/pool VIPs are active to the GSLB. If not, then the focus should be on the health check and why it may be failing. 3) As with local load balancers, one other factor to keep in mind is that health checks could periodically fail, causing sporadic issues. To diagnose this, you need access to the GSLB’s log to validate that the health checks have not been failing. 9.3 Content Delivery Networks A content delivery network is a geographically dispersed set of reverse proxy servers. The intent is to reduce strain on centralized web servers and network latency for websites that are accessed worldwide (e.g. when you are in Europe and attempt to hit www.google.com, you are hitting a server in Europe). This is achieved through regional DNS cache injection, eDNS, or Anycast; basically, DNS tricks to send you to the closest location.

CDNs began strictly as a means of pushing static content (e.g. video, images, music, JavaScript, HTML, etc.) quicker and to offload stress on centralized servers. However, they became so popular with larger worldwide websites, that they began to expand their offerings. CDNs have now evolved into full reverse proxies that automatically pull updates, support dynamic content retrieval from backend servers, and can even include security features like DDoS prevention and WAF functionality. Given their current implementations, CDNs are useful in addition to a GSLB.

In such a scenario, the GSLB would only be responsible for the DNS resolution of the backend VIP (Origin Server)

in the diagram below, while the CDN would be responsible for the DNS resolution of a different name that would act as the reverse proxy for end users. Since this still a quickly evolving field, this may very well change over time. The most popular vendors of CDN services include Akamai, AWS Cloudfront, Cloudflare, Verizon, Limelight, Imperva. Troubleshooting CDNs More so than GSLBs and local load balancers, troubleshooting CDN issues requires the vendor to get involved. While there could be problems with the configurations entered into the CDN’s portal, diagnosing issues is not simple

because to the globally dispersed network of servers is limited to nonexistent. access The complexity and scale of these networks are what makes them attractive, but this also means that there’s increase reliance on the vendor to make sure everything is operating as expected. We can add CDNs to the chart we built before comparing local load balancers and GSLBs to provide a finer contrast as follows:

necdotal Story – Disaster Test Failure

You’re involved in a disaster recovery test where the application being tested is being switched from one location to another. The application’s database is failed over to the other location using a SQL Server Always-On Availability Group failover and the web/application servers are switched to utilize the ones at the other location by changing the preferred site on the GSLB. Everything seems to have gone well with the database as it can be accessed outside the application and everything seems to have worked with the GSLB as you test doing a nslookup and the DR site’s load-balanced IP address responds. You then test the application, and all works well. However, when the users start to test the application, some of them report getting 500 errors on the browser. Where do you start to diagnose this?

Since a GSLB was changed, the first step would be to verify that the GSLB configuration to make sure that the DR site is marked offline or that some other mistake was made. Everything looks good, but you look at the GSLB logs to make sure the health checks aren’t sporadically failing, which would cause some users to get the wrong IP, however, the log looks clean. You decide to ask one of the users with the problem to ping the www.acme.com name and to your surprise, the IP address returned was that of Site #1 (the primary site), not Site #2 (the DR site being tested). Soon, you can confirm that all working users are going to Site #2 (the DR site) and nonworking users are going to site #1 (the original, non-DR site). Why would this be happening? You realize that the users that are failing are the ones that were using the system just before the switch to the DR site and a thought comes to mind – what’s the TTL for www.acme.com? The answer is 30 minutes. Thus, the users having problems still have the old IP addressed cached on their PCs. You flush the DNS cache on their PCs, and they begin to work as expected. The DR test is a success, but the work is not done. There are two things to fix: 1) The TTL for this DNS name within the GSLB must be changed to something smaller (e.g. 30 seconds). 2) Even though it was not good that the users were being routed to the original non-DR site, why were they getting 500 errors? This still should have worked, potentially with some performance problems because thealbeit database was physically distant from the application servers. On this second point, it was discovered that the firewall in front of the database server at the DR location (Site #2) was blocking the web/application servers at Site #1 from access. This was corrected as well.

9.4 Message Queuing/Message Bus

Message queues are a popular technique for supporting asynchronous communications; is communication expecting/requiring immediate feedback to that a request. This does notnot diminish the criticality of such requests. It’s just that the timing of the response is not expected to be immediate. Email is a prime example of asynchronous communications; however, email is generally not considered a good message queuing solution for program-to-program asynchronous communications. While email is used by some systems for this type of communication, it is most appropriate when user interaction is involved. A key feature of message queuing systems is the guarantee of message delivery. This guarantee stems from the fact that the message will not be removed from the source system until confirmation of successful receipt comes from the destination system. Messages can be placed in any of several queues by producers (or publishers) and then extracted from the queue by consumers (or subscribers). There could be multiple consumers and/or producers for any given queue. A message bus is a similar concept but implies a centralized process that sits in between the systems managing the queues.

Historically, message queuing solutions ride on either TCP or HTTP, however, they use proprietary protocols to add/remove items from the queue. The one exception is JMS (Java Message Service)applications. which is a well-known standard for communication between Java-based There are other standards documented such as AMQP, STOMP & MQTT, but none have broad adoption. There are many products in this space, including RabbitMQ,

Microsoft MQ (built into the OS as a feature like IIS), IBM MQ, Apache ActiveMQ, Oracle AQ, JBoss Messaging, Amazon SQS, and Azure Service Bus. Architecturally, the message queues typically reside locally on the application server and are sent either directly to another application server that has a queue listening or to a centralized group of servers from which consumers pick up their messages or messages are forwarded to them. From a troubleshooting perspective, it’s important to understand the architecture of any message queuing involved within an application to be able to determine if there are any bottlenecks in between and where they may reside. 9.5 Enterprise Service Bus An Enterprise Service Bus (ESB), which can also be called an Application Integration Platform, takes the message queuing concept to a higher level by introducing the ability to transform the message in various ways through the transport. As such, an ESB is not constrained to a single protocol but instead is designed to be able to ingest and communicate with many different protocols.

ESBs grew in popularity with the SOA (Service Oriented Architecture) movement as they allowed for different types of applications to communicate with each other without requiring a rewrite of

any existing service. ESBs provide for both asynchronous and synchronous communications depending on the rules established for each bus. ESBs are typically architected as a centralized solution and often have redundancy built-in as a result. ESBs are fairly complex and further details are outside the scope of this course, but as with message queuing, it’s important to understand the architecture in place to support the applications that you are troubleshooting. Some of the most popular vendors of ESBs include IBM (WebSphere), Mulesoft, Microsoft (Biztalk), WSO2, and Oracle. 9.6 API Gateways API gateways are a technology born out of the Microservices movement. API gateways act as a reverse proxy and provide the ability to transform API calls in various ways similar to an ESB, however API gateways are more commonly focused on just SOAP & REST calls. Some API gateways also provide ESB functionality as these two technologies are close in terms of functionality.

API gateways allow developers to invoke multiple back-end services and aggregate the results when responding to a single API call into the gateway. In addition to exposing microservices, popular API gateway features include functions such as authentication, security policy enforcement, load balancing,

monitoring & SLA Management. As with ESBs, API gateways are typically architected as a centralized solution and often redundancy a result. gateways fairly complex andhave further details arebuilt-in outsideasthe scope API of this course, are but as with ESBs, it’s important to understand the architecture in place to support the applications that you are troubleshooting.

Some of the most popular vendors of API gateways include Google (Apigee), Mulesoft, IBM (WebSphere), WSO2, and Software AG. 9.7 Job Scheduling Another common technique within Enterprise IT shops is job scheduling (also called batch job scheduling if there are several jobs with dependenciese.g. a batch). This can also be called event-based scheduling. Job scheduling entails the execution of unattended processes based on various conditions such as date/time, the successful completion or failure of another job, the creation of a file, the addition of a row in a table, or any combination of these. Due to its unattended nature, this kind of processing requires thoughtful error checking and notification – how will someone know that something has gone wrong?

Job scheduling capabilities come standard with operating systems (e.g. Windows Task Scheduler on Windows and cron on Linux), however, in Enterprise IT shops, there is typically a much more sophisticated scheduler employed that can execute jobs on both platforms (and sometimes others as well). These systems have a central controller (or automation engine) that

manages all of the schedules with their dependencies, triggers, etc. The unattended jobs then run on other servers that have agents that communicate with the central controller. From a troubleshooting perspective, the key is to query whether there are any event-based or batch job processes involved with the applications in scope during a troubleshooting exercise. Not considering unattended processes can

cause delays in determining the root cause. 9.8 Network Segmentation

A common practice in corporate networks is the segmentation of the network into smaller, typically secured subnets. The reasons for doing this are as follows: - Improved Security (firewall enforcement) – Reducing the attack surface for critical sections of the network such as DMZ (portion exposed to the Internet) Business Partner Segment (portion exposed to business partners) PCI (the portion that credit card data traverses) AAA (the portion that authentication and authorization appliances reside) -

- -

Controlling Visitor Access (firewall enforcement – usually in combination with NAC) – Creating a segment that does not have access to the interior network for wireless access by visitors. Reduced Congestion (VLAN enforcement) – Avoid broadcasts from being sent throughout the entire network. Containing Network Problems (VLAN enforcement) – Limiting the effect of local failures.

When dealing with network segmentation, consider each firewalled area as an independent network. It’s also important to note that the emerging trend of micro-segmentation enabled by SDDC (Software-Defined Data Center) will further complicate this in the future. From a troubleshooting perspective, the key is to understand which segment each device involved in the problem is located and whether there is any restriction (e.g. firewall enforcement) that could be preventing access to some other device.

9.9 Tunneling Network tunneling allows for communications between two devices that

otherwise could not communicate, by encapsulating (or wrapping) their conversation within another protocol. The most common purposes are as follows:

-

Allow private networks to communicate over the Internet a.k.a. VPN sample uses: Home PC toisa acompany’s private network (typically SSL VPN which host-to-host VPN). One company to another or one company location to another (typically IPsec which is a site-to-site VPN)

- -

Allow a device using one protocol to connect to a device on another (e.g. IPv6 over IPv4 and vice versa). Allow the communication of protocols that are not typically allowed through a firewall.

There are many tunneling protocols (SSL VPN, IPsec, GRE, PPTP, etc.). Tunnels do not need to be encrypted, but in practice almost all are.

From a troubleshooting perspective, when tunnels are involved, it’s important to clearly understand the IP addresses involved in the conversation being diagnosed. Typically, it’s the IP address inside the tunnel that you should focus on as that’s what is trying to reach resources on the other side of the tunnel. Also, when dealing with SSL VPN, reviewing the routing table of a

PC as described in chapter 5 is a good practice as well to make sure that the IP addresses trying to be reached are actually routing over the tunnel and not the Internet. 9.10 Parallelization Parallelization is the idea of performing multiple subtasks concurrently to

improve the overall performance of the main task. This is a technique used in many areas (most of which we have already touched on). Primary examples include: -

-

-

-

Database processing – The concept of breaking large queries into several smaller queries, run them concurrently, and then put all of them back together to find the result. Most modern RDBMS systems offer this capability and even decide automatically when best to use it. Communications – Creating multiple concurrent requests for data over communication protocols is another common use case. Browsers have been doing this for decades. Load Balancing – Load balancing is a form of parallelization as it allows multiple requests to be serviced by multiple machines at the same time. Big data – Big This data is processing is impossible without through extreme parallelization. a fundamental need to process Terabytes or Petabytes of data in any reasonable amount of time.

For parallelization to function properly, you need to be able to break up the task into discrete independent shards of work. From a troubleshooting perspective, it’s important to question whether parallelization is possible to improve performance for any particular problem.

Chapter 9 – Review/Key Questions

1) What are some use cases for a proxy server? 2) How do Enterprise IT shops force the use of a web proxy? 3) What are the common features of a load balancer? 4) What are some advanced features of a load balancer? 5) What are the differences between a local load balancer and a GSLB? 6) What functionality does a CDN provide? 7) What functionality does message queuing provide?

8) What’s the difference between message queuing, ESBs, and API Gateways? 9) What are the things to look out for with batch job scheduling? 10)

What functionality does tunneling provide?

11)

What are some of the use cases for network segmentation?

Chapter 10 – Cloud Computing Fundamentals

In this chapter, we will cover cloud computing fundamentals. While most enterprises still have a great deal of computing setup in proprietary data centers, the move to the cloud is an active pursuit for most enterprises. Some enterprises have even completed their move to the cloud. However, the cloud is not just one thing. There are many services and countless ways to set up your environment. While getting into all those details is outside the scope of this book, the goal of this chapter is to provide a decent high-level understanding of the options available and the considerations that need to go into selecting these options. 10.1 What is Cloud Computing?

There are many definitions of Cloud Computing. Here are a few: “Cloud computing is the on-demand delivery of compute power, database, storage, applications, and other IT resources via the internet with pay-as-yougo pricing … over the Internet.” [1] “IT resources as an elastic, utility-like service from a cloud ‘provider’ (including storage, computing, networking, data processing and analytics, application development, machine learning, and even fully managed services).” [2] “[C]loud computing is the delivery of computing services—including servers, storage, databases, networking, software, analytics, and intelligence—over the Internet … [paying] only for cloud services you use” [3] “Cloud computing is a model for enabling ubiquitous, convenient, ondemand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service

provider interaction. “Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user … available to many users over the Internet … typically [using] a "pay-as-you-go" model.” [5]

Here's the definition that we landed on: Cloud computing is a software-defined, elastically scalable, multitenant, rovider-managed technology stack that is consumed as a network-connected utility in a self-service and on-demand manner for which the consumer pays for usage in a metered manner for the allocation/consumption of pooled resources in service of the consumer. It’s a mouthful of terms, no doubt … let’s take it one “byte” at a time by tackling the key words called out in this definition as follows: - Software-defined – Via a collection of APIs and UIs for programmatically provisioning, managing, and consuming softwarebased abstractions of IT resources. - Elastically Scalable – Through transparent ability to scale up/down and in/out with minimal, if any, reconfiguration of consuming applications. - Multitenant – Servicing of multiple independent tenants, likely of a heterogeneous variety in terms of service consumption and workload characteristics. - Provider Managed – Configuration options for a service being furnished as governed features for consumer use, with no “systemlevel” administrative access. - Technology Stack – The IT resources necessary to deliver a business product/service, including, but not limited to disks, servers, databases, firewalls, load balancers, app runtimes, APIs, and full-blown applications. - Network Connected – Services are accessed over IP and are remote from the perspective of the consumer (remote meaning not directly attached to the consuming user computing devices). - Utility – Services are useful, provide benefit, and are both convenient and practical. - Self-Service – Provisioning, management, and consumption is user-

- -

driven and centric via provider furnished APIs and UIs. On-Demand – Delivery of functionality is at the behest of the consuming user. Metered – For any given resource or service, units of consumption measurement are defined and accumulated over time such that consumers pay for what has been used according to the cloud provider

-

(a.k.a. pay-as-you-go). Pooled Resources – The aggregation of provider assets into a collective pool that can be purposed and repurposed dynamically to

address the needs of the tenancy. As noted earlier, cloud computing is software-defined. It is a system and line of business application whose business is IT infrastructure and application hosting capabilities. It is the abstraction that is key. Cloud computing is premised on the virtualization of assets and resources such that they are programmatically manipulated.

10.2 Deployment Models As mentioned earlier, there are many ways to deploy to the cloud. We will start with the various deployment models available to enterprises. Deployment models center around how the enterprise users connect to cloud resources and how much the resources are shared with other entities. Primary Deployment Models For example, Public Cloud computing refers to a cloud computing model where services are offered by third-party vendors independently of

corporate/Enterprise IT. Access is via the Internet or a Cloud Exchange Point (CXP) / Cloud Exchange Network (CXN). The tenancy is generally open to anyone willing to pay for and consume services. Capital costs and management overhead are offloaded to third-party vendors. Public Cloud Deployment Model

Meanwhile, Private Cloud computing refers to a cloud computing model where services are offered by corporate/Enterprise IT. Access via the WAN or LAN. The tenancy is generally restricted to applications andisemployees of the company. Capital costs and management overhead are incurred by the enterprise.

Private Cloud Deployment Model

Hybrid Cloud computing refers to a cloud computing model where services are offered by both third-party vendors and corporate/Enterprise IT. Workloads can be placed in a public cloud provider, the private cloud provider, or a combination of both. Combines all of the characteristics of both Public and Private Cloud deployments. Offers flexibility of deployment when dealing with client or regulatory constraints on data locality and residency. Hybrid Cloud Computing Model

Finally, Multicloud computing refers to a cloud computing model where

services are offered by third-party vendors independently of corporate/Enterprise IT. Has all of the characteristics of Public Cloud deployments. Allows for vendor agnosticism and fit-for-purpose cloud usage. Enables competitive pricing amongst multiple vendors offering cloud services. This deployment model typically occurs in enterprises more so through acquisition than purposefully as it’s difficult to easily migrate workload from one cloud provider to another. Multicloud Deployment Model

Other deployment models Aside from these well-known deployment models, there are also emerging deployment models that provide for exciting opportunities in the future. These deployment models include “Fog computing”, which is a relatively new paradigm that focuses on offloading or moving processing and storage to endpoint devices consuming cloud services and providing application services to users or other applications. Internet of Things (IoT) processing is

a good example of Fog computing. Then there’s “Edge computing”, another relatively new paradigm that focuses on moving data and computation closest to the endpoint devices that will perform computation. Content Delivery Networks (CDN’s) and Cloud computing appliances are good examples of Edge computing.

Finally, there’s “Grid computing”, which is a long-standing paradigm that focuses on aggregating highly distributed resources, usually across geographic boundaries and diverse, to solve a problem through a common software package or library. Great Internet Mersenne Prime Search (GIMPS) and distributed.net’s hash and encryption algorithm cracking software are two great examples of grid computing. Note, this should not be confused with HPC Grid computing, which usually aggregates homogenous assets into a type of supercomputer composed of large numbers of nodes, usually in few geographic locations. 10.3 Service Models As touched on briefly in a previous chapter, there are several “service models” as it relates to the cloud. The difference with these models is all

about how much control you as a customer wish to maintain and/or how much you are willing to pay to avoid sharing resources. We will focus on the difference between the four primary choices that enterprises have today, although it’s important to note that some services straddle the fence in between these offerings. These choices are On-Premises, IaaS, PaaS, and SaaS as illustrated below:

Primary service models Now, let’s dig into the three cloud offerings in a little more detail, starting with IaaS. IaaS is the most basic of all cloud service models available today. It is a fully managed compute, storage, and network service exposing only logical IT asset abstractions. Compute handoff interface is the guest VM and guest VM OS. Storage handoff interface is raw block or object interface. The

network handoff interface is at Layer 3+ (routing, subnetting, IP addressing, logical firewalls/ACLs, and load balancers). IaaS is Billed at the resource level, per reservation or consumption. Here’s a summary of the advantages and disadvantages of this service model:

Next, let’s take a look at PaaS. PaaS is the most application-centric of all cloud service models available today. It is a fully managed app hosting and development runtime service exposing only application-centric abstractions. For example, data platform handoff occurs at the database, data pipeline, or analytics model (i.e. no server, instance, or shard to install and manage). Application runtime platform handoff occurs at the programming language runtime (i.e. no framework, load balancer, or web server and load balancer to install and manage). PaaS is billed at the transaction unit or resource level, per reservation or consumption. CPU, memory, disk, network and software resources are encapsulated as service units. Here’s a summary of the advantages and disadvantages of this service model:

**typically** implies that the app is not constantly executing transactions or

service requests. At scale and continual load, IaaS or on-premise may be cheaper. SaaS Lastly, let’s take at the serviceavailable model. SaaS and utilitarian of aalllook cloud service models today.isItthe is most a fullynative managed application exposing only business process functionality. Service handoff typically occurs in a browser or client application UI. Consumable

APIs are often exposed for integration into custom applications. A SaaS platform may consist entirely of APIs for use in other software applications. SaaS is billed at the subscriber seat count level, typically with tiered service per seat type. All application and infrastructure services are packaged up as part of the seat cost. API’s are often billed on a per transaction or sized block of transactions. Here’s a summary of the advantages and disadvantages of this service model:

Now, if we overlay these service models on top of the deployment models discussed in the previous section, then we can depict the below as an overall view of cloud options:

This diagram generalizes the choices made as you move from Private to Public on the deployment model exchanging control for economies of scale

and from IaaS to SaaS on the service model exchanging flexibility for abstraction. Other service models As with deployment models, there are other service models available depending on the technology and at times the marketing at play. While there are many, we’ve decided to call out at least a few.

Everything/Anything as a Service ( XaaS) refers to any technology service that can be packaged up and presented such that it meets the definition of Cloud Computing as defined in the experiential definition. The “as a Service” term proliferates in the age of cloud computing given the market opportunities. Serverless Computing (a.k.a. Functions as a Service, ). A runtime-based cloud service model paradigm where a cloud providerFaaS abstracts the entirety of the stack such that application developers only have to worry about eventdriven programming of logic for a function or subroutine. FaaS is arguably the evolution of PaaS into the most discreet unit of consumption and delivery. Unified Communications as a Service ( UCaaS) and Network as a Service (NaaS). A relatively new set of service paradigms being seen in the industry where telephony, telepresence, collaboration, messaging, and data communications are being packaged up and managed as SaaS-like services. 10.4 Security in the Cloud Cloud Security is sufficiently broad and deep of a topic that it can easily take up an entire book. Suffice it to say, that it is out of scope for this book; however, some key aspects of cloud security can be reviewed. Cloud Security can be broken up into two broad areas that rely on similar cybersecurity principles and technologies (i.e. encryption, authentication, intrusion detection/prevention,

antivirus, etc.). - The security measures cloud providers use to protect the fabric and its tenants from threats originating from other tenants or externally. - The security measures cloud tenants use to protect logical cloud resources from external threats, accidental disclosure, or tenant internal vectors.

The easiest way to think of this is that cloud providers are responsible for securing the cloud whereas consumers of the cloud are responsible for securing what they put into the cloud. You need to secure assets and resources the same way they would be secured on-premise. For example, with IaaS, the cloud provider is not going to patch your systems to ensure that known vulnerabilities do not impact your business. The cloud provider will also not make sure that your internal systems are firewalled from the Internet; they may provide you with a mechanism to do it, you need to make sure it’s configured properly. With PaaS, cloud providers will not protect you if you have not coded your applications to prevent common attacks like SQL injection or Cross-site scripting. Cloud providers will not ensure that proper complexity passwords are used. The key takeaway is that the discipline of security is as important, if not more so when you deploy to the cloud. 10.5 Cloud Providers So, who are the “cloud providers”? Well, there are hundreds, if not thousands now and the solutions they provide are as varied as the problems they try to solve. The largest growing and most varied type of service model by far is SaaS. This is because SaaS offerings are usually quite specific, thus any startup can create a narrow-focused cloud offering that solves a very specific problem.

There are SaaS providers that solve business needs like managing HR, Finance, Sales/Marketing, Contact Center, etc. Other SaaS providers solve technology needs like Email, Video Conferencing, Storage, VPN, User Authentication/Authorization, Content Delivery, Content Filtering, Backup/Restore, etc. Some provide solutions to more generic needs like collaboration, customer surveying, payment processing, etc. Nowadays, it’s difficult to think of a problem that is not solvable or at least possibly solvable

with a SaaS solution. So, what about IaaS and PaaS? This is where the market thins. Providing IaaS and PaaS is more difficult because it requires much more complex engineering and requires a great deal more capital on the part of the provider to handle the dynamic and unpredictable nature of the capacity needs of these services. IaaS and PaaS also require work on the part of the customer to

secure and properly configure their systems. This means that with IaaS and PaaS, engineers on the enterprise side of the house need to be well-versed with the cloud provider’s systems and capabilities. Due to this, enterprises are more likely to focus on one IaaS/PaaS provider to limit the training and expertise needed on their side. Because of all of the above, it is much less likely for a new and/or smaller company to break into the general IaaS and PaaS markets. Some niche PaaS offerings could certainly arise, but large enterprises would most likely focus on the larger players in the market. This is why Amazon’s AWS (about 32% of overall market share as of Q3/2020) and Microsoft’s Azure (about 19% market share) dominate. There are other players in this market like Google’s GCP, Oracle’s OCI, and IBM’s Cloud, but they have struggled to obtain anywhere near the market share enjoyed by AWS and Azure in terms of enterprise clients and this is unlikely to change in the foreseeable future. Due to this, IT professionals should focus on learning about the intricacies of Amazon’s AWS and Microsoft’s Azure IaaS and PaaS offerings as these are likely to dominate the entries IT space for years to come. 10.6 Troubleshooting Troubleshooting Issues in the Cloud So, now that you have a broad, high-level understanding of what the cloud is all about, how do you troubleshoot issues in the cloud? Well, it depends on the deployment and service model. The more of the environment is under the

control of the Enterprise IT area, the more it will be responsible to troubleshoot issues. For example, with SaaS offerings, there’s not much troubleshooting that you can do from the enterprise perspective. With SaaS offerings, the enterprise engineering team is completely dependent on the SaaS provider. Thus, with SaaS offerings having the ability to contact the cloud provider’s support team

and get an immediate response is critical. At the other end of the spectrum is IaaS, where the troubleshooting skills needed are very similar to that of an on-premise environment. Under this model, since the Enterprise IT shop owns the operating systems, web servers, load balancers, application servers, database servers, etc., all of the troubleshooting tools reviewed in the previous chapters apply in the same

way as a traditional on-premise environment. However, there are some additional skills needed as well even with IaaS. The networking setup is unique with each cloud provider, so network engineering teams need to become very familiar with the intricacies of the cloud provider’s terminology and configuration options. Similarly, VMWare or Hyper-V is replaced with the cloud provider’s hypervisor interface and offerings, so server engineering teams need to familiarize themselves with these. Still, most problems will be solved with similar tools as on-prem. As expected, PaaS is somewhere in the middle. Because access to the operating systems is not available with PaaS, most of the tools discussed in this book are simply not available to be used. However, the concepts learned in this book do still apply and can be used with the proprietary interfaces available with these PaaS offerings. The following are some basics that need to be understood about your cloud provider’s PaaS offerings to properly support them from an engineering perspective and troubleshoot issues when they occur: - Configuration Options: The configuration of PaaS offerings can be complex and will include concepts like sizing (CPU, RAM, storage), subnetting, load balancing, session state, DNS, etc. It’s important to learn the intricacies of what each cloud provider’s options mean for each service. - Thresholds/Throttling: Cloud providers will only allow you to use what you pay for, especially in terms of PaaS offerings. Thus, if you can only perform 100 I/O’s per second, 1000 transactions per minute, transmit 100Mb per second, etc. then that is all you will be able to do. The problem with these thresholds (which are based on the tier of service you have purchased) is that sometimes it is not clear that you have exceeded them. For example, Azure SQL has thresholds on the amount of I/O to any specific file. So, even though your CPU is low and perhaps your overall I/O isn’t too high, if you have crossed the I/O

-

threshold for one particularly critical file, you will be throttled in terms of that I/O and begin to experience performance issues. Diagnostic Tools: It’s also important to understand what diagnostic tools are available for these PaaS offerings. For starters, both AWS and Azure provide decent monitoring and alerting capabilities. These must be well understood so that you know where to look for potential

problems like constraints. Also, cloud providers often offer special diagnostic tools that need to be understood. For example, many Azure services like ASE (Application Service Environment), which provides

-

similar services to IIS, provide for “console access”. This is in essence a very limited command-line interface to the system providing this service. On this console, you can issue commands like ping, tcpping (similar to test-netconnection in powershell), nameresolver (similar to nslookup), etc. to diagnose network connectivity issues as you would under any other circumstance. Instrumentation - Many instrumentation tools can be implemented with cloud-deployed applications and this is critical to be able to identify where in the code the failures or performance problems lie.

While there is certainly a lot more that would need to be discussed to provide decent coverage of this topic, such information is outside the scope of this book.

necdotal Story – Why is this copy taking forever? A development team rolled out a change overnight for a critical application and all seemed well with initial testing. However, once the user load

increased, the application began to experience errors and the decision was made to roll back the code to the previous version to stabilize the situation. The rollback procedure was expected to take about 10 minutes simply due to the large amount of data that had to be transferred to the cloud provider. After 30 minutes, however, the transfer is still not even halfway done, and the user community is suffering from the problem. The developer explains that the data is copied from the cloud-based source code repository to a gateway server and then copied from there to the external PaaS service that hosts the application. The network engineer checks the Internet circuit and there’s no bandwidth constraint. What can you do to figure out the reason for the slow transfer? Below is a diagram of the transfer:

Given that the only thing going on is a copy of data, looking more deeply at the network traffic appears to be the way to start. Also, since the Gateway Server is right in the middle of the data transfer, it would be the best place to capture the network traffic as you can determine which leg of the traffic is

causing the issue. Note that there are two TCP conversations at play: 1) From the Internet-based Source Code Repository to the Gateway Server 2) From the Gateway Server to the External PaaS Service A Wireshark capture is performed on the Gateway Server for a short time (30 seconds) and reviewed. The TCP Stream from the Source Code Repository to the Gateway Server was clean of any TCP slowdowns (e.g. Retransmissions, Duplicate Acknowledgements, ECN flag turned on, etc.). However, the TCP

Stream from the Gateway Server to the External PaaS Service had many significant delays. The delays were all due to the window size being set to 0 by the External PaaS Service. The client-side Gateway process appeared to be periodically probing with a duplicate acknowledgment as shown until finally, the windows size would open, and more data was sent until the window size again went to 0 and it all started again.

If you remember our discussions on TCP conversations, there are two distinct ways in which TCP conversations can be slowed down: 1) Flow Control - The receiver is overwhelmed and limits the receive window to slow down the flow of data to it. This is typically rare because NIC cards rarely get overwhelmed. 2) Congestion Control - The sender notices network congestion by seeing duplicate acknowledgments (indicating that some packets did not arrive at the receiver) or experiences a timeout causing it to perform a retransmission. This is more common as this is what occurs when bandwidth is constrained somewhere within the network. Since the receive window is being restricted in this case, it’s the receiver that’s slowing down the conversation (flow control), not the network in between. Since the receiver is the External PaaS Service, this simply

appeared to be an artificial constraint being caused by throttling due to some threshold being exceeded. The cloud provider was immediately contacted, and they confirm that this was being caused by the fact that you were exceeding the amount of data per second that is allowed to be transmitted into the PaaS Service given the tier of service you were paying for. This had not occurred during the overnight code migration because the service was not being used during this time. By the

time the cloud provider had confirmed the diagnosis, the transmission ended – well over an hour after it had started. Thus, there was was no immediate action needed, however, a review and adjustment of this threshold by increasing the tier of service was performed after the fact.

Chapter 10 – Review/Key Questions

1) What are the core features of cloud computing? 2) What are the differences between the cloud deployment models? 3) What are the differences between the cloud service models? 4) In terms of security, what are cloud providers responsible for and what are you responsible for? 5) Which service model has broad usage across many cloud providers? 6) Which are the primary cloud providers of the IaaS and PaaS service models? 7) How does the troubleshooting of SaaS, PaaS, and IaaS differ? 8) What are some key items to remember in terms of troubleshooting PaaS offerings?

Chapter 11 – Application Architecture Patterns

In this chapter, we will cover common application architectures used within Enterprise environments what mean from a troubleshooting perspective.ITIt’s important toand note thatthey these architecture patterns are generalities as applications and application ecosystems can utilize multiple of these patterns. Still, being able to identify these patterns helps in the troubleshooting process. 11.1 Tiering and Layering Application architecture is a complex topic. In this chapter, we will be focusing mostly on the “tiering” of applications as opposed to a similar, but different topic called layering. We do recognize that there aren’t any standard

industry definitions for the terms “tiering” and “layering” yet theywe’ve are used often when explaining application architectures. The and definitions landed on are based on what seems to be the most popular way to look at these terms. However, the entire intent behind introducing these terms is simply to inform about the application architecture patterns that follow. Application layering generally refers to the separation of application development responsibility within the business application. For example, separate teams/individuals could “own” the different layers of the application stack. Application tiering on the other hand generally refers to the physical deployment of the application across separate processes and/or machines. Thus, application layering influences, but does not dictate, application tiering. Here’s one way to look at these concepts:

In the above, you can see a clear alignment of application layers with application tiers. Some developers could manage the presentation layer (e.g.

Javascript), which happens to execute on the client tier (e.g. browser). Other developers manage the business rules layer (.NET backend code) which happens to execute on the business tier (application servers). Yet another group of developers could manage the storage layer (e.g. developing stored procedures) which resides on the storage tier (database servers). However, here’s a different view on this topic:

In this diagram, you can see that all of these application layers can reside on one physical machine or even within one process. This doesn’t mean that the application layers no longer exist, just that the application tiers have melted away to the point that they are not relevant anymore. As we review these architecture patterns, it’s important to understand that these variations will exist, and on complex incidents, probing the development team about the structure of their application is important to make sure everyone understands what may be occurring. 11.2 Terminal Emulation The first architecture pattern that we will review is an old one that has

evolved over the years terminal emulation. Character based applications (a.k.a. green-screen applications) were the original user interfaces starting back in the 1950s and they are still around in Enterprise IT shops today. The most popular of these operating systems are Linux (Telnet/SSH interface), IBM’s z/OS (a.k.a. OS/390 or mainframe), and IBM’s i (a.k.a. OS/400, AS/400 or iSeries), but there are others. For this section, we will focus on z/OS and i as the Linux interface (e.g. Putty) has already been discussed.

IBM’s z/OS predecessors communicated with 3270 dumb terminals, while IBM’s i predecessors communicated with 5250 dumb terminals. While these looked similar, they were not compatible, thus neither were their protocols. These physical dumb terminals (keyboard and screen only, no CPU) no longer exist in the enterprise, however, their protocols do. TN3270 (for z/OS) and TN5250 (for i) were created to allow the simulation of these terminals over a TCP/IP network. Both TN3270 and TN5250 ride on the Telnet protocol and thus use port 23 or 992 (TLS) by default. Of course, ports can be changed by the administrator. Terminal emulation-based applications are typically 1- or 2-layer applications (depends on the design). Regardless of the number of layers, however, they are deployed on at least 2 tiers. Typical IBM i main screen:

Terminal emulation could be the simplest of application architectures,

however, enterprises may evolve it into something more complex and thus more than 2 tiers. In its simplest form, a client-side emulator could be a fatclient application directly communicating to the back-end system (2-tier) using TN3270 or TN5250 as shown below:

However, since distributed fat-client applications incur some administrative overhead and exposing TN3270 and TN5250 over the Internet is often discouraged, a common practice is to have the client-side emulator be a java applet served from a web server (3-tier) as shown below:

Within some enterprises there could even be a complex server application in between combining multiple screens into one modern user interface (this would be n-tier) to gain user efficiency - shown below:

From a troubleshooting perspective, it’s critical to understand the architecture (simple or complex) underlining your terminal emulation-based application

and all the servers involved. necdotal Story – Are you blocking me?

A client callsand that they made a change to their network (they changed web proxy) now they cannot connect to your mainframe over the their TN3270 web-based applet you provided. They are now going out the Internet via a different IP address and think you need to allow their new IP addresses access through your firewall. However, you check your firewall and do not see any rules pertaining to the exposed applet URL. They are skeptical and suggest that perhaps something else in your network has the IP-restriction. What are some possible next steps to take?

Whenever there is disagreement about the possible cause of a problem, especially when cooperation is needed between two parties, the first thing that’s needed is to prove or disprove one of the competing theories. In this case, how could you disprove the client’s theory that something with your network is causing this problem? Well, since the IPs are not IP-filtered out to the Internet, they should be accessible from anywhere on the Internet. Thus, a simple to enough, prove the issuefrom is not your network is to testsofrom the Internet.way Sure a test thewithin Internet proves successful, the client now realizes that the issue is within their network. Regardless, since they are the client, you still need to help them resolve the issue. So, what would be the next step? You try browser developer tools to see if you can catch the error, but all it shows is the Java applet being

downloaded successfully and executing; the failure is occurring within the Java applet, so it’s obscured from the browser developer tools. So, then what’s the next step? You need to look at what’s occurring inside the Java applet. As mentioned in chapter 7, when Java is locally installed on a machine, the “Java Console” can be enabled to display all of the activity that occurs within a Java applet. Once this is done, an error appears within the console. The error indicates

that a request through a web proxy server was failing. This information is relayed to the client and they realize that the web proxy server in question was the old one they had just turned off. They then realize that while the browser’s settings had been changed to use the new proxy server, Java was still configured to use the old one. In the Java Control Panel, the client changes the configuration to use the browser’s settings as opposed to hardcoding the proxy server within the Java Control Panel. Once this change is made, the application began to work as expected. Other than documenting this incident, since the problem was with another entity (a customer), there is no need for problem management on this incident. 11.3 Client-Server

In terms of simplicity, the next level up would be the client-server application architecture pattern, which is a 2-layer model and also 2-tier in its simplest form. While generically, the client-server

architecture relates to any request-response pair of communicating programs with the client being the initiator of the communication, the rest of this section will focus on a full business application deployed with the clientserver architecture pattern. Unlike the terminal emulation pattern, the clientside of a client-server architecture is not a simple emulator displaying information generated by the server. Instead, the client-side contains business logic and complex UI components.

The server side is typically a database server that stores the information in a centralized location such that multiple clients all share and update it. Most client-server applications are optimal when there is a low-latency link between the client and server machines (e.g. they are co-located in the same building). This is due to the chattiness between the programs. One of the key reasons for the development of database stored procedures in the 1990s was to reduce the chattiness between the components by placing more business logic in the database resulting in less back and forth. Due to the need for low latency between the client and the server in this type of architecture, a common solution for allowing remote access involves the use of a VDI (Virtual Desktop) server (3-tier approach). The actual client application process executes on the VDI server that’s in close physical proximity to the database server. The remote client machine uses a screen

sharing protocol like Windows RDP or Citrix’s ICA typically delivered over a browser (HTTPS). Only the screen changes are relayed back to the remote client machine.

From a troubleshooting perspective, the client-server model has several points of concern as follows: Client-side operating system dependency is a problem as applications usually only support running on one OS (e.g. Windows) and sometimes only up to a certain version or patch level of that OS (e.g. Windows 7, etc.). This applies to VDI clients and regular clients. Client applications typically have dependencies such as database drivers of a certain version that need to be installed and properly configured on every client machine. This applies to VDI clients and regular clients. When deploying new versions of the client application (e.g. code release), maintaining consistency acrossclients. all systems is critical. This applies to VDI clients and regular The VDI infrastructure itself can become a bottleneck if not properly sized. necdotal Story – I’m not working, but my neighbor is fine

You start receiving calls from some users that they are receiving errors with their client-server

application, which is served off a VDI. The errors are indicating indicating a failure executing some logic on the client. However, many users are working just fine. What are some possible next steps to take? As with most problems, a good question to start with is whether there were any changes implemented recently. Sure enough, the answer is that yes, there was a major deployment of a new version of the application the night before. This deployment included changes to the database and the client. In fact, the

first thing the client application does when it starts is validate the version of the database to make sure it’s compatible and this is where the failure is occurring. Validation is made to confirm that all of the clients working had a successful client deployment while the ones failing still had the old version. To rectify, the new client version is installed on all the users that called in with the problem and they are corrected one by one. Again, this does not conclude the activity with this incident as it needs to be determined why this occurred. A review of the client deployment list discovers that a recent migration of users to a new set of VDIs inadvertently removed them from the deployment list for this application. This is corrected to make sure future deployments do not experience the same problem. 11.4 Web-based

The web-based application architecture is typically 3-layered and 3-tiered in its basic form with a browser on the client machine accessing a web/application server, which subsequently accesses a

database server. Modern web-based applications include sophisticated clientside functionality using HTML5, JavaScript, and frameworks such as React.JS & Angular to provide a rich GUI experience. The client-side code is downloaded from the webserver when the site is hit by the browser eliminating fat client deployment issues. The web-based application architecture solves some of the problems inherent with the client-server architecture, such as: Remote client performance – Because the chattiness between the business logic and the database can remain co-located in a data center there is no need for a VDI to support remote clients. Client-side simplification – Because all you need on the client is a browser, dependencies on an OS, OS version, or database driver configuration disappear as do issues with the deployment of new releases; NOTE: Applications may still have browser-specific issues. While the standard web-based application architecture with 3-tiers is fairly simplistic, in practice it can get complicated to allow it to scale, add redundancy and ensure security. This leads to an n-tier physical implementation. Additions can include: Load balancers may be added to scale & add redundancy to the webserver layer. Web and application server functions may be separated for security. Firewalls, WAFs, and 3rd party authentication products may be added to secure the web layer. GSLBs may be added to provide location redundancy. CDNs may be added to provide optimal global performance.

The below is an example of a more complex n-tier web-based application architecture pattern:

necdotal Story – Big client can’t ca n’t use the nice new application You’ve rolled out your new web-based application to a new client, but your application is failing to load on their browsers. You ensure that you are allowing them through your firewall and even see traffic from their IP addresses come through to your web and application servers. You also confirm they are being successfully authenticated. You even test some of their userids yourself and they appear to be working fine. The other few clients using your new application are not having any problems. What are some possible next steps to take?

This is a good example where using the browser developer tools is a good start. You do just this and immediately realize something a little odd. Their Chrome browser’s developer tools interface looked different from yours. You then look at the version of Chrome and realize that it was quite old – almost 2

years old. When you query about this, the client explains that this particular department had not been able to upgrade their Chrome browser due to an issue with another application once they went past a certain version of Chrome. You ask them to try Firefox instead and your application begins to function as expected. Later testing confirmed that your new application implicitly required at least a fairly recent version of Chrome due to the client-side functionality

employed. The key takeaway from this incident is that when deploying applications to be consumed by entities outside of your company, it’s important to understand the browser type and version requirements to avoid issues. 11.5 SOA and Microservices Services Oriented Architecture (SOA) is a style of software design whereby you break-up the business logic portion of your application into smaller independent portions that communicate with each other. A service has 4 properties according to several definitions of SOA:

It logically represents a business activity with a specified outcome.

It is selfcontained. It is a black box for its consumers.

It may consist of other underlying services. Microservices architecture is a modern interpretation of SOA. SOA & Microservices are multi-layered application development approaches where the business logic responsibility can be distributed broadly. Each “service” can utilize multiple tiers & layers, making the identification of tiers & layers for the entire application quite complicated. It is often simpler to focus on the tiers & layers of similarly deployed services, while also documenting the

relationship between the services. Putting together a complete picture of a SOA or Microservice-based application canVIPs, get overly complexDatabase as shownnames, below etc.) and the needed (DNS Names, IP addresses, candetails grow to an unwieldy amount. Thus, from a troubleshooting perspective, documenting the flow of the particular process/function having a problem is the key as opposed to focusing on the entire application. Also, as mentioned before, application instrumentation tools become invaluable when deploying this kind of architecture. The below diagram is a limited example of what a SOA and/or Microservices based application can look like if mapped out:

necdotal Story – Where’s the slowdown in all of this? One morning, you receive a call that users of a critical Internet-facing web application are receiving sporadic errors stating that their transaction could not be completed at this time. There are multiple application development and support teams involved because this is a SOA-based application. The

web, application, and database servers have been checked and none seemed constrained. What do you do next in terms of the troubleshooting process?

As always, asking if there were any recent changes is the best way to start. While there was a new release of some supporting services, this occurred well over a week ago and performance had been fine since then. There had been no changes since to any related component as far as anyone knew. The next question was whether there was an instrumentation tool available for this application and the answer was yes, so this immediately became the next avenue of investigation. Upon reviewing the instrumentation tool, you can focus on the transactions that took longer and notice that with all of them, there was a particular service call that was taking the majority of the time. The focus then shifts to the application servers and database server supporting this sporadically slow web service call. The web service call was supported by 5 application servers and one of them appeared to be CPU constrained (100% CPU utilization). You remove the constrained server from the load balancer as the other servers could easily handle a 20% increase in work. Once this is completed, performance stabilizes, and the user complaints end. However, your work was not done.

The application server that is now out of the load balancer pool has to be analyzed. The process is still at 100% utilization even though no more requests were coming to it. A look at the ITIM tool shows that this server had been at 100% utilization since late the night before even though there was barely any activity overnight. Three memory dumps, one minute apart, are taken and sent to Microsoft for analysis. Then, the application pool was

recycled. After the recycling, the server’s CPU utilization returned to near 0 and the server is added back into the load balancer pool. The application remains stable. The analysis of the memory dumps indicates that a framework being used within the application was taking up all of the CPU. Some investigation finds that this framework has a known bug that under certain rare conditions, several threads can go into CPU loops. The application development team quickly works on getting the new version of the framework, which corrected this bug, implemented it in their code, and deploys it to production to avoid recurrence. 11.6 Cloud-Native

A “cloudnative” application is a program that is designed specifically for a cloud computing architecture. While there is no industry agreed-upon definition, this is usually interpreted as an application able to execute by strictly utilizing PaaS (Platform as a Service) offerings. In other words, no operating system access is needed for any of its components. The code can utilize any of dozens of platforms (.NET, Java, Containers, Functions, etc.) while the data is also PaaS-based whether an RDBMS service like AWS RDS or Azure

SQL or No-SQL offerings or perhaps just cloud-based blob storage. Cloud-Native applications are more often than not Microservices based, but they do not need to be. The key is that they can utilize PaaS services without needing to rely on accessing the underlying operating system. Applications that require an IaaS component (even though IaaS is a cloud computing offering) would not be considered “cloud-native”.

A prominent and quickly growing type of Cloud Native application is one that utilizes Serverless architecture or FaaS (Functions as a Service). The primary FaaS platforms today are AWS’ Lambda Functions and Azure Functions. Architecturally, FaaS-based applications are simply a specific implementation of Microservices. Behind the scenes these FaaS offerings typically utilize Containers. Multiple programming languages are supported by these FaaS offerings From a troubleshooting perspective, with Cloud Native applications the following should be considered: You have no control over or visibility to the underlining infrastructure (e.g. CPU, RAM, I/O, etc.). Thus, vendor support is critical. Application instrumentation and telemetry (how long things are taking) becomes critical as well for you to be able to detect when and where issues are occurring. Becoming familiar with limitations like per-call execution time, soft limits based on the class of service purchased, etc. is of utmost importance as your cloud provider will throttle your usage as you

reach these thresholds causing performance issues. Latency & “Cold starts” are a common problem with some of these services, especially if lightly used because the cloud provider will shut them down to save on costs. Periodic automated calls to these services can avoid this situation. 11.7 Business to Business Integration

Most Enterprise IT shops need to support integration with other companies, whether it be clients, customers, or vendors. There are several dimensions to such integrations, and each has implications when it comes to troubleshooting issues. The primary dimensions are as follows: 1) Type of integration needed, such as: a. Access to an internal application(s) and/or reports b. Method of access (VDI or web-based) is a sub-component of this c. Access to to service(s) (e.g. RESTful or SOAP web services) d. Ability to share files (typically via FTPS or SFTP) e. Access to phone system system (typically only needed for outsourced vendors) 2) Type of connectivity needed or decided upon. There are typically four options as follows: a. Use the Internet with TLS encryption b. Same as #1, but with IP filtering as well (e.g. only certain IP addresses allowed to access your exposed services). This means that if the client changes their IP address, they cannot get in. c. VPN – Private, encrypted tunnel over the Internet. The sharing of keys is needed to set this up. d. Private connection (e.g. MPLS) 3) Method of authentication (Please note that this is not a complete list): a. Userid/password – You could simply provide your business partner with a userid and password and maintain these

carefully in some repository. This would only make sense if you only needed to share a small number of userids with your business partners (perhaps one per partner). b. Federation – If your business partner requires to have a subset of their personnel log into one of your applications or vice versa, then a common approach is federation. With federation, the entity owning the application, “trusts” that the other entity, who owns the users will ensure that their users

are properly authenticated. In other words, the business partner’s user logs into their own systems and then presents a token via SAML, OAuth, or some other mechanism to the owner of the application, who trusts that this is the user claimed by the token. These tokens can also include other information like what authority the user should have within the application. c. Multi-factor – Options a or b can be a part of a multi-factor authentication mechanism where the second factor is typically a code sent to your mobile device. d. Certificates – In some cases, a client certificate can be provided to the business partner for web service calls. e. Tokens – This is mostly for web service calls as well, where temporary a token is provided after some other kind of authentication occurs.

Chapter 11 – Review/Key Questions

1) What are the differences between application tiering and layering? 2) What are the characteristics of the terminal emulation application architecture pattern? 3) What are the potential problems with the client-server application architecture pattern? 4) Which issues of the client-server application architecture pattern do VDIs address? 5) Which issues of the client-server application architecture pattern are addressed by the web-based application architecture pattern? 6) How is it best to troubleshoot SOA & Microservices-based applications? 7) What’s the key feature of the cloud-native application architecture pattern? 8) From a troubleshooting perspective, what are some of the things to look out for with cloud-native services like FaaS? 9) What are the key elements to consider with business to business integration?

Chapter 12 – Troubleshootin Troubleshooting g Techniques and Examples

Now that we’ve covered most of the technologies in use within Enterprise IT environments, will tryto towalking put it allthrough togetherhow withtoone final almost entirelywe dedicated apply thechapter that’s troubleshooting techniques discussed in the book using real-life, complex examples of Enterprise IT problems. In this chapter, we will also introduce one final troubleshooting technique that the authors have found valuable over the years called “Follow the packet”. 12.1 Follow the Packet Method When troubleshooting a problem that involves a complex infrastructure and/or application pattern, it’s often necessary to step through the path of the “packet” as you would step through lines of code when debugging a program. We call this the “Follow the Packet” method. This is not a recognized industry term but provides a nice visual and comes from years of firsthand experience. So, how does this work? Consider the below infrastructure.

Let’s say that there is a failure showing up on the client that is of unknown origin. To identify the problem, you follow the servers, network equipment, and platforms that the application request is supposed to traverse when executing the failing function. You then check each to see if there is an error at any level and/or confirm that the request/packet traversed this far and was

successful. Now, let’s walk through this process in some detail using the numbers provided in the diagram above: 1) On the client itself, identify the URL requested when the failure occurs and the destination IP address the client used. Tools: nslookup, ping, browser developer tool 2) Did we see the packet traverse the firewal firewall? l? Was it blocked? Is the

firewall overloaded? Tools: Packet capture or logging on the firewall, firewall monitoring tool 3) Which load balancer is act active? ive? Any load load issues? Is there session stickiness for this pool? What mechanism is used for the session stickiness? What’s the load balancing method and health check method? Are all the pool members active? Can we validate that the packet traversed the load balancer? Tools: Load balancer configuration UI, load balancer logs. 4) Which web server did the request/packet use? Any constraints? Any errors in the Web access logs? Any successful HTTP requests? Which authentication server is used? Any problems reaching the authentication server on whatever port it’s using? Tools: Instrumentation tool, ITIM tool, local OS tools & logs, web access logs, PortQry, Packet capture – if applicable 5) Any errors on the authentication server? Any load issues? Tools: ITIM tool, OS tools & logs, authentication product’s log, packet capture – if applicable 6) Any errors on the application server? Any successful transactions? What’s the connection string used to connect to databases? Any issues connecting to the database server on whatever port it’s using? Any telemetry on the service being used to indicate where the failure occurs or does it succeed? Tools: Instrumentation tool, ITIM tool, local OS tools & logs, application & web access logs, Packet capture – if applicable 7) Any problems with the database server? Any blocking, deadlocks, or poor response? Tools: local OS tools & logs, database scripts to check for blocking and/or long-running queries. Database logs. Packet capture – if applicable

12.2 Problem #1 – No one can log in!! Now, let’s start to tackle some complicated and/or tricky problems. Consider the following internal (not Internet-facing) application/infrastructure diagram and steps taken to log into the application (note that these steps align with the

diagram proceeding it): 1) User accesses load balancer 2) Load Balancer chooses application server and sets cookie-based session affinity

3) The application server redirects the user to a cloud-based service 4) Cloud service, which uses AWS to host its services, challenges the user to authenticate, validates credentials, and replies with authentication token to the client

5) The client sends an authentication token to the load balancer which passes it along to the designated application server, which happens to be IBM WebSphere Application Server 6) Application server issues REST API request to cloud service for authorization associated to authentication token provided 7) Cloud service replies with an authorization token to App server to know what kind of access the user should have 8) App server serves up appropriate page to the client One morning, you come in and no users can log into your application. In fact, there are several applications experiencing issues trying to login with your cloud-based provider. However, many applications are fine. You get a hold of your provider and are given the following information:

They are seeing the requests come in from the client (authentication), but not from the server (authorization) It started the night before, but not for all servers as many are working just fine This is then confirmed by the developers as they are seeing timeouts calling the REST API to obtain the authorization token. Below is an updated diagram with the steps failing noted in red. What do you do next?

As with any troubleshooting effort for an unknown problem, asking some basic questions helps to inform as to what to do next. Here are some relevant questions and their answers: Were there any relevant changes made to any components involved? This question applies to the application development team, internal infrastructure personnel, and the cloud-based authentication service since it’s calls to their service that appear to be failing. – The answer from all parties was “no”. When did the problem start? – The first calls came in at 8 AM when people came into work, but the vendor started seeing symptoms (no requests from servers) the night before. How’s the CPU & RAM on application servers in question – Normal, which is well below 50%. What’s common about the applications failing and those not? – Most failing apps are WebSphere-based, but there was also one IIS

application that now seems to have started working. Also, most failing applications are using the default login screen, but one isn’t (most applications use the default login screen, so this could be a red herring). What’s the DNS Name and IP of the cloud server – Acme.auth.com, 18.209.113.161-163 Is the firewall blocking the WebSphere server’s calls to acme.auth.com? – Packet capture performed on the firewall and

the servers that are working fine are seen successfully sending/receiving packets from the cloud service IPs, but for the servers failing, no traffic to the cloud service IPs is seen What do you do next? This is now confusing and doesn’t seem to make sense. How could it be that the firewall is not seeing traffic to the IPs of the cloud provider (acme.auth.com) from the failing application servers? At this point, you need to validate everything you know. Since the only clue is the error provided by the developers that they can’t access the cloud provider and the firewall is indeed seeing no traffic between these devices, you need to focus your energy on determining what’s going in in the communication between the application server and the cloud provider. First, validate that the packet capture is working from the operating system. An engineer logs into one of the failing backend application servers (these are Linus servers running WebSphere) and performs the following command: telnet 18.209.113.161 443. This traffic is seen on the firewall packet capture and the handshake is successful, so the packet capture is working as expected. That proved that it worked with the IP address, but maybe there’s a name resolution issue. The engineer issues the following command: telnet acme.auth.com 443. The result is the same (it’s successful seen by theThis firewall), so DNS resolution is working as expected on the and Linux server. continues to be puzzling, but it seems clear that if there is a communication problem it’s only occurring within the application. The next step is to get a capture on the application server as the user attempts and fails to login. The engineer issues the following command: tcpdump -i

eth0 -s 65535 -w capture.pcap. The application is tested and the tcpdump is stopped with ctrl-c. The “capture.pcap” file is then sent to a PC for analysis with Wireshark. In Wireshark you look for requests to the IP addresses of the cloud service using the following display filter in Wireshark: ip.addr == 18.209.113.161 or ip.addr == 18.209.113.162 or ip.addr == 18.209.113.163. Below is what was found:

So, this aligns with what was being reported by the firewall capture. The application server was indeed NOT calling these IP addresses, but why not? The code clearly shows it’s calling acme.auth.com and you know that these are the IP addresses that resolve to that name. Assuming that the application server is calling something that’s timing out, you look for any attempted handshakes on port 443 that had problems. You input the following display filter in Wireshark: tcp.port==443 && tcp.analysis.flags. Below is what was found:

This shows a long list of attempts to communicate with 34.203.255.208 on port 443 (SYN only, not SYN-ACK replies), but what is that and why is the application server trying to communicate with that IP address? What is going on here? Well, the next obvious step is to perform a whois search on this IP address to see what organization owns it. This search shows that this IP address is owned by AWS and associated with the same AWS region as the 18.209.113.161-163 IP addresses. At this point, a theory begins to form. What if AWS changed the cloud provider’s IP addresses without telling them – this is something that we’ve experienced with Azure ourselves, so it wasn’t

a stretch. If AWS had done this, then if the application server had cached the old IP address for the name acme.auth.com (assuming it’s 34.203.255.208), it could still be trying to hit it. At this point, the most prudent approach was to bounce the application servers. If they had cached the old IP address, this would clear it. This action is performed and sure enough, the application begins to work. Also, the firewall now starts to see traffic go through it from these servers to the cloud

service IP addresses. As with all problems, while the immediate problem is fixed, you are not done yet. You ask the cloud provider to validate with AWS as to whether their IP addresses were changed. Later that afternoon, the cloud provider does indeed confirm that there was maintenance that AWS performed that evening that indeed changed the IP addresses of many of their services, including yours. Another critical question was why didn’t the TTL kick in and the cache expire as it appears to have done so on the Windows machines and other versions of WebSphere. Well, after a little research, it is discovered that there is a bug with the version of WebSphere you are currently running that if you set the DNS caching time to 0 as you had (meaning never cache DNS), the application server instead caches the DNS entries forever. A quick fix is to change the setting to 1 second to avoid the bug, but there is also a plan put in place to upgrade the WebSphere servers to avoid this bug.

12.3 Problem #2 – Can’t See Image… Sometimes Now, let’s tackle another tricky problem. As with the last problem, let’s first understand how the application works. Consider the following internal (not Internet-facing) application/infrastructure diagram and steps taken to acquire an image: 1) User accesses load balancer using app.acme.com 2) Load Balancer chooses app server and sets cookie-based session affinity 3) App server challenges the user to authenticate and validates credentials 4) App server verifies user authentication 5) When the user wants to access an image, the app server authenticates the user against the image server related to it (e.g. image01.acme.com) 6) Image server replies with authentication token and URL to access the

image (e.g. http://image01.acme.com) 7) App server sends token (cookie) and URL to the client. URL is of the backend image server (e.g. http://image01.acme.com) 8) The client accesses the URL (e.g. http://image01.acme.com) while providing the token (cookie) and retrieves the image 9) The Image server accepts authentication and retrieves the image from the backend database 10) The image server replies to the client with the image

This is all working quite well, but an audit discovers a problem. Because the image servers contain some sensitive information, they need to be placed

behind the load balancer and a firewall such that client machines cannot access them directly. The vendor says that this is not uncommon and provides a solution that’s been used at other companies. The solution entails simply updating the configuration files to redirect the user to a new DNS name (images.acme.com) that will resolve to a VIP on the load balancer when an image needs to be retrieved. The solution is tested in the staging environment (which is supposed to be identical to production) and it works. The solution is then deployed to production, but unfortunately, about half the time, users fail to retrieve an image. Instead, the user is prompted to log in again. The login works, but the extra step is killing productivity, and yet backing out of the change is unacceptable from a security perspective. Below is the updated diagram with the same steps altered to the new reality and the problem steps noted in red. How do you work the problem? What questions should be asked?

The key is to understand where the failure is occurring. So, using the follow the packet method and troubleshooting best practice, here are some critical questions to ask with answer in italics italics:: 1) Which step is the login prompt occurring? It occurring? It seems to be occurring when the image is called called up by the user (steps 8-10). This makes sense as this is what changed. 2) Before the change, how did the client know which Image server to

retrieve the image from? The application server would pass along the name of the image server (e.g. image01.acme.com) to the client. 3) After the change, how does the client know the name of the image server to retrieve the image from? The application server now passes the DNS name (images.acme.com) that goes to the load balancer instead of the backend server name. 4) How does the load balancer determine which image server to send the client to? Round to? Round robin. 5) How does the user authenticate to the backend image server? The userid & password is passed along from the application server to the image server (step 5) and the image server replies with an authentication token (step 6). So, if you simply walk through this tokens. logically, about half the time theanbackend image server is rejecting the user’s Why? Especially, when identical configuration works in the staging environment. It doesn’t seem to make sense. This is where per troubleshooting best practice, you need to put aside assumptions. In this case, the assumption that the staging environment was set up in the same way as production. You should deduce that there is a difference between the two environments. So, putting that assumption aside, you need to ask yourself another key question – Are the tokens generated by each image server compatible (e.g. is a token generated by one image server valid on the other image server)? The vendor says that they should be, but you press them as to how this is enforced? The vendor explains that you simply need to make sure that the server key generated on one server is copied to the other per the installation instructions. You then compare the keys on both servers and find that they are different.

You then check the staging environment and the keys on both servers are the same. The step of copying the key from one server to the other was accidentally skipped a few years ago during the initial installation. Since there was no load balancer in front of the image servers, this mistake never caused a problem. However, nowserver, that users couldlater be authenticated by the application server on one image but then be sent to a different image server by the load balancer, this old mistake came back to be a problem.

The key is copied from one server to another and the problem is solved. In this case, since the problem occurred years ago, there wasn’t any specific problem management follow-up. 12.4 Problem #3 – Why is Email so slow? Now, let’s work on a very different problem. You get a call that a critical email process is backing up; there are thousands of emails backlogged and it’s getting worse. This is a problem because the system is supposed to send a confirmation email to customers after they perform certain functions and when the confirmation is not received, they begin to call into the Contact Center causing hold times to exceed your client SLAs.

As usual, you first need to understand the environment, so you ask questions to create a mental picture of the situation. How does this function work and what servers are involved; remember, that you want to be able to follow the packet. Order of operation: 1) The source system sends a message through Azure Service Bus (API call) with information about what email needs to be sent, including the template to use and destination email. 2) A set of services called “Comms Manager” is running on a Service Fabric cluster and listening to the Azure Service Bus to pick up messages as they arrive. 3) There are two instances of these services running (VM10 & VM04) with each executing up to 10 threads. Thus, up to 20 emails can be worked concurrently. This is typically more than enough to keep up. 4) These services log everything to a centralized service that can be queried using Kibana. 5) These services put the emails together by referencing the requesting

template from the SAN and then executing a send-mail request to a mail-relay load-balanced DNS name (mailrelay.acme.com). 6) The load-balanced DNS name (mail.acme.com) contains 4 mail-relay servers (a*****p004, a*****p005, m*****p003 & m*****p004) that validate the emails do not contain a virus and do not violate any other security protocols before sending them out. Below is a visual diagram of the above environment:

So, what questions do you ask? As usual, you ask whether there were any changes relevant to any of these applications or infrastructure recently? The answer is no, there are no registered changes to either the source system, Comms Manager, email servers, load balancer, or any network device in play. So, then let’s follow the packet. Can you tell if the source system is sending messages promptly to the Service The developers zure Service Bus and theAzure answer is yes,Bus? messages seem to be can sent query the immediately into the Azure service Bus – no delay there. So, then, are the two Comms Manager services picking up the messages from the Azure Service Bus in a timely fashion? To this, the developers answer no, the messages appear to be piling up at the Service Bus and not being picked

up fast enough by the Comms Manager services. Since the Comms Manager services are not picking up messages as they used to, are the application servers that they reside on constrained? The server engineers use OS tools to confirm that the CPU and RAM/paging on the VM04 and VM10 servers is minimal (actually less than typical). Since there appears to be capacity on the application server, does launching

additions threads/services speed up the process? This was attempting and launching additional threads/services does speed things up, but only for a short while, then the slow down occurs again. Could the slowdown be further back in the process (e.g. the load balancer or email servers) that’s not letting the Comms Manager services move quickly? The web hosting engineers check the Load Balancer and it does not appear to have high CPU and other applications are not complaining. The messaging (email) engineers do not have direct access to the operating system of the mail relay appliances. Instead, they access the health of the system through a proprietary interface. proprietary thatorall is well with theseThis appliances. Also,interface no otherindicates email users applications are complaining Is there any instrumentation or telemetry available from the development side? The development team provides information showing the slowdowns in their log. The emails take anywhere from milliseconds to minutes to send and both servers (VM04 and VM10) are affected. The interface to the log is rovided below:

As can be seen, emails from both systems are taking seconds to complete, and with hundreds being sent every minute, there’s no way to keep up at this speed. The key is figuring out where the bottleneck is occurring. At this point, the possibilities still include:

The Communications Manager code itself has some kind of bug The Service Fabric servers (VM04 & VM10) themselves have some undetectable issue The SAN is slow or having sporadic issues The network between the Service Fabric servers and the Load Balancer has an issue The Load Balancer itself has some undetectable issue The network in between the Load Balancer and the email servers has an issue The email relay servers have some issue that’s not showing up on the proprietary monitor The connection from the email servers to the Internet has an issue How do you reduce the possibilities? You need more information/evidence. How do you get more information? There are several options but looking at the communication streams between systems with a network capture is likely as good a start as any. But where do you put the capture? Since the mail relay appliances are not easy to get into, the easiest place would be one of the application servers – you chose VM10 and collect over a minute of traffic before stopping the capture. Ok, now that you have a network capture from VM10, what do you look for? Well, the developers also shared some other logging that they collect that shows the email address of each request. Thus, you could simply look in the Wireshark capture for these email addresses to see if these TCP conversations experienced any issues (e.g. duplicate packets, retransmissions, etc.). Below is a screenshot of the first email send conversation:

This conversation clearly shows that there were no network layer issues (e.g.

duplicate packets, retransmissions, etc.), but there was a significant delay after the TCP handshake (28 seconds). The delay was from the server as the SMTP protocol requires the server to begin the conversation when it’s ready with a 220-response code and the server took 28 seconds to supply this response code. You can also see the server name (a*****p004). Lastly, while a lot less impactful, there were two other noticeable delays in the conversation, and both were caused by the client waiting for the server to respond. OK, so what does this tell you? It tells you that the problem appears to be the mail relay appliances, but you know that some emails complete in milliseconds, so why are some taking longer when the mail relay appliances seem fine and no one else is complaining of email delays? Well, now that you have this trace, you might as well look for other examples of slow emails. Below is the next example:

This next example follows the same pattern. Curiously, the mail relay appliance is the same one as with the first example given that there are supposed to be four of these appliances evenly balanced. You check two more slow emails and they are all going to the same appliance. So, what’s going on? Is this appliance the problem or are all requests going to the same

appliance? You decide to search for the other appliances and there are requests in the capture that go to the other 3 appliances, but all of those are completed in less than a second! It appears that you may have our culprit – it’s the a*****p004 appliance. You eliminate this appliance as a pool member from the load balancer VIP pool and sure enough, requests begin to speed up, and in fewer than 30 minutes the backlog had cleared completely. While you have solved the

incident, your work is not done. How come the appliance monitoring tool did not identify this issue? What is wrong with this appliance? Why did it start now? How come other systems/users didn’t experience this issue? Etc. It’s important to fully understand what happened to be able to avoid it in the future. So, after some research, it is identified that this appliance was being used for a mass email campaign. In fact, the email campaign asked to use just this one appliance to avoid impacting everyone. Since the 3 other appliances usually run well below 50% capacity, having one get a little overwhelmed should not have been a problem. Well, it wasn’t for most applications because they send very few emails, so a delay of a few seconds with every fourth email did not cause them to backlog. However, the volume of this particular application was such a backlog. delay with every wasthe a problem and created a significant But, whyfourth wouldemail that be case if the other 3 appliances had plenty of capacity? For this, you have to think through the process a bit more deeply. Remember that the load balancer is set to round robin, which means that every fourth request will hit the bad appliance and be delayed for about 10-30 seconds. Also, note that there are 20 threads sending emails and the emails usually take less than a second to be sent. So, after attempting to send out the first 20 emails, 5 of the 20 threads (every fourth thread) are now waiting and yet only a few seconds have passed. After the next 20 emails, which only 15 threads try to handle, 5 more threads are hung. After just 80 emails and a few seconds, all the threads will now be waiting. Of course, after 10-30 seconds or so the initial emails to the bad appliance would begin to finish, but only 3 emails would be able to be sent by each thread before it would get hung up again. Unless the volume is high as with this application, this kind of delay would not be an impact, but given the volume, there was a significant impact.

What about the proprietary monitoring interface for the appliance, why did it not alert the delays? Well, the appliance’s monitoring was not designed to measure how long it takes to send out an email. Instead, it was designed to indicate therepushing was a throughput issue and wasn’t. In fact, on thisa perapplianceif was out more emails thanthere any other appliance second basis. It’s just that the vast majority of the emails were for this special campaign and not regular processing.

So, the takeaways of this incident were as follows: 1) Large email campaigns should be registered as a change in the system as this could have shortened your analysis time. 2) When performing such large email campaigns, it’s not just a good practice to send all the traffic through one appliance, but also to take this appliance off the load balancer temporarily. 3) Reach out to the vendor for improved monitoring/alerting for these appliances such that any constraint (e.g. high CPU, Paging, a slowdown in average email send time, etc.) is detectable and even generates alarms. 12.5 Problem #4 – System keeps freezing, but where’s the problem? You get a call that an application is on occasion freezing. When this occurs, the problem often rectifies itself, but at other times the recycling of IIS is needed to correct the issue. Whenever this problem occurs, the application is non-responsive to all web users and yet batch applications continue to execute without a problem. This has been happening on occasion for a while but has been getting worse. How do you start?

As before, understanding the environment is key to getting started. After some queries, the below diagram is created:

Now, what questions do you ask next? As usual, it’s important to consider whether any changes could induce this problem? However, problem? However, the problem appears to occur in the middle of the day and no changes are ever scheduled for that time of day. Server engineers are asked to see if there are any constraints (CPU, RAM, I/O) on the database, application, tokenization, or authentication servers whenever this problem occurs? The answer is no, in fact, utilization is lower than normal when the problem occurs. Are there any errors in the system (event viewer) logs on any of the servers?

No, nothing out of the ordinary was found. Could there be blocking in the database when the problem occurs? During the last two incidents, no blocking was detected in the database. Could there be that any network issues these events? Network engineers note none have beenduring noticed and problem no otherevents? Network applications have had issues during these times. Is there any application instrumentation available? Yes, two types:

The application is monitored by Dynatrace, which provides a good history of each request. The application is actually a purchased product and it includes very detailed application logs. Are there any issues or delays with the batch processing, which occurs all the time? No, time? No, batch processing continues without interruption or delay and the batch server also uses the same application server. Are there any web users not impacted? Can any users do anything? All anything? All users are impacted and cannot perform any function when this occurs – their browsers do not respond at all. What do you do next? Since there do not appear to be any system-related issues, this is a case where application instrumentation is critical and luckily, there are a couple of good avenues to follow – Dynatrace and the application logs. So, you start with Dynatrace, and below is the first clue:

The diagram is displaying all the transactions served up by the application server from just before 11:30 AM to 3 PM. At first glance, it appears that performance was fine throughout this period except for a very brief period of time around 1:40 PM. However, upon closer inspection, there’s something else that is clearly out of sorts. The average number of transactions per minute seems to always be in the range of several hundred, but between about 12:20 PM and the time of the large spike in performance at 1:40 PM, the number of transactions per minute is very low. This makes sense since all

the users claimed they could not work during this time. Thus, all of those transactions that were completed during this time must have been from the batch server. So, this is interesting data and worth a closer look. If you look at all the transactions that end during this spike in performance at around 1:40 PM, you notice that they all end at about the same time. You will note below that the transactions that started sooner, took the longest. This would make sense as if the entire system was frozen, then whenever a user tried to do something from that point on, that action would have simply spun and waited. Then, whatever ended to correct the situation seemed to have done this for all transactions at the same time, so the ones that had been spinning the longest started the soonest.

While this is interesting, where do you go from here? Well, since nothing

seems to have been completed after the longest transaction started at about 12:18 AM, then looking more deeply into this transaction seems to make sense. Clicking on this transaction within the Dynatrace tool provides the detailed screenshot of this one transaction below:

While the details of this transaction do not reveal a great deal of information, a query parameter does provide a clue. The SID query parameter is likely referring to a SessionID and something that can be searched in the application log along with the date and time. All of this provides some general information as to what happened, namely that after this transaction began nothing worked for over an hour until this same transaction ended. Still, there’s not enough to formulate a plan forward. So, you need to look into the application log for more information. As you start to review the application log, you need to remember the best practices as it relates to logs. The first of which is to synchronize the time. The application server is in the Central Time Zone, while the data in Dynatrace was in Eastern Time Zone, thus you need to reduce the timestamps in the log by one hour to have them match what you found in Dynatrace. This means that the transaction you are interested in would appear at about 11:18

in the application log. After a brief search the entry in the log is found as shown below: Aug 22 2019 7456 56 306 3064 43 11:18:23.614 74

Coe oeW WareBaseInit nit,Quer ueryString

Fascst

1=4&Key0=8&Key2=388218&Key4=285&Key6=345808&Key8=345808& You will note that the timestamp is about right and the SID, as well as the rest of the query string, matches that of what was seen in Dynatrace. A

review of the application’s documentation indicates that the two numbers highlighted in red are representative of the PID and Thread ID of the application. A look back in the log from before this transaction shows that there are multiple entries per minute (sometimes per second) for this combination of PID and Thread ID. In fact, this PID and Thread ID seems to be serving all of the web-based application users. There are other Thread IDs, but only one PID. The other Thread IDs seem to be serving the batch jobs. Armed with this information, you then look for the next message for this combination PID and Thread ID, but there is no other entry until 12:41 as follows: Aug 22 2019 12:41:10.114 Exception ID:1271085297 Aug 22 2019 12:41:10.114

7456 6 30 3064 64 1 745

7456 3064 1

Exception caught and logged. Stack for exception:

The next few lines include a stack trace. After that, the flow of entries for this PID and Thread ID combination begins again as normal with quite a few every minute. So, what can you surmise from this? Given that there are no entries in the log for this PID and Thread ID combination, and this aligns with users getting no response during this time, it appears that while the PID is multi-threaded, there’s only one thread dedicated to the web users and if a request gets stuck, then all user must wait for that request to finish. So, what can you do about this?

The first step is to relay all of this information to the vendor to see what their opinion of the situation is. They responded that the request that hung the system was a report with parameters that were too broad and thus would take too long to complete. This is a known flaw in the system in that a user can request a report such as this and freeze all operations. So, what’s the remedy? There are three remedies as follows: 1) User Training - Users need to be made aware that when running certain reports, they need to provide limiting criteria. 2) Increase the number of processes for the App Pool – As you noted

when reviewing the log, there was only one PID and only one Thread ID within that PID supporting the web users. So, you can simply increase the number of processes for the App Pool such that if a user happens to run a report as such again, work can continue with other processes. 3) Add another application server and place it behind a load balancer – Again, since the problem is that there is only one thread that serves web users adding another application server provides another process as

well as adds more redundancy to an application with a growing number of users. An updated diagram is shown below:

12.6 Problem #5 – Broken Portal… and no background You get a call that an old portal used by a fairly important client is failing –

they are receiving attemptingvery to login. is a about projectthis ongoing to replace500 thiserrors portal.when Unfortunately, little There is known old environment, so you need to figure it out yourself by questioning those on the call. How do you start? As usual, the first question should be whether a change has been made

recently? The team replies that yes, a change was made the morning prior, but they don’t think it could be related because there were no problems all day yesterday. Also, they explain that the clients use the URL http://portal.acme.com, which is what is now failing, to access the portal from the Internet and that was not changed. Internally, the business users test with a different URL (http://intra-portal.acme.com), and this DNS was changed to a new VIP on the internal load balancer that points to a new set of servers that they will be migrating to at the end of the week. Also, this new internal URL is working fine. Regardless, out of an abundance of caution the change

was backed out and intra-portal.acme.com was changed to point to the old servers once more, but the problem persists. Since the goal is to use the “follow the packet” method, it’s important to ask about each layer that a request would traverse. You start with the DNS names of the portal (portal.acme.com) and use nslookup to find the external IP address that your client is using. The IP address is part of the subnet used for load balancer VIPs, so your Web Hosting engineer can now look for this VIP in the external load balancer’s configuration. Once the Web Hosting engineer finds the VIP configuration, he notes that there are three pool members and they are Apache servers. He also notes that these Apache servers are protected by the old authentication servers and there is a new external VIP pointing to new Apache servers ready to go for this weekend, but no change has been done yet to impact this. So, this is a good start, and given the previous information about the internal load balancer VIP already provided as well as the database server that is not planned to be changed, you can put together the below diagram:

However, you have not completed “following the packet” because you don’t

yet know how the packet knows to get to the internal load balancer VIP from the Apache servers. So, you ask the Web Hosting engineer to look at the configuration files on these old Apache servers to see how this virtual host is defined and the following is discovered: ProxyPass "/" "http://intra-portal.acme.com/" ProxyPassReverse "/" http://intra-portal.acme.com/ ServerName portal.acme.com

This means that when a request comes in for the name “portal.acme.com” it is passed along to “intra-portal.acme.com”. This is a big find! It makes it clear why the change to intra-portal.acme.com broke the external DNS name as well. When requests come in from the Internet to portal.acme.com they are eventually routed to intra-portal.acme.com so a change to the inside impacts the outside. This explains the 500 error perfectly because the old Apache servers use different authentication servers than the new internal IIS servers (remember that the plan was to move to the new Apache servers later in the week). Since the users are authenticating with one set of servers and then hit the backend that expects something different, they fail. Also, it’s easy to see why the internal users aren’t impacted regardless since they would never mix and match incompatible authentication servers since they are only using the internal VIP. While this explains a lot, there’s still the mystery as to why the problem did not immediately occur when the change was done the day before and why it did not rectify itself after the change was undone. It seems like there’s a delayed reaction to the changes, but why? Well, since the changes are DNS changes, then DNS caching must be the first suspect. At this point, the Web Hosting engineer tells the team that there is a known issue with DNS caching persisting for hours regardless of the TTL on these old Apache servers. And that’s the last clue to resolve the mystery. You ask the Web Hosting engineer to bounce the Apache servers and once completed, the problem is resolved. As usual, the work is not yet done. Problem Management demands that you ask the question how could this have been prevented or mitigated? The only key takeaway from this incident is that before making any change to an environment, especially an unfamiliar one, the environment must be fully

understood. If you would have mapped out the environment before the change, it would have been clear that this change would impact the client and needed to wait or better done differently if testing of the new servers was needed. For example, a completely new and temporary DNS name could have been created new-intra-portal.acme.com) and pointed the newinload balancer VIP (e.g. for internal testing as opposed to touching the at currently use name.

12.7 - Problem #6 – Another Portal… another mystery You get a call that another old portal used by a fairly important client is failing. There is a project to replace this one as well, but it’s being replaced by a website the client is implementing so there is no team working on this, but the environment is well known to you as it’s identical to the other Portal. Users are getting a 500 error after they log in – again. What questions do you ask?

The obvious first question is whether anything has changed, and the answer is yes again. However, this time the change was not just to this one portal, but to a large number of applications that all share the same VIP. The change was a required security change to place a WAF (Web Application Firewall) in front of any external-facing application that could include sensitive information. Everyone suspects that this change broke the application, but it’s unclear how. The WAF has active monitoring that displays “blocks” – that is any time the WAF actively impacts the application due to a perceived threat. However, there have not been any “blocks” triggered, so no one understands why this is failing. Backing out the change has been considered but given that this protection was put in place for hundreds of websites and only this one is having a problem; it’s preferred that this be solved. Since you are familiar with the environment now, you map it out again but include the change (in red below). What do you do next?

Using the Follow the Packet Method, you need to isolate the HTTP request that gets a 500 error and see where it’s occurring and what you can glean from it. With that in mind, you start with the IIS servers to see if the 500 errors are being generated there and you discover that indeed they are present in the logs. The 500 errors are HTTP GET requests on an ASPX page, but the developer claims that this particular ASPX page only accepts HTTP POST commands, so a GET would get a 500 error just like this. So, how could adding a WAF in front of the website cause what should be a POST to change to a GET? We then search the log to see if there are any POSTs for this particular ASPX page and find that since the WAF went into place, not a single POST has reached the IIS servers for this ASPX page. However, before the WAF was in place, all the requests associated with this ASPX page were HTTP POSTs, not GETs. Given this, what could you do next? At this point, there’s not much more to glean from the IIS logs. You need to go to the other side of the transaction to get more information – the browser. If HTTP GET requests are being sent to this ASPX page, they need to be coming from the browser, so you need to understand why. You start the developer tools and reproduce the error. You see that there is indeed an initial HTTP POST for this ASPX page, but the response to that POST is an HTTP 302 Redirect and the redirect tells the browser to issue a GET to the same ASPX page. Bingo, there’s the problem! For some reason the WAF is

intercepting the original HTTP POST request and redirecting the client to issue a GET request instead, but why? Also, the POST has a bunch of information in the body of the request that’s needed to pass on to the application, but all of that is lost with the GET request as it has no body to it. So, now the question is why is the WAF doing this? Closer inspection of the original HTTP POST shows that the URL of the request is HTTP://portal/acme.com/setup/submit.aspx. However, all the other requests were HTTPS, not HTTP. The security engineer is told this and then

realizes that there is an HTTP to HTTPS redirect in place. This redirect is only expected to be needed with the initial request to the site, so the redirect forcing subsequent GET was not expected to be a problem. This appears to be why the WAF is issuing the redirect – simply to change the HTTP to HTTPS. So, this has gotten us closer, but why is the application using HTTP instead of HTTPS at this point? Relative pathing of the HTML links should have made this an HTTPS request. The developer inspects the code and realizes that the URL is hardcoded with an absolute path, not relative pathing – this is why it’s HTTP, which is a bad thing over the Internet. The security engineer disables the HTTP to HTTPS redirects on the WAF as a short-term fix. Meanwhile, the developer quickly works to change HTTP://portal/acme.com/submit.aspx to simply /setup/submit.aspx such that security can re-enable the redirect. While the problem is now fixed, problem management demands that you look to see how this could have been avoided. Two solutions are provided: 1) The development team begins work on implementing a source-code validation tool to look for the violation of best practices like the use of absolute pathing. 2) Another gap was that this particular portal did not have a working nonproduction version accessible over the Internet for testing before implementing the WAF. This must be a requirement going forward.

Chapter 12 – Review/Key Questions

1) What’s the basic premise behind the “Follow the Packet” method of troubleshooting? 2) If something that was working broke all of a sudden, what’s a good first question to ask? 3) When you begin to troubleshoot, why is it good to put together a diagram of the components involved? 4) Why is understanding which users or functions are impacted important early on in the process? 5) How can the instrumentation of the application help in the diagnosis of the problem?

Chapter 13 – Clouducate

While we hope that this book provides a solid understanding of how Enterprise IT environments work and provides the tools for troubleshooting issues within such environments, there’s nothing like experiencing problems firsthand. For this reason, the authors created a set of tools, collectively called Clouducate, that create a virtual data center in the cloud with a couple of web-based applications. These environments are then configured with various errors from assignment to assignment to allow students to utilize all the tools covered in this book to try to figure out the problems. At this time, these tools require the use of AWS and a small amount of spend on AWS services (about $40-$50 per semester if care is taken to stop services when not using them) unless education credits can be acquired from AWS. Please note that AWS Educate ( https://aws.amazon.com/education/awseducate/  ) ) does provide “starter accounts”, but Clouducate cannot work with these “starter accounts” at this time due to the way we simulate a real enterprise environment. We do wish to be able to support AWS Educate starter accounts in the future. 13.1 AWS Glossary and Concepts Before we get into how Clouducate works, we thought it would be best to review AWS terms and concepts first as understanding at least some AWS basics is a key part of being able to troubleshoot the issues created by Clouducate.

◆

AWS Region – One or more AWS data centers within a nearby area (within 250 miles) – We will be using “us-east-2” ◆ AWS Availability zone – One datacenter within a region - We will be

using 1 AZ in this course – “us-east-2A” ◆ AWS VPC (Virtual Private Cloud) – Private network within AWS includes large private address space; can span Availability zones, but not regions ☐ We will be using this as it simulates an Enterprise environment ◆ AWS VPC Subnet – Portion of VPC addresses used to group instances & services; Cannot span availability zones ◆ IAM User – Need to create a user (any name) to administrate your services as opposed to the account login ◆ User Access Key ID and Access Secret Key (associated to IAM user) – Needed for AWS CLI (command-line interface) scripts ◆ S3 Storage – AWS’ Simple Storage Service (object storage via Internet access) ☐ We will be using this service for Load Balancer logs ◆ EC2 (Elastic Compute Cloud) Instance – A VM server in AWS (can be associated with a VPC subnet and a security group) ◆ ec2-user – Userid to login to your Linux EC2 instances ◆ Administrator – Userid to login to your Windows instances ◆ EC2 instance states – running, stopped, and terminated (deleted) ◆ Routing table – Associated to VPC subnets – how to route network requests ◆ Security Group – Associated to EC2 instances/AWS services – firewall rules for access in & out of instance/service NAT Instance – A pre-configured EC2 instance that allows outbound access to the Internet ◆ Internet Gateway - Allows Internet access out of VPC ◆ Client VPN Endpoint – Allows VPN access into VPC - Associated to a VPC subnet ◆

◆

Route 53 AWS DNS service Associated to VPC IP addresses (inbound endpoints) ☐ Can include private zone (e.g. AWSVPCB.edu) for DNS resolution within VPC ◆ ELB (Elastic Load Balancer) – AWS’ load balancer service (there are multiple options – we will utilize the Classic LB) - Associated to subnet & security group 13.2 Clouducate Environment and AWSVPCB Scripts

As we began to work on Clouducate, we wanted to keep the architecture simple, but not too simple. For example, to one extreme, we could create servers in AWS and simply expose them to the Internet like the below diagram:

However, this would be completely irrelevant to what an Enterprise IT environment looks like. Just some of the problems with a setup like this would be as follows: All servers are exposed to the Internet (no private addressing) No local DNS servers No zones No firewall load balancing No subnetting No internal routing No VPN

In general, this would be unrealistic in terms of simulating any kind of significant real-world Enterprise IT troubleshooting. So, we developed a more complex environment as follows:

This setup includes six subnets (dotted lined boxes), four of these subnets are protected by a security group (solid lined boxes) limiting access in and out of the subnet to simulate firewalls, all private addressing except for the exit point (AWS Internet Gateway), VPN connectivity, internal DNS, internal routing, and finally applications with load balancers, web/application servers and a database server. While the complexity of this setup is still minor compared to real-world Enterprise IT environments, it at least provides enough to allow for students

to learn how to use some of the tools that are used in real enterprise environments. In fact, there are four distinct environment setups across nine assignments. These environments increase in complexity over the course as the students learn more concepts. The below are the environment diagrams for the assignments:

Each student creates their own environment to allow each of them to work independently, although the course also calls for group assignments. In order for a student to create their own AWS environment, they first need to find a Linux or MacOS machine onto which they can install the AWS CLI (version 2) as well as the Clouducate AWSVPCB shell scripts and supporting files. The Clouducate AWSVPCB scripts do not support Windows at this time. The below is a description of the AWSVPCB scripts and what they do:

◆

◆

5 one-time scripts as follows and 1 recovery script: ☐ AWSVPCB.CONFIGURE & AWSVPCB.TEST – Only needed at the start of the course, but can be run again, if needed. ☐ AWSVPCB.VPC.CREATE – This script creates your VPC, Internet Gateway, route tables, subnets, security groups, and Client VPN endpoint. It also registers all the AWS unique IDs generated during the builds ➢ This script will fail if an “AWS-VPCB” tagged VPC already exists ➢ You should only need to run this script once. If you need to start from scratch, you should run AWSVPCB.VPC.DESTROY before re-running this. ☐ AWSVPCB.VPC.DESTROY – This script will destroy the registered VPC and everything in it. ➢ You should only need to run this at the end of the course; running this will require replacing the VPN config. ☐ AWSVPCB.VPC.REGISTER – This script may never be needed. It will find your AWS-VPCB VPC in AWS and register all its components for the other scripts to be able to work as expected. 4 multi-use assignment scripts as follows: – Destroys existing ☐ AWSVPCB.ASSIGNMENT.CREATE #  – instances and ELB targets if any exist, then Adjusts VPC settings, Creates assignment instances and ELB (if applicable) based on number passed in as parameter; Destroys any existing work you’ve done on assignment. ☐ AWSVPCB.ASSIGNMENT.START – Starts instances, Associates Client VPN endpoint to subnet, Creates Route 53 DNS Zone, DNS entries (including saved entries if the assignment was previously

stopped) and DNS Inbound endpoints; Can be run multiple times without destroying work. ☐ AWSVPCB.ASSIGNMENT.STOP – Stops instances, Disassociates Client VPN endpoint from the subnet, Saves DNS entries, Deletes Route 53 DNS Zone and Inbound Endpoints; Can be run multiple times without destroying work. AWSVPCB.ASSIGNMENT.DESTROY – Destroys existing ELB and instances. Called by AWSVPCB.ASSIGNMENT.CREATE; AWS VPCB.ASSIGNMENT.CREATE; Destroys all the work ☐

you’ve done on the assignment.

In the course taught by the authors, $100 in AWS credits are provided to each student. However, care must be taken by the students to avoid overrunning these credits. The below is guidance provided in terms of the usage of these credits. Credit Usage Implications of Scripts: The below assumes a new AWS account. If you have an account older than 12 months will be charged a small amount after VPC.CREATE and a little more after ASSIGNMENT.CREATE. If you are in this position, be more mindful of your spend status, but should still have enough credits for the course. ◆ AWSVPCB.VPC.CREATE - No charges after this script is run. ◆ AWSVPCB.ASSIGNMENT.CREATE - After running this script for later assignments that use the ELB, a few cents will begin to be billed to your account daily. ◆ AWSVPCB.ASSIGNMENT.START - You will not be able to use your services until you run this script. After you run this script, you will begin to draw down on your credits at a much steeper rate (several dollars per day). The course is designed to allow you hundreds of hours of uptime, but not 24x7 for days. If you do not run the AWSVPCB.ASSIGNMENT.STOP AWSVPCB.ASSIGNM ENT.STOP script, whenever you are done working on your assignment, you will run out of credits. ◆ AWSVPCB.ASSIGNMENT.STOP - will stop the necessary services so that you are no longer using a heavy amount of credits while saving all of your changes. While your changes are saved, you will not be able to work until you perform an ASSIGNMENT.START again. ◆ AWSVPCB.VPC.DESTROY - If you find that you need to destroy your VPC for a while to save on credits by running this script, that is

fine, but please note that you will need to re-import your VPN configuration after recreating your VPC. 13.3 Clouducate Components and Interdependencies At a high level, Clouducate is currently made up of two independent codebases that coordinate with each other to create a dynamic, highly flexible environment. These two components are AWSVPCB scripts, which were discussed in the previous section because that’s what the students are exposed to, and Havoc Circus, which allows for the dynamic alteration of the

environment to create problems for the students to solve. Both of these components utilize JSON configuration files that are retrieved from the cloud to build each assignment uniquely. There are several intersection points between these components such as the domain name for the environment (i.e. the DNS zone – defaults to awsvpcb.edu), the AMIs (AWS Machine Images) which host the applications/databases, DNS servers which need to be configured in the AMIs, Assignment number which is passed between these components through a DNS TXT record and finally the passwords for the instances and databases which are included in the AWSVPCB support files for the students. The AWSVPCB scripts are written as Bash shell scripts. These scripts create the AWS VPC and all of its components as described in the previous section. These scripts are accompanied by several critical files including the passwords for all the systems, SSL Certificate Authority certificate that needs to be trusted for VPN & SSL to the web sites to work, SSL certificate for the VPN and the websites, and the JSON files needed for each assignment. Havoc Circus is a Windows service written in C-Sharp. The AMI Havoc Circus executes on needs to have SSH tokens for any Linux server it needs to modify as well as access to a SQL Server database where it maintains its state. As such, Havoc Circus is typically installed on the SQL Server that is used to support all the applications in all the assignments. Havoc Circus also retrieves the JSON assignment settings from the cloud at execution time. All of the above is graphically displayed in the below diagram:

13.4 AWSVPCB Documentation AWSVPCB (AWS Virtual Private Cloud Builder) is a set of BASH Shell scripts designed to create a small Enterprise IT-like environment in AWS to provide college-level students with hands-on experience individually working on IT problems. These scripts work in conjunction with the Havoc Circus utility which provides AWS images (AMIs) with pre-loaded applications and a method for automatically changing the environment for assignment purposes. These scripts were originally developed by Norbert Monfort and Robert Fortunato for the CTS-4743 Enterprise IT Troubleshooting course

taught at FIUand (Florida International University) in Miami, Florida. these scripts the associated Havoc Circus C#.Net packages are However, opensource and free to be used for any purpose. These two components make up what is called the "Clouducate Suite", which we would like to see expand to include additional tools in the future. Both of these components are made generally available as open source. The rest of this section will focus on

AWSVPCB. STRUCTURE OF AWSVPCB The AWSVPCB scripts require a specific directory structure. You can simply download the awsvpcb-scripts.zip file to install this directory structure to get

started. The following is an explanation of the directories: 1. The root direct directory ory of the scripts - All the scr scripts ipts that are meant to be executed by the students reside in this directory and are all fully

CAPITALIZED. All the other directories are support directories relevant to the instructor, but not to the students. The root directory also includes the vpcb-config file which provides for a few configuration settings for the scripts. 2. The "procs" directory includes all of the detail code for the scripts as well as some dynamic variable files (collectively called the registry) that start empty but are modified as the scripts load the JSON files and build out the AWS components. All "code" resides in this directory and the root directory. 3. The "secfiles" directory includes includes all the supporting information like: The certificates and private keys used by the load balancers The certificates and private leys used by the VPN (client & server-side) The private key used for the AWS instances The certificate authority certificate for all the above The passwords for all of the instances so that the students can log in The password for the database so that students can log in The dynamically generated OVPN file for the students to import into their PC for VPN connectivity The dynamic JSON files and templates used to create and save DNS entries The downloaded VPC and Assignment-level JSON files used to build the VPC(s) and assignments (these are in sub-

directories) 4. The "tempfiles" directory includes temporary dynamically generated files 5. The "logs" directory includes all of the output for all the script scriptss run CONFIG FILE (vpcb-config) SETTINGS There aren't many settings to worry about in the vpcb-config file and most are self-explanatory, but here's the list:

COURSE & SEMESTER - These two parameters go hand in hand to determine the name of the log group to be created in Cloudwatch. If you happen to have multiple sections for a particular course within a semester, then it's recommended that you add the section number to the COURSE parameter. DOMAIN - This is used to append to all of your server and ELB names. Consider this your fake companies domain name. The default is "awsvpcb.edu" and certificates for the VPN, and a couple of ELB (load balancer) names are provided for this domain as well as the CA cert (ca.crt) which is all used to build the VPC and ELBs. However, if you wish to create new ELB names or use a different domain, then you would need to create your own CA and replace all the relevant files in the "secfiles" directory. Changing the domain also implies that you have created your own AMIs Havoc Circus assignment files compatible with the new domain name. AWSCMD - This parameter should not be changed for now. It was set up to provide for the ability to change the aws command in the future. ENABLE_AWS_LOGGING - This parameter can be set to "yes" or "no". If set to "yes", then an AWS access key and secret key with access to your Cloudwatch logs must be provided. MANIFEST_LOCATION - This parameter can be set to "aws" or "local". If set to "aws", then an AWS access key and secret key with access to your S3 bucket where the VPC

and Assignment JSON files are located. AWS_REGION - The AWS region where you want your VPC built. This must be the same region that you have instructed your students to select when running the “aws configure” command after the AWS CLI installation. AWS_AZ - The AWS availability zone where you want your VPC built. LOGGING_ACCESS_KEY - The AWS access key to be used for logging into AWS.

LOGGING_SECRET_KEY - The AWS secret key to be used for logging into AWS. MANIFEST_ACCESS_KEY - The AWS access JSON key tofiles be used to download VPC and assignment manifest from AWS. MANIFEST_SECRET_KEY - The AWS secret key to be used to download VPC and assignment manifest JSON files from AWS. MANIFEST_S3BUCKET - The AWS S3Bucket where the VPC and assignment manifest JSON files are located in AWS. DIAGLOG_S3BUCKET - The AWS S3Bucket where diagnostic information will be placed; the LOGGING_ACCESS_KEY must have access to write to this S3Bucket. MAIN EXECUTABLE SCRIPTS AWSVPCB.CONFIGURE - Reads vpcb-config and configures the base settings within the "procs" directory. Can be rerun as often as needed but does not need to be run unless a change is made to the vpcb-config file. AWSVPCB.TEST – Simply tests basic connectivity to AWS. AWSVPCB.VPC.CREATE – This script optionally accepts a numeric parameter (the VPC number; 0 is the default if no number is provided). The script expects a vpc# directory (where # is the VPC number) to exist in the "secfiles" directory or AWS manifest S3bucket where the vpc.json file can be found and loaded. Using

this vpc.json file, this script creates a VPC with associate Internet Gateway, NAT instance, route tables, subnets, security groups, S3Bucket for ELB logs, Client VPN endpoint, OVPN file for the OpenVPN client and registers all AWS unique IDs. This script will fail if an “AWS-VPCB” tagged VPC already exists in the target AWS account. To rerun this script, the AWSVPCB.VPC.DESTROY must be run first. AWSVPCB.VPC.DESTROY – This script will destroy the registered VPC and everything in it.

AWSVPCB.VPC.REGISTER – This script compares the AWS object IDs registered with what exists in AWS to adjust the registry files in the "procs" directory appropriately. The script requests the user to provide an optional assignment number such that all of the assignment components are also registered properly. AWSVPCB.ASSIGNMENT.CREATE # – This script accepts a mandatory assignment number (no default). The script then stops and destroys existing assignment instances, DNS entries, and ELB targets if any exist. The script then expects an assignment# directory (where # is the assignment number) to exist in the "secfiles" directory or AWS manifest S3bucket where the awsvpcb.assignment#.json can be found and loaded. The script then creates assignment instances and firewall rules. AWSVPCB.ASSIGNMENT.START – Starts instances, Creates ELB (if applicable and just during the first start), Associates Client VPN endpoint to subnet, Creates Route 53 DNS zone and entries, Creates Route 53 Inbound endpoints. This script can run multiple times without destroying any work. AWSVPCB.ASSIGNMENT.STOP – Stops instances, Saves Route 53 DNS entries, Destroys Route 53 DNS zone, Disassociate Client VPN endpoint from the subnet, Deletes Route 53 Inbound Endpoints. This script can run multiple times without destroying work. AWSVPCB.ASSIGNMENT.DESTROY – Destroys existing assignment ELB and instances. This script is called by AWSVPCB.ASSIGNMENT.CREATE and destroys all the work done on the assignment. AWSVPCB.DIAGLOG - This script gathers all the relevant

information from the student's AWSVPCB directories and AWS VPC and sends it to an S3 bucket for review. All the files are placed in the S3Bucket defined in the vpcb-config file. AWSVPCB.MANIFEST.DISPLAY - This script prompts the user for which manifest to display (VPC or Assignment and the relevant #). The script then reads the designated manifest and displays what was extracted and would be loaded into the registry of the awsvpcb-scripts. This is a good way to test whether your JSON is properly formatted and accepted. Note that this does not

validate the JSON, just displays what loading it would do. VPC JSON FILES The awsvpcb-scripts.zip file includes a sample JSON configuration file in the "secfiles/vpc0" directory. AWSVPCB allows for the definition of multiple VPCs (default is vpc0), however, there could only be one defined in AWS for a given AWS account at one time. Also, because each time you create a VPC a new OVPN file is generated, it is recommended that you limit the number of VPCs used in one course. For CTS-4743, for example, we only use one VPC for the entire semester and use the assignment JSON files to adjust the environment. VPCs can be created, registered, and destroyed. Registering a VPC entails comparing the dynamic files within the "procs" directory with what is actually in AWS. All VPC JSON parameters are required, albeit the number of subnets, possibleInstanceNames, and PossibleELBs is variable. The sample VPC JSON file includes all the parameters currently available. The below is a summary of these parameters:

VPC-VPCCIDR(required): The range of IPs available for the VPC in CIDR notation Subnets(required): The number of subnets is variable, but the DEFAULT and PUBLIC subnets are required and should not be touched Subnets-SubnetName(required): The name of the subnet (no spaces allowed) Subnets-SubnetCIDR(required): The IP range for the subnet in CIDR notation Subnets-SecurityGroup(required): "yes" or "no" as to whether this subnet has an equivalently named Security group controlling its

inbound and outbound traffic Subnets-RoutingTable(required): "DEFAULT" or "PUBLIC" - all subnets other than the PUBLIC, should use the DEFAULT routing table PossibleInstanceNames(required): The number of PossibleInstanceNames is the variable, but at least one must exist. This is necessary to allow AWSVPCB.VPC.REGISTER script to re-calibrate the registry for the scripts with what exists in AWS. Simply list the possible instance names that may exist in any

assignment to be used with this VPC. PossibleELBs(required): The number of PossibleELBs is variable, but at least one must exist. This is necessary to allow the AWSVPCB.VPC.REGISTER script to re-calibrate the registry for the scripts with what exists in AWS. Simply list the possible ELB names that may exist in any assignment to be used with this VPC. DNSIPAddresses(required): This is a list of the IP addresses that will be defined in the AWS Route 53 resolver (DNS server). All AMIs should have these IPs in their config to appropriately resolve your private domain's DNS names to IPs. NATDefinition(required): This is unlikely to need to be changed as this is simply defined to allow Internet access from within the VPC. NATDefinition-IPAddress(required): IP address for the NAT instance. NATDefinition-Subnet(required): Subnet for the NAT instance. NATDefinition-AMI(required): AMI (AWS Machine Image) for the NAT instance. SAMPLE AWSVPCB VPC JSON INCLUDED IN ZIP FILE (secfiles/vpc0/vpc.json) { "AWSVPCB": {

"VPC": { "VPCCIDR": "172.31.0.0/16" } },

{ "Subnets": [ { "SubnetName": "DEFAULT", "SubnetCIDR": "172.31.131.0/24",

"SecurityGroup": "yes", "RoutingTable": "DEFAULT" }, {

"SubnetName": "WEB", "SubnetCIDR": "172.31.128.0/24", "SecurityGroup": "yes", "RoutingTable": "DEFAULT" }, { "SubnetName": "DB", "SubnetCIDR": "172.31.129.0/24", "SecurityGroup": "yes", "RoutingTable": "DEFAULT" }, { "SubnetName": "ELB", "SubnetCIDR": "172.31.130.0/24", "SecurityGroup": "yes", "RoutingTable": "DEFAULT" }, { "SubnetName": "CLIENTVPN", "SubnetCIDR": "172.31.16.0/24", "SecurityGroup": "no", "RoutingTable": "DEFAULT" }, { "SubnetName": "PUBLIC", "SubnetCIDR": "172.31.132.0/24", "SecurityGroup": "yes", "RoutingTable": "PUBLIC"

} ] }, { "PossibleInstanceNames": [

{

"InstanceName": "IIS1" }, {

"InstanceName": "IIS2" }, { "InstanceName": "LINUX1" }, { "InstanceName": "LINUX2" }, { "InstanceName": "MSSQL" } ] }, { "PossibleELBs": [ { "ELBName": "myfiu" }, { "ELBName": "rainforest" } ] }, { "DNSIPAddresses": [ { "IPAddress": "172.31.131.10" },

{ "IPAddress": "172.31.131.11" } ] }, {

"NATDefinition": { "IPAddress": "172.31.132.151",

"Subnet": "PUBLIC", "AMI": "ami-****" } }, { "VPNDefinition": { "CACert": "ca.crt", "ConfigFile": "AWSVPCB-client-config.ovpn", "ClientCIDR": "172.31.8.0/22" } }, { "ServerPrivateKey": "privkey" } } SSIGNMENT MANIFEST JSON FILES The awsvpcb-scripts.zip file includes sample assignment JSON configuration files in the "secfiles/assignment#" directories (#=1-3). AWSVPCB allows for the definition of multiple assignments (there is no default). Assignments can be created, started, stopped, and destroyed. Starting and stopping an assignment preserves all changes made. The option to start and stop an

assignment is necessary to allow a student to step away and not unnecessarily use up their allotted AWS credits. The three sample assignment JSON files include all the parameters currently available. Some of these parameters are optional. The below is a summary of these parameters:

Instances(required): The number of instances is variable, but at least one must exist Instances-InstanceName(required): The name of the instance (no spaces allowed) Instances-InstanceIP(required): The IP address for this instance. Must be within the IP range of the subnet Instances-InstanceSubnet(required): The subnet for the instance. Must be a subnet defined in the VPC.json file Instances-InstanceAMI(required): The AMI (AWS Machine Image) for this instance. If the AMI is private, then it must be

shared with the student's AWS account Instances-InstanceType(required): The type/size of the instance. While anything can be chosen here, please be aware that only type t2.micro is currently covered by the AWS "free tier", which allows for many more hours of usage without chewing up a lot of AWS credits. Instances-StartPreference(optional): Although this can be used for any purpose, it is only necessary for the server that executes the Havoc Circus service and should be set to "first" ("last" is also an available option, but will cause problems if used for the instance running the Havoc Circus service) Instances-StopPreference(optional): Although this can be used for any purpose, it is only necessary for the server that executes the Havoc Circus service and should be set to "last" ("first" is also an available option, but will cause problems if used for the instance running the Havoc Circus service) FirewallRules(optional): Syntactically, Firewall Rules do not need to be provided and there is no limit as to how many are provided. However, if rules are not provided, then no access will be allowed into a security group, so generally, at least one inbound and one outbound rule per security group is needed to make things functional. FirewallRules-SecurityGroup(required): The security group to which this particular firewall rule applies FirewallRules-RuleType(required): Must be "inbound" or "outbound" FirewallRules-Protocol(required): Can be "all", "udp", "tcp" or "icmp"

FirewallRules-Port(required): Can be "all" or specific port number or port range with a hyphen in between (e.g. "137-139") FirewallRules-SourceGroup(required): The source (for "inbound" rules) or destination (for "outbound" rules) for this firewall rule in CIDR notation DNSEntriesFile(required): This is the location of the JSON file that includes the DNS entries to apply to Route 53 when the assignment is created. The path is relative to the "secfiles" directory and/or the AWS S3 bucket where the manifest exists

ELBs(optional): Only required if ELBs are defined. NOTE: There is a log automatically created for each ELB within an S3Bucket within the student's AWS account that will include all the requests into that ELB (equivalent to a web access log). There is also a DNS entry automatically created for the ELB. ELBs-ELBName(required): The name of the ELB. There should sh ould be an SSL certificate (.crt file) and private key (.pkey file) in the "secfiles" directory with this name and the chosen domain appended (e.g. rainforest.awsvpcb.edu.crt and rainforest.awsvpcb.edu.pkey where rainforest is the ELBName and awsvpcb.edu is the DOMAIN). ELBs-ListenerProtocol(optional): The default is HTTPS. At this time, no other option is supported, although support for HTTP is being built to avoid the need for certificates. ELBs-ListenerPort(optional): The default is 443. Any value recognized by AWS is accepted. ELBs-InstanceProtocol(optional): The default is HTTP. Any value recognized by AWS is accepted. ELBs-InstancePort(optional): The default is 80. Any value recognized by AWS is accepted. ELBs-ELBSubnet(optional): The default is ELB. The name chosen must be a valid subnet as defined in the VPC JSON file ELBs-HealthCheckTarget(optional): The default is TCP:80. This will be what AWS uses to validate that the target instances are active. Any value recognized by AWS is accepted. ELBs-HealthCheckInterval(optional): The default is 5. This indicates how often AWS will execute the Health Check (in secs). Any value recognized by AWS is accepted.

ELBs-HealthCheckTimeout(optional): The default is 3. This indicates how long (in secs) before AWS considers a health check as timed out (failed). Any value recognized by AWS is accepted. ELBs-HealthCheckUnhealthyThreshold(optional): The default is 2. This indicates how many health checks must fail for the target instance to be taken out offline. Any value recognized by AWS is accepted. ELBs-HealthCheckHealthyThreshold(optional): The default is 2. This indicates how many health checks must succeed for the target

instance to be brought back online. Any value recognized by AWS is accepted. ELBs-EnableSessionStickiness(optional): The should defaultenable is N for "no". This indicates whether the load balancer cookie-based session stickiness such that sessions remain on the same backend server. N for "no" or Y for "yes" are accepted. ELBs-ELBInstances(required): The number of instances is variable, but at least one instance is required. ELBs-ELBInstances-InstanceName(required): The name of a target instance for the ELB. Must be an instance defined in the assignment. ONE OF THE THREE SAMPLE AWSVPCB ASSIGNMENT JSON INCLUDED IN ZIP FILE (secfiles/assignment1/awsvpcb.assignement1.json) { "AWSVPCB": { "Instances": [ { "InstanceName": "LINUX1", "InstanceIP": "172.31.128.43", "InstanceSubnet": "WEB", "InstanceAMI": "ami-****",



},"InstanceType": "t2.micro" { "InstanceName": "LINUX2", "InstanceIP": "172.31.128.44",

InstanceSubnet : WEB , "InstanceAMI": "ami-****", "InstanceType": "t2.micro" }, {



"InstanceName": "MSSQL", "InstanceIP": "172.31.129.75", "InstanceSubnet": "DB", "InstanceAMI": "ami-****", "StartPreference": "last",

"StopPreference": "first", "InstanceType": "t3.small"

   ] } }, { "FirewallRules": [ { "SecurityGroup": "DEFAULT", "RuleType": "outbound", "Protocol": "all", "Port": "all", "SourceGroup": "0.0.0.0/0" }, { "SecurityGroup": "PUBLIC", "RuleType": "outbound", "Protocol": "all", "Port": "all", "SourceGroup": "0.0.0.0/0" }, { "SecurityGroup": "WEB", "RuleType": "outbound", "Protocol": "all", "Port": "all", "SourceGroup": "0.0.0.0/0" },

{

"SourceGroup": "0.0.0.0/0" }, { "SecurityGroup": "ELB",

"RuleType": "outbound", "Protocol": "all",

"SecurityGroup": "DB", "RuleType": "outbound", "Protocol": "all", "Port": "all",



"Port": "all", "SourceGroup": "0.0.0.0/0" }, { "SecurityGroup": "DEFAULT", "RuleType": "inbound", "Protocol": "all", "Port": "all", "SourceGroup": "172.31.0.0/16" }, { "SecurityGroup": "PUBLIC", "RuleType": "inbound", "Protocol": "all", "Port": "all", "SourceGroup": "172.31.0.0/16" }, { "SecurityGroup": "WEB", "RuleType": "inbound", "Protocol": "udp", "Port": "137-138", "SourceGroup": "172.31.0.0/16" }, { "SecurityGroup": "WEB",

"RuleType": "inbound", "Protocol": "tcp", "Port": "80", "SourceGroup": "172.31.16.0/24" },

{

"SecurityGroup": "WEB", "RuleType": "inbound", "Protocol": "tcp",



"Port": "80", "SourceGroup": "172.31.130.0/24" }, { "SecurityGroup": "WEB", "RuleType": "inbound", "Protocol": "tcp", "Port": "135-139", "SourceGroup": "172.31.0.0/16" }, { "SecurityGroup": "WEB", "RuleType": "inbound", "Protocol": "tcp", "Port": "49152-65535", "SourceGroup": "172.31.0.0/16" }, { "SecurityGroup": "WEB", "RuleType": "inbound", "Protocol": "tcp", "Port": "3389", "SourceGroup": "172.31.0.0/16" }, { "SecurityGroup": "WEB", "RuleType": "inbound", "Protocol": "tcp",

"Port": "445", "SourceGroup": "172.31.0.0/16" }, { "SecurityGroup": "WEB",

"RuleType": "inbound", "Protocol": "tcp", "Port": "22", "SourceGroup": "172.31.0.0/16"

}, {



"SecurityGroup": "WEB", "RuleType": "inbound", "Protocol": "tcp", "Port": "5000", "SourceGroup": "172.31.16.0/24" }, { "SecurityGroup": "DB", "RuleType": "inbound", "Protocol": "tcp", "Port": "1433", "SourceGroup": "172.31.16.0/24" }, { "SecurityGroup": "DB", "RuleType": "inbound", "Protocol": "tcp", "Port": "1433", "SourceGroup": "172.31.128.0/24" }, { "SecurityGroup": "DB", "RuleType": "inbound", "Protocol": "tcp", "Port": "1434", "SourceGroup": "172.31.16.0/24"

}, { "SecurityGroup": "DB", "RuleType": "inbound", "Protocol": "tcp",

"Port": "3389", "SourceGroup": "172.31.0.0/16" }, {

"SecurityGroup": "ELB", "RuleType": "inbound",

   "Protocol": "Port": "all","all", "SourceGroup": "172.31.0.0/16" } ] }, { "DNSEntriesFile": "assignment1/awsvpcb.assignment1.DNS.json" }, { "ELBs": [ { "ELBName": "rainforest", "ListenerProtocol": "HTTPS", "ListenerPort": "443", "InstanceProtocol": "HTTP", "InstancePort": "80", "ELBSubnet": "ELB", "HealthCheckTarget": "TCP:80", "HealthCheckInterval": "5", "HealthCheckTimeout": "3", "HealthCheckUnhealthyThreshold": "2", "HealthCheckHealthyThreshold": "2", "EnableSessionStickiness": "Y", "ELBInstances": [ {

"InstanceName": "LINUX1" }, { "InstanceName": "LINUX2" }

] } ] }

} SSIGNMENT DNS JSON FILES The awsvpcb-scripts.zip file includes sample assignment DNS JSON configuration files in the "secfiles/assignment#" directories (#=1-3). AWSVPCB allows for each assignment to include its own set of DNS entries for the servers and or other components. The DNS JSON file must be configured within the assignment JSON file. Thus, there could be one centralized DNS JSON config file used for multiple assignments, if desired. The format of the DNS JSON configuration file is dictated by AWS as it is loaded directly without modification. Any parameters accepted by AWS would be accepted in this file. Please refer to https://docs.aws.amazon.com/cli/latest/reference/route53/change-resource-

record-sets.html for record-sets.html for available JSON elements. ONE OF THE THREE SAMPLE AWSVPCB ASSIGNMENT DNS JSON INCLUDED IN ZIP FILE (secfiles/assignment1/awsvpcb.assignement1.DNS.json) { "Changes": [ { "Action": "CREATE", "ResourceRecordSet": { "Name": "nat.awsvpcb.edu", "Type": "A", "TTL": 300, "ResourceRecords": [ {

"Value": "172.31.132.151" } ] } }, { "Action": "CREATE", "ResourceRecordSet": { "Name": "iis1.awsvpcb.edu",

"Type": "A", "TTL": 300,



"ResourceRecords": [ { "Value": "172.31.128.98" } ] } }, { "Action": "CREATE", "ResourceRecordSet": { "Name": "iis2.awsvpcb.edu", "Type": "A", "TTL": 300, "ResourceRecords": [ { "Value": "172.31.128.99" } ] } }, { "Action": "CREATE", "ResourceRecordSet": { "Name": "linux1.awsvpcb.edu", "Type": "A",

TTL : 300, "ResourceRecords": [ { "Value": "172.31.128.43" } ]

} }, { "Action": "CREATE",

"ResourceRecordSet": { "Name": "linux2.awsvpcb.edu",

   "Type": "A", "TTL": 300, "ResourceRecords": [ { "Value": "172.31.128.44" } ] } }, { "Action": "CREATE", "ResourceRecordSet": { "Name": "mssql.awsvpcb.edu", "Type": "A", "TTL": 300, "ResourceRecords": [ { "Value": "172.31.129.75" } ] } } ] }

13.5 Havoc Circus Documentation Havoc Circus is a set of C#.Net packages used to run the HavocCircus Windows service which allows you to dynamically change your VPC based on configuration files in JSON format. Havoc Circus was designed to be highly flexible such that almost any scenario can be created within your VPC. Havoc Circus must be installed on a Windows Server-based AMI that is part of an assignment as built by AWSVPCB within your VPC. Havoc Circus is pre-installed within the mssql AMI (one of the three publicly available Clouducate AMIs). Havoc Circus also requires a SQL Server database called HavocMonkey to be available within your VPC – the mssql sample AMI also

includes this database. Havoc Circus Components Havoc Circus includes the following components: CTS4743 Havoc Circus Windows Service with associated JSON configuration file & the HavocMonkey SQL Server database, HavocCircus assignment directories & associated configuration JSON configuration files, and the sample AMIs that provide two sample web applications as well as two sample batch jobs. Also, these components integrate with the AWSVPCB scripts in several ways. This integration is called out as each component is reviewed below. CTS4743 Havoc Circus Windows Service The CTS4743 Havoc Circus Windows service is installed by default in the

C:\Program Files\CTS Circus directory. Circus directory. If you create your own image, it can 4373\Services\Havoc be installed in any directory, however, you need to make sure that all the JSON configuration files are also adjusted accordingly. This directory contains all the files needed to run the Windows service, although the HavocMonkey database also needs to be available somewhere within your VPC. Of particular note is the HavocCircus.exe.config the HavocCircus.exe.config file file which is a JSON formatted configuration file for the service. Note that there is also a staging staging directory directory used during the assignment modifications and this is necessary as well.

HavocCircus.exe.config There is only one configuration file for the Havoc Circus Windows service. This is a standard .NET configuration file. Thus, the modifiable values are in the “appSettings” section. All of the “keys” in this section are required.

Below is a description of what these keys are used for: connectionString: This is the SQL Server connection string to the HavocMonkey database. The default refers to the database that comes pre-installed with the mssql AMI. manifestFileName: The default manifest.json. the theafter suffix for the assignment manifest file is that HavocCircusThis will is load appending the assignment name at the front (e.g., assignment1.manifest.json).

processExecutionThreadSleepTime: ???? s3BucketName: This is the name of the S3Bucket where the manifest and all associated assignment support files areused located. s3IAMAccessKey: This is the AWS IAM Access Key to retrieve the manifest and all associated assignment support files from the designated S3 Bucket. s3IAMSecretKey: This is the AWS IAM Secret Key used to retrieve the manifest and all associated assignment support files from the designated S3 Bucket. scpClientExecutable: This is the location of the executable used to copy files to any Linux system within your VPC. sshClientExecutable: This is the location of the executable used to execute commands on any Linux system within your VPC. sshPrivateKey: This is the private key used to authenticate to any Linux system within your VPC. startupProcessingDelayThreadSleepTime: This is how many milliseconds HavocCircus will wait after starting up before executing the HavocCircus commands from the manifest file. triggerTXTRecordName: This is the DNS name of the TXT record populated by AWSVPCB to let HavocCircus know the assignment that is executing. This is not easily changeable at this time as it is not variable within the AWSVPCB scripts yet. Thus, it must remain as the default trigger.awsvpcb.edu. workingDirectory: This is the directory where all the HavocCircus files and staging directory is located. SAMPLE HAVOCCIRCUS.EXE.CONFIG HAVOCCIRCUS.EXE.CONFIG JSON INCLUDED ON THE

MSSQL AMI (C:\Program (C:\Program Files\CTS 4373\Services\Havoc Circus\HavocCircus.exe.config) /> /dev/null 2>&1 &`\"", "CommandArguments": "" }, { "CommandExecutable": "\"`nohup dotnet \"/usr/bin/cts4743/apps/looper/Looper.dll 125\" > /dev/null 2>&1 &`\"", "CommandArguments": "" }

}

] } } ] } }

Havoc Circus Sample AMIs Havoc Circus and AWSVPCB come with 4 publicly available AMIs. These

AMIs will change over time, so they are updated in our publicly available AWS S3 Bucket public.awsvpcb.edu. The AWSVPCB VPC and ASSIGNMENT filesiswill be updated bucket as new AMIs are published andJSON the code changed. Note within that thethis administrator passwords for all AMIs are included in the secfiles directory of the AWSVPCB scripts as well as in the sample Havoc Circus assignment JSON files. Below is a summary of the 4 publicly available AMIs and what each one contains. NAT Instance AMI – used by nat.awsvpcb.edu nat.awsvpcb.edu This is the simplest of the AMIs as it contains only configuration information to allow for Internet access out of your VPC. This AMI participates in the public routing table, while all other AMIs participate in the private routing table. This AMI contains both a private IP and a public IP. It is the only AMI

with a public IP and it routes Internet traffic out the VPC’s Internet Gateway. LINUX AMI – used by linux1.awsvpcb.edu linux1.awsvpcb.edu and linux2.awsvpcb.edu This AMI includes two key services – nginx & kestrel-rainforest. Both of these services can be started/stopped/restarted using systemctl as root (e.g., sudo systemctl start kestrel-rainforest). The nginx service listens on port 80 and its configuration is set up to proxy all traffic to the kestrel-rainforest service that listens on port 5000. The kestrel-rainforest service hosts a dotnet application called rainforest that is an implementation of Microsoft’s eShopOnWeb (https://github.com/dotnet-architecture/eShopOnWeb (https://github.com/dotnet-architecture/eShopOnWeb). ). The

kestrel-rainforest service is This configured to utilize a couple of databases mssql.awsvpcb.edu server. application requires session stickinesson onthe the load balancer to perform properly as exampled on assignment1. This AMI also includes a batch job that can be used for assignments. The

batch job is located at /var/batch/catalog_exporter at /var/batch/catalog_exporter on on the AMI. This batch ob creates an export of the items available on the site and can be executed with the following command: dotnet CatalogExporter.dll . CatalogExporter.dll . The output of this ob is sent to the export subdirectory. This batch job is also configured to utilize the same database on the mssql.awsvpcb.edu server. IIS used by IIS iis1.awsvpcb.edu and iis2.awsvpcb.edu iis2.aw ThisAMI AMI–includes and an implementation of svpcb.edu Microsoft’s Contoso University web application (https://github.com/dotnet/AspNetCore.Docs/tree/master/aspnetcore/data/efmvc/intro/samples). mvc/intro/samples ). IIS is configured with just one site and only has one app

pool running. The Contoso University web application is configured to utilize a database on the mssql.awsvpcb.edu server. IIS listens on both ports 80 and 5000. This does not session stickiness on the load balancer to application perform properly as require exampled on assignment2. This AMI also includes a batch job that can be used for assignments. The batch job is located at C:\Batch\StudentGenerator C:\Batch\StudentGenerator on on the AMI. As the name implies, the batch job randomly generates students and imports them into the Contoso University database. The batch job accepts a parameter to indicate the number of students to generate. This batch job is also configured to utilize the same database on the mssql.awsvpcb.edu server. MSSQL AMI – used by mssql.awsvpcb.edu mssql.awsvpcb.edu

This AMI includes the CTS4743 Havoc Circus Windows service and SQL Server 2019 to contain 4 databases (2 for eShopOnWeb, 1 for Contoso University, and 1 for Havoc Circus). SQL Server listens on the default 1433 port, however, the DAC Admin port TCP 1434 is also enabled and available within the VPC depending on the firewall rules created by the AWSVPCB assignment JSON. The sa password for this instance is included in the secfiles directory of the AWSVPCB scripts. 13.6 CLOUDUCATE GETTING STARTED You can use the default awsvpcb-scripts provided to build a VPC with VPN

connectivity that includes publicly available Havoc Circus AMIs with one or two applications loaded. This will use static JSON files pre-loaded in the awsvpcb "secfiles" directory for the VPC and Assignments (1-3 are included) to provide you with examples of what can be configured with AWSVPCB. This will also use static assignment JSON files loaded on the Havoc Circus

AMIs that provide examples of what you can do with Havoc Circus. GETTING STARTED PROOF OF CONCEPT (POC) PROCEDURE

1. CREATE TEST STUDE STUDENT NT AWS ACCOUNT - This will be where the VPC is created NOTE: At this time, this cannot be an AWS Educate Starter account as such accounts do not support VPN connectivity NOTE: Credit card must be provided. AWS credits should be

obtained and added to this account to avoid unnecessary charges. However, the charges for going through this POC are minimal as everything is destroyed in the end and it(under is not $10) kept as uplong and running for hours. 2. DECIDE which PC you are planning to use to access your VPC (Linux, MAC, or Windows) through the VPN. 3. INSTALL O OpenVPN penVPN on selected selected PC - This will be used later when the OVPN file is created. ON MAC: OpenVPN Connect version 3.1 or higher from the MAC App Store ON WINDOWS: https://www.ovpn.com/en/guides/windows-openvpn-gui 4. DECIDE where you are going to run the scripts (must be a Linux or MAC machine) - if using MAC or Linux for the PC where OpenVPN is installed, then it can be the same machine. 5. INSTALL tthe he AWS CLI (version 2) on the chosen machine where the scripts will run - follow instructions provided by AWS. (https://docs.aws.amazon.com/cli/latest/userguide/installcliv2.html)) and test with the following command to make sure you cliv2.html have the proper access: aws ec2 describe-instances 6. DOWNLOAD the awsvpcb-scripts file at https://github.com/clouducate/awsvpcb/blob/main/awsvpcbscripts.zip and scripts.zip and EXTRACT onto the machine where the AWS CLI is installed.

7. CD to the the root of the awsvpcb scripts directory. 8. RUN ./AWSVPCB.CONFIGURE 9. RUN ../AWSVPCB.TEST /AWSVPCB.TEST to confirm all is working as expected 10. RUN ./AWSVPCB.VPC.CREATE 11. IF NECESSARY, DOWNLOAD the AWSVPCBclient-config.ovpn file created in the "secfiles" directory by the AWSVPCB.VPC.CREATE script to the PC where you installed the OpenVPN client. If the scripts reside on the same machine as your OpenVPN client, then this is unnecessary. 12. IMPORT the AWSVPCB-client-config.ovpn into the

OpenVPN client. 13. IF NECESSARY, DOWNLOAD the ca.crt file in the "secfiles" directory the PC installed the OpenVPN client. Again, if the scripts to reside on you the same machine as your OpenVPN client, then this is unnecessary. 14. ADD the ca.crt root certificate to the trusted certificate store on your PC (Windows, MAC, or Linux). 15. IF NECESSARY, DOWNLOAD other support files from the "secfiles" directory to the PC including: iis1.rdp, iis1.password, iis2.rdp, iis2.password, mssql.rdp, mssql.password, mssql.sa.password, privkey.ppk (if you plan to use putty), privkey.pem (if you plan to use SSH) 13. RUN ./AWSVPCB.ASSIGNMENT.CREATE 1 14. RUN ./AWSVPCB.ASSIGNMENT.START 15. CONNECT with OpenVPN to your AWS VPC using the OVPN connection you just imported. 16. TEST the following: RDP to the iis1.awsvpcb.edu server using the iis1.rdp file in the "secfiles" directory. The password is in the iis1.password file. If using a MAC for the VPN connectivity, then download Microsoft Remote Desktop 10 or higher from the MAC App Store. RDP to the mssql.awsvpcb.edu server using the mssql.rdp file in the "secfiles" directory. The password is in the

mssql.password file. If using a MAC for the VPN connectivity, then download Microsoft Remote Desktop 10 or higher from the MAC App Store. SSH to the linux1.awsvpcb.edu server using the privkey.pem file (password is hardcoded to cts4743) or use putty (need to add privkey.ppk to the session definition - same password of cts4743). Use SSMS (Windows) or Azure Data Studio (MAC) to connect to the SQL Server instance on mssql.awsvpcb.edu. You can log in as “sa” using the password in the

mssql.sa.password file. Use a browser to connect s://rainforest.awsvpcb.edu to http ://rainforest.awsvpcb.edu (assignment (assignment http s://myfiu.awsvpcb.edu ://myfiu.awsvpcb.edu (assignment (assignment 2) or 1) or http://myfiu.awsvpcb.edu AND http://myfiu.awsvpcb.edu AND http://rainforest.awsvpcb.edu (assignment 3). NOTE: http://rainforest.awsvpcb.edu (assignment Assignments 1 & 2 use the AWS ELB and thus utilize TLS (https) and http is disabled.

17. RUN ./AWSVPCB.ASSIGNMENT.STOP # this would only be necessary if you wanted to get back to this later, else you can run ./AWSVPCB.ASSIGNMENT.DESTROY NEXT STEPS FOR YOUR COURSE

1. MAP OUT TARGET ENVIRONMENT What AMIs do you plan to use? Are the default AMIs provided sufficient or do you want to take those and create custom AMIs? What applications will the students be testing? Are the default rainforest and myfiu applications sufficient or do I want to build/use others? What do you want your VPC to look like (IP range, subnets, possible instances, possible ELBs)? What do you want your assignments to look like (instances, firewall rules, DNS entries, ELBs)? NOTE: To use Havoc

Circus, at least one of the AMIs must be Windows with the Havoc Circus service pre-loaded and pointing to your Havoc Circus manifest location locally on the server or in AWS. What do you want your assignments to do (e.g., set up a security vulnerability or break the environment in some way)? 2. CHOOSE DOMAIN Use default awsvpcb.edu domain - This limits you to two ELB names (myfiu and rainforest), but other than

that, there are no other limitations OR Create your own domain and provide the following: Public CAorcert import student client machines usefor public CAinto (e.g., Sectigo) Client & Server VPN certs and private keys ELB certs and private keys All of these would need to be placed in the "secfiles" directory with the appropriate name (NOTE: the VPN server and client names are currently hardcoded, so you need to replace the files in the secfiles directory) 3. SETUP INSTRUCTOR AWS ACCOUNT ACCOUNT (optional). This would be needed if you wanted to do any of the following: Host custom AMIs for the assignments that will not be publicly available House Havoc Circus assignment dependency files and manifest JSON files House AWSVPCB VPC and assignment JSON files If you wish to enable AWS logging, then you would need an IAM user with Cloudwatch access If you wish to be able to gather diagnostic information, then the same IAM user needs access to be able to create a new S3 bucket 4. CREATE your VPC and assignment AWSVPCB JSON files. 5. CREATE Havoc Circus assignment JSON files.

Enterprise IT Troubleshooting Cross Functional IT Problem Solving

Short Description

Description

Comments

We need your help!