PaaS Under the Hood
Episode 5: Distributed Routing
1
Platform as a Service Under the Hood Episodes 1-5 dotcloud.com
PaaS Under the Hood
2
INTRODUCTION Building a Platform as a Service (PaaS) is rewarding work. We get to make the life of a developer easier. PaaS helps developers deploy, scale, and manage their applications, without making developers hardcore systems administrators themselves. As with many problems, the toughest part about managing applications in the cloud is actually not the building of the PaaS itself. The challenge lies in being able to scale the applications. To give you a sense of the complexity, each minute, millions of HTTP requests are routed through the application. Not only does our PaaS collect millions of metrics, we also aggregate, process, and analyze the metrics and look for abnormal patterns. Apps are constantly deployed and migrated on our PaaS platform. For economies of scale, virtually all PaaS providers pack density onto their physical machines. How does a PaaS provider solve the following issues? • • • • • •
How is application isolation accomplished? How does the platform handle data isolation? How does the platform deal with resource contention? How does the platform deploy and run apps efficiently? How does the platform provide security and resiliency? How does the platform handle the load from the millions of HTTP requests?
One key element is lightweight virtualization which is the use of virtual environments (called containers) to provide isolation characteristics comparable to full-blown virtual machines, but with much less overhead. In this area, the dotCloud platform relies on Linux Containers called LXCs.
In the following 5 episodes, we will dive into some of the internals of the dotCloud platform or more specifically, the Linux kernel features used by dotCloud.
PaaS Under the Hood
Episode 1: Kernel Namespaces
3
Episode 1: Kernel Namespaces
Simplifying complexity takes a lot of work. At dotCloud, we turn highly complex processes such as deploying and scaling web applications in the cloud and make them appear as simple workflows to developers and DevOps. How do we accomplish such a feat? In this eBook, we will show you how dotCloud works under the hood. We will expose the mechanics behind the kernel-level virtualization and high-throughput network routing. We will expose other technologies such as metrics collection and memory optimization in later eBooks. A developer once said, “Diving into the inner workings of a PaaS is like going Disneyland, you’ll uncover a world of wonder.”
dotcloud.com
PaaS Under the Hood
Episode 1: Kernel Namespaces
4
Episode 1: Namespaces Each time a new Linux Container (LXC) is created, the name of the container is filed under the /cgroup directory. For example, a new container named “sanfrancisco” is filed under the directory “/cgroup/sanfrancisco.” It is easy to think that the container relies on the control groups. Although cgroups are useful to Linux Containers (we will cover cgroups more thoroughly in Episode 2), Namespace provides an even more vital function to the Linux Containers. Namespaces isolate the resources of processes. This isolation is the real magic behind Linux Containers! There are five Namespaces, each covering a different resource: pid, net, ipc, mnt, and uts. The pid namespace The pid namespace is the most useful technology for basic isolation. Each pid namespace has it own numbering process. Different pid namespaces form a hierarchy with the kernel which keeps track of all the namespace. A “parent” namespace can see and implement actions on the “child” namespaces. A “child” namespace cannot perform any actions on its “parent”. There are some principles about the pid namespace as follows: • Each pid namespace has its own “PID 1” init-like process • Processes residing in a namespace cannot affect processes residing in a parent or sibling namespace with system calls like kill or ptrace because process ids are only meaningful inside a given namespace • If a pseudo-filesystem like proc is mounted by a process within a pid namespace, it will only show the processes belonging to the namespace • Numbering is different in each namespace which means that a process in a child namespace can have multiple PIDs, for example, one in its own namespace and a different PID in its parent namespace. Top-level pid namespace can see all processes running in all namespaces with different PIDs. A process can have more than 2 PIDs if there are more than two levels of hierarchy in the namespaces. The net namespace With the pid namespace, you can start processes in multiple isolated environments called “containers”. What if you need to run separate instances of Apache webserver in each container? Generally only one process can listen to port 80/TCP at a time. To configure your instances of Apache webserver to listen on different ports, you could use the net namespace, which has been designed for networking. Each different net namespace can have different network interfaces. Even lo, the loopback interface supporting 127.0.0.1, can be different in each different net namespace. It is even possible to create a pair of special interfaces, which will appear in two different net namespaces and allow one of the two net namespaces to talk to the outside world. A typical container will have its own loopback interface (lo), as well as a special interface on one end, generally named eth0. The other end of the special interface will be in the “original” namespace, and will bear a poetic name like veth42xyz0. It is then possible to put those special interfaces together within an Ethernet bridge (to achieve switching between containers), or route packets between them, etc. This is similar to the Xen networking model. Each net namespace has its own local meaning for INADDR_ANY, a.k.a. 0.0.0.0. When your Apache webserver process binds to INADDR_ANY and port 80 within its namespace *:80 within its namespace, it will only receive connections directed to the IP addresses and interfaces of its namespace. That allows you to run multiple Apache instances, each in their own pid and own net namespace, with their default configuration listening on port 80 and each will remain individually addressable. Each net namespace has its own routing table, and its own iptables chains and rules. The ipc namespace The ipc namespace won’t appeal to many of you, unless you’ve passed UNIX 101 when engineering schools still taught classes on IPC (InterProcess Communication). IPC provides semaphores, message queues, and shared memory segments. While still supported by virtually every UNIX flavors, those features are considered by many as obsolete, and superseded by POSIX semaphores, POSIX message queues, and mmap. Nonetheless, some programs such as PostgreSQL, for example, still use IPC.
PaaS Under the Hood
Episode 1: Kernel Namespaces
5
What’s the connection with namespaces? Each IPC resource is accessed through a globally unique 32-bit ID. While IPC implements permissions on the resource itself, an application could be surprised if it failed to access a given resource because it has already been claimed by another process in a different container. The app doesn’t know anything about other containers! Meet the ipc namespace. Processes within a given ipc namespace cannot access (or even see) the IPC resources living in other ipc namespaces. And now you can safely run a PostgreSQL instance in each container without the fear of IPC key collisions. The mnt namespace chroot is a mechanism to sandbox a process (and its children) within a given directory. The mnt namespace takes the chroot concept even further. As its name implies, the mnt namespace deals with mount points. Processes living in different mnt namespaces can see different sets of mounted file systems and different root directories. If a file system is mounted in an mnt namespace, it will be accessible only to those processes within that namespace. It will not be visible for processes in other namespaces. At first impression, it may sound useful, since the mnt namespace allows you to sandbox each container within its own directory, hidden from other containers. However, is this really useful after all? If each container is chroot’ed in a different directory, container C1 won’t be able to access or see container C2’s file system, right? There are downsides. Inspecting /proc/mounts in a container will show the mount points of all containers. Also, those mountpoints will be relative to the original namespace, which can give out some hints about the layout of your system. Seeing the path for the global namespace may confuse some applications that rely on the paths in the local namespace /proc/mounts. The mnt namespace makes the situation much cleaner, allowing each container to have its own mount points, and see only those mount points, with their path correctly correlated to the actual root of the namespace. The uts namespace Finally, the uts namespace deals with one important detail in that the hostname can be “seen” by a group of processes. The uts namespace addresses this issue by giving each uts namespace a different hostname, and changing the hostname through the sethostname system call. Also, the uts namespace will only change the hostname for processes running in the same namespace. Creating namespaces Namespace creation is achieved with the clone system call. This system call supports a number of flags, allowing you to specify whether the new process should run within its own pid, net, ipc, mnt, and uts namespaces. These are the series of steps that take place when creating a new container. A new process starts with new namespaces created. Its network interfaces that include the special pair of interfaces to talk with the outside world are configured. It then executes an init-like process. When the last process within a namespace exits, the associated resources (IPC, network interfaces...) are automatically reclaimed. If, for some reason, you want those resources to survive after the termination of the last process of the namespace, you can use mount --bind to retain the namespace for future use, because each namespace is stored in a special file in /proc/$PID/ns. Not all namespaces can be retained, only for ones up to kernel 3.4 . There is support for ipc, net, and uts namespaces but not for mnt and pid namespace. This presents a problem that we will address in the next paragraph.
PaaS Under the Hood
Episode 1: Kernel Namespaces
6
Attaching to Existing Namespaces It is also possible to get into or “enter” a namespace, by attaching a process to an existing namespace. Here are some use cases for assigning your own namespaces • Setting up network interfaces “from the outside”, without relying on scripts inside the container • Running arbitrary commands to retrieve information about the container (this can be done by executing netstat) • Obtaining a shell within a container Attaching a process to existing namespaces requires two things: • The setns system call (which exists only since kernel 3.0, or with patches for older kernels) • The namespace must appear in /proc/$PID/ns We mentioned in previous paragraphs that only ipc, net, and uts namespaces were supported /proc/$PID/ns and that mnt and pid namespaces were not supported. Only a patched kernel will allow you to attach to existing mnt and pid namespaces. Combining the necessary patches can be fairly tricky, because it involves resolving conflicts between AUFS and GRSEC. AUFS and GRSEC will be covered in Episodes 3 & 4 respectively. To avoid running an overly patched kernel, there are three suggested workarounds. • You can run sshd in your containers, and pre-authorize a special SSH key to execute your commands. This is one of the easiest solutions to implement. But if sshd crashes, or is stopped (either intentionally or by accident), you may be locked out of the container. Also, if you want to squeeze the memory footprint of your containers as much as possible, you might want to get rid of sshd. If the latter is your main concern, you can run a low profile SSH server like dropbear. Or, you can start the SSH service from inetd or a similar service. • If you want something simpler than SSH (or something different than SSH to avoid interferences with sshd custom configurations), you can open a backdoor. An example would be to run socat TCP-LISTEN:222,fork,reuseaddr EXEC:/bin/bash,stderr from init in your containers. Make sure that port 222/tcp is configured correctly and firewalled within. • An even better solution is to embed this “control channel” within your init process. Before changing its root directory, the init process could setup a UNIX socket on a path located outside the container root directory. When it will change its root directory, it will retain its open file descriptors – and therefore, the control socket. How dotCloud uses namespaces In previous releases, the dotCloud platform used vanilla LXCs (Linux Containers), which made implicit use of namespaces. From the beginning, we deployed kernel patches that allowed us to attach arbitrary processes into existing namespaces. We found this approach to be the most convenient and reliable way to deploy, control, and orchestrate containers. As the dotCloud platform evolved, we still made use of namespaces to isolate applications from each other even though we have stripped down the vanilla LXC containers.
PaaS Under the Hood
Episode 2: Cgroups
7
Episode 2: cgroups Control groups, or “cgroups”, are a set of mechanisms to measure and limit resource usage for groups of processes. Conceptually, it works somewhat like the ulimit shell command or the setrlimit system call. ulimit and setrlimit set resource limits for a single process. cgroups allow you to set resource limits for groups of processes.
dotcloud.com
PaaS Under the Hood
Episode 2: Cgroups
8
Pseudo-FS Interface The easiest way to manipulate control groups is through the cgroup file system. Assuming that it has been mounted on /cgroup, creating a new group named polkadot is as easy as mkdir /cgroup/ polkadot. When you create this (pseudo) directory, it instantly gets populated with many (pseudo) files to manipulate the control group. You can then move one (or many) processes into the control group by writing their PID to the right control file, for example, echo 4242 > /cgroup/polkadot/tasks. When a process is created, it will be in the same group as its parent. If the init process of a container has been placed in a control group, all the processes of the container will be also be in the same control group. Destroying a control group is as easy as rmdir /cgroup/polkadot. However the processes within the cgroup have to be moved to other groups first. Otherwise rmdir will fail since it is like trying to remove a non-empty directory. Technically, control groups are split into many subsystems. Each subsystem is responsible for a set of files in /cgroup/ polkadot, and the file names are prefixed with the subsystem name. For instance, the files cpuacct.stat, cpuacct.usage, cpuacct.usage_percpu are the interface for the cpuacct subsystem. The available subsystems will be detailed in the next paragraph. The subsystems can be used together, or independently. In other words, you can decide that each control group will have limits and counters for all the subsystems. Alternatively, each subsystem can have different control groups. To explain the latter case more fully, a subsystem can have a process in the polkadot control group for memory control, a process in the bluesuedeshoe control group for CPU control such that polkadot and bluesuedeshoe are in completely separated namespaces. What can be Controlled? Many things! We’ll highlight the most useful ones here, at least the ones we think are the most useful. Memory You can limit the amount of RAM and swap space that can be used by a group of processes. It accounts for the memory used by the processes for their private use such as their Resident Set Size, or RSS, but also for the memory used for caching purposes. This is actually quite powerful, because traditional tools such as ps or analysis of /proc do not have a way to identify the cache memory usage incurred by specific processes. This can make a big difference, for instance, with databases. A database typically consumes very little memory for processing but consumes a large chunk of cache. Complex queries would consume a lot of memory but, for this example, we are not performing complex queries. To perform optimally, your whole database (or at least, your “active set” of data that you refer to the most often) should fit into memory. You can implement a memory limit for a process inside a cgroup that can easily be done by using echo 1000000000 > /cgroup/polkadot/memory.limit_in_bytes (that will be rounded to a page size). To check the current usage for a cgroup, inspect the pseudo-filememory.usage_in_bytes in the cgroup directory. You can gather very detailed (and very useful) information using memory.stat.
PaaS Under the Hood
Episode 2: Cgroups
9
CPU You might already be familiar with scheduler priorities, and with the nice and renice commands. Once again, control groups will let you define the amount of CPU that should be shared by a group of processes, instead of by a single process. You can give each cgroup a relative number of CPU shares, and the kernel will make sure that each group of processes gets access to the CPU in proportion of the number of shares you gave it. Setting the number of shares is as simple as echo 250 > /cgroup/polkadot/cpu.shares. Remember that those shares are just relative numbers. If you multiply everyone’s share by 10, the end result will be exactly the same. This control group also gives statistics incpu.stat. CPU Sets This is different from the cpu controller. In systems with multiple CPUs (i.e., the vast majority of servers, desktop & laptop computers, and even phones today!), the cpuset control group lets you define which processes can use which CPU. This can be useful to reserve a full CPU to a given process or group of processes. Those processes will receive a fixed amount of CPU cycles, and they might also run faster because there will be less thrashing at the level of the CPU cache. On systems with Non Uniform Memory Access (NUMA), the memory is split in multiple memory banks, and each bank is tied to a specific CPU - or set of CPUs in a multi-core system. Binding a process (or group of processes) to a specific CPU or to a specific group can also reduce the overhead when a process is scheduled to run on a CPU, while accessing RAM tied to another CPU. There is a penalty to pay for accessing RAM that is tied to another CPU, so that you can use a cpuset to bind a process and its memory to a specific CPU to avoid the penalty. This works for a group of processes or CPUs too. Block I/O The blkio controller provides a lot of information about the disk accesses (or technically, block devices’ requests) performed by a group of processes. This is a useful technology, because I/O resources are much harder to share than CPU or RAM. A system has a given, known, and fixed amount of RAM. It has a fixed number of CPU cycles every second. This is true even on systems where the number of CPU cycles can change (such as tickless systems, or virtual machines). This does not present an issue, because the kernel will slice the CPU time in shares of e.g. 1 millisecond, and there is a given, known, and fixed number of milliseconds in every second obviously. However, I/O bandwidth can be unpredictable, or the predictions aren’t very useful. A hard disk with a 10ms average seek time will be able to process about 100 requests of 4 kB per second; but if the requests are sequential, typical desktop hard drives can easily sustain 80 MB/s transfer rates – which means 20000 requests of 4 kB per second. The average throughput (measured in IOPS, I/O Operations Per Second) will be somewhere between those two extremes. But as soon as the application performs a task that requires a lot of scattered, random I/O operations, the performance will drop – dramatically. The system can give you some guaranteed performance, but this guaranteed performance is so low that it is not helpful. That is exactly the problem over at AWS EBS, by the way. It’s like a highway that offers a guarantee that you be able to go above a given speed, except that this speed is 5 mph. Not very helpful in practicality, is it?
PaaS Under the Hood
Episode 2: Cgroups
10
That’s why SSD storage is becoming increasingly popular. SSD has virtually no seek time, and can therefore sustain random I/O as fast as sequential I/O. The available throughput is therefore predictably good, under any given load. Actually, there are some workloads that can cause problems. For instance, writing and rewriting a whole disk will cause performance to drop dramatically. This is because read and write operations are fast, but erase, which must be performed at some point before write, is slow. An example of this use case would be to use SSD to manage video on demand for hundreds of HD channels simultaneously. The disk will sustain the write throughput until it has written every block once. When it needs to erase, the performance will drop to below acceptable levels. Going back to dotCloud, what’s the purpose of the blkio controller in a PaaS environment? The blkio controller metrics will help detect applications that are putting an excessive strain on the I/O subsystem. The controller lets you set limits, which can be expressed in number of operations and/or bytes per second. It also allows for different limits for read and write operations. It allows you to set some thresholds that no single app can significantly degrade performance for other apps. Furthermore, once an I/O intensive app has been identified, its quota can be adapted to reduce impact on other apps. It’s Not Only for Containers As we mentioned, cgroups are convenient for containers, since it is very easy to map each container to a cgroup. But there are many other uses for cgroups. The systemd service manager is able to put each service in a different cgroup. This allows you to keep track of all the subprocesses started by a given service, even when they use the “double-fork” technique to detach from their parent and re-attach to init. It also allows fine-grained tracking and control of the resource used by each service. It is also possible to run a system-wide daemon to automatically classify processes into cgroups. This can be particularly useful on multi-user systems, to limit and/or meter appropriately the resources of each user, or to run some specific programs in a special cgroup—when you know that those programs are prone to excessive resource use. dotCloud & Control Groups Thanks to cgroups, we can meter very accurately the resource usage of each container, and therefore of each unit of each service for each application. Our metrics collection system uses collectd, along with our in-house lxc plugin. Metrics are streamed to a custom storage cluster, and can be queried and streamed by the rest of the platform using our ZeroRPC protocol. We will be writing a more in-depth article on metrics collection system in the future. We also use cgroups to allocate resource quotas for each container. For instance, when you use vertical scaling on dotCloud, you are actually setting limits for memory, swap usage, and CPU shares.
PaaS Under the Hood
Episode 3: AUFS
Episode 3: AUFS AUFS (which initially stood for “Another Union File System”) provides fast provisioning while retaining full flexibility and ensuring disk and memory savings
dotcloud.com
11
PaaS Under the Hood
Episode 3: AUFS
12
AUFS is a union file system, which merges two directory hierarchies together. On the dotCloud platform, we use AUFS to combine a large, read-only file system containing a ready-to-run system image under a writeable layer. The resulting file system looks like the large read-only one, except that you can now write on it anywhere and store just the changed files. LiveCDs or bootable USBs are common examples of this use case. AUFS allows us to have a common base image for all applications and a separate read-write layer, unique to each app. Storage Savings Let’s assume that the base image takes up 1 GB of disk space. In reality, it is actually more than that, since we’re talking about a full server file system, containing everything a dotCloud app could potentially need such as Python, Ruby, Perl, Java, C compiler and libraries, and so on. If the entire image had to be cloned each time a dotCloud application is deployed, it would use 1 GB of disk space for each new cloned deployment. AUFS therefore lets us save on storage costs because it is typically using less than 1 MB of disk space. Faster Deployments Copying the whole base image would not only use up precious disk space, but it would also take time, up to a minute or so depending on the disk speed. Also, the copy would put a significant I/O load on the disk. On the other hand, creating a new “pseudo-image” using AUFS takes a fraction of a second, and virtually no I/O at all. AUFS offers a much better solution when compared to copying an entire image every time. Better Memory Usage Virtually all operating systems use a feature called buffer cache to make disk access faster. Without it, your system could run at least 10x, 100x or up to 1000x slower, because it has to access the disk even to run simple commands, for example, when listing your files with ls! As we will see, AUFS also lets us rack in big savings on this buffer cache. Every single application will load from disk a number of common files and components such as the libc standard library, the /bin/sh standard shell... and a lot of common infrastructures, like crond, sshd, the local Mail Transfer Agent, just to name a few. Additionally, all applications of the same type will load the same files. For example, Python applications will load a copy of the Python interpreter every time. If each app were running from its own copy, identical copies of those common files would be present multiple times in memory, within the buffer cache. Using AUFS, those common files are in the base image, and the Linux kernel therefore knows how to load them only once in memory. This will typically save tens of MB for each app. Easier Upgrades If you are familiar with storage technology, you might argue that snapshots, and copy-on-write devices already have those advantages mentioned above. That’s true. However, with those systems, it is not possible to update the base image, and have the changes reflected in the lightweight “clones” such as in the snapshots. AUFS, on the other hand, lets you do whatever you want with the base image. The changes will be immediately visible in the AUFS mount points using the base image. It means that it is easy to do software upgrades, even while the applications are running, just like on a typical single server environment, except that you can upgrade thousands of servers all at once. Allows Arbitrary Changes All those things can also be done without AUFS. For a decade, skilled UNIX systems administrators have been deploying machines (workstations, X terminals, servers...) with a read-only root file system, allowing read-write access through ad hoc mount points. After all, with some clever configuration and tuning, you don’t need to write anywhere else except places like /tmp, /var/run, /var/lock, and of course /home. The latter can be a traditional read-write file system, and the formers can even use a tmpfs mount.
PaaS Under the Hood
Episode 3: AUFS
13
Use Cases for AUFS Because it allows for arbitrary changes to the file system, AUFS offers many advantages. Let’s suppose you need an extra package, or maybe you want to upgrade the version of Python or Ruby. On a system without AUFS, one with only a shared read-only root file system with distinct writable mount points, you have two alternatives. • Either you upgrade the read-only base image (and potentially affect all other users of the image) • Alternatively, install whatever you need into a specific writable mount point like /home, /tmp or equivalent. That means a manual install and potentially introducing side effects or conflicts with existing previously installed versions With AUFS, since the entire “read only” root file system is still writeable, just apt-get install whatever you need. The read-only base file system won’t be affected because all the changes will be written onto your own private layer. Other Union File Systems We considered many file systems with similar properties outlined in the above, in addition to AUFS. We opted for AUFS because for what we need to do at dotCloud, we believe that it is the most mature and stable solution at the time of our evaluation. Caveats However, technology is constantly evolving and no solution is ever a perfect match for our changing requirements. We are currently using AUFS 3. When we were using AUFS 2, it had significant issues, notably with mmap. However, the other union file systems performed even worse for that specific issue. We worked around those issues by mounting some read-write volumes at strategic places such as into the data directories of MySQL, PostgreSQL, MongoDB, Redis and on the home directory where the application code is executed. Mounting the read-write volumes into the data directories gave us the required stability. We were able to leverage the flexibility provided by AUFS without the downside. AUFS at dotCloud Technically, the main feature that benefits from AUFS is our custom package installation system. If you need a particular library that is not included in our base image but the library does exist in the Ubuntu package repository, then installing it in your service can be a breeze! Use the systempackages option in your dotcloud.yml file. AUFS allows the package to be installed into your service, without ever touching the base image used by other applications.
PaaS Under the Hood
Episode 4: GRSEC
Episode 4: GRSEC GRSEC is a security patch to Linux kernel. Security features in GRSEC help detect and deter malicious code.
dotcloud.com
14
PaaS Under the Hood
Episode 4: GRSEC
15
GRSEC is a fairly large patch for the Linux kernel, providing strong security features that prevent many kinds of attacks (or “exploits”), detect suspicious activity such as people looking for new exploits and/or known system vulnerabilities. There are many features in GRSEC, so our goal is to provide an overview of the relevant features to dotCloud. Randomize Address Space Many exploits rely on the fact that the base address for the heap or the stack is always the same. Consider the following example, this is a classic scenario for an attack on a remote service: • A bug is found in the service. Some index is not checked properly, and can be used to alter the stack, and cause a jump to an arbitrary address (when a function returns) • The stack is altered to introduce some malicious code • A pointer to this malicious code is placed on the stack as well • The bug is triggered. The service jumps to the malicious code and executes it If the address space of the stack is randomized, it would be much more difficult for an attacker to exploit the system. The attacker would have to locate his malicious code before he can jump to the code in memory. Prevent Execution of Arbitrary Code There are two steps to make sure that arbitrary code can’t make it inside a running program. First, program code must be loaded in an area that is marked by the memory management unit as being read-only. This prevents code from modifying itself. Self-modifying code is sometimes referred to as polymorphic code. There are legitimate use cases for polymorphic code. However, it is more often associated with dubious intentions. Second, the heap and the stack must be marked as non-executable. After all, they’re supposed to contain data structures, function parameters, and return addresses but no opcode should be in there. On architectures supporting it, the heap and the stack regions should be marked as non-executable at the hardware level, effectively preventing accidental or intentional execution of code located in there. At this point, there is no memory that is both executable and writable. We mentioned that there were some legitimate uses for memory regions with both write and exec permissions. When does that happen, and what can be done about it? The most common case is on-the-fly code generation for optimization purposes. This use case is applicable to those using Java and JIT (Just-In-Time) compiler. The good news is that GRSEC lets you flag some specific executables and allows them to write to their code region or execute their data region. This reduces the security for those specific processes, but there are benefits. To exploit a bug, there has to be a bug in e.g. the JVM itself, not in your program. Bugs in the JVM are likely to be found and fixed much faster than bugs in your own program. This is not a comment about the quality of anyone’s code. It’s more about the number of users in the Java community and their scrutiny on the JVM.
PaaS Under the Hood
Episode 4: GRSEC
16
Audit Suspicious Activity Another interesting security feature of GRSEC is the ability to log some specific events. For instance, it is possible to make a record each time a process is terminated by SIGSEGV, a.k.a. Segmentation Fault in the kernel log. What’s the point? Potential attackers will likely run a number of known exploits in an attempt to gain escalated privileges. Many of the exploits will hopefully fail. Often, the failure will result in the process having to do a segmentation violation, and then be killed by SIGSEGV. Any C programmer will tell you that there are legitimate cases where programs are terminated by SIGSEGV. If the system detects many different programs started by the same user that are all being killed in the same way, then it is telltale sign that someone is trying to break into the system. If you’re not familiar with those concepts, you can draw upon an analogy in which you observe many scratches around a padlock. A few scratches on the surface won’t mean anything. But if you see the padlock full of dents, you can bet that someone is trying to pick it! There are many other similar events that are logged by GRSEC. The kernel logs can then be analyzed in real time, and suspicious patterns can be detected. This allows you to lock out malicious users, or, alternatively, monitor them closely to see what they’re doing. GRSEC can be useful in Forensics in case someone does successfully breach the system. GRSEC logs will record how they’ve exploited the system. Knowing how someone exploited the system can be a valuable tool for the person who is trying to close the security gap. Compile-time Security Features GRSEC also plays its part during the kernel compilation. It enables a compiler plugin, which will “constify” some kernel structures. It will automatically add the const keyword to all structures containing only function pointers (unless they have a special “non const” marker to evade the process). In other words, instead of being mutable by default unless marked const, function tables are now const by default, unless specified otherwise. Accordingly, attempts to modify function tables will be detected at compile-time.The rationale is to make sure that any code that manipulates a function table will be closely audited before the function table is marked “non const”. Why the emphasis on function tables? Because if they can be breached, they are a convenient way for a potential attacker to jump to arbitrary code, recall the technique explained in the beginning of Episode 4! Marking those data structures as const helps at compile time, but also later when the kernel is running, because those data structures will be laid out in a memory region which will be made read-only by the memory management unit. This not only reduces exposure to attacks, but can also make it harder for successful attackers to cover up their tracks by hijacking existing function tables. ...And Many More As told in the introduction, this is just a quick overview. If you want to learn about other features, you can check GRSEC’s website. If you want to quench your thirst for technical details, you can follow these four steps to get a full listing of all the GRSEC features and descriptions on each feature. • Get the kernelsources • Apply the GRSEC patch set • Run make menuconfig • Navigate to the compilation options related to GRSEC Almost each feature of GRSEC can be enabled/disabled at compilation time, and therefore will be listed there. The Help provided with each compilation option is fairly informative.
PaaS Under the Hood
Episode 4: GRSEC
17
In addition to GRSEC, dotCloud has built-in additional layers of security. Each service runs in its own container. The benefits of container isolation were explained in Episode 2 on cgroups. We do not allow dotCloud users to have root access. “No root access” means that users cannot SSH as root, cannot login as root, and cannot get a root shell through sudo. All processes run under a regular, non-privileged UID. Furthermore, SUID binaries are restricted to a set of well-known, well-audited programs, like ping. Each of those security layers is strong. We believe that combining them together can provide a more than adequate level of security for massively scaled, multi-tenant platforms.
PaaS Under the Hood
Episode 5: Distributed Routing
18
Episode 5: Distributed Routing
The dotCloud platform is powered by hundreds of servers, some of them running more than one thousand containers. The majority of these containers are HTTP servers and they handle millions of HTTP requests every day to power the applications hosted on our platform.
dotcloud.com
PaaS Under the Hood
Episode 5: Distributed Routing
19
All the HTTP traffic is bound to a group of special machines called the “gateways”. The gateways parse HTTP requests, and route them to the appropriate backends. When there are multiple backends for a single service, the gateways also deal with the load balancing and failover. Last but not least, the gateways also forward HTTP l
to be processed by the metrics cluster. HTTP Routing Layer
visitors (technically HTTP clients)
HTTP routing layer (load balancers)
dotCloud platform
dotCloud app cluster
This “HTTP routing layer”, as we call it, runs on an elastic number of dedicated machines. When the load is low, 3 machines are enough to deal with the traffic. When spikes or DoS attacks happen, we scale up to 6, 10, or even more machines, to ensure optimal availability and latency. All HTTP requests are bound to the “HTTP routing layer”, which is a cluster of identical HTTP load balancers. Each time we create, update (e.g. scale), or delete an application on dotCloud, the configuration of those load balancers has to be updated. The “master source” for all the configuration is stored within a Riak cluster, working in tandem with a Redis cache. The configuration is modified using basic commands: • Create a HTTP entry • Add/remove a frontend (virtual host) • Add/remove a backend (container) The commands are passed through a ZeroRPC API. Each update done through the API propagates through the platform; in the next sections, we will see which mechanisms are used. Version 1: Nginx + ZeroRPC As you probably know, a start-up must be lean, agile, and many other things. It also needs to be pragmatic, and the right solution is not always the ideal one, but it’s the one that allowed us to ship on time. That’s why the first iteration of our routing layer had some shortcomings, as we will see. But it has functioned properly up to support tens of thousands of apps. Nginx powered the first version of dotCloud’s routing layer. Each modification to an app caused the central “vhosts” service to push the full configuration to all the load balancers, using ZeroRPC. Obviously, as the number of apps grew, the size of the configuration grew as well. Sending differential updates would have been better. But at least, when a load balancer lost a few configuration messages, there was no special case to handle. The next update would contain the full configuration, and provide all the necessary information. The configuration was transmitted using a compressed, efficient format. Then, each load balancer would transform this abstract configuration into the Nginx configuration file, and inform Nginx to reload this configuration. Nginx is well designed, even when loading the new configurations, it can still serve requests along with the old one which meant that no HTTP request is lost during the configuration update.
PaaS Under the Hood
Episode 5: Distributed Routing
20
Nginx also handles load balancing and fail-over well. When a backend server dies, Nginx detects it, removes it from the pool, periodically tries it again, and will re-add it to the pool once it has fixed itself. This setup had two issues: • Nginx does not support the WebSocket protocol, which was one of the top features requested by our users at that time • Nginx has no support for dynamic reconfiguration, which means that each configuration update requires the whole configuration file to be regenerated and reloaded At some point, the load balancers started to expend a significant amount of CPU time to reload Nginx configurations. There was no significant impact on running applications, but it required deploying more and more powerful instances as the number of apps increased. Although Nginx was still fast and efficient, we had to find a more dynamic alternative. Version 2: Node.js + Redis + WebSocket = Hipache We spent some time digging through several kinds of languages and different technologies to solve this issue. We needed the following features: • Ability to add, update, and remove virtual hosts dynamically, with a very low cost • Support for the WebSocket protocol • Great flexibility and control over the routed requests: we want to be able to trigger actions, log events, etc., during different steps of the routing After looking around, we finally decided to implement our own proxy solution. We did not implement everything from scratch. We based our proxy on the node-http-proxy library developed by NodeJitsu. It included everything needed to route a request efficiently with the appropriate level of control. The new routing layer would therefore be in JavaScript, using Node.JS, leveraging on NodeJitsu’s library; we added several features such as the following: • • • • •
Use multi-core machines by scaling the load to multiple workers Ability to store HTTP routes in Redis, allowing live configuration updates Passive health-checking (when backend is detected as being down, it is removed from the rotation) Efficient logging of requests Memory footprint monitoring - if a leak causes the memory usage of a worker to go beyond a given threshold, the worker is gracefully recycled • Independence from other dotCloud technologies (like ZeroRPC, to make the proxy fully re-usable by third parties (the code being, obviously, open source) After several months of engineering and intensive testing, we released the source code of Hipache: our new distributed proxy solution! Behind the scenes, integrating Hipache into the dotCloud platform was very straight forward, due to our serviceoriented architecture. We simply wrote a new adapter which consumed virtual host configurations from the existing ZeroRPC service, and used it to update Hipache’s configuration in Redis. No refactoring or modification of the platform was necessary. Here’s a side comment about dynamic configuration and latency. Storing the configuration in an external system (like Redis) means that you have to make following trade-offs: • You can look up the configuration at each request, but it requires a round-trip to Redis at each request, which adds latency • You can cache the configuration locally, but you will have to wait a bit for your changes to take effect, or implement a complex cache-busting mechanism We implemented a cache mechanism to avoid hitting Redis at each request. But that wasn’t necessary, because we realized that requests done to a local Redis are very, very fast. The difference between direct lookups and cached lookups was less than 0.1ms, which was in fact below the error margin of our measurements.
PaaS Under the Hood
Episode 5: Distributed Routing
21
Version 3: Active Health Checks Hipache has a simple embedded health-check system. When a request fails because of some backend issue (TCP errors, HTTP 5xx responses, etc.), the backend is flagged as being dead, and remains in this state for 30 seconds. During the 30 seconds, no request is sent to the backend; then it goes back to normal state. However, if it is faulty, it will immediately be re-flagged as dead. This mechanism is simple enough and it works, but it has three caveats: • If a backend is frozen, we will still send requests to it, until it gets marked as dead • When a backend is repaired, it can take up to 30 seconds to mark it live again • A backend which is permanently dead will still receive a few requests every 30 seconds To address those three problems, we implemented active health checks. The health checker permanently monitors the state of the backends, by doing the HTTP equivalent of a simple “ping”. As soon as a backend stops replying correctly to ping requests, it is marked as dead. As soon as it starts replying again, it is marked as live. The HTTP “pings” can be sent every few seconds, meaning that it will be much faster to detect when a backend changes. To implement the active health checker, we considered multiple solutions: Node.js, Python+gevent, Twisted. And finally decided to roll it with the Go language. Go Lang was chosen for several reasons as follows: • The health checker is massively concurrent (hundreds, and even thousands of HTTP connections can be “in flight” at a given time) • Go programs can be compiled and deployed as a single, stand-alone, binary • We have other tools doing massively concurrent queries, and this was an excellent occasion to do some comparative benchmarks (we will be publishing the benchmarks in future eBooks) The active health checker is completely optional. You don’t need it to run Hipache, and you can plug it on top of an existing Hipache installation without modifying Hipache configuration: it will detect and update Hipache configuration directly through the Redis used by Hipache itself. In other words, it gets along perfectly fine with Hipache embedded passive health-checking system, and running it will just improve the dead backend detection. And of course, hchecker is open source, just like Hipache. What’s next? Since this HTTP routing layer is a major part of the dotCloud infrastructure, we’re constantly trying to find ways to make it better all the time. Recently we did some research and tests to see if there was some way to implement dynamic routing with Nginx. In fact, we aimed for an even higher goal. We wanted to route requests with Nginx, using configuration rules stored in Redis, using the format currently used by Hipache. This would allow us to re-use many components such as the Redis feeder and the active health checker that uses the same configuration format. Guess what: we found something! Less than one year ago when we started to think about the design of Hipache and begin implementation, we looked at the Nginx Lua module. It has improved a lot since then and it may be an ideal candidate. We started an experimental project which lets Nginx mimic Hipache, by using the same Redis configuration format. Nginx deals with the request proxying, while the routing logic is all in Lua. We used the excellent lua-resty-redis module to talk to Redis from Nginx. This open source project, called hipache-nginx. Some preliminary benchmarks show that under high load, hipache-nginx can be 10x faster than the original Hipache in Node.js. The benchmarks have to be refined, but it appears that hipache-nginx can deliver the same performance as hipache-nodejs with 10x fewer resources. So, while the code is still experimental, it shows that there is plenty of room for improvement in the current HTTP routing layer. Even if it will probably have an affect on apps with 10,000-100,000 requests per second, it is still worth investigating.
PaaS Under the Hood
22
CONCLUSION As you can see, building a PaaS like dotCloud or Heroku involves specific knowledge about fundamental technologies. Of course, you may not choose to implement any of the specific technologies that we’ve implemented in dotCloud. We aim to expose the underlying technologies that we’ve implemented that provide isolation between apps, rapid deployment, protection against security threats and distributed routing. In other words, if you are serious about building a robust platform, you may want to become familiar with those types of technologies. Or, alternatively, you could rely on an existing proven platform like dotCloud. Join dotCloud’s Technical Community Sign up for your own account Join the technical discussions in our open forums Read our blog Have a technical question? Email us:
[email protected]
PaaS Under the Hood
23
Author’s Biography Jérôme Petazzoni, PaaS under the Hood, Episodes 1-4 Jérôme is a senior engineer at dotCloud, where he rotates between Ops, Support and Evangelist duties and has earned the nickname of “master Yoda”. In a previous life he built and operated large scale Xen hosting back when EC2 was just the name of a plane, supervised the deployment of fiber interconnects through the French subway, built a specialized GIS to visualize fiber infrastructure, specialized in commando deployments of large-scale computer systems in bandwidth-constrained environments such as conference centers, and various other feats of technical wizardry. He cares for the servers powering dotCloud, helps our users feel at home on the platform, and documents the many ways to use dotCloud in articles, tutorials and sample applications. He’s also an avid dotCloud power user who has deployed just about anything on dotCloud - look for one of his many custom services on our Github repository. Connect with Jérôme on Twitter! @jpetazzo Sam Alba, PaaS Under the Hood, Episode 5 As dotCloud’s first engineering hire, Sam was part of the tiny team that shipped our first private beta in 2010. Since then, he has been instrumental in scaling the platform to tens of millions of unique visitors for tens of thousands of developers across the world, leaving his mark on every major feature and component along the way. Today, as dotCloud’s first Director of Engineering, he manages our fast-growing engineering team, which is another way to say he sits in meetings so that the other engineers don’t have to. When not sitting in a meeting, he maintains several popular open source projects, including Hipache and Cirruxcache and other projects also ending in “-ache”. In a previous life, Sam supported Fortune 500s at Akamai, built the web infrastructure at several startups, and wrote software for self-driving cars in a research lab at INRIA. Follow Sam on Twitter @sam_alba