Linux tools for analyzing System Performance

Share on:

Background

In today's contemporary world of enterprise software where massively used applications are expected to scale and run seamlessly at extreme high loads e.g. Scaling Hotstar for 25.3 million users, system performance becomes one of the key tenant of architecting high throughput, low latency applications along with capability of ease in scaling as per business / end consumer needs .

System performance is a very broad term as it would encompass entire gambit of computer system i.e. all the software and hardware components that comes within the path of a user request. In this article I will be mainly covering key tools to analyze system resources -

  1. Four main logical components of a physical computer - CPU, RAM, Networking, and Disk I/O
  2. Operating System

App - Top Down View

CPU

CPU is mainly responsible for executing all software and are most often probable hot spot for system performance issue. Key Linux tools that can help in finding deeper insights -

top

Shows top CPU consuming processes along with its CPU usage

CPU - top

Above numbers pertaining to CPU utilization means -

  • us (user time)—The percentage of time the CPU spent in user space.
  • sy (system time)—The percentage of time the CPU spent in kernel space.
  • ni (nice time)—The percentage of time spent on low-priority processes.
  • id (idle time)—The percentage of time the CPU spent doing literally nothing. (It can’t stop!)
  • wa (I/O wait time)—The percentage of time the CPU spent waiting on I/O.
  • hi (hardware interrupts)—The percentage of time the CPU spent servicing hardware interrupts.
  • si (software interrupts)—The percentage of time the CPU spent servicing software interrupts.
  • st (steal time)—The percentage of time a hypervisor stole the CPU to give it to someone else. This kicks in only in virtualized environments.

Niceness

It mainly indicates how happy a process is to give CPU cycles to other more high priority process i.e. how nice it is with others :) Allowed values range from -20 to 19

mpstat

Its multi processor statistics tool tha can emit metrics per CPU. It is similar to top but with load split separately for each processor.

CPU - mpstat

Note

When your application slows down considerably due to other background processes than one may use niceness property to set higher relative property for application process. This has major limitation, as it does not control how much CPU to allocate to the process In order to overcome above limitation one can use control groups which mainly allows user to specify exact magnitude of resources i.e. CPU, Memory, I/O that kernel should allocate to group of processes

RAM

RAM determines the operating capacity of a system at any given time. Read speed of RAM determines two things -

  1. How fast your CPU can load data into its cache
  2. How much of system performance drops when CPU is forced to read from RAM instead of from its cache

Key Linux tools that can provide deeper insights -

free

It shows utilization of RAM.

RAM - free

'Available' column as shown above is a key metric to check RAM utilization

top

It gives an overview of the memory along with CPU utilization (as discussed in previous section) of system. By default, the output is sorted by the value of field %CPU utilization of process

RAM - top

vmstat

It mainly emits virtual memory statistics

RAM - vmstat

Key fields to understand from performance standpoint :

  • r -> Indicates no. of processes running / waiting to run which implicitly helps us to know the saturation level of the system
  • b -> Indicates no. of processes
  • in -> Total no. of interrupts
  • cs -> Total no. of context switches

oomkill-bpfcc

It basically works by tracing oom_kill_process and emitting information whenever out of memory happens. In order to leverage this utility, open a terminal and execute below command and keep the terminal open.

1  sudo oomkill-bpfcc

Whenever OOM occurs below output can be seen

RAM - ram-oomkil-bpfcc

Network

With Distributed Systems, network plays a huge role in System Performance. Typically speaking network would mainly comprise of -

  1. Hardware - which would mainly include routers, NIC, switches etc
  2. Software - which would mainly include OS kernel that may comprise of device drivers, protocols etc.

Note - At software level, protocols can be further categorized into 2 :

  1. Protocols at kernel level i.e. TCP, UDP etc
  2. Protocols at application level i.e. HTTP One can refer my older posts w.r.t application / kernel level protocols and ways to optimize it -

Key tools that can help in gaining kernel level metrics

sar

It is a tool to collect system metrics

Network - sar

Activate sar

Edit the file /etc/default/sysstat by changing ENABLED from "false" to "true" and then restart its service by below command

1  sudo service sysstat restart

sar options and its metrics understanding

Various options provided by Linux version -

  1. -n DEV: Network interface statistics
  2. -n EDEV: Network interface errors
  3. -n TCP: TCP statistics
  4. -n ETCP: TCP error statistics
  5. -n SOCK: Socket usage
  6. -n IP: IP datagram statistics
  7. -n EIP: IP error statistics

Network - sar-tcp

tcptop

It is a tool that emits TCP throughput by host and its processes

Network - tcptop

  • RX_KB : Received traffic in KB
  • TX_KB : Traffic sent in KB

Disk I/O

Disk I/O can have huge impact on system performance and this in turn may lead to high latency of application. A very common scenario that might lead to performance issues - System is waiting for Disk I/O to get completed and as a result CPU is idle due to blocking I/O operation

Key tools that can help in gaining kernel level metrics

df

It mainly shows all the file systems mounted along with its size and used space

Disk - df

iostat

Emits summary of I/O statistics per disk

Disk - iostat

Note -

  • Use -x to get extended metrics
  • Few key metrics to understand the throughput / avg size of read/write - r/s, w/s, rKb/s, rKb/s
  • aqu-sz : Measurement of saturation which indicates queue length of requests
  • %util : Indicates utilization i.e. percentage of time device spent doing some work. A high value may indicate saturation

biotop

Emits summary of disk I/O by process

Disk - biotop

It mainly helps to identify source of load being written to disk is coming from. It in a way helps to identify the root cause of high consumption of resources

Operating System

Operating System is a program on which application along with its binaries runs. It also manages system along with hardware, CPU, memory etc.

Few tools that can help to get key metrics -

opensnoop

Allows you to see all the files being opened by all the processes on your system, in almost near to real time

1  sudo opensnoop-bpfcc

execsnoop

It list of all the processes being started on the machine by listening for call to exec variant in the kernel

1  sudo execsnoop-bpfcc

Other handy Linux tools from Performance Engineering standpoint

Uptime

It mainly indicates -

  • how long system has been up
  • average load on the system which mainly helps in understanding the pattern (increasing or decreasing load)

Misc - uptime

Note:

  1. The three numbers in Load Average represent a moving window sum average of processes competing over CPU time to run over 1, 5, and 15 minutes. The numbers are exponentially scaled, so a number twice as large doesn’t mean twice as much load
  2. If the numbers are decreasing sharply, it might mean that the program which was eating up all the resources has already completed its execution
  3. Increasing number would mostly indicate rising load on the system

Exit code of applications

One can return exit code returned by previous command

1  echo $?

Misc - exitcode

Note - codes in range 128-192 are decoded by using 128 + n, where n is the number of the kill signal

Misc - exitcode kill

Quick check on memory and CPU utilization

1  top -n1 -o+%MEM

Misc - Top mem cpu

dmesg

Linux utility that displays kernel messages

misc-dmesg Misc - dmesg

Conclusion

Architecting and developing low latency, high throughput enterprise applications needs an overall understanding of not only software but also about the underlying OS behavior along with its key components viz. CPU, RAM, Network and Disk. In this article I tried to show how quickly one can use some of the commands to get insights from OS level whilst troubleshooting performance issues.

P.S : List of commands and their output shown above are based on Chaos Engineering

comments powered by Disqus