Run Control Group Version 2 on Oracle Linux

Introduction

Control Group (cgroup) is a Linux kernel feature for limiting, prioritizing, and allocating resources such as CPU time, memory, and network bandwidth for running processes.

This tutorial guides you through limiting the CPU time for user processes using cgroup v2.

Objectives

In this tutorial, you will learn how to:

Enable control group version 2
Set a soft CPU limit for a user process
Set a hard CPU limit for a user process

Prerequisites

Minimum of a single Oracle Linux system
Each system should have Oracle Linux installed and configured with:
- A non-root user account with sudo access
- Access to the Internet

Deploy Oracle Linux

Note: If running in your own tenancy, read the linux-virt-labs GitHub project README.md and complete the prerequisites before deploying the lab environment.

Open a terminal on the Luna Desktop.

Clone the linux-virt-labs GitHub project.

git clone https://github.com/oracle-devrel/linux-virt-labs.git

Change into the working directory.
```
cd linux-virt-labs/ol
```

Install the required collections.

ansible-galaxy collection install -r requirements.yml

Deploy the lab environment.
```
ansible-playbook create_instance.yml -e localhost_python_interpreter="/usr/bin/python3.6"
```
The free lab environment requires the extra variable local_python_interpreter, which sets ansible_python_interpreter for plays running on localhost. This variable is needed because the environment installs the RPM package for the Oracle Cloud Infrastructure SDK for Python, located under the python3.6 modules.
The default deployment shape uses the AMD CPU and Oracle Linux 8. To use an Intel CPU or Oracle Linux 9, add -e instance_shape="VM.Standard3.Flex" or -e os_version="9" to the deployment command.
Important: Wait for the playbook to run successfully and reach the pause task. At this stage of the playbook, the installation of Oracle Linux is complete, and the instances are ready. Take note of the previous play, which prints the public and private IP addresses of the nodes it deploys and any other deployment information needed while running the lab.

Create a Load-generating Script

Open a terminal and connect via SSH to the ol-node-01 instance.
```
ssh oracle@<ip_address_of_instance>
```

Create the foo.exe script.

echo '#!/bin/bash

/usr/bin/sha1sum /dev/zero' > foo.exe

Copy the foo.exe script to a location in your $PATH and set the proper permissions.

sudo mv foo.exe /usr/local/bin/foo.exe
sudo chown root:root /usr/local/bin/foo.exe
sudo chmod 755 /usr/local/bin/foo.exe

Fix the SELinux labels after copying and changing permissions on the foo.exe script.
```
sudo /sbin/restorecon -v /usr/local/bin/foo.exe
```
Note: Oracle Linux runs with SELinux set to enforcing mode by default. You can verify this by running sudo sestatus.

Create a Load-generating Service

Create the foo.service file.

echo '[Unit]
Description=the foo service
After=network.target

[Service]
ExecStart=/usr/local/bin/foo.exe

[Install]
WantedBy=multi-user.target' > foo.service

Copy the foo.service script to the default systemd scripts directory and set the proper permissions.

sudo mv foo.service /etc/systemd/system/foo.service
sudo chown root:root /etc/systemd/system/foo.service
sudo chmod 644 /etc/systemd/system/foo.service

Fix the SELinux labels.

sudo /sbin/restorecon -v /etc/systemd/system/foo.service

Reload the systemd daemon so it recognizes the new service.
```
sudo systemctl daemon-reload
```

Start foo.service and check its status.

sudo systemctl start foo.service
sudo systemctl status foo.service

Create Users

Additional users will be allowed to run the load-generating script under these different accounts and different CPU weights.

Create users and set passwords.

sudo useradd -u 8000 ralph
sudo useradd -u 8001 alice
echo "ralph:oracle" | sudo chpasswd
echo "alice:oracle" | sudo chpasswd

Allow SSH connections.

Copy the SSH key from the oracle user account for the 'ralph' user.

sudo mkdir /home/ralph/.ssh
sudo cp /home/oracle/.ssh/authorized_keys /home/ralph/.ssh/authorized_keys
sudo chown -R ralph:ralph /home/ralph/.ssh
sudo chmod 700 /home/ralph/.ssh
sudo chmod 600 /home/ralph/.ssh/authorized_keys

Repeat for the alice user.

sudo mkdir /home/alice/.ssh
sudo cp /home/oracle/.ssh/authorized_keys /home/alice/.ssh/authorized_keys
sudo chown -R alice:alice /home/alice/.ssh
sudo chmod 700 /home/alice/.ssh
sudo chmod 600 /home/alice/.ssh/authorized_keys

Open a new terminal and verify both SSH connections work.
```
ssh -l ralph -o StrictHostKeyChecking=accept-new <ip_address_of_instance> true
```
The -o StrictHostKeyChecking=accept-new option automatically accepts previously unseen keys but will refuse connections for changed or invalid hostkeys. This option is a safer subset of the current behavior of StrictHostKeyChecking=no. The true command runs on the remote host and always returns a value of 0, which indicates that the SSH connection was successful. If there are no errors, the terminal returns to the command prompt after running the SSH command.

Repeat for the other user.

ssh -l alice -o StrictHostKeyChecking=accept-new <ip_address_of_instance> true

Exit the current terminal and switch to the other existing terminal connected to ol-node-01.

Enable Control Group Version 2

Note: Oracle Linux 9 and higher ships with cgroup v2 enabled by default.

For Oracle Linux 8, you must manually configure the boot kernel parameters to enable cgroup v2 as it mounts cgroup v1 by default.

If you are not using Oracle Linux 8, skip to the next section.

Update grub with the cgroup v2 systemd kernel parameter.
```
sudo grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1"
```
You can instead specify only your current boot entry by running sudo grubby --update-kernel=/boot/vmlinuz-$(uname -r) --args="systemd.unified_cgroup_hierarchy=1".

Confirm the changes.

cat /etc/default/grub |grep systemd.unified_cgroup_hierarchy

Reboot the instance for the changes to take effect.
```
sudo systemctl reboot
```
Note: Wait a few minutes for the instance to restart.
Reconnect to the ol-node-01 instance using SSH.

Verify that Cgroup v2 is Enabled

Check the cgroup controller list.
```
cat /sys/fs/cgroup/cgroup.controllers
```
The output should return similar results: cpuset cpu io memory hugetlb pids rdma
Check the cgroup2 mounted file system.
```
mount |grep cgroup2
```
The output should return similar results: cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

Inspect the contents of the cgroup mounted directory.

ll /sys/fs/cgroup

Example output:

total 0
-r--r--r--.  1 root root 0 Mar 13 21:20 cgroup.controllers
-rw-r--r--.  1 root root 0 Mar 13 21:20 cgroup.max.depth
-rw-r--r--.  1 root root 0 Mar 13 21:20 cgroup.max.descendants
-rw-r--r--.  1 root root 0 Mar 13 21:20 cgroup.procs
-r--r--r--.  1 root root 0 Mar 13 21:20 cgroup.stat
-rw-r--r--.  1 root root 0 Mar 13 21:20 cgroup.subtree_control
-rw-r--r--.  1 root root 0 Mar 13 21:20 cgroup.threads
-rw-r--r--.  1 root root 0 Mar 13 21:20 cpu.pressure
-r--r--r--.  1 root root 0 Mar 13 21:20 cpuset.cpus.effective
-r--r--r--.  1 root root 0 Mar 13 21:20 cpuset.mems.effective
drwxr-xr-x.  2 root root 0 Mar 13 21:20 init.scope
-rw-r--r--.  1 root root 0 Mar 13 21:20 io.pressure
-rw-r--r--.  1 root root 0 Mar 13 21:20 memory.pressure
drwxr-xr-x. 87 root root 0 Mar 13 21:20 system.slice
drwxr-xr-x.  4 root root 0 Mar 13 21:24 user.slice

The output shows the root control group at its default location. The directory contains interface files all prefixed with cgroup and directories related to systemd that end in .scope and .slice.

Work with the Virtual File System

Before we get started, we need to learn a bit about the cgroup virtual file system mounted at /sys/fs/cgroup.

Show which CPUs participate in the cpuset for everyone.
```
cat /sys/fs/cgroup/cpuset.cpus.effective
```
The output shows a range starting at 0 that indicates the system's effective CPUs, which consist of a combination of CPU cores and threads.
Show which controllers are active.
```
cat /sys/fs/cgroup/cgroup.controllers
```
Example output:
```
cpuset cpu io memory hugetlb pids rdma misc
```
It's good to see the cpuset controller present as we'll use it later in this tutorial.

Show processes spawned by oracle.

First, we need to determine oracle's user id (UID).

who
id

Example output:

[oracle@ol-node-01 ~]$ who
oracle   pts/0        2022-03-13 21:23 (10.39.209.157)
[oracle@ol-node-01 ~]$ id
uid=1001(oracle) gid=1001(oracle) groups=1001(oracle),10(wheel) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

Using the UID, we can find the oracle users slice.

cd /sys/fs/cgroup/user.slice
ls

Example output:

[oracle@ol-node-01 ~]$ cd /sys/fs/cgroup/user.slice
[oracle@ol-node-01 user.slice]$ ls
cgroup.controllers      cgroup.subtree_control  memory.events        memory.pressure      pids.max
cgroup.events           cgroup.threads          memory.events.local  memory.stat          user-0.slice
cgroup.freeze           cgroup.type             memory.high          memory.swap.current  user-1001.slice
cgroup.max.depth        cpu.pressure            memory.low           memory.swap.events   user-989.slice
cgroup.max.descendants  cpu.stat                memory.max           memory.swap.max
cgroup.procs            io.pressure             memory.min           pids.current
cgroup.stat             memory.current          memory.oom.group     pids.events

Systemd assigns every user a slice named user-<UID>.slice. So, what's under that directory?

cd user-1001.slice
ls

Example output:

[oracle@ol-node-01 user.slice]$ cd user-1001.slice/
[oracle@ol-node-01 user-1001.slice]$ ls
cgroup.controllers  cgroup.max.descendants  cgroup.threads  io.pressure        user-runtime-dir@1001.service
cgroup.events       cgroup.procs            cgroup.type     memory.pressure
cgroup.freeze       cgroup.stat             cpu.pressure    session-3.scope
cgroup.max.depth    cgroup.subtree_control  cpu.stat        user@1001.service

These are the top-level cgroup for the oracle user. However, there are no processes listed in cgroup.procs. So, where is the list of user processes?

cat cgroup.procs

Example output:

[oracle@ol-node-01 user-1001.slice]$ cat cgroup.procs
[oracle@ol-node-01 user-1001.slice]$

When oracle opened the SSH session at the beginning of this tutorial, the user session created a scope sub-unit. Under this scope, we can check the cgroup.procs for a list of processes spawned under that session.

Note: The user might have multiple sessions based on the number of connections to the system; therefore, replace the 3 in the sample below as necessary.

cd session-3.scope
ls
cat cgroup.procs

Example output:

[oracle@ol-node-01 user-1001.slice]$ cd session-3.scope/
[oracle@ol-node-01 session-3.scope]$ ls
cgroup.controllers  cgroup.max.depth        cgroup.stat             cgroup.type   io.pressure
cgroup.events       cgroup.max.descendants  cgroup.subtree_control  cpu.pressure  memory.pressure
cgroup.freeze       cgroup.procs            cgroup.threads          cpu.stat
[oracle@ol-node-01 session-3.scope]$ cat cgroup.procs
3189
3200
3201
54217

Now that we have found the processes the hard way, we can use systemd-cgls to show the same information in a tree-like view.

Note: When run from within the virtual filesystem, systemd-cgls limits the cgroup output to the current working directory.

cd /sys/fs/cgroup/user.slice/user-1001.slice
systemd-cgls

Example output:

[oracle@ol-node-01 user-1001.slice]$ systemd-cgls
Working directory /sys/fs/cgroup/user.slice/user-1001.slice:
├─session-3.scope
│ ├─ 3189 sshd: oracle [priv]
│ ├─ 3200 sshd: oracle@pts/0
│ ├─ 3201 -bash
│ ├─55486 systemd-cgls
│ └─55487 less
└─user@1001.service
  └─init.scope
    ├─3193 /usr/lib/systemd/systemd --user
    └─3195 (sd-pam)

Limit the CPU Cores Used

With cgroup v2, systemd has complete control of the cpuset controller. This level of control enables an administrator to schedule work on only a specific CPU core.

Check CPUs for user.slice.

cd /sys/fs/cgroup/user.slice
ls
cat ../cpuset.cpus.effective

Example output:

[oracle@ol-node-01 cgroup]$ cd /sys/fs/cgroup/user.slice/
[oracle@ol-node-01 user.slice]$ ls
cgroup.controllers      cgroup.subtree_control  memory.events        memory.pressure      pids.max
cgroup.events           cgroup.threads          memory.events.local  memory.stat          user-0.slice
cgroup.freeze           cgroup.type             memory.high          memory.swap.current  user-1001.slice
cgroup.max.depth        cpu.pressure            memory.low           memory.swap.events   user-989.slice
cgroup.max.descendants  cpu.stat                memory.max           memory.swap.max
cgroup.procs            io.pressure             memory.min           pids.current
cgroup.stat             memory.current          memory.oom.group     pids.events
[oracle@ol-node-01 user.slice]$ cat ../cpuset.cpus.effective
0-1

The cpuset.cpus.effective shows the actual cores used by the user.slice. If a parameter does not exist in the specific cgroup directory, or we don't set it, the value gets inherited from the parent, which happens to be the top-level cgroup root directory for this case.

Restrict the system and user 0, 1001, and 989 slices to CPU core 0.

cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective
sudo systemctl set-property system.slice AllowedCPUs=0
cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective

Example output:

[oracle@ol-node-01 user.slice]$ cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective
cat: /sys/fs/cgroup/system.slice/cpuset.cpus.effective: No such file or directory
[oracle@ol-node-01 user.slice]$ sudo systemctl set-property system.slice AllowedCPUs=0
cat: /sys/fs/cgroup/system.slice/cpuset.cpus.effective: No such file or directory
0

Note: The No such file or directory indicates that by default, the system slice inherits its cpuset.cpus.effective value from the parent.

sudo systemctl set-property user-0.slice AllowedCPUs=0
sudo systemctl set-property user-1001.slice AllowedCPUs=0
sudo systemctl set-property user-989.slice AllowedCPUs=0

Restrict the ralph user to CPU core 1.

sudo systemctl set-property user-8000.slice AllowedCPUs=1
cat /sys/fs/cgroup/user.slice/user-8000.slice/cpuset.cpus.effective

Example output:

[oracle@ol-node-01 ~]$ sudo systemctl set-property user-8000.slice AllowedCPUs=1
[oracle@ol-node-01 ~]$ cat /sys/fs/cgroup/user.slice/user-8000.slice/cpuset.cpus.effective
1

Open a new terminal and connect via ssh as ralph to the ol-node-01 system.
```
ssh ralph@<ip_address_of_instance>
```

Test using the foo.exe script.

foo.exe &

Verify the results.

top

Once top is running, hit the 1 key to show the CPUs individually.

Example output:

top - 18:23:55 up 21:03,  2 users,  load average: 1.03, 1.07, 1.02
Tasks: 155 total,   2 running, 153 sleeping,   0 stopped,   0 zombie
%Cpu0  :  6.6 us,  7.0 sy,  0.0 ni, 84.8 id,  0.0 wa,  0.3 hi,  0.3 si,  1.0 st
%Cpu1  : 93.0 us,  6.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.7 hi,  0.3 si,  0.0 st
MiB Mem :  14707.8 total,  13649.1 free,    412.1 used,    646.6 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  13993.0 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
226888 ralph     20   0  228492   1808   1520 R  99.7   0.0 199:34.27 sha1sum
269233 root      20   0  223724   6388   1952 S   1.3   0.0   0:00.04 pidstat
   1407 root      20   0  439016  41116  39196 S   0.3   0.3   0:17.81 sssd_nss
   1935 root      20   0  236032   3656   3156 S   0.3   0.0   0:34.34 OSWatcher
   2544 root      20   0  401900  40292   9736 S   0.3   0.3   0:10.62 ruby
      1 root      20   0  388548  14716   9508 S   0.0   0.1   0:21.21 systemd
...

Type q to quit top.

Alternate way to check the processor running a process.

ps -eo pid,psr,user,cmd | grep ralph

Example output:

[ralph@ol-node-01 ~]$ ps -eo pid,psr,user,cmd | grep ralph
 226715   1 root     sshd: ralph [priv]
 226719   1 ralph    /usr/lib/systemd/systemd --user
 226722   1 ralph    (sd-pam)
 226727   1 ralph    sshd: ralph@pts/2
 226728   1 ralph    -bash
 226887   1 ralph    /bin/bash /usr/local/bin/foo.exe
 226888   1 ralph    /usr/bin/sha1sum /dev/zero
 269732   1 ralph    ps -eo pid,psr,user,cmd
 269733   1 ralph    grep --color=auto ralph

The psr column is the CPU number of the cmd or actual process.

Exit and close the current terminal and switch to the other existing terminal connected to ol-node-01.
Kill the foo.exe job.
```
sudo pkill sha1sum
```

Adjust the CPU Weight for Users

Time to have alice join in the fun. She has some critical work to complete, so, we'll give her twice the normal priority on the CPU.

Assign alice to the same CPU as ralph.

sudo systemctl set-property user-8001.slice AllowedCPUs=1
cat /sys/fs/cgroup/user.slice/user-8001.slice/cpuset.cpus.effective

Set CPUWeight.

sudo systemctl set-property user-8001.slice CPUWeight=200
cat /sys/fs/cgroup/user.slice/user-8001.slice/cpu.weight

The default weight is 100, so 200 is twice that number.

Open a new terminal and connect via SSH as ralph to the ol-node-01 system.
```
ssh ralph@<ip_address_of_instance>
```
Run foo.exe as ralph.
```
foo.exe &
```
Open another new terminal and connect via SSH as alice to the ol-node-01 system.
```
ssh alice@<ip_address_of_instance>
```
Run foo.exe as alice.
```
foo.exe &
```

Verify via top that alice is getting the higher priority.

top

Once top is running, hit the 1 key to show the CPUs individually.

Example output:

top - 20:10:55 up 25 min,  3 users,  load average: 1.29, 0.46, 0.20
Tasks: 164 total,   3 running, 161 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 96.5 id,  0.0 wa,  0.0 hi,  3.2 si,  0.3 st
%Cpu1  : 92.4 us,  7.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15715.8 total,  14744.6 free,    438.5 used,    532.7 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  15001.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
   7934 alice     20   0   15800   1768   1476 R  67.0   0.0   0:36.15 sha1sum  
   7814 ralph     20   0   15800   1880   1592 R  33.3   0.0   0:34.60 sha1sum  
      1 root      20   0  388476  14440   9296 S   0.0   0.1   0:02.22 systemd  
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.00 kthreadd
...

Switch to the terminal logged in as the oracle user.

Load the system.slice using the foo.service.

sudo systemctl start foo.service

Look now at the top output, which is still running in the alice terminal window. See that the foo.service consumes CPU 0 while the users split CPU 1 based on their weights.

Example output:

top - 19:18:15 up 21:57,  3 users,  load average: 2.15, 2.32, 2.25
Tasks: 159 total,   4 running, 155 sleeping,   0 stopped,   0 zombie
%Cpu0  : 89.1 us,  7.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.7 hi,  0.3 si,  2.6 st
%Cpu1  : 93.7 us,  5.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.7 hi,  0.3 si,  0.0 st
MiB Mem :  14707.8 total,  13640.1 free,    420.5 used,    647.2 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  13984.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 280921 root      20   0  228492   1776   1488 R  93.4   0.0   0:07.74 sha1sum
 279185 alice     20   0  228492   1816   1524 R  65.6   0.0   7:35.18 sha1sum
 279291 ralph     20   0  228492   1840   1552 R  32.8   0.0   7:00.30 sha1sum
   2026 oracle-+  20   0  935920  29280  15008 S   0.3   0.2   1:03.31 gomon
      1 root      20   0  388548  14716   9508 S   0.0   0.1   0:22.30 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd
...

Assign a CPU Quota

Lastly, we will cap the CPU time for ralph.

Return to the terminal logged in as the oracle user.

Set the quota to 5%.

sudo systemctl set-property user-8000.slice CPUQuota=5%

The change takes effect immediately, as seen in the top output, which still runs in the alice user terminal.

Example output:

top - 19:24:53 up 22:04,  3 users,  load average: 2.21, 2.61, 2.45
Tasks: 162 total,   4 running, 158 sleeping,   0 stopped,   0 zombie
%Cpu0  : 93.0 us,  4.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.7 hi,  0.0 si,  1.7 st
%Cpu1  : 91.7 us,  5.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  1.0 hi,  1.0 si,  0.7 st
MiB Mem :  14707.8 total,  13639.4 free,    420.0 used,    648.4 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  13984.7 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 280921 root      20   0  228492   1776   1488 R  97.4   0.0   6:26.75 sha1sum
 279185 alice     20   0  228492   1816   1524 R  92.1   0.0  12:21.12 sha1sum
 279291 ralph     20   0  228492   1840   1552 R   5.3   0.0   8:44.84 sha1sum
      1 root      20   0  388548  14716   9508 S   0.0   0.1   0:22.48 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd
...

Revert the cap on the ralph user using the oracle terminal window.
```
echo "max 100000" | sudo tee -a user-8000.slice/cpu.max
```
The quota gets written to the cpu.max file, and the defaults are max 100000.
Example output:
```
[oracle@ol-node-01 user.slice]$ echo "max 100000" | sudo tee -a user-8000.slice/cpu.max
max 100000
```
You can enable cgroup v2, limit users to a specific CPU when the system is under load, and lock them to using only a percentage of that CPU.

Next Steps

Thank you for completing this tutorial. Hopefully, these steps have given you a better understanding of installing, configuring, and using control group version 2 on Oracle Linux.