Run Control Group Version 2 on Oracle Linux

3
0
Send lab feedback

Run Control Group Version 2 on Oracle Linux

Introduction

Control Group (cgroup) is a Linux kernel feature for limiting, prioritizing, and allocating resources such as CPU time, memory, and network bandwidth for running processes.

This tutorial guides you through limiting the CPU time for user processes using cgroup v2.

Objectives

In this tutorial, you will learn how to:

  • Enable control group version 2
  • Set a soft CPU limit for a user process
  • Set a hard CPU limit for a user process

Prerequisites

  • Minimum of a single Oracle Linux system

  • Each system should have Oracle Linux installed and configured with:

    • A non-root user account with sudo access
    • Access to the Internet

Deploy Oracle Linux

Note: If running in your own tenancy, read the linux-virt-labs GitHub project README.md and complete the prerequisites before deploying the lab environment.

  1. Open a terminal on the Luna Desktop.

  2. Clone the linux-virt-labs GitHub project.

    git clone https://github.com/oracle-devrel/linux-virt-labs.git
  3. Change into the working directory.

    cd linux-virt-labs/ol
  4. Install the required collections.

    ansible-galaxy collection install -r requirements.yml
  5. Deploy the lab environment.

    ansible-playbook create_instance.yml -e localhost_python_interpreter="/usr/bin/python3.6"

    The free lab environment requires the extra variable local_python_interpreter, which sets ansible_python_interpreter for plays running on localhost. This variable is needed because the environment installs the RPM package for the Oracle Cloud Infrastructure SDK for Python, located under the python3.6 modules.

    The default deployment shape uses the AMD CPU and Oracle Linux 8. To use an Intel CPU or Oracle Linux 9, add -e instance_shape="VM.Standard3.Flex" or -e os_version="9" to the deployment command.

    Important: Wait for the playbook to run successfully and reach the pause task. At this stage of the playbook, the installation of Oracle Linux is complete, and the instances are ready. Take note of the previous play, which prints the public and private IP addresses of the nodes it deploys and any other deployment information needed while running the lab.

Create a Load-generating Script

  1. Open a terminal and connect via SSH to the ol-node-01 instance.

    ssh oracle@<ip_address_of_instance>
  2. Create the foo.exe script.

    echo '#!/bin/bash
    
    /usr/bin/sha1sum /dev/zero' > foo.exe
  3. Copy the foo.exe script to a location in your $PATH and set the proper permissions.

    sudo mv foo.exe /usr/local/bin/foo.exe
    sudo chown root:root /usr/local/bin/foo.exe
    sudo chmod 755 /usr/local/bin/foo.exe
  4. Fix the SELinux labels after copying and changing permissions on the foo.exe script.

    sudo /sbin/restorecon -v /usr/local/bin/foo.exe
    

    Note: Oracle Linux runs with SELinux set to enforcing mode by default. You can verify this by running sudo sestatus.

Create a Load-generating Service

  1. Create the foo.service file.

    echo '[Unit]
    Description=the foo service
    After=network.target
    
    [Service]
    ExecStart=/usr/local/bin/foo.exe
    
    [Install]
    WantedBy=multi-user.target' > foo.service
  2. Copy the foo.service script to the default systemd scripts directory and set the proper permissions.

    sudo mv foo.service /etc/systemd/system/foo.service
    sudo chown root:root /etc/systemd/system/foo.service
    sudo chmod 644 /etc/systemd/system/foo.service
  3. Fix the SELinux labels.

    sudo /sbin/restorecon -v /etc/systemd/system/foo.service
  4. Reload the systemd daemon so it recognizes the new service.

    sudo systemctl daemon-reload
  5. Start foo.service and check its status.

    sudo systemctl start foo.service
    sudo systemctl status foo.service

Create Users

Additional users will be allowed to run the load-generating script under these different accounts and different CPU weights.

  1. Create users and set passwords.

    sudo useradd -u 8000 ralph
    sudo useradd -u 8001 alice
    echo "ralph:oracle" | sudo chpasswd
    echo "alice:oracle" | sudo chpasswd
  2. Allow SSH connections.

    Copy the SSH key from the oracle user account for the 'ralph' user.

    sudo mkdir /home/ralph/.ssh
    sudo cp /home/oracle/.ssh/authorized_keys /home/ralph/.ssh/authorized_keys
    sudo chown -R ralph:ralph /home/ralph/.ssh
    sudo chmod 700 /home/ralph/.ssh
    sudo chmod 600 /home/ralph/.ssh/authorized_keys
  3. Repeat for the alice user.

    sudo mkdir /home/alice/.ssh
    sudo cp /home/oracle/.ssh/authorized_keys /home/alice/.ssh/authorized_keys
    sudo chown -R alice:alice /home/alice/.ssh
    sudo chmod 700 /home/alice/.ssh
    sudo chmod 600 /home/alice/.ssh/authorized_keys
  4. Open a new terminal and verify both SSH connections work.

    ssh -l ralph -o StrictHostKeyChecking=accept-new <ip_address_of_instance> true

    The -o StrictHostKeyChecking=accept-new option automatically accepts previously unseen keys but will refuse connections for changed or invalid hostkeys. This option is a safer subset of the current behavior of StrictHostKeyChecking=no. The true command runs on the remote host and always returns a value of 0, which indicates that the SSH connection was successful. If there are no errors, the terminal returns to the command prompt after running the SSH command.

  5. Repeat for the other user.

    ssh -l alice -o StrictHostKeyChecking=accept-new <ip_address_of_instance> true
  6. Exit the current terminal and switch to the other existing terminal connected to ol-node-01.

Enable Control Group Version 2

Note: Oracle Linux 9 and higher ships with cgroup v2 enabled by default.

For Oracle Linux 8, you must manually configure the boot kernel parameters to enable cgroup v2 as it mounts cgroup v1 by default.

If you are not using Oracle Linux 8, skip to the next section.

  1. Update grub with the cgroup v2 systemd kernel parameter.

    sudo grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1"

    You can instead specify only your current boot entry by running sudo grubby --update-kernel=/boot/vmlinuz-$(uname -r) --args="systemd.unified_cgroup_hierarchy=1".

  2. Confirm the changes.

    cat /etc/default/grub |grep systemd.unified_cgroup_hierarchy
  3. Reboot the instance for the changes to take effect.

    sudo systemctl reboot

    Note: Wait a few minutes for the instance to restart.

  4. Reconnect to the ol-node-01 instance using SSH.

Verify that Cgroup v2 is Enabled

  1. Check the cgroup controller list.

    cat /sys/fs/cgroup/cgroup.controllers

    The output should return similar results: cpuset cpu io memory hugetlb pids rdma

  2. Check the cgroup2 mounted file system.

    mount |grep cgroup2

    The output should return similar results: cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

  3. Inspect the contents of the cgroup mounted directory.

    ll /sys/fs/cgroup

    Example output:

    total 0
    -r--r--r--.  1 root root 0 Mar 13 21:20 cgroup.controllers
    -rw-r--r--.  1 root root 0 Mar 13 21:20 cgroup.max.depth
    -rw-r--r--.  1 root root 0 Mar 13 21:20 cgroup.max.descendants
    -rw-r--r--.  1 root root 0 Mar 13 21:20 cgroup.procs
    -r--r--r--.  1 root root 0 Mar 13 21:20 cgroup.stat
    -rw-r--r--.  1 root root 0 Mar 13 21:20 cgroup.subtree_control
    -rw-r--r--.  1 root root 0 Mar 13 21:20 cgroup.threads
    -rw-r--r--.  1 root root 0 Mar 13 21:20 cpu.pressure
    -r--r--r--.  1 root root 0 Mar 13 21:20 cpuset.cpus.effective
    -r--r--r--.  1 root root 0 Mar 13 21:20 cpuset.mems.effective
    drwxr-xr-x.  2 root root 0 Mar 13 21:20 init.scope
    -rw-r--r--.  1 root root 0 Mar 13 21:20 io.pressure
    -rw-r--r--.  1 root root 0 Mar 13 21:20 memory.pressure
    drwxr-xr-x. 87 root root 0 Mar 13 21:20 system.slice
    drwxr-xr-x.  4 root root 0 Mar 13 21:24 user.slice

    The output shows the root control group at its default location. The directory contains interface files all prefixed with cgroup and directories related to systemd that end in .scope and .slice.

Work with the Virtual File System

Before we get started, we need to learn a bit about the cgroup virtual file system mounted at /sys/fs/cgroup.

  1. Show which CPUs participate in the cpuset for everyone.

    cat /sys/fs/cgroup/cpuset.cpus.effective

    The output shows a range starting at 0 that indicates the system's effective CPUs, which consist of a combination of CPU cores and threads.

  2. Show which controllers are active.

    cat /sys/fs/cgroup/cgroup.controllers

    Example output:

    cpuset cpu io memory hugetlb pids rdma misc

    It's good to see the cpuset controller present as we'll use it later in this tutorial.

  3. Show processes spawned by oracle.

    First, we need to determine oracle's user id (UID).

    who
    id

    Example output:

    [oracle@ol-node-01 ~]$ who
    oracle   pts/0        2022-03-13 21:23 (10.39.209.157)
    [oracle@ol-node-01 ~]$ id
    uid=1001(oracle) gid=1001(oracle) groups=1001(oracle),10(wheel) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

    Using the UID, we can find the oracle users slice.

    cd /sys/fs/cgroup/user.slice
    ls

    Example output:

    [oracle@ol-node-01 ~]$ cd /sys/fs/cgroup/user.slice
    [oracle@ol-node-01 user.slice]$ ls
    cgroup.controllers      cgroup.subtree_control  memory.events        memory.pressure      pids.max
    cgroup.events           cgroup.threads          memory.events.local  memory.stat          user-0.slice
    cgroup.freeze           cgroup.type             memory.high          memory.swap.current  user-1001.slice
    cgroup.max.depth        cpu.pressure            memory.low           memory.swap.events   user-989.slice
    cgroup.max.descendants  cpu.stat                memory.max           memory.swap.max
    cgroup.procs            io.pressure             memory.min           pids.current
    cgroup.stat             memory.current          memory.oom.group     pids.events

    Systemd assigns every user a slice named user-<UID>.slice. So, what's under that directory?

    cd user-1001.slice
    ls

    Example output:

    [oracle@ol-node-01 user.slice]$ cd user-1001.slice/
    [oracle@ol-node-01 user-1001.slice]$ ls
    cgroup.controllers  cgroup.max.descendants  cgroup.threads  io.pressure        user-runtime-dir@1001.service
    cgroup.events       cgroup.procs            cgroup.type     memory.pressure
    cgroup.freeze       cgroup.stat             cpu.pressure    session-3.scope
    cgroup.max.depth    cgroup.subtree_control  cpu.stat        user@1001.service

    These are the top-level cgroup for the oracle user. However, there are no processes listed in cgroup.procs. So, where is the list of user processes?

    cat cgroup.procs

    Example output:

    [oracle@ol-node-01 user-1001.slice]$ cat cgroup.procs
    [oracle@ol-node-01 user-1001.slice]$

    When oracle opened the SSH session at the beginning of this tutorial, the user session created a scope sub-unit. Under this scope, we can check the cgroup.procs for a list of processes spawned under that session.

    Note: The user might have multiple sessions based on the number of connections to the system; therefore, replace the 3 in the sample below as necessary.

    cd session-3.scope
    ls
    cat cgroup.procs

    Example output:

    [oracle@ol-node-01 user-1001.slice]$ cd session-3.scope/
    [oracle@ol-node-01 session-3.scope]$ ls
    cgroup.controllers  cgroup.max.depth        cgroup.stat             cgroup.type   io.pressure
    cgroup.events       cgroup.max.descendants  cgroup.subtree_control  cpu.pressure  memory.pressure
    cgroup.freeze       cgroup.procs            cgroup.threads          cpu.stat
    [oracle@ol-node-01 session-3.scope]$ cat cgroup.procs
    3189
    3200
    3201
    54217

    Now that we have found the processes the hard way, we can use systemd-cgls to show the same information in a tree-like view.

    Note: When run from within the virtual filesystem, systemd-cgls limits the cgroup output to the current working directory.

    cd /sys/fs/cgroup/user.slice/user-1001.slice
    systemd-cgls

    Example output:

    [oracle@ol-node-01 user-1001.slice]$ systemd-cgls
    Working directory /sys/fs/cgroup/user.slice/user-1001.slice:
    ├─session-3.scope
    │ ├─ 3189 sshd: oracle [priv]
    │ ├─ 3200 sshd: oracle@pts/0
    │ ├─ 3201 -bash
    │ ├─55486 systemd-cgls
    │ └─55487 less
    └─user@1001.service
      └─init.scope
        ├─3193 /usr/lib/systemd/systemd --user
        └─3195 (sd-pam)

Limit the CPU Cores Used

With cgroup v2, systemd has complete control of the cpuset controller. This level of control enables an administrator to schedule work on only a specific CPU core.

  1. Check CPUs for user.slice.

    cd /sys/fs/cgroup/user.slice
    ls
    cat ../cpuset.cpus.effective

    Example output:

    [oracle@ol-node-01 cgroup]$ cd /sys/fs/cgroup/user.slice/
    [oracle@ol-node-01 user.slice]$ ls
    cgroup.controllers      cgroup.subtree_control  memory.events        memory.pressure      pids.max
    cgroup.events           cgroup.threads          memory.events.local  memory.stat          user-0.slice
    cgroup.freeze           cgroup.type             memory.high          memory.swap.current  user-1001.slice
    cgroup.max.depth        cpu.pressure            memory.low           memory.swap.events   user-989.slice
    cgroup.max.descendants  cpu.stat                memory.max           memory.swap.max
    cgroup.procs            io.pressure             memory.min           pids.current
    cgroup.stat             memory.current          memory.oom.group     pids.events
    [oracle@ol-node-01 user.slice]$ cat ../cpuset.cpus.effective
    0-1

    The cpuset.cpus.effective shows the actual cores used by the user.slice. If a parameter does not exist in the specific cgroup directory, or we don't set it, the value gets inherited from the parent, which happens to be the top-level cgroup root directory for this case.

  2. Restrict the system and user 0, 1001, and 989 slices to CPU core 0.

    cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective
    sudo systemctl set-property system.slice AllowedCPUs=0
    cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective

    Example output:

    [oracle@ol-node-01 user.slice]$ cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective
    cat: /sys/fs/cgroup/system.slice/cpuset.cpus.effective: No such file or directory
    [oracle@ol-node-01 user.slice]$ sudo systemctl set-property system.slice AllowedCPUs=0
    cat: /sys/fs/cgroup/system.slice/cpuset.cpus.effective: No such file or directory
    0

    Note: The No such file or directory indicates that by default, the system slice inherits its cpuset.cpus.effective value from the parent.

    sudo systemctl set-property user-0.slice AllowedCPUs=0
    sudo systemctl set-property user-1001.slice AllowedCPUs=0
    sudo systemctl set-property user-989.slice AllowedCPUs=0
  3. Restrict the ralph user to CPU core 1.

    sudo systemctl set-property user-8000.slice AllowedCPUs=1
    cat /sys/fs/cgroup/user.slice/user-8000.slice/cpuset.cpus.effective

    Example output:

    [oracle@ol-node-01 ~]$ sudo systemctl set-property user-8000.slice AllowedCPUs=1
    [oracle@ol-node-01 ~]$ cat /sys/fs/cgroup/user.slice/user-8000.slice/cpuset.cpus.effective
    1
  4. Open a new terminal and connect via ssh as ralph to the ol-node-01 system.

    ssh ralph@<ip_address_of_instance>
  5. Test using the foo.exe script.

    foo.exe &

    Verify the results.

    top

    Once top is running, hit the 1 key to show the CPUs individually.

    Example output:

    top - 18:23:55 up 21:03,  2 users,  load average: 1.03, 1.07, 1.02
    Tasks: 155 total,   2 running, 153 sleeping,   0 stopped,   0 zombie
    %Cpu0  :  6.6 us,  7.0 sy,  0.0 ni, 84.8 id,  0.0 wa,  0.3 hi,  0.3 si,  1.0 st
    %Cpu1  : 93.0 us,  6.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.7 hi,  0.3 si,  0.0 st
    MiB Mem :  14707.8 total,  13649.1 free,    412.1 used,    646.6 buff/cache
    MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  13993.0 avail Mem
    
       PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    226888 ralph     20   0  228492   1808   1520 R  99.7   0.0 199:34.27 sha1sum
    269233 root      20   0  223724   6388   1952 S   1.3   0.0   0:00.04 pidstat
       1407 root      20   0  439016  41116  39196 S   0.3   0.3   0:17.81 sssd_nss
       1935 root      20   0  236032   3656   3156 S   0.3   0.0   0:34.34 OSWatcher
       2544 root      20   0  401900  40292   9736 S   0.3   0.3   0:10.62 ruby
          1 root      20   0  388548  14716   9508 S   0.0   0.1   0:21.21 systemd
    ...

    Type q to quit top.

  6. Alternate way to check the processor running a process.

    ps -eo pid,psr,user,cmd | grep ralph

    Example output:

    [ralph@ol-node-01 ~]$ ps -eo pid,psr,user,cmd | grep ralph
     226715   1 root     sshd: ralph [priv]
     226719   1 ralph    /usr/lib/systemd/systemd --user
     226722   1 ralph    (sd-pam)
     226727   1 ralph    sshd: ralph@pts/2
     226728   1 ralph    -bash
     226887   1 ralph    /bin/bash /usr/local/bin/foo.exe
     226888   1 ralph    /usr/bin/sha1sum /dev/zero
     269732   1 ralph    ps -eo pid,psr,user,cmd
     269733   1 ralph    grep --color=auto ralph

    The psr column is the CPU number of the cmd or actual process.

  7. Exit and close the current terminal and switch to the other existing terminal connected to ol-node-01.

  8. Kill the foo.exe job.

    sudo pkill sha1sum

Adjust the CPU Weight for Users

Time to have alice join in the fun. She has some critical work to complete, so, we'll give her twice the normal priority on the CPU.

  1. Assign alice to the same CPU as ralph.

    sudo systemctl set-property user-8001.slice AllowedCPUs=1
    cat /sys/fs/cgroup/user.slice/user-8001.slice/cpuset.cpus.effective
  2. Set CPUWeight.

    sudo systemctl set-property user-8001.slice CPUWeight=200
    cat /sys/fs/cgroup/user.slice/user-8001.slice/cpu.weight

    The default weight is 100, so 200 is twice that number.

  3. Open a new terminal and connect via SSH as ralph to the ol-node-01 system.

    ssh ralph@<ip_address_of_instance>
  4. Run foo.exe as ralph.

    foo.exe &
  5. Open another new terminal and connect via SSH as alice to the ol-node-01 system.

    ssh alice@<ip_address_of_instance>
  6. Run foo.exe as alice.

    foo.exe &
  7. Verify via top that alice is getting the higher priority.

    top

    Once top is running, hit the 1 key to show the CPUs individually.

    Example output:

    top - 20:10:55 up 25 min,  3 users,  load average: 1.29, 0.46, 0.20
    Tasks: 164 total,   3 running, 161 sleeping,   0 stopped,   0 zombie
    %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 96.5 id,  0.0 wa,  0.0 hi,  3.2 si,  0.3 st
    %Cpu1  : 92.4 us,  7.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    MiB Mem :  15715.8 total,  14744.6 free,    438.5 used,    532.7 buff/cache
    MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  15001.1 avail Mem 
    
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
       7934 alice     20   0   15800   1768   1476 R  67.0   0.0   0:36.15 sha1sum  
       7814 ralph     20   0   15800   1880   1592 R  33.3   0.0   0:34.60 sha1sum  
          1 root      20   0  388476  14440   9296 S   0.0   0.1   0:02.22 systemd  
          2 root      20   0       0      0      0 S   0.0   0.0   0:00.00 kthreadd
    ...
  8. Switch to the terminal logged in as the oracle user.

  9. Load the system.slice using the foo.service.

    sudo systemctl start foo.service

    Look now at the top output, which is still running in the alice terminal window. See that the foo.service consumes CPU 0 while the users split CPU 1 based on their weights.

    Example output:

    top - 19:18:15 up 21:57,  3 users,  load average: 2.15, 2.32, 2.25
    Tasks: 159 total,   4 running, 155 sleeping,   0 stopped,   0 zombie
    %Cpu0  : 89.1 us,  7.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.7 hi,  0.3 si,  2.6 st
    %Cpu1  : 93.7 us,  5.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.7 hi,  0.3 si,  0.0 st
    MiB Mem :  14707.8 total,  13640.1 free,    420.5 used,    647.2 buff/cache
    MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  13984.3 avail Mem
    
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     280921 root      20   0  228492   1776   1488 R  93.4   0.0   0:07.74 sha1sum
     279185 alice     20   0  228492   1816   1524 R  65.6   0.0   7:35.18 sha1sum
     279291 ralph     20   0  228492   1840   1552 R  32.8   0.0   7:00.30 sha1sum
       2026 oracle-+  20   0  935920  29280  15008 S   0.3   0.2   1:03.31 gomon
          1 root      20   0  388548  14716   9508 S   0.0   0.1   0:22.30 systemd
          2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd
    ...

Assign a CPU Quota

Lastly, we will cap the CPU time for ralph.

  1. Return to the terminal logged in as the oracle user.

  2. Set the quota to 5%.

    sudo systemctl set-property user-8000.slice CPUQuota=5%

    The change takes effect immediately, as seen in the top output, which still runs in the alice user terminal.

    Example output:

    top - 19:24:53 up 22:04,  3 users,  load average: 2.21, 2.61, 2.45
    Tasks: 162 total,   4 running, 158 sleeping,   0 stopped,   0 zombie
    %Cpu0  : 93.0 us,  4.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.7 hi,  0.0 si,  1.7 st
    %Cpu1  : 91.7 us,  5.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  1.0 hi,  1.0 si,  0.7 st
    MiB Mem :  14707.8 total,  13639.4 free,    420.0 used,    648.4 buff/cache
    MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  13984.7 avail Mem
    
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     280921 root      20   0  228492   1776   1488 R  97.4   0.0   6:26.75 sha1sum
     279185 alice     20   0  228492   1816   1524 R  92.1   0.0  12:21.12 sha1sum
     279291 ralph     20   0  228492   1840   1552 R   5.3   0.0   8:44.84 sha1sum
          1 root      20   0  388548  14716   9508 S   0.0   0.1   0:22.48 systemd
          2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 kthreadd
    ...
  3. Revert the cap on the ralph user using the oracle terminal window.

    echo "max 100000" | sudo tee -a user-8000.slice/cpu.max

    The quota gets written to the cpu.max file, and the defaults are max 100000.

    Example output:

    [oracle@ol-node-01 user.slice]$ echo "max 100000" | sudo tee -a user-8000.slice/cpu.max
    max 100000

    You can enable cgroup v2, limit users to a specific CPU when the system is under load, and lock them to using only a percentage of that CPU.

Next Steps

Thank you for completing this tutorial. Hopefully, these steps have given you a better understanding of installing, configuring, and using control group version 2 on Oracle Linux.

SSR