$24
Setup cgroup controls to be passed into the program using command line arguments.
The file sr_container.c has an array ‘cgroups’ which is supposed to hold all the cgroup controls for the newly created container. This array holds structs of type “cgroups_control”. You would ideally have one entry of this struct inside the ‘cgroups’
array per cgroup control (memory, cpu, cpuset, blkio). Within this struct you have a double pointer ‘settings’ that points to a collection of struct type ‘cgroup_setting’. This struct holds settings specific to a cgroup-controller.
Ex:cgroup-controller: memory
cgroup-settings: memory.limit_in_bytes,
memory.kmem.limit_in_bytes
tasks
So you must fill in the cgroups array with ‘cgroups_control’ elements. And each of these elements will have a list of relevant settings as shown above. The given code has an example for the ‘blkio’ controller. Note that all controls will have the ‘tasks’ setting to ensure the process is added to the tasks list of that cgroup.
You must update the main() of the given program to support more flags. These flags will enable the user (of the program) to set cgroups when running the code. You must accordingly fill in the above array with the right values. You can have a look at how the arguments are handled in as of now in main() and extend it to fetch more flags and update the array. The flags to be supported are given as comments in the template code. Note that the 4th flag was changed from blkio-weight to memory.
In addition to the cgroup controls, an addition flag also is to be supported to provide the program with a hostname. The value of this flag must be set to the ‘hostname’ attribute of ‘child_config’struct created at the beginning of main().
2. Implement the child process creation logic
Fill in the left off portion of the code in main() [in sr_container.c]to successfully create a child process with namespace isolation for the following namespace: Network,
Cgroup, PID, IPC, Mount, UTS (Don’t add User namespace).Lines 171 – 186.
3. Changing root using pivot_root()
Complete the method switch_child_root() in the sr_container_helpers.c file using the pivot_root() system call. Refer here for info on the arguments to use with pivot_root(): http://man7.org/linux/man-pages/man2/pivot_root.2.html
4. Setting capabilities to the container
For the purpose of performing permission checks, traditional UNIX implementations
distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero). Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process's credentials (usually: effective UID, effective GID, and supplementary group list).
In recent kernel versions, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled. Thus with this new feature the kernel can control privileges allowed to a traditional super-user process.
Capabilities basically subdivide the the property of being “root”. We can restrict certain access of some processes even though they have root privileges. For example we may allow a process to set network devices (CAP_NET_ADMIN) but disallow reading all files (CAP_DAC_OVERRIDE). However, not all of the properties of being a root is subdivided into capabilities. There are some properties that is still accessible after dropping capabilities.
Read herefor more info: http://man7.org/linux/man-pages/man7/capabilities.7.html (You can complete the assignment even with the description on this handout)
In this assignment we want some of these harmful/unnecessary capabilities also to be disabled from our SRContainer. The list of capabilities that must be disabled are:
CAP_AUDIT_CONTROL, CAP_AUDIT_READ, CAP_AUDIT_WRITE,
CAP_BLOCK_SUSPEND, CAP_DAC_READ_SEARCH, CAP_FSETID, CAP_IPC_LOCK,
CAP_MAC_ADMIN, CAP_MAC_OVERRIDE, CAP_MKNOD, CAP_SETFCAP, CAP_SYSLOG, CAP_SYS_ADMIN, CAP_SYS_BOOT, CAP_SYS_MODULE, CAP_SYS_NICE, CAP_SYS_RAWIO, CAP_SYS_RESOURCE, CAP_SYS_TIME, CAP_WAKE_ALARM
Disabling capabilities involves 2 steps:
Dropping the said capability from the ambient capability setof the process.
Clearing the said capability from the inheritable capability setof the process.
You can read more about the different capability sets of a process in the man page.
However the description that follows must be sufficient to complete the assignment.
Ex:Say you want to disable the capabilities: CAP_MKNODand CAP_SYS_BOOT
Use prctl()to drop the capabilities from the AMBIENT set Use cap_get_proc()to get the capability sets of the process
Use cap_set_flag()to clear the capabilities from the INHERITABLE set Use cap_set_proc()to set the cleared set back to the process
Use the approach shown above to complete the setup_child_capabilities()method in sr_container_helpers.c.
You can test if this works by simply running “mknod <SOME_NAME p”. If the capabilities have been set properly then this should fail.
To test if the capabilities were set properly you can do the following:
Copy the binary'capsh'found inside the [/sbin]folder of the docker container into
the [/sbin]folder of the 'rootfs'you downloaded to run containers. cp /sbin/capsh $ROOTFS/sbin/
Now if you run 'capsh --print'[inside our SNR_CONTAINER]without this method
implemented (i.e: capabilities not being filtered)the output for [Bounding set] will indicate many capabilities.
But after properly implementing this method (filtering capabilities)if you run the same command inside your SNR_CONTAINER container you will see a smaller set
of capabilities for [Bounding set]
5. Disabling system calls inside a container
In addition to disabling capabilities, we also want to restrict processes inside our
SRContainer from using certain system-calls that can possibly lead to a vulnerable state. Seccomp is one kernel feature which can be used to achieve this. This feature allows to control which system-calls a process and all its children have access to. It also enables to set the action to take (kill the process, raise a signal, just allow it, etc) when a process tries to execute such a system call. The intent is to allow untrusted processes to use the resources provided by the kernel with restricted access without abusing them .
In this assignment we will use this seccomp kernel feature to limit the system-calls allowed to the processes within our SRContainer. Support for the seccomp feature is provided by the libseccomplibrary.
The idea behind instigating this system-call restriction is as follows:
Create a system calls filtering contextwith a default behavior for all system-calls
Set up filterson this context for certain system calls that must be handled differently
Set any attributes that applies to created seccomp context.
Load the newly configured context into the kernel.
Release any memory allocated for the seccomp context that was just configured. This does not affect the context that was loaded into the kernel.
See the detailed description below of each of these steps.
(Trust me you can just use this as a one-to-one template to finish this part of the assignment)
STEP-1:
STEP-2:
In the last example (2nd image) we want to capture calls to unshare() only if the ‘CLONE_NEWUSER’ flag is used.
So in the call to seccomp_rule_add(), in its 4th argument we say we want “one” argument match on the call to unshare.
We include what this match is in the 5th argument of seccomp_rule_add().
SCMP_A0(SCMP_CMP_MASKED_EQ, CLONE_NEWUSER, CLONE_NEWUSER)
SCMP_A0-
Tells to match the 0th argument of unshare(). If it was SCMP_A1 then the match must be on 1st argument. Notice that its 0 indexed like arrays in C
SCMP_CMP_MASKED_EQ -
Tells that it’s not a one to one match but its a check on a MASKED argument. This is because the argument to unshare() can be an OR of many flags: CLONE_FS | CLONE_FILES | CLONE_VM | CLONE_NEWUSER.
3rd Argument: The mask for validation
4th Argument: What it must be equal to
So similarly you can write rules to match certain arguments on the system_call when filtering them.
STEP-3:
Set the filter attribute value of SCMP_FLTATR_CTL_NNP.
STEP-4:
Load the created context into the kernel. You can simply re-use Step-3 and 4 as is.
Your task (should you choose to accept it)is to:
Complete the method setup_syscall_filters()in the sr_container_helpers.c file to STOP our SRContainer from invoking the following system calls. Any process that attempts to run these system calls must be killed (SCMP_ACT_KILL SCMP_FAIL). All other system calls must be allowed.
ptrace
mbind
migrate_pages
move_pages
unshare (Only restrict if the CLONE_NEWUSER flag is used)
clone (Only restrict if the CLONE_NEWUSER flag is used)
chmod (Only restrict if the S_ISUID or S_ISGID flags are used for the “mode” argument)
You can test if this works by simply writing a C program which tries to use one of the system calls above.
Instructions to copy your code into the host-container environment.
You must first copy the template code folder “A3Template” to the cs310 server.
scp -r <path_in_your_pc/A3Template <socs_uname@cs310.cs.mcgill.ca:~
You can compile the program by simply running ‘make container’ with the given
Makefileor use the complete ‘gcc’command:
gcc-o SNR_CONTAINER-g-Wall -Werrorsr_container.c sr_container_helpers.c sr_container_utils.c -lseccomp -lcap
Then, you must copy the built executable into your own docker-container environment.
That is, the container you created with ‘docker run’ for Phase-1.
docker cp ~/A3Template/SNR_CONTAINER<container_name:/home
Now, if you go into your container using:
docker exec -it <container_name /bin/bash
You should see your executable and you can run it with the correct flags.
Do not copy your entire A3Templatecode into your docker container. Only copy the built executable.
What to submit:
sr_container.c (with your changes)
sr_container_helpers.c(with your changes)
No need to submit any other files since you will not have to change them.
Rubric (This Phase accounts for 40%)
1.
Setting up cgroups/hostname flags:
7%
2.
Implementing child process logic:
7%
3.
Proper usage ofpivot_root():
6%
4.
Implementing capabilities:
10%
5.
Implementing syscall filtering:
10%