I have never dug into low level things like cpu architectures etc. and decided to give it a try when I learned about cpu.land.
I already was aware of the existence of user and kernel mode but while I was reading site it came to me that “I still can harm my system with userland programs so what does it mean to switch user mode for almost everything other than kernel and drivers?” also we still can do many things with syscalls, what is that stopping us(assuming we want to harm system of course) from damaging our system.
[edit1]: grammar mistakes
The reason behind kernel mode/user mode separation is to require all user-land programs to have to go through the kernel to do any modification to the system. In other words, would it not be for syscalls, the only thing a user land program could do would be to burn CPU cycles. And even then, the kernel can still preempt it any time to let other, potentially more important programs, run instead.
So if a program can harm your system from userland, it’s because the kernel allowed it, every time. Which is why we currently see a slow move toward sandboxing everything. Basically, the idea of sandboxing is to give the kernel enough information about the running program so that we can tailor which syscalls it can do and with which arguments. For example: you want to prevent an application from accessing the network? Prevent it from allocating sockets through the associated syscall.
The reason for this slow move is historical really: introducing all those protections from the get go would require a lot of development time to start with, but it had to be built unpon non-existant security layers and not break all programs in the process. CPUs were not even powerful enough to waste cycles on such concerns.
Now, to better understand user mode/kernel mode, you have to realize that there are actually more modes than this. I can only speak for the ARM architecture because it’s the one I know, but x86 has similar mechanisms. Basically, from the CPU perspective, you have several privilege levels. On x86 those are called rings, on ARM, they’re called Exception Level. On ARM, a CPU has up to four of those, EL3 to EL0. They also have names based on their purpose (inherited from ARMv7). So EL3 is firmware level, EL2 is hypervisor, EL1 is system and EL0 is user. A kernel typically run on EL2 and EL1. EL3 is reserved for the firmware/boot process to do the most basic setup, partly required by the other ELs. EL2 is called hypervisor because it allows to have several virtual EL1 (and even EL2). In other words, a kernel running at EL2 can run several other kernels at EL1: this is virtualization and how VMs are implemented. Then you have your kernel/user land separation with most of the kernel (and driver) logic running at EL1 and the user programs running at EL0.
Each level allocates resources for the sub-level (under the form of memory map, as memory maps, which do not necessarily map to RAM, are also used to talk to devices). Would a level try to access a resource (memory address) it has no rights to, an exception would be raised to the upper level, which would then decide what to do: let it through or terminate the program (the later translates to a kernel panic/BSOD when the program in question is the kernel itself or a segmentation fault/bus error for user land programs).
This mechanism is fairly easy to understand with the swap mechanism: the kernel allows your program to access some page in memory when asked through brk or mmap, used by malloc. But then, when the system is under memory pressure, and it turns out your program has not used that memory region for a little while, the kernel swaps it out. Which means your program is now forbidden from accessing this memory. When the program tries to access that memory again, the kernel is informed of the action through a exception raised (unintentionally) by your program. The kernel then swaps back the memory region from disk, allows your program to access the memory region again, and then let the program resume to a state prior to the memory access (that it will then re-attempt without even realizing).
So basically, a level is fully responsible for what a sub-level does. In theory, you could have no protection at all: EL1 (the kernel) could allow EL0 to modify all the memory EL1 has access to (again, those are memory maps, that can also map to devices, not necessarily RAM). In practice, the goal of EL1 is to let nothing through without being involved itself: the program wants to write something on the disk: syscall, wants more memory: syscall, wants to draw something on the screen: syscall, use the network: syscall, talk to another program: syscall.
But the reason is not only security. It is also, and most importantly, abstraction. For example, when talking to a USB device, a user program does not have to know the USB protocol. This is implemented once in the kernel and then userland programs can use that to deal with all the annoying stuff such as timings, buffers, interruptions and so on. So the syscalls were initially designed for that: build a library of functions all user programs can re-use without having to re-implement them, or worse, without having to deal with the specifics of every device/vendor: this is the sole responsibility of the kernel.
So there you have it: a user program cannot harm the computer without going through the kernel first. But the kernel allows it nonetheless because it was not initially designed as a security feature. The security concerns came afterward and were initially implemented with users, which are mostly enough for servers, and where root has nearly as many privileges as the kernel itself (because the kernel allows it). Those are currently being improved under the form of sandboxes, for which the work started a while ago, with every OS (and CPU architecture) having its own implementation. But we are only seeing widespread adoption by userland since fairly recently on desktop. Partly thanks to the push from smartphones where application-level privileges (to access the camera for example) were born AFAIK.
Nowadays, CPUs are powerful enough to even have security features to try to protect a userland program from itself: from buffer overflow, return address manipulation and the like. If you’re interested, I recommend you look at the concept of pointer authentication.
Great write up, very informative. Thank you.
I’d say that the separation is not just about security. It’s also about performance and stability through separation of duties. The kernelb does a lot of work that is but directly related to current activity in userland. A good portion of this is to keep hardware in a state where userland programs can reliably run without having to reimplement low level functionality.
Additionally, when it comes to performance, it’s worth looking at monolith vs microkernels. It may seem counter-intuitive but, for general-purpose computing, monolithic kernels outperform microkernels by a wide margin. This is due to the near exponential increase in context switching that ends up being required by microkernels for the sorts of tasks needed for such use cases.
I appreciate every single second you spent on this, I hope more people will benefit
This was very informative, thanks !