
Contents
- 1 Introduction
- 2 A high level view of implementing a threaded kernel
- 2.1 Scheduling, priorities and queues
- 2.2 Other considerations on intervals and interrupts
- 3 Implementing kernel threads
- 3.1 Allocating thread stacks & structures
- 3.2 The thread structure & cpu state
- 3.3 Creating a new thread
- 3.4 Cleaning up dead threads
- 4 Next topics…
1 – Introduction
To thread or not to thread? That is the question! Having a multi-threaded kernel does provide some compelling advantages. If your kernel plans to support multiple cpus, it’s pretty much a necessity. Even on a single-cpu system, having support for multiple threads enables us to have blocking events handled “in the background” without hanging up the entire kernel while we wait.
Of course, it’s quite possible to have a single threaded kernel on a single-cpu system and indeed, this was the model for many operating systems of yesterday. Also, having a multi-threaded environment means that one has to be wary of race conditions, data access synchronization, deadlocks and so forth. Modern machine performance characteristics (and user expectations) do make a single-threaded kernel rather anachronistic, especially if you’re writing a “general purpose” operating system kernel.
2 – A high level view of implementing a threaded kernel
Implementing a multi-threaded kernel isn’t actually that difficult, at least not as far as the actual thread mechanism itself is concerned. All you really need to do is be able to save and restore the thread context when required – and as we’re talking about kernel threads, it’s really just a matter of restoring the cpu register state and resuming where the thread last left off.
There are essentially two types of threading models; a cooperative model, where threads surrender the cpu voluntarily, and a pre-emptive model, where the context switching is interval driven, usually based on timer interrupts. In reality, the pre-emptive model also includes a cooperative element, for example when a thread voluntarily surrenders the cpu because it’s waiting on something.

Using the pre-emptive model, especially driven by the interval timer, really helps make thread switching easier. When an interrupt fires, the cpu saves a minimal current execution state on the stack and vectors through the interrupt descriptor table to the appropriate handler. When you return from the interrupt, the cpu restores the saved state from the stack and picks up where it left off.
This mechanism is almost ideal for our purposes – you have access to all the cpu state for the interrupted thread available, so as long as each thread has its own stack, all you need to do is save this state (basically all the registers the cpu didn’t save), choose the next thread to run, point to its stack, reload its state and get the cpu to resume it by “returning” from the interrupt.
Note: I’ll cover interrupts, the interrupt descriptor table (IDT) and vectoring to handlers in a separate article. The rest of this article assumes you’re already familiar with that aspect (well, you need to be anyway if you plan to do preemptive scheduling).
2.1 – Scheduling, priorities and queues
Although we’ve looked at switching and resuming threads at a high level, there’s a lot more that has to happen in practice. For example, if a thread was blocked on a resource, then it was probably put on a “sleep” queue – a linked list of blocked threads – and when that resource was available, the waiting thread(s) would be “woken” by marking them as runnable and putting them on the “run” queue – another linked list of threads to choose from when resuming. This means we have to select from a list of threads to decide what to resume next.
As the switch, select, resume sequence is interval driven, we need to get this done quickly. We don’t want to be processing a thread queue for so long that the next timer interval expires, especially as we have timer related work to do too (like updating our counters, waking up other threads that have been sleeping for a specified interval, etc.).
In a fairly quiet or idle system, most threads will be sleeping – waiting for something to happen (network packet, disk I/O completion, keyboard press, etc.). This means the run queue will have little or nothing on it. However, if we have a busy system and/or if it takes longer than one “clock tick” to do all this work, we’ll spend our entire time doing this and little else – not to mention introducing time lag into any timers we update based on “clock ticks” (timer interrupts)! Worse, on the 80×86 (PC) architecture, the programmable interval timer (PIT) that drives these interrupts fires at IRQ0, which could effectively pin all lower priority interrupts until we acknowledge it. Speed is the key!
One way to help with this is to assign priorities to threads and order the “run queue” by priority. This makes it quicker to pick the next, most appropriate, thread to resume. We can also acknowledge the IRQ0 interrupt early, to prevent pinning the system at a high interrupt level – although this means a little extra code to detect when a subsequent timer interrupt comes in while we’re still finishing off our scheduling, so we don’t end up “chasing our tail” and crashing spectacularly.
2.2 – Other considerations on intervals and interrupts
The above summarizes a well-known approach to threading and the timer interval in this model is often set to something like 10 milliseconds (100 Hz). A 10ms interval is usually more than enough to perform the aforementioned functions on a modern processor. However, the timer is programmable, so you can have longer or shorter intervals. Longer intervals will drive scheduling more slowly and will increase the granularity of any in-kernel timers you maintain. Shorter intervals will do the opposite and, especially for pseudo-realtime kernels, could be desirable if you want a higher time resolution…but this makes the work you do in the interval (between timer interrupts) even more time-critical.
Implementing kernel threads
This next section looks at some practical examples of how you might approach the implementation of threads in your own kernel. Note that this is going to be a relatively straightforward set of examples derived from my own kernel code – you will almost certainly have to adapt the examples to suit your own needs.
I’m going to try and break this out into sub-sections to avoid tackling the entire topic at once. Bear in mind that all these facets do fit together and have inter-dependencies, so watch for these (although I’ll try to point them all out). Finally, the code I’m referencing here (like this document) is all my own work and almost certainly contains bugs and errors that I hope to iron out over time
3.1 – Allocating thread stacks and structures
We will need some form of data structure to keep track of all the thread-specific state information and we need to allocate memory for these “thread structures”. We are also going to need to ensure each thread has a usable stack and this will need to be allocated from somewhere too. It’s up to you how big you make a thread stack but it needs to be large enough to prevent a stack overflow condition.
This typically happens when the code a thread is executing uses up more stack space than is available, often through many layers of nested function calls, recursion or by having very large “automatic” variables in the code. Consider this example…
void
read_data()
{
char buffer[1024];
[...]
}
In the above case, the function read_data() declares a local variable called buffer which is 1 KB in size. Space for non-persistent, local function variables is typically allocated from the stack by adjusting the stack pointer by the required amount (write something like the above, compile it and then look at the instructions the compiler generates – on most platforms, you should see a stack space reservation).
This causes a problem for kernel threads because variables like these can quickly eat up precious stack space. We need to use such declarations with caution but we also need to ensure that our thread stacks are sufficiently sized for reasonable use.
Side note: I’ve debugged numerous real-world OS kernel crashes where the engineer who wrote some kernel context code didn’t consider this and caused a thread to overflow its stack. Dynamic kernel thread stacks are really hard to implement and most kernels don’t bother. Bear this in mind when writing kernel code – especially file systems, layered over I/O, bus and device code. It’s amazing how quickly a few layers of inefficient code can eat a stack!
In my kernel, what I do is allocate a regular, single page (4 KB) for the thread stack, steal a chunk at the top for the thread structure itself and then set the stack pointer below it. For me, this strikes a nice balance between memory efficiency and sufficient sizing for later stack use.
3.2 – The thread structure and cpu state
Ok, with all that theory covered, let’s start putting it into practice and look at how we create a new thread in our kernel. Let’s start with the thread structure itself. I’m going to share the thread structure I currently use in my kernel with you, partly to illustrate what sort of information it contains but also because code examples later on will refer heavily to this structure.
typedef struct thread {
trapregs_t regs; /* save/resume register set */
struct thread *next, /* next thread in nucleus */
*prev, /* prev thread in nucleus */
*q_prev, /* prev thread in sched queue */
*q_next; /* next thread in sched queue */
volatile int state;
uint32_t priority, /* thread priority */
timeout; /* timeout - tick countdown */
void *wchan; /* whatever object we're waiting on */
uint64_t tick_on, /* tick when last put on CPU */
tick_off; /* tick when last taken off CPU */
vaddr_t start; /* start function address */
void *arg; /* argument to start function */
void *stack; /* thread stack */
uint32_t stksize; /* size of thread stack */
uint32_t rescnt;
uint8_t flags; /* internal implementation flags */
} thread_t;
Note in particular the trapregs_t regs structure. This is basically where the cpu state gets saved to and restored from when switching threads. As I mentioned further above, we use the interrupt event as an opportunity to save the cpu state of the interrupted thread so that it can be resumed later. It is very important that the order that you save registers in the interrupt handler called from the IDT matches your saved register state structure otherwise you’ll reload the wrong values into the cpu registers when you try to resume a thread and that is a very bad thing. Here’s my trapregs_t structure for your reference – note the comment!
/*
* WARNING
* =======
* The trapregs_t structure below is also used to hold saved register state
* for thread switching by the scheduler (this struct is referenced by the
* thread_t structure). The ordering of the registers in this structure
* is CRUCIAL to the handle_trap() code in interrupt.s and the resume() code
* in resume.s (although genofs should get the offsets right for you).
*
* Most importantly, handle_trap() does some diddling with the stack to
* exchange pushed trap handler vectors and type codes with registers that
* need to be passed as a parameter - for example, trap() takes a trapregs_t
* structure as an argument and its the handle_trap() routine that sets up
* this structure on the stack (hence the importance of ordering).
*/
typedef struct {
uint32_t ss, /* 0x00 */
ds, /* 0x04 */
es, /* 0x08 */
fs, /* 0x0c */
edi, /* 0x10 */
esi, /* 0x14 */
ebp, /* 0x18 */
esp, /* 0x1c */
edx, /* 0x20 */
eax, /* 0x24 */
ebx, /* 0x28 */
ecx, /* 0x2c */
eip, /* 0x30 */
cs, /* 0x34 */
eflags; /* 0x38 */
} trapregs_t;
This next bit of code is taken from my interrupt descriptor table (IDT) module. Except for some very specific exception types, all interrupts vector through a common piece of code that saves the cpu state on the stack in a known format (that’ll be trapregs_t) and then calls a handler. I’m including this here to help you see how it marries with the aforementioned register state structure because I suspect you’ll need a similar mechanism in your own implementation.
; Okay, handle_trap is the main "gateway" to the higher level trap handler. ; Each trap vector calls this subroutine to make the more complex call to ; the main trap handler for that type. This function saves registers on ; the stack so that the trap handlers are all called like this: ; ; handler(type, error_code, registers) ; ; Naturally, we don't want to duplicate this code for every vector - that's ; why each vector just pushes it's type and handler address before calling ; handle_trap. ; ; BEWARE of the ordering here. It is depended upon by the trapregs_t structure ; in nucleus/include/i386/regs.h (and, of course, we need to restore registers ; in the same order we saved them after calling the handler). ; global handle_trap:function (handle_trap.end - handle_trap) handle_trap: xchg eax, [esp] ; swap eax with address of handler xchg ebx, [esp + 4] ; swap ebx with trap type xchg ecx, [esp + 8] ; swap ecx with error code push edx ; save rest of registers mov edx, esp add edx, 16 ; account for stack pushes so far push edx ; push original esp at time of trap push ebp push esi push edi push fs push es push ds push ss push ecx ; push error code push ebx ; push type call eax ; call handler add esp, 8 ; skip past the pushed type and error pop ss ; restore registers pop ds pop es pop fs pop edi pop esi pop ebp ; note the restoration of eax, ebx and ecx below may look ; counter-intuitive at first. The reason for this particular ; ordering is that we exchanged eax, ebx and ecx with values on ; the stack and that means that we need to "pop" them of the ; stack in that same order ; ; As an extra confusion, the next thing on the stack is the original ; stack pointer - but we can't reload esp now as we've got more ; "popping" to do, so we stash it in ecx temporarily (which is the ; last value on the stack to restore and do an "exchange" of the ; values before popping off the esp - et voila, all ok! ; pop ecx ; ecx is actually the original esp pop edx pop eax pop ebx xchg ecx, [esp] ; exchange original esp in ecx with the pop esp ; saved ecx on the stack, then pop esp iret .end:
As you can see, we set up the stack so that whatever handler we call gets an exception (trap) type, an error code (if applicable) and a pointer to the saved register state on the stack. This is exactly what we use in our thread structure. We’ll revisit all of this when we get to the section on context switching between threads a little later. First, let’s look at how we create a thread in the first place.
3.3 – Creating a new thread
Still with me? Great! With all the previous information in mind, creating the thread itself is actually quite easy. A new thread needs a starting point a pointer to a function with an optional pointer to arguments, and some state initialization – which includes putting it on either the “run” queue (if we want the thread to start running asap) or on the “wait” queue, for those cases where we actually want the new thread to be created in a wait-state (i.e. for some later event).
All we need to do is allocate space for the stack and thread structure, set up the thread’s cpu state so that a “resume” would actually kick us off at the start function and then queue the sucker up. Have a look at my thread creation code as an example. First the entire function for reference and then I’ll walk through bits of it…
thread_t *
thread_create(thread_t *tp, void *start, void *arg, int state, uint32_t pri)
{
vaddr_t va = 0;
static int init = 0;
if (init == 0) {
init = 1;
atom_init(&thr_atom);
}
/*
* thread stacks, if not pre-allocated by our caller, come out
* of the low memory area (where virtual and physical addresses
* can be the same). We can steal some of it for the thread
* structure too.
*/
if (tp == NULL || tp->stack == NULL) {
if ((vmmu_map_vpage(NULL, &va, NULL, PP_WRITEABLE)) != 0) {
puts("couldn't map page\n");
return (NULL);
}
if (tp == NULL) {
tp = (thread_t *)(va + PAGESIZE - sizeof (thread_t));
memset(tp, 0, sizeof (thread_t));
tp->stksize = PAGESIZE - sizeof (thread_t);
} else
tp->stksize = PAGESIZE;
tp->stack = (void *)va;
tp->flags |= T_STKALLOC;
}
/*
* paranoia: if we have a stack but no stack size, that's a bug
* and this can happen if the caller supplies a pre-created thread
* struct with a stack but forgot to fill in the stack size!
*/
if (tp->stack && tp->stksize == 0) {
error(E_WARN, "WARNING: thread_create: stack without stack size\n");
return (NULL);
}
/*
* set up essential registers - we start a thread by "resuming" it
*/
tp->regs.ss = get_ss();
tp->regs.es = get_es();
tp->regs.ds = get_ds();
tp->regs.fs = get_fs();
tp->regs.esp = (uint32_t)tp->stack + tp->stksize - 0x14;
*(uint32_t *)tp->regs.esp = tp->start = (vaddr_t)start;
*(uint32_t *)(tp->regs.esp + 0x4) = tp->regs.cs = get_cs();
*(uint32_t *)(tp->regs.esp + 0x8) = get_eflags();
*(uint32_t *)(tp->regs.esp + 0xc) = (uint32_t)thread_dead;
*(uint32_t *)(tp->regs.esp + 0x10) = (uint32_t)(tp->arg = arg);
tp->priority = pri;
tp->prev = NULL;
atomic_enter(&thr_atom);
if ((tp->next = nucleus_threads) != NULL)
nucleus_threads->prev = tp;
nucleus_threads = tp;
atomic_exit(&thr_atom);
if ((tp->state = state) == T_RUN) {
timer_off();
schedule_queue(tp, T_RUN, &runq);
timer_on();
}
stat_thrcreate++;
return (tp);
}
Ok, so let’s start at the beginning. The thread_create() function is called with a few parameters…
thread_t *
thread_create(thread_t *tp, void *start, void *arg, int state, uint32_t pri)
{
The first, tp, is a pointer to a thread structure to populate. Usually this is NULL to tell us to allocate a new one but on rare occasions we might already have space for the thread structure set aside (the most obvious example being “thread zero”, which is what the start-up code already has defined when the kernel first starts initializing).
The next two arguments, start and arg are a pointer to the start function and a pointer to any arguments to that function respectively. Sometimes it’s useful to pass the new thread’s start function some arguments. Rather than try to handle a variable number of arguments, it’s easier just to have the arguments packed into a structure and pass a single pointer to that.
The next argument, state, tells us what state the thread should be initialized with and this also implies which dispatch queue it should be queued up on. Finally, the pri argument gives us the thread priority, for ordering it’s place on the dispatch queue. Most threads will have the same base priority, so that they content equally for cpu time. However, threads intended for interrupt or time critical processing may get a higher priority while the “idle thread” (for when there’s nothing else to do) will get a lower priority, so it doesn’t get in the way of threads waiting to do some real work.
The next few lines just test if this is the first ever call and if so, we initialize an atomic lock to protect the global linked list of kernel threads (I call my kernel a “nucleus” because I’m working towards a microkernel architecture). After this, we have the thread stack and structure allocation code…first we check if either the thread pointer is NULL or, if it is valid, whether its stack pointer is set. We use the page allocator to give us an entire page for a thread stack, if required. We steal space for the thread structure from the top of this stack if necessary…
/*
* thread stacks, if not pre-allocated by our caller, come out
* of the low memory area (where virtual and physical addresses
* can be the same). We can steal some of it for the thread
* structure too.
*/
if (tp == NULL || tp->stack == NULL) {
if ((vmmu_map_vpage(NULL, &va, NULL, PP_WRITEABLE)) != 0) {
error(E_WARN, "couldn't map page\n");
return (NULL);
}
if (tp == NULL) {
tp = (thread_t *)(va + PAGESIZE - sizeof (thread_t));
memset(tp, 0, sizeof (thread_t));
tp->stksize = PAGESIZE - sizeof (thread_t);
} else
tp->stksize = PAGESIZE;
tp->stack = (void *)va;
tp->flags |= T_STKALLOC;
}
The T_STKALLOC flag is really for debugging purposes and I use it to note when we’ve had to allocate a thread and/or stack because the caller didn’t supply one.
The next important part is setting up the initial cpu state in the thread’s save register structure, so that we can “resume” the thread into the start function. Pay particular attention to the second half where we cobble together a stack frame…
/* * set up essential registers - we start a thread by "resuming" it */ tp->regs.ss = get_ss(); tp->regs.es = get_es(); tp->regs.ds = get_ds(); tp->regs.fs = get_fs(); tp->regs.esp = (uint32_t)tp->stack + tp->stksize - 0x14; *(uint32_t *)tp->regs.esp = tp->start = (vaddr_t)start; *(uint32_t *)(tp->regs.esp + 0x4) = tp->regs.cs = get_cs(); *(uint32_t *)(tp->regs.esp + 0x8) = get_eflags(); *(uint32_t *)(tp->regs.esp + 0xc) = (uint32_t)thread_dead; *(uint32_t *)(tp->regs.esp + 0x10) = (uint32_t)(tp->arg = arg);
That first part simply sets up some segment registers and its fine to use current register values as both the thread and this routine are within the kernel context. It’s the latter half that is crucial. Here, we set up the stack frame so that the thread “resumes” straight into the start function, along with the required stack frame register values.
Note: we cunningly set up the stack to make it look like we were called by thread_dead(). This is a function that retires a thread by marking itself free, putting itself on the “reap” queue (to be harvested and recycled for another thread) and surrenders the cpu. We do this in case the start function we’re calling happens to return – we don’t want to “return” to some indeterminate address!
The rest of the code is quite straightforward. We note the requested thread priority, add it to the global list of kernel threads and, if the requested state is runnable, we queue it up on the scheduler’s “run” queue…
tp->priority = pri;
tp->prev = NULL;
atomic_enter(&thr_atom);
if ((tp->next = nucleus_threads) != NULL)
nucleus_threads->prev = tp;
nucleus_threads = tp;
atomic_exit(&thr_atom);
if ((tp->state = state) == T_RUN) {
timer_off();
schedule_queue(tp, T_RUN, &runq);
timer_on();
}
stat_thrcreate++;
return (tp);
}
That’s it! We have a new thread, ready to rock. Hope that all made some sense at least.
3.4 – Cleaning up dead threads
You may be interested in the code that cleans up retired threads and this comprises two parts. One is the thread_dead() function that was mentioned earlier, which is only called when a thread voluntarily retires itself (either via explicit call or by returning from the start function). The second is thread_reap() which actually runs as its own “reaper” thread. This thread just waits until the scheduler sees retired threads on the “reap” queue and pokes it back to life. Here’s the code for those two functions…
void
thread_dead()
{
/*
* put the current thread on the reap queue for cleaning up
* later
*/
yield(T_FREE, &dead_reapq);
/* we should never return from the above yield call */
error(E_WARN, "thread_dead: thread %p returned\n", curthread);
/* hang this thread */
while (1)
yield(T_FREE, &dead_reapq);
}
void
thread_reap(void *arg)
{
thread_t *tp;
while (1) {
/*
* wait for some work to do - the scheduler will wake
* us up if there are threads on the reap queue
*/
yield(T_WAIT, &waitq);
/* work through the reap queue, freeing dead threads */
atomic_enter(&thr_atom);
while ((tp = dead_reapq) != NULL) {
timer_off();
schedule_dequeue(tp, &dead_reapq);
timer_on();
if (tp->prev)
tp->prev->next = tp->next;
else
nucleus_threads = tp->next;
if (tp->next)
tp->next->prev = tp->prev;
if (tp->flags & T_STKALLOC)
vmmu_unmap_vpage(NULL, (vaddr_t)tp->stack);
stat_thrreap++;
}
atomic_exit(&thr_atom);
}
}
Next topics…
There’s still more to cover, specifically the scheduling code that handles switching threads on and off cpu, as well as synchronization primitives. I’ll cover these in separate articles!
Please do leave feedback and comments. Again, this article and the code therein is a work in progress, so feel free to point out bugs and errors you think you may have seen…just don’t flame me for them being there – hehe!