Symmetric multiprocessing (SMP)

Allocating resources in a multicore design can be difficult, especially when multiple software components are unaware of how other components are employing those resources. Symmetric multiprocessing addresses the issue by running only one copy of the BlackBerry 10 OS on all of the system's CPUs. Because the OS has insight into all system elements at all times, it can allocate resources on the multiple CPUs with little or no input from the application designer. Moreover, BlackBerry 10 OS provides built-in standardized primitives, such as pthread_mutex_lock(), pthread_mutex_unlock(), pthread_spin_lock(), and pthread_spin_unlock(), that let multiple applications share these resources safely and easily.

By running only one copy of BlackBerry 10 OS, SMP can dynamically allocate resources to specific applications rather than to CPUs, thereby enabling greater utilization of available processing power. It also lets system tracing tools gather operating statistics and application interactions for the multiprocessing system as a whole, giving you valuable insight into how to optimize and debug applications.

For instance, the System Profiler in the IDE can track thread migration from one CPU to another, as well as OS primitive usage, scheduling events, application-to-application messaging, and other events, all with high-resolution timestamping. Application synchronization also becomes much easier since you use standard OS primitives rather than complex IPC mechanisms.

BlackBerry 10 OS lets the threads of execution within an application run concurrently on any CPU, making the entire computing power of the chip available to applications at all times. BlackBerry 10 OS's preemption and thread prioritization capabilities help you ensure that CPU cycles go to the application that needs them the most.

The BlackBerry 10 OS's microkernel approach

SMP is typically associated with high-end operating systems such as Unix and Windows NT running on high-end servers. These large monolithic systems tend to be quite complex, the result of many person-years of development. Since these large kernels contain the bulk of all OS services, the changes to support SMP are extensive, usually requiring large numbers of modifications and the use of specialized spinlocks throughout the code. BlackBerry 10 OS, on the other hand, contains a very small microkernel surrounded by processes that act as resource managers, providing services such as file systems, character I/O, and networking. By modifying the QNX Neutrino microkernel alone, all other OS services gain full advantage of SMP without the need for coding changes. If these service-providing processes are multithreaded, their many threads are scheduled among the available processors. Even a single-threaded server would also benefit from an SMP system, because its thread would be scheduled on the available processors beside other servers and client processes.

As a testament to this microkernel approach, the SMP-enabled BlackBerry 10 OS kernel/process manager adds only a few kilobytes of additional code. The SMP versions are designed for these main processor families:

  • ARM (procnto-smp, procnto-v6)
  • x86 (procnto-smp)

The x86 version can boot on any system that conforms to the Intel MultiProcessor Specification (MP Spec) with up to 32 Pentium (or better) processors. BlackBerry 10 OS also supports Intel's Hyper-Threading Technology found in P4 and Xeon processors.

The procnto-smp manager also functions on a single non-SMP system. With the cost of building a dual-processor Pentium motherboard very nearly the same as that for a single-processor motherboard, it's possible to deliver cost-effective solutions that can be scaled in the field by the simple addition of a second CPU. The fact that the OS itself is only a few kilobytes larger also allows SMP to be seriously considered for small CPU-intensive embedded systems, not just high-end servers.

Booting an x86 SMP system

The QNX Neutrino microkernel itself contains very little hardware- or system-specific code. The code that determines the capabilities of the system is isolated in a startup program, which is responsible for initializing the system, determining available memory, and so on. Information gathered is placed into a memory table available to the microkernel and to all processes (on a read-only basis). The startup-bios program is designed to work on systems compatible with the Intel MP Spec (version 1.4 or later). This startup program is responsible for:

  • Determining the number of processors
  • Determining the address of the local and I/O APIC
  • Initializing each additional processor

After a reset, only one processor executes the reset code. This processor is called the boot processor (BP). For each additional processor found, the BP running the startup-bios code:

  • Initializes the processor
  • Switches it to 32-bit protected mode
  • Allocates the processor its own page directory
  • Sets the processor spinning with interrupts disabled, waiting to be released by the kernel

How the SMP microkernel works

When the additional processors have been released and are running, all processors are considered peers for the scheduling of threads.

Scheduling
The scheduling policy follows the same rules as on a uniprocessor system. That is, the highest-priority thread is running on an available processor. If a new thread becomes ready to run as the highest-priority thread in the system, it is dispatched to the appropriate processor. If more than one processor is selected as a potential target, then the QNX Neutrino microkernel tries to dispatch the thread to the processor where it last ran. This affinity is used as an attempt to reduce thread migration from one processor to another, which can affect cache performance.

In an SMP system, the scheduler has some flexibility in deciding exactly how to schedule the other threads, with an eye towards optimizing cache usage and minimizing thread migration. This could mean that some processors are running lower-priority threads while a higher-priority thread is waiting to run on the processor it last ran on. The next time a processor that's running a lower-priority thread makes a scheduling decision, it chooses the higher-priority one.

In any case, the realtime scheduling rules that were in place on a uniprocessor system are guaranteed to be upheld on an SMP system.

Kernel locking
In a uniprocessor system, only one thread is allowed to execute within the microkernel at a time. Most kernel operations are short in duration (typically a few microseconds on a Pentium-class processor). The microkernel is also designed to be completely preemptible and restartable for those operations that take more time. This design keeps the microkernel lean and fast without the need for large numbers of fine-grained locks. It is interesting to note that placing many locks in the main code path through a kernel noticeably slows the kernel down. Each lock typically involves processor bus transactions, which can cause processor stalls.

In an SMP system, BlackBerry 10 OS maintains this philosophy of only one thread in a preemptible and restartable kernel. The microkernel may be entered on any processor, but only one processor is granted access at a time.

For most systems, the time spent in the microkernel represents only a small fraction of the processor's workload. Therefore, while conflicts can occur, they should be more the exception than the norm. This is especially true for a microkernel where traditional OS services like file systems are separate processes and not part of the kernel itself.

Interprocessor interrupts (IPIs)
The processors communicate with each other through IPIs (interprocessor interrupts). IPIs can effectively schedule and control threads over multiple processors. For example, an IPI to another processor is often needed when:
  • A higher-priority thread becomes ready
  • A thread running on another processor is hit with a signal
  • A thread running on another processor is canceled
  • A thread running on another processor is destroyed

Critical sections

To control access to data structures that are shared between them, threads and processes use the standard POSIX primitives of mutexes, condvars, and semaphores. These work without change in an SMP system. Many realtime systems also need to protect access to shared data structures between an interrupt handler and the thread that owns the handler. The traditional POSIX primitives used between threads aren't available for use by an interrupt handler. There are two solutions here:

  • Remove all work from the interrupt handler and do all the work at thread time instead. Given the fast thread scheduling, this is a very viable solution.
  • In a uniprocessor system running the BlackBerry 10 OS, an interrupt handler may preempt a thread, but a thread never preempts an interrupt handler. This allows the thread to protect itself from the interrupt handler by disabling and enabling interrupts for very brief periods of time.

The thread on a non-SMP system protects itself with code of the form:

InterruptDisable()

/* Critical section */

InterruptEnable()

Or:

InterruptMask(intr)

/* Critical section */

InterruptUnmask(intr)

Unfortunately, this code fails on an SMP system since the thread may be running on one processor while the interrupt handler is concurrently running on another processor!

One solution is to lock the thread to a particular processor (see Bound multiprocessing (BMP)).

A better solution would be to use a new exclusion lock available to both the thread and the interrupt handler. This is provided by the following primitives, which work on both uniprocessor and SMP machines:

InterruptLock(intrspin_t* spinlock )
Attempt to acquire a spinlock, a variable shared between the interrupt handler and thread. The code spins in a tight loop until the lock is acquired. After disabling interrupts, the code acquires the lock (if it was acquired by a thread). The lock must be released as soon as possible (typically within a few lines of C code without any loops).
InterruptUnlock(intrspin_t* spinlock )
Release a lock and reenable interrupts.

On a non-SMP system, there's no need for a spinlock.

For more information, see Multicore processing.

Last modified: 2015-05-07



Got questions about leaving a comment? Get answers from our Disqus FAQ.

comments powered by Disqus