The impact of developing for multicore

Although the actual changes to the way you set up the processor to run SMP are fairly minor, the fact that you're running on a multicore system can have a major impact on your software!

The main thing to keep in mind is this: in a single processor environment, it may be a nice design abstraction to pretend that threads execute in parallel; under a multicore system, they really do execute in parallel! (With BMP, you can make your threads run on a specific CPU.)

Thread affinity

One issue that often arises in a multicore environment can be put like this: Can one processor handle the GUI, another handle the database, and the other two handle the realtime functions?

The answer is: Yes, absolutely.

This is done through the magic of thread affinity, the ability to associate certain programs (or even threads within programs) with a particular processor or processors.

Thread affinity works like this. When a thread starts up, its affinity mask (or runmask) is set to allow it to run on all processors. This implies that there's no inheritance of the thread affinity mask, so it's up to the thread to use ThreadCtl() with the _NTO_TCTL_RUNMASK control flag to set its runmask:

if (ThreadCtl( _NTO_TCTL_RUNMASK, (void *)my_runmask) == -1) {
    /* An error occurred. */
}

The runmask is simply a bitmap; each bit position indicates a particular processor. For example, the runmask 0x05 (binary 00000101) allows the thread to run on processors 0 (the 0x01 bit) and 2 (the 0x04 bit).

If you use _NTO_TCTL_RUNMASK, the runmask is limited to the size of an int (currently 32 bits). Threads created by the calling thread don't inherit the specified runmask.

If you want to support more processors than fit in an int, or you want to set the inherit mask, you need to use the _NTO_TCTL_RUNMASK_GET_AND_SET_INHERIT command described below.

The <sys/neutrino.h> file defines some macros that you can use to work with a runmask:

RMSK_SET(cpu, p)
Set the bit for cpu in the mask pointed to by p.
RMSK_CLR(cpu, p)
Clear the bit for cpu in the mask pointed to by p.
RMSK_ISSET(cpu, p)
Determine if the bit for cpu is set in the mask pointed to by p.

The CPUs are numbered from 0. These macros work with runmasks of any length.

Bound multiprocessing (BMP) is a variation on SMP that lets you specify which processors a process or thread and its children can run on. To specify this, you use an inherit mask.

To set a thread's inherit mask, you use ThreadCtl() with the _NTO_TCTL_RUNMASK_GET_AND_SET_INHERIT control flag. Conceptually, the structure that you pass with this command is as follows:

struct _thread_runmask {
    int size;
    unsigned runmask[size];
    unsigned inherit_mask[size];
};

If you set the runmask member to a nonzero value, ThreadCtl() sets the runmask of the calling thread to the specified value. If you set the runmask member to zero, the runmask of the calling thread isn't altered.

If you set the inherit_mask member to a nonzero value, ThreadCtl() sets the calling thread's inheritance mask to the specified value(s); if the calling thread creates any children by calling pthread_create(), fork(), spawn(), vfork(), and exec(), the children inherit this mask. If you set the inherit_mask member to zero, the calling thread's inheritance mask isn't changed.

If you look at the definition of _thread_runmask in <sys/neutrino.h>, you see that it's actually declared like this:

struct _thread_runmask {
    int         size;
/*  unsigned    runmask[size];      */
/*  unsigned    inherit_mask[size]; */
};

This is because the number of elements in the runmask and inherit_mask arrays depends on the number of processors in your multicore system. You can use the RMSK_SIZE() macro to determine how many unsigned integers you need for the masks; pass the number of CPUs (found in the system page) to this macro.

Here's a code snippet that shows how to set up the runmask and inherit mask:

unsigned    num_elements = 0;
int         *rsizep, masksize_bytes, size;
unsigned    *rmaskp, *imaskp;
void        *my_data;

/* Determine the number of array elements required to hold
 * the runmasks, based on the number of CPUs in the system. */
num_elements = RMSK_SIZE(_syspage_ptr->num_cpu);

/* Determine the size of the runmask, in bytes. */
masksize_bytes = num_elements * sizeof(unsigned);

/* Allocate memory for the data structure that we'll pass
 * to ThreadCtl(). We need space for an integer (the number
 * of elements in each mask array) and the two masks
 * (runmask and inherit mask). */

size = sizeof(int) + 2 * masksize_bytes;
if ((my_data = malloc(size)) == NULL) {
    /* Not enough memory. */
    …
} else {
    memset(my_data, 0x00, size);

    /* Set up pointers to the "members" of the structure. */
    rsizep = (int *)my_data;
    rmaskp = rsizep + 1;
    imaskp = rmaskp + num_elements;

    /* Set the size. */
    *rsizep = num_elements;

    /* Set the runmask. Call this macro once for each processor
       the thread can run on. */
    RMSK_SET(cpu1, rmaskp);

    /* Set the inherit mask. Call this macro once for each
       processor the thread's children can run on. */
    RMSK_SET(cpu1, imaskp);

    if ( ThreadCtl( _NTO_TCTL_RUNMASK_GET_AND_SET_INHERIT,
                   my_data) == -1) {
        /* Something went wrong. */
        
    }
}

You can also use the -C and -R options to the on command to launch processes with a runmask (assuming they don't set their runmasks programmatically); for example, use on -C 1 io-pkt-v4 to start io-pkt-v4 and lock all threads to CPU 1. This command sets both the runmask and the inherit mask.

You can also use the same options to the slay command to modify the runmask of a running process or thread. For example, slay -C 0 io-pkt-v4 moves all of io-pkt-v4's threads to run on CPU 0. If you use the -C and -R options, slay sets the runmask; if you also use the -i option, slay also sets the process's or thread's inherit mask to be the same as the runmask.

Multicore and synchronization primitives

Standard synchronization primitives (barriers, mutexes, condvars, semaphores, and all of their derivatives, for example, sleepon locks) are safe to use on a multicore box. You don't have to do anything special here.

Multicore and FIFO scheduling

A common single-processor trick for coordinated access to a shared memory region is to use FIFO scheduling between two threads running at the same priority. The idea is that one thread accesses the region and then calls SchedYield() to give up its use of the processor. Then, the second thread runs and accesses the region. When it was done, the second thread too calls SchedYield(), and the first thread runs again. Since there's only one processor, both threads cooperatively share that processor.

This FIFO trick won't work on an SMP system, because both threads may run simultaneously on different processors. You have to use the more proper thread synchronization primitives (for example, a mutex), or use BMP to tie the threads to specific CPUs.

Multicore and interrupts

The following method is closely related to the FIFO scheduling trick. On a single-processor system, a thread and an interrupt service routine are mutually exclusive, because the ISR runs at a higher priority than any thread. Therefore, the ISR can preempt the thread, but the thread can never preempt the ISR. So the only protection required is for the thread to indicate that during a particular section of code (the critical section) interrupts should be disabled.

Obviously, this scheme breaks down in a multicore system, because again the thread and the ISR could be running on different processors.

The solution in this case is to use the InterruptLock() and InterruptUnlock() calls to ensure that the ISR won't preempt the thread at an unexpected point. But what if the thread preempts the ISR? The solution is the same: use InterruptLock() and InterruptUnlock() in the ISR as well.

You should always use InterruptLock() and InterruptUnlock(), both in the thread and in the ISR. The small amount of extra overhead on a single-processor box is negligible.

Multicore and atomic operations

If you want to perform simple atomic operations, such as adding a value to a memory location, it isn't necessary to turn off interrupts to ensure that the operation won't be preempted. Instead, use the functions provided in the C include file <atomic.h>, which let you perform the following operations with memory locations in an atomic manner:

Function Operation
atomic_add() Add a number
atomic_add_value() Add a number and return the original value of *loc
atomic_clr() Clear bits
atomic_clr_value() Clear bits and return the original value of *loc
atomic_set() Set bits
atomic_set_value() Set bits and return the original value of *loc
atomic_sub() Subtract a number
atomic_sub_value() Subtract a number and return the original value of *loc
atomic_toggle() Toggle (complement) bits
atomic_toggle_value() Toggle (complement) bits and return the original value of *loc

The *_value() functions may be slower on some systems, so don't use them unless you really want the return value.

Last modified: 2015-05-07



Got questions about leaving a comment? Get answers from our Disqus FAQ.

comments powered by Disqus