DEV Community: Jaideep more

Memory Coherence in Shared Virtual Memory Systems

Jaideep more — Mon, 17 Jul 2023 10:24:52 +0000

The article serves as a summary of the paper - Memory Coherence in Shared Virtual Memory Systems.

Design Choices for Memory Coherence:

Our design goals require that the shared virtual memory is coherent. Coherence can be maintained if a shared virtual memory satisfies the following single constraint:

A processor is allowed to update a piece of data only while no other processor is updating or reading it

Two design choices greatly influence the implementation of a shared virtual memory, granularity and strategy for maintaining coherence.

Granularity:

Memory Contention: When two or more threads try to access the same memory location simultaneously, contention occurs, and one or more threads may be blocked or delayed in their execution. This can lead to performance degradation and even deadlock in some cases.

The possibility of contention pushes one toward relatively small memory units.

A suitable granular compromise is a typical page used in a conventional virtual memory implementation. Part of the justification for using page size granularity is that memory references in sequential programs generally have a high degree of locality.

Although memory references in parallel programs may behave differently from those in sequential ones, a single process remains a sequential program and should exhibit a high degree of locality.

In addition, such a choice allows us to use existing page-fault schemes. This can be done by setting the access rights to the pages so that memory accesses that could violate memory coherence cause a page fault. Thus the memory coherence problem can be solved in a modular way in the page fault handlers.

Memory Coherence Strategies:

The strategies may be classified by how one deals with page synchronisation and ownership.

Page Synchronisation:

Invalidation: If a processor has a write fault, the fault handler will copy the true page and invalidate all other copies of the page.
Write Back: For a write fault, the fault handler will write to all the copies of the page. Clearly, this is expensive as all write operations will cause a write-back. Not suitable for loosely coupled processors.

Page Ownership:

Static: The same processor always owns a page. This means that other processors request the owner process for any updates to the page. Generates write faults for every write operation. Not suitable for our system
Dynamic:
Centralised Managers
Distributed Managers

Solutions for Memory Coherence Problems:

Improved Centralised Manager Algorithms:

The responsibility for the synchronisation of pages is given to the individual processors/owners.

Each processor maintains data structure ptable such that for every page, it stores:

access: type of access.
copy_set: list of processor ids that have copies of the
lock: for synchronising the request to the page.

A manager stores only the page ownership information in the owner table.

The following diagrams provide the flowchart of the fault handlers and their corresponding service.

Fault handlers are functions invoked when a read/write operation is performed on a page the processor does not own.

Services are functions that are invoked when a page owner receives a read/write request from the manager.

Distributed Manager Algorithms:

Fixed Distributed Manager Algorithm:

Every processor is given a predetermined subset of the pages to manage. The most straightforward approach is to distribute the pages evenly in a fixed manner to all processors.

When a fault occurs on a page p, the faulting processor asks H(p) where the true page owner is and proceeds as in the centralised manager algorithm.

It is, however, difficult to find a good static distribution that fits all applications well.

Broadcast Distributed Manager Algorithm:

An obvious way of eliminating the centralised manager is by using broadcast mechanisms. With this strategy, each processor manages precisely those pages it owns, and faulting processors send broadcasts into the network to find the page's owner.

Thus the owner table is eliminated, and the ownership information is stored in each processor's ptable, which in addition to access, copy_set and lock fields also has an owner field.

More precisely, when a read fault occurs, the faulting processor P sends a broadcast read request, and the true owner of the page responds by adding P to the page's copy_set field and sending a copy of the page to P.

Similarly, when a write fault occurs, the faulting processor sends a broadcast write request, and the true owner of the page gives up ownership and sends back the page and its copy_set. When the requesting processor receives the page and the copy_set, it will invalidate all copies.

Broadcasting every time a page fault occurs leads to the communication system being a potential bottleneck.

Dynamic Distributed Manager Algorithm:

The heart of a dynamic distributed manager algorithm is to attempt to keep track of the ownership of all pages in each processor's local ptable.

To do this, the owner field is replaced with another field, prob_owner, whose value can be either null or the "probable" owner of the page.

The information that prob_owner contains is not necessarily correct at all times, but if incorrect, it will at least provide the beginning of a sequence of processors through which the true owner can be found.

Programming with Threads

Jaideep more — Mon, 10 Jul 2023 09:27:43 +0000

This article serves as a summary for the paper - Programming with Threads

Introduction:

A "thread" is a straightforward concept: a single sequential flow of control. Having multiple threads in a program means that at any instant the program has multiple points of execution.

The programmer can mostly view the threads as executing simultaneously, as if the computer were endowed with as many processors as there are threads. The programmer must be aware that the computer might not in fact execute all his threads simultaneously.

Each thread executes on a separate call stack with its own separate local variables while the off-stack (global) variables being shared among all the threads of the program. The programmer is responsible for using appropriate synchronisation mechanisms to ensure that the shared memory is accessed in a manner that will give the correct answer.

A thread facility allows us to write programs with multiple simultaneous points of execution, synchronising through shared memory. In this article we discuss the basic thread and synchronisation primitives.

Why use Concurrency?

Use of multiple processors (Obvious)
Driving slow devices such as disks or networks. In these case an efficient program should be doing some other useful tasks while waiting for device to produce its next event.
A third source of concurrency is human users. When your program is performing some lengthy task for the user, the program should still be responsive.
We can deliberately add concurrency to our program in order to reduce the latency of operations. For example, some of the work incurred by a method call can be deferred if it does not affect the result of the call. For example, when you add or remove something in a balanced tree you could happily return to the caller before re‐balancing the tree. Re-balancing done in a separate thread.

Design of thread facility:

In general there are four major mechanisms:

Thread Creation:

Creating and starting a thread is called “forking”. Most forked threads are daemon threads

Mutual Exclusion:

Threads interact through access to shared memory. To avoid errors arising when more than one thread is accessing the shared variable a mutual exclusion tool is used. It specifies a particular region of code that only one thread can execute at any time.

A lock statement locks the given object, then executes the contained statements, then unlocks the object.

A thread executing inside the lock statement is said to “hold” the given object’s lock. If another thread attempts to lock the object when it is already locked, the second thread blocks (enqueued on the object’s lock) until the object is unlocked.

In general we achieve mutual exclusion on a set of variables by associating them (mentally) with a particular object. We can then write our program so that it accesses those variables only from a thread which holds that objects lock.

Waiting for Events:

Often programmer needs to express more complicated scheduling policies. This requires use of a mechanism that allows a thread to block itself until some condition is true.

The mechanism used to achieve this is generally called "condition variables" which provides functionalities to allow programmers to express complicated scheduling policies. These functions are:

Wait(object): atomically unlocks the object and blocks the thread.
Pulse(object): awakens a thread that is waiting on that object
PulseAll(object): awakens all threads that are waiting on that object

Interrupt:

The final mechanism is for interrupting a particular thread. if threadA is blocked waiting for a condition, and threadB calls threadA.interrupt(), then threadA will resume execution by re-locking the object and throwing an InterruptException.

Using Locks: Accessing Shared Data

Unprotected Data:

The simplest bug related to locks occurs when we fail to protect some mutable data and then we access it without synchronisation.

The lock statement enforces serialisation of threads, so that at any time only one thread executes the statements inside the lock.

Invariants:

Programmers find it easier to think of the lock as protecting the invariant of the associated data.

An invariant is a boolean function which checks the constraints the object must satisfy at any given time. Invariant must be true whenever the associated lock is not held.

Releasing the lock while our state is in a transient inconsistent state will inevitably lead to confusion if it is possible for another thread to acquire lock in this state.

Deadlocks involving only locks:

The most effective rule for avoiding deadlocks is to have a partial order for the acquisition of locks in our programs.

Poor performance through lock conflicts:

Whenever a thread is holding a lock, it is preventing another thread from making progress.

The best way to reduce lock conflicts is to lock at a finer granularity, which introduces complexity. It is a trade-off inherent in concurrent computation.

The most typical example where locking granularity is important is in a class that manages a set of object, for example a set of open buffered files.

The simplest strategy is to use a single global lock for all the operations. But this would prevent simultaneous operations of two different files. The most obvious way to use the locks is to have one global lock that protects the global data structures of the class and have object specific locks which protect the data specific to that instance.

There is an interaction between locks and the thread scheduler that can produce insidious performance problem. Thread scheduler decides which of the non-blocked threads should be given a processor to run on. Generally, this decision is based on a priority associated with each thread.

Lock conflicts can lead to priority inversion in which a thread even with the highest priority fails to make progress. Ex:

C is running (e.g., because A and B are blocked somewhere); 
C locks object M;  
B wakes up and pre-empts C(i.e., B runs instead of C since B has higher priority); 
B embarks on some very long computation;  
A wakes up and pre-empts B (since A has higher priority); 
A tries to lock M, but can’t because it’s still locked by C;  
A blocks, and so the processor is given back to B;  
B continues its very long computation.

The best solution to this problem lies in the thread scheduler. Ideally, the scheduler should raise threads C's priority which thats needed to enable thread A to eventually make progress.

Wait and Pulse:

Scheduling shared Resources:

When we want to schedule the way in which multiple threads access some shared resource, then we want to make threads block by waiting on an object. Simple mutual exclusion is not sufficient in such cases.

Use the following general pattern, which is strongly recommended for all uses of condition variables.

while(!expression) Monitor.wait(object);

The main reason for advocating use of this pattern is to make your program more obviously, and more robustly, correct. With this style it is immediately clear that the expression is true before the following statements are executed

This programming convention allows us to verify correctness by local inspection, which is always preferred over global inspection (looking for all places where pulse(object) is called).

Using `PulseAll()`:

A simple example where PulseAll() is useful in when we want to awaken multiple threads, because the resource we have just made available can be used by several other threads.

One use of PulseAll() is when you want to simplify your program by awakening multiple threads, even though you know that not all of them can make progress.

If we always program in the recommended style mentioned above then the correctness of our program will be unaffected if we replace all Pulse with PulseAll.

This use trades slightly poorer performance for greater simplicity.

Spurious Lock Conflicts:

A potential source of excessive scheduling overhead comes from situations where a thread is awakened from waiting on an object, and before doing useful work the thread blocks trying to lock an object.

Example: A thread awakens another thread by using Pulse() but is yet to release the lock that will block the awakened thread. This has cost us two extra reschedule operations, which is a significant expense.

If getting the best performance is important, we need to consider carefully whether a newly awakened thread will necessarily block on some other object shortly after it starts running. If so we need to arrange to defer the wake-up to a more suitable time.

Starvation:

Whenever we have a program that is making scheduling decisions, we must worry about how fair these decisions are; in other words, are all threads equal or are some more favoured?

When you are locking an object, this consideration is dealt with for you by the threads implementation typically by a first‐in‐first‐out rule for each priority level.

The most extreme form of unfairness is “starvation”, where some thread will never make progress.

Concluding Remarks

A successful program must be useful, correct, live and efficient. Our use of concurrency can impact each of these. We have discussed quite a few techniques in the previous sections that will help us in achieving these goals.