The concept of parallel computing is a general organizational scheme. Parallel computing systems

Parallel computing processes and systems (Lecture 13)

Types of parallelism

Parallel data processing has two types: pipeline and actual parallelism.

Parallel processing. If a certain device performs one operation per unit of time, then it will perform a thousand operations in a thousand units. If we assume that there are five identical independent devices capable of operating simultaneously, then a system of five devices can perform the same thousand operations not in a thousand, but in two hundred units of time.

Conveyor processing. What is needed to add two real numbers, represented in floating point form? A whole lot of small operations such as comparing orders, aligning orders, adding mantissas, normalizing, etc. The processors of the first computers performed all these “micro-operations” for each pair of arguments, one after another, until they reached the final result, and only then proceeded to process the next pair of terms. The idea of pipeline processing is to isolate individual stages of performing a general operation, and each stage, having completed its work, would pass the result to the next one, while simultaneously receiving a new portion of input data. There is an obvious gain in processing speed due to the combination of previously spaced operations. Let's assume that there are five micro-operations in an operation, each of which is performed in one unit of time. If there is one indivisible serial device, then it will process 100 pairs of arguments in 500 units. If each micro-operation is separated into a separate stage (or otherwise called a stage) of a conveyor device, then on the fifth unit of time, at different stages of processing of such a device, the first five pairs of arguments will be located, and the entire set of one hundred pairs will be processed in 5 + 99 = 104 units time - acceleration compared to a serial device is almost five times (based on the number of pipeline stages).

It would seem that pipeline processing can be successfully replaced by ordinary parallelism, for which we duplicate the main device as many times as the number of stages of the pipeline are supposed to be allocated. But by increasing the number of devices fivefold, we are significantly increasing both the volume of equipment and its cost.

Implementation of parallel systems

Computer performance has grown exponentially from 1945 to the present (based on an average of every 10 years). Computer architecture has undergone significant changes, moving from serial to parallel.

Computer performance is directly related to the time required to perform basic functions and the number of these basic operations that can be performed simultaneously. The execution time of one simple instruction is ultimately limited.

It is easy to conclude that you cannot limit yourself to increasing speed only at the expense of processor clock speed. Dependence on processors ultimately leads to a dead end. Another strategy in this area is to use internal parallelism in the processor chip. But such technology is very expensive. Modern supercomputers are based largely on the idea of using a large number of relatively inexpensive processors already available.

This also includes systems such as: supercomputers equipped with thousands of processors; networks of workstations; multiprocessor workstations, etc.

Multicomputer - this is a certain amount von Neumann machines(nodes) interconnected by a network. Each computer runs its own program. These programs can access local memory and can send and receive messages over the network. Messages used for communication between computers are equivalent to read or write operations from remote memory. In an idealized network, the delivery time of a message between machines does not depend on the distance between nodes or network traffic, but does depend on the length of the letter being sent.

The defining parameter of the multicomputer model is that accessing local (same node) memory is less expensive than accessing remote (located in another node) memory. Those. Reading and writing operations are less expensive than sending or receiving messages. Therefore, it is desirable that local data be accessed much more frequently than remote data. This fundamental property of software is called locality. The value of locality depends on the ratio of the cost of remote access to local access.

Other car models. Let's look at the most important computer architectures. A multicomputer is very similar to what is often called a MIMD (Multiple Instruction Multiple Data) distributed memory computer. MIMD means that each processor can process a separate instruction stream over its own local data. Distributed memory means that the memory is distributed among processors. The fundamental difference between a MIMD computer and a multicomputer is that the cost of delivering a message between two nodes does not depend on the location of the node and network traffic. The main representatives of this class: IBM SP, Intel Paragon, Thinking Machines CM 5, Cray T 3D, Meiko CS-2, and CUBE.

Another class of supercomputer is the multiprocessor or MIMD shared memory computer. In a multiprocessor, all processors share access to a common memory, usually through a bus or through a bus hierarchy. In an idealized parallel random access machine (PRAM) model, often using theoretically studied parallel algorithms, any processor can access any memory element at the same time. This architecture usually involves some special form of memory device. The number of shared memory accesses is reduced by storing copies of frequently accessed data in a cache associated with each processor.

Accessing this cache is much faster than accessing shared memory, hence locality is very important. Programs designed for multicomputers can run just as efficiently on multiprocessors because shared memory allows efficient message passing. Representatives of this class are Silicon Graphics Challenge, Sequent Symmetry and many multiprocessor workstations.

A more specialized class of parallel computers are SIMD (Single Instruction Miltiple Data) computers. In SIMD machines, all processors operate with the same stream of instructions on different pieces of data. This approach can reduce the complexity of software and hardware, but this only makes sense for specialized problems characterized by a high degree of regularity, such as image processing and certain types of digital modeling. Algorithms applicable on multicomputers cannot, in general, be efficiently executed on SIMD computers.

Neurocomputational systems.

A neurocomputing device is a system whose functioning is maximally focused on the implementation of neural network algorithms. The main difference between neurocomputers and other computing systems is the provision of high parallelism of calculations through the use of a specialized neural network logical basis or specific architectural solutions. Using the ability to represent neural network algorithms for implementation on a neural network logical basis is the main prerequisite for a sharp increase in the performance of neurocomputers.

Currently, the development of digital neurocomputers is most actively carried out in the following areas:

· software emulation of neural network algorithms based on the use of conventional computing facilities and software for modeling neural networks;

· software and hardware emulation of neural networks based on standard computing tools with a plug-in virtual neural network unit that performs basic neurooperations and software that performs general control functions;

· hardware implementation of neural networks.

Despite the fact that the greatest effect in the implementation of neural network algorithms can be achieved only with the use of neurocomputers of the third direction, their widespread use is limited by high technology. For example, the neurocomputer Synaps1 is one of the representatives of neurocomputers of the third direction, has a multiprocessor architecture, an original design of the memory subsystem, and uses signal processors and special signal matrix processors MA16 to perform computational operations. Due to this, the performance of the neurocomputer amounted to several billion multiplications and additions per second. Software This system includes the Synaps1 OS with a library of neuroalgorithms, as well as software: a basic NS library, a compiler for the neuroalgorithm programming language (nAPL) (a set of library functions for C++), etc. Applied research has shown that the use of neurocomputers of the third direction makes it possible to increase the performance of conventional computing systems by at least three orders of magnitude and to simulate neural networks with millions of connections. For example, Synaps1 allows you to simulate a neural network with 64 million synapses using various activation functions.

Two classes of computer systems that are sometimes used as parallel computers are the local network(LAN), in which computers located in physical proximity (for example, the same building) are connected by a fast network, and global network(WAN), in which geographically connected remote computers. Although systems of this type bring additional problems such as security, reliability, they can be considered for various purposes as multicomputers, albeit at a high cost remote access.

Difficulties of using parallel systems

The enormous performance of parallel computers and supercomputers is more than offset by the difficulties of using them.

You have a program and access to, say, a 256-processor computer. What are you expecting? Yes, it is clear: you quite legitimately expect that the program will execute 256 times faster than on a single processor. But this is precisely what most likely will not happen.

Amdahl's Law. Suppose that in a program the fraction of operations that need to be performed sequentially is equal to f, where 0<=f <=1 (при этом доля понимается не по статическому числу строк кода, а по числу операций в процессе выполнения). Крайние случаи в значениях f соответствуют полностью параллельным (f = 0) и полностью последовательным (f = 1) программам. Тогда для того, чтобы оценить, какое ускорение S может быть получено на компьютере из "p" процессоров при данном значении f, можно воспользоваться законом Амдала: если 9/10 программы исполняется параллельно, а 1/10 по-прежнему последовательно, то ускорения более, чем в 10 раз получить в принципе невозможно вне зависимости от качества реализации параллельной части кода и числа используемых процессоров (10 получается только в том случае, когда время исполнения параллельной части равно 0).

Consequence of Amdahl's law. In order to speed up the execution of a program by q times, it is necessary to speed up at least q times no less than the (1-1/q)th part of the program. Therefore, if you want to speed up a program by 100 times compared to its sequential version, then you need to get no less speedup for at least 99.99% of the code!

Thus, making a parallel computing system work with maximum efficiency on a specific program is not an easy task, since it is necessary to carefully coordinate the structure of programs and algorithms with the architectural features of parallel computing systems.

Parallel Systems Programming

The von Neumann machine model assumes that the processor executes a sequence of instructions. Instructions can specify, in addition to various arithmetic operations, the addresses of data to be read/written in memory and/or the address of the next instruction to be executed. While it is only possible to program a computer in terms of this basic model, this method is prohibitively complex for most purposes due to the fact that we must keep track of millions of memory positions and orchestrate the execution of thousands of machine instructions. Consequently, a modular design technique is applied, whereby complex programs are created from simple components, and components are structured in terms of higher-level abstractions (such as data structures, iterative loops and procedures). Abstractions (eg procedures) make exploitation of modularity easier by allowing objects to be manipulated without disturbing their internal structure. This is how high-level languages are made, such as Fortran, C, Ada and Java, which allow development in terms of these abstractions, which are translated automatically into executable code. Parallel programming introduces additional sources of complexity: if we must program at the lowest level, we not only need to increase the number of instructions executed, but also manage the execution of thousands of processors and coordinate millions of interprocessor communications. Therefore, abstraction and modularity are at least as important as in sequential programming. In fact, we highlight modularity as the fourth fundamental requirement for parallel software, in addition to parallelism, scalability, and locality.

The main abstractions used in parallel programming come down to tasks and channels:

1.Parallel computing consists of one or more tasks. Tasks are executed in parallel. The number of tasks may change during program execution.

2.The task isolates the sequential program and local memory. In addition, a set of inputs and outputs defines its interface in its environment.

3. A task can perform four basic actions in addition to reading and writing in local memory: send a message to its output ports, receive a message from its input ports, create new tasks and destroy (complete) a task.

4.The operation of sending a message is asynchronous and completes immediately. The receive operation is synchronous and causes a task to execute, blocking the process until the message is received.

5.I/O pairs can be linked by queued messages called channels. Channels can be created and deleted, and references to channels (ports) can be included in messages so that connectivity changes dynamically.

6. Jobs can be mapped to physical processors in a variety of ways; display application does not affect the semantics of the program. Specifically, multiple jobs can be mapped to a single processor (one could also imagine that a single task could be mapped to multiple processors, but this possibility is not taken into account here.)

Abstraction of tasks requires the property of locality: the data contained in the local memory of the task is “closed”; other data is “deleted”. The channel abstraction provides a mechanism for specifying what data from one task must be computed before another task can begin running. (This is characterized by data dependency). The task and channel model also has some other properties:

Performance . Sequential programming abstractions, such as procedures and data structures, are effective because they can be mapped simply and efficiently to a von Neumann computer. Tasks and channels have a similarly direct distribution in a multicomputer. A task represents a piece of code that can be executed sequentially on a single processor. If two tasks that share a channel are mapped to other processors, the channel connection is implemented as an interprocessor connection; if they are mapped to the same processor, some more efficient mechanisms can be used.

Independence of distribution . Since tasks communicate using the same mechanism (channels) regardless of the task's location, the result calculated by the program does not depend on where the task is executed. Therefore, algorithms can be designed and implemented without worrying about the number of processors on which they will run; in fact, algorithms are often designed to create many more tasks than processors. This is a simple way to achieve scale: as the number of processors increases, the number of tasks per processor decreases, but the algorithm itself does not need to be modified. When there are more tasks than the processors could handle to mask communication delays, other computations are provided that can be performed while communication is in progress to access remote data.

Modularity. In modular programming, different program components are developed separately as independent modules and then combined to form a complete program. Interaction between modules is limited to clearly defined interfaces. Consequently, modular implementations can be changed without modifying other components, and the properties of a program can be determined from the specification of its modules and the code that connects those modules together. When modular development is successfully applied, software complexity is reduced and code reuse is facilitated.

Determinism. An algorithm or program is deterministic if, when executed with a specific input, it always produces the same output. It is non-deterministic if multiple executions of the same input may produce a different output. While nondeterminism is sometimes useful and should be supported, a parallel programming model that makes it easier to write deterministic programs is highly desirable. Deterministic programs tend to be more understandable. Also, when checking for correctness, only one execution sequence of a parallel program should be calculated, and not all possible execution sequences.

The current version of the page has not yet been verified

The current version of the page has not yet been verified by experienced participants and may differ significantly from the version verified on October 5, 2014; checks are required.

Parallel Computing- a method of organizing computer computing, in which programs are developed as a set of interacting computing processes running in parallel (simultaneously). The term covers a set of issues of parallelism in programming, as well as the creation of efficient hardware implementations. The theory of parallel computing constitutes a branch of applied theory of algorithms.

There are various ways to implement parallel computing. For example, each computing process may be implemented as an operating system process, or computing processes may be a collection of threads of execution within a single OS process. Parallel programs can be physically executed either sequentially on a single processor - alternating in turn the steps of each computational process, or in parallel - allocating one or more processors (located nearby or distributed in a computer network) to each computational process.

The main difficulty in designing parallel programs is to ensure the correct sequence of interactions between different computing processes, as well as the coordination of resources shared between processes.

In some parallel programming systems, data transfer between components is hidden from the programmer (for example, using a promise mechanism), while in others it must be specified explicitly. Explicit interactions can be divided into two types:

Message-passing parallel systems are often easier to understand than shared memory systems and are generally considered a superior method of parallel programming. There is a wide range of mathematical theories for the study and analysis of message passing systems, including the actor model and various types of process calculus. Messaging can be efficiently implemented on symmetric multiprocessors with or without shared coherent memory.

Distributed memory parallelism and message passing parallelism have different performance characteristics. Typically (but not always), the overhead of process memory and task switching time is lower for message-passing systems, but message passing is more expensive than procedure calls. These differences are often overshadowed by other factors that affect performance.

Plaksin M.A.

National Research University Higher School of Economics (Perm branch), Perm, Ph.D., Associate Professor of the Department of Information Technologies in Business, mapl @ list. ru

"SUPERCOMPUTERS" VS "PARALLEL PROGRAMMING". “PARALLEL PROGRAMMING” VS “COLLABORATIVE ACTIVITY”. HOW TO STUDY THE TOPIC “PARALLEL COMPUTING” IN SECONDARY SCHOOL?

KEYWORDS

Computer science, parallel programming, parallel computing, parallel algorithms, supercomputers, primary school, secondary school, TRIZformashka.

ANNOTATION

The article is devoted to the issue of including the topic “parallel computing” in the school computer science course. A number of problems arising in this case are mentioned, the purpose of studying the topic, the selection of material, some proposals for teaching methods, mechanisms for testing the proposed methodology and the accumulated experience are considered. The question of the place of this material in the curriculum is not addressed.

The current stage of development of computer science is associated with the massive spread of parallelism of calculations at all levels (multi-machine clusters, multiprocessor computers, multi-core processors).

The massive spread of parallelism has serious consequences that have yet to be identified and analyzed. Let's start by listing some theoretical problems.

The modern theory of algorithms was created with the concept of a sequential algorithm in mind. How will the refusal to require a sequence of steps affect the concept of an algorithm?

For at least the last 20 years, the concept of “algorithm” has been introduced in schools inextricably linked with the concept of “performer”. This is natural for a sequential algorithm. What to do with a parallel algorithm? Is it performed by one performer or a group of performers? To be specific, let’s take the computer training program “Tank Crew” as an example. In this program, the student is required to program the actions of a tank crew consisting of three people: a gunner, a driver and a loader. Each of them has its own command system. In order to complete a combat mission (hit all targets), all crew members must act in concert. For an example of the playing field for the Tank Crew program, see Fig. 1.

Question: should these three actors be considered as independent performers or as three components (devices) of one complex performer? For the tank crew, the second option seems more natural, since not a single character is able to complete the task on his own. But what if the game becomes more complicated, and a combat mission is assigned to two tanks at once? For three tanks? Three members of one crew can well be considered as three parts of one performer. But each crew is obviously an independent performer. This means that a parallel algorithm for several tanks will be executed by a group of executors at once. It turns out that for a parallel algorithm it is necessary to consider both possibilities: the execution of parallel actions by one executor and by a group of executors. In the case of a tank crew, drawing the line is easy. The performer is the one who is able to solve the task. This executor may consist of several components, each of which performs a certain part of the task, but cannot independently complete the entire task without the help of other components. But whether the separation of “whole performers” and parts of a complex performer will always be as simple is impossible to say now.

File 1*ra Window About the program

Vpolyet everything

Bbno.n«fTb to the highlighted line

Return to initial page**"

would popnlt step by step (after executing the “.order command nesykoa^“ the button gV ygolg “n-b next uwr” will be pressed)

Ё ГГВД iTHWTt. special step

Information step by step

Fig.1. Fragment of the playing field of the Tank Crew program

Identification of the parts of the performer that are capable of independent actions requires that these parts be named somehow. Moreover, the name must allow recursion, since the acting parts of the performer themselves may have a complex structure.

It is necessary to agree on a term to designate a group of cooperative performers. The term “command” is not suitable; it is associated with the “executor command system” and with “central processor commands”. "Collective of performers"? "Brigade of performers"?

Sh. Algorithm

n Hitting1"; Driver Charger

1 Measure orun* on “master sgkll V Stop V Charge 1

g Pci V Stop V Charge 2

3 Wholesale! V Turn clockwise 90 degrees V Charge 1 V

L V V first V Charge? V

5 Fire! V Stop V Charge 1

Í P^chm V St*p V Zaryasya? V

7 Fire! V Stop V Charge 1 V

3 Pa^ V Turn clockwise 45 degrees V Charge 2 V

S Pause V Start V Pause V

10 Pvdea V Forward V Pause ¿d

11 Plrl V Forward V Pause V

12 Paum V Turn clockwise 45 degrees V Pause V

13 Padm V Forward V Pause V

14 V

Fig.2. Fragment of the program for the “Tank Crew” (example of command lines) The traditional concept of the “executor command system” (SCS) and the concept of the team itself require improvement. If we believe that three members of a tank crew form a single performer, then what should be considered the SKI of this performer? And what is considered a team? Or leave the concept of SKI for each character? That is, this is no longer a system of commands of the EXECUTOR, but a system of commands of one of the components of the performer (for which there is no name yet)?

It is convenient to expand the concept of a team to a “line of commands.” For an example of tank crew command lines, see Fig. 2. However, the concept of a “line of commands” only works well for linear algorithms. In other cases, the rulers are formed dynamically. It is impossible to depict them in the form of a visual table.

Among the properties of algorithms, a new practically significant characteristic stands out: the ability to parallelize. A clarifying question is about the possible degree of parallelization (to what extent does it make sense to increase the number of processors when executing a given algorithm).

A separate issue is methods for parallelizing existing sequential algorithms.

Until recently, parallel programming was the province of a small number of highly skilled systems programmers. Today it is becoming part of professional competence. But parallel programming technology differs significantly from traditional sequential programming. In support of this statement, following L.L. Bosova, we will quote the largest Russian specialist in the field of parallel computing V.V. Vojvodina:

“... The mastery of parallel architecture computing technology... by young specialists comes with great difficulties. In our opinion, this is due to the fact that acquaintance with parallel computing, as well as education in this area in general, does not begin with where one should begin. In addition, what you need to start with is not covered in any courses at all. The ability to quickly solve problems using parallel architecture computing technology forces users to change their entire habitual style of interaction with computers. Compared, for example, with personal computers and workstations, almost everything changes: other programming languages are used, most algorithms are modified, users are required to provide numerous non-standard and difficult-to-obtain characteristics of the tasks being solved, the interface ceases to be friendly, etc. An important fact is that failure to fully take into account new operating conditions can significantly reduce the efficiency of using new and, moreover, quite expensive equipment.”

“It is only important that the student learns as early as possible that there are other ways of organizing computational processes, and not just the sequential execution of “operation by operation,” that the most powerful modern computer technology is built on these other methods, that only with such technology it is possible to solve large problems. industrial and scientific tasks, etc. It is important, first of all, to draw students’ attention as early as possible to the need for a critical attitude towards the philosophy of sequential computing. After all, it is this philosophy that they have to deal with throughout their entire education, both at school and at university. And it is precisely this philosophy that hinders the understanding of the features of working on parallel architecture computers.”

Today we need methods for mass training in parallel programming technology. The author of this article believes that in the learning process the time has come for a revolution in the relationship between sequential and parallel programming. Until now, we first taught sequential programming, and then - parallelization of sequential algorithms. Now we need to raise the question of teaching parallel programming right away. A sequential algorithm should be considered as a certain part of a parallel algorithm, which does not require communication with its other parts. How to do this is an open question. There are still some ideas that need practical implementation and testing. It is hoped that in a year at the next conference it will be possible to discuss the results obtained.

Thirty years ago, the beginning of mass computerization of production required an increase in the level of computer literacy of the population. This led to the introduction of computer science into the school curriculum in 1985. But the computer science course in the Soviet (then Russian) version was not limited to “push-button computer science” - to mastering the technology of working with application software packages and computer games. He began to change the thinking style of the younger generation. First of all, this concerned algorithmicity, accuracy, and rigor. Then the computer science course incorporated elements of logic and systems analysis. Subsequently, all this greatly simplified the distribution of much-needed technology in the 21st century. project approach. The talk now is that over the next decade, parallel algorithms should become

element of the general culture of thinking. Question: how will mastering the concept of a parallel algorithm affect the thinking of the next generation, what will the restructuring of consciousness “in a parallel way” lead to?

The massive spread of parallel information processing makes it urgent to move the relevant concepts into the category of publicly accessible and general cultural ones. Familiarity with parallel algorithms should become part of literacy, just as basic concepts of algorithm theory have become over the past quarter century. This can be done only in one way - by including relevant topics in the school computer science course. This means that we need a methodology for initial acquaintance with parallel programming at the high school level.

Historically, the first attempt to include the topic of parallel computing in a school computer science course was made twenty years ago. Twenty years ago, in a course called “Algorithmics”, a “Construction Director” was described, who commanded the parallel actions of several teams building a structure from rectangular and triangular blocks. Moreover, a software implementation was created for this performer. Alas! This wonderful methodological development was not in demand in the mid-90s. She was almost twenty years ahead of her time!

Today, the situation is such that the topic of parallel computing in high school is primarily related to the topic of supercomputers. It is on supercomputers that the authors of various methodological developments focus the attention of students, even when this is not necessary. Suffice it to say that the corresponding section in the journal “Informatics at School” is called “Supercomputer Education at School.” This situation has both positive and negative sides. Among the positive aspects are:

The topic of supercomputers is of interest in society, including among students. This interest repeats at the modern level the interest that half a century ago was caused by large machines - the supercomputers of their time;

Organizational support from the supercomputing community. Every summer, the Faculty of Computational Mathematics and Cybernetics of Moscow State University hosts the Summer Supercomputer Academy. And every summer, within the framework of this Academy, a school track for computer science teachers is organized. Training is provided free of charge. Nonresident students are provided with housing on very favorable terms. At the Russian Supercomputing Days conference in September 2015, a school section and master class for computer science teachers were organized. Consistent organizational work led to the identification and formation of a group of teachers interested in promoting this topic;

The presence of a bright, charismatic leader, such as Vladimir Valentinovich Voevodin - Doctor of Physical and Mathematical Sciences, Professor, Corresponding Member of the Russian Academy of Sciences, Deputy Director of the Research Computing Center of Moscow State University;

Interest and support (including material) from the Russian representative office of Intel and the strategic development manager of Intel, Igor Olegovich Odintsov.

The disadvantage of the “supercomputer” approach is that it narrows the scope of parallel computing. Supercomputers themselves are, as a rule, inaccessible to schoolchildren (unless in large cities they can be seen on excursions). The tasks they are aimed at solving are too complex for schoolchildren and, in most cases, do not have immediate practical significance and are not of practical interest.

A natural extension of the supercomputing field is the study of parallel programming. Currently, to execute parallel programs it is not at all necessary to have a supercomputer. A multi-core processor or video card with a set of graphics accelerators is enough. And this is already available to almost everyone. Among the works in this direction, we note the candidate’s dissertation of M.A. Sokolovskaya on the methodology of teaching future computer science teachers the basics of parallel programming and the experience of E.Yu. Kiseleva on mastering CUDA technology by schoolchildren.

According to the author of this article, focusing on supercomputers and parallel programming significantly impoverishes and complicates the topic of parallel computing and distracts students from many important and accessible issues. The purpose of the topic “parallel

computing” in high school is not teaching “real” parallel programming (studying relevant language constructs, programming languages and technologies), but familiarizing students with the corresponding set of concepts and understanding the features of parallel work. The world around and inside us is a complex parallel system. And this system itself provides a lot of material for mastering the concepts and mechanisms of parallelism. No complex artificial structures such as MPI and OpenMP technologies are needed for this. School computer science should foster thinking in a “parallel way.” And then let the university incorporate professional knowledge, skills, and abilities into this thinking. At school, it makes sense to focus not on getting to know supercomputers and studying parallel programming, but on mastering the mechanisms of “joint activity” that are constantly and widely used in life. The course proposes to cover the following questions:

1) Collaboration of several executors (digging a ditch with several diggers) and parallelization “inside” one executor in the presence of several processing devices (reading and eating an apple). In computer science, this will be a multi-machine complex and a multi-core processor.

2) Types of parallelism: true parallelism and pseudo-parallelism (one processor executes several programs in parts).

3) Performers of the same type (diggers) and different types (tank crew).

4) Works of the same type and different types.

5) The ratio of “executors - jobs”: 1 performer - 1 job, 1 performer - N jobs (pseudo-parallel execution or true parallelism in the presence of several processing devices for different jobs), N performers - 1 job, N performers - N jobs.

6) Coordination of the activities of performers. Types of approval: by parts of work, by time, by results of activities, by resources.

7) Resources. Resources are shared and non-shared, consumable and reusable. Recycling of consumed resources (“garbage collection” in the broad sense).

8) Performing the same work by one performer and a group of performers. Dependence of work speed on the number of performers. Dependence of the cost of work on the number of performers. Nonlinear increase in work speed with an increase in the number of performers. Critical path. Optimal number of performers. Optimal loading of performers. Optimal procedure. Load balancing.

9) Competition between performers for resources. Blocking. Clinch (deadlock).

10) Mechanisms for coordinating the actions of performers.

11) Pseudo-parallel execution of processes on a computer (division of one resource - the processor) between executor-processes.

12) Suitability of algorithms for parallelization. Possible degree of parallelization. The existence of algorithms that cannot be parallelized.

Please note that the above list represents the private opinion of the author of the article and is open for discussion, addition and correction. Moreover, in the author’s opinion, it would be very useful for the “supercomputer community” to formulate a “social order” for the school: what kind of knowledge and skills it wants to see in school graduates. How should a graduate of the “supercomputer world” school differ from a graduate of today? If there is an order, there will be a result. Fresh example. On the first day of Russian Supercomputing Days-2015, two reports voiced the idea that the speed of modern supercomputers is determined not by the power of processors (which is the focus of public attention), but by the speed of RAM. It is this that becomes the bottleneck, the throughput of which determines the productivity of the entire system. As a result, on the second day of the conference, participants in the teacher’s master class tested a game invented by the author of this article, demonstrating the interaction of the central processor, RAM and cache memory. The order and form of presentation of the material is an open question.

The material should be demonstrated using examples not related to the operation of a computer. Performers must manipulate material objects.

As much of the training as possible should be in the nature of business (organizational and activity) games.

Fulfilling these requirements will make it easier to understand the material being studied. This will be useful both when using this technique in computer science lessons at school (including elementary!), and when teaching adults: computer science teachers and students. A schoolchild, a school teacher, a student of a non-core specialty will be able to stop at the level of familiarization and understanding. The professional student will have to take the next step and move from acquaintance to studying these mechanisms at a professional level. But this is already a step beyond the methods of initial familiarization with the topic.

The author of this article began work on preparing a methodology for studying parallel computing in 2013 during the preparation of the TRIZformashka-2013 competition and continued in subsequent years.

(“TRIZformashka” is an interregional Internet competition in computer science, system analysis and TRIZ. Held annually in the second half of March. The age of participants is from the first grade to the fourth year. Geography - from Vladivostok to Riga. The average number of participants is 100 teams (300 people .), maximum - 202 teams (more than 600 people). Competition website www.trizformashka.ru.) Then, in 2013, the goal of the work was formulated as follows:

1. Within two to three years, prepare a description of performers, a set of games and tasks related to parallel computing;

2. Offer them (in parts, annually) to the participants of the competition;

3. Analyze their reaction (assess the number of solvers, their age, the success of the solution, typical errors, detected inaccuracies in the formulation of problems, etc.). The TRIZformashka competition turned out to be a convenient tool for debugging problems, since

made it possible to obtain reactions from all ages (from first grade to fourth year), from different regions, from different educational institutions.

Over the past years, the following set of methodological tools and platforms for their testing have been prepared.

1. Tasks on parallelism, starting from 2013, were included in the “TRIZformashka” competition (starting from 2013, the competition has the subtitle “Parallel Computing”). A list of task types is given below;

2. A chapter on parallelism has been prepared for the new version of the computer science textbook for grade 4. The material was tested in 3rd and 4th grades of Lyceum No. 10 in Perm;

3. The computer game “Tank Crew” has been developed and used since 2014 in the TRIZformashka competition;

4. A number of games have been developed and tested, which reflect the following issues:

Coordination of the activities of performers. Various types of approval;

Performing the same work by one performer and a group of performers. Dependence of work speed on the number of performers. Nonlinear increase in work speed with an increase in the number of performers. Critical path. Optimal number of performers. Optimal loading of performers. Optimal procedure;

Resources. Shared and non-shared resources;

Competition between performers for resources. Blocking. Clinch (deadlock). The following types of problems were proposed and tested:

1. Tasks on types of approval. (What types of coordination exist in the school cafeteria?);

2. Game "Tank Crew". Assignment to construct a parallel algorithm;

3. Performer “Construction”. At the same time, working teams build a structure from horizontal and vertical beams. Tasks include tasks for the execution of the specified algorithm, for the development of a new algorithm, for searching for errors in a given algorithm, for researching algorithms (comparing construction times using different algorithms, comparing construction costs, assessing the possibility of saving by redistributing labor, etc.);

4. Competition for resources. Three little pigs each cook their own lunch. For each piglet it is indicated what dishes he prepares, what resources (equipment, utensils, etc.) he needs for this and for how long these resources should be used. It is necessary to draw up a work schedule for each piglet, if he cooks alone in the kitchen, if they cook in pairs, if all three cook at once. Cooking time should be minimized;

5. Network diagram. A network diagram is given. It is required to depict (schematically) the structure that will be built, determine how many days it will take for construction with a particular number of teams, what part of the work will be completed by a certain time;

6. Tiered-parallel forms. Planning work according to various criteria. The work assignment, worker productivity, and payment rules are given. It is necessary to determine the number of workers needed to complete the work in a given time, determine the period of work for a given number of workers, determine the number of workers needed to minimize the cost of work;

7. Gantt charts. The text describes the work plan for the reconstruction of the workshop: duration and mutual sequence of actions, required workers. It is necessary to determine the deadline for the completion of the object, the change in the deadline due to certain changes in the workforce, and the list of workers involved on a specific date.

8. Coordination of repetitive work. Let the task be given to produce a batch of devices in a minimum time, provided that each device must be processed on different equipment; there are different amounts of equipment with different productivity. It is necessary to plan the start and operation time of each equipment and minimize downtime.

Today we have the following results:

1. An approach has been formulated for studying the topic of “parallel computing”: to go not from the problems of computer science, but “from life”, to focus on “joint activity”;

2. A list of issues has been formulated that are proposed to be reflected in the initial course of parallel computing;

3. Some classes of problems are formulated. Based on the accumulated experience, you can evaluate what kind of problems should be invented;

4. A set of problems of the named classes has been prepared. The problems were tested in the TRIZformashka competitions in 2013, 2014, 2015. and/or in elementary school (in classes with students of the third and fourth grades of Lyceum No. 10 in Perm);

5. A set of business games has been prepared. The games have been tested in primary schools and at a number of events for teachers. In particular, they were presented at the school track of the Summer Supercomputer Academy of Moscow State University in 2014, at a master class for teachers at Russian Supercomputing Days-2015, at several other conferences (including at the IT-education-2015 conference of the APKIT association) and other events for computer science teachers;

6. A set of texts about parallelism has been prepared for a fourth grade textbook. The texts were tested at Lyceum No. 10 in Perm;

7. The computer game “Tank Crew” has been prepared. The game was tested in the TRIZformashka competitions in 2014 and 2015;

8. The TRIZformashka competition has proven itself as a testing platform;

9. The task of “castling” in the process of teaching algorithmization is formulated: immediately teach parallel programming, presenting a sequential algorithm as part of a parallel one. There are thoughts on how this idea can be implemented. There is an opportunity to try out these ideas during the current school year (for students in grades 4-5);

10. There is a need, desire and opportunity to continue working.

Literature

1. Algorithms: grades 5-7: Textbook and problem book for general education. educational institutions /A.K. Zvonkin, A.G. Kulakov, S.K. Lando, A.L. Semenov, A.Kh. Shen. - M.: Bustard, 1996.

2. Bosova L.L. Parallel algorithms in primary and secondary schools. //Informatics at school. 2015, no. 2. P.24-27.

3. Voevodin V.V. Computational mathematics and the structure of algorithms: Lecture 10 about why it is difficult to solve problems on parallel architecture computing systems and what additional information you need to know. to successfully overcome these difficulties: a textbook. M.: MSU Publishing House 2010.

4. Gavrilova I.V. First trip to the “parallel world”. //Informatics at school. 2015, No. 6. P.16-19.

5. Dieter M.L., Plaksin M.A. Parallel computing in school computer science. Game "Construction". //Informatics at school: past, present and future.: materials of the All-Russian. scientific method. conf. on the use of ICT in education, February 6-7, 2014 /Perm. state national research univ. - Perm, 2014. - P.258-261.

6. Ivanova N.G., Plaksin M.A., Rusakova O.L. TRIZformashka. //Computer science. N05 Retrieved 10/10/2015.

14. Plaksin M.A. Computer science: textbook for 4th grade: 2 hours / M.A.Plaksin, N.G.Ivanova, O.L.Rusakova. - M.: BINOM. Knowledge Laboratory, 2012.

15. Plaksin M.A. About the method of initial acquaintance with parallel computing in high school. //Informatics at school: past, present and future.: materials of the All-Russian. scientific method. conf. on the use of ICT in education, February 6-7, 2014 /Perm. state national research univ. - Perm, 2014. - P.256-258.

16. Plaksin M.A. A set of business games for introducing parallel computing in elementary school. //Teaching information technologies in the Russian Federation: materials of the Thirteenth Open All-Russian Conference “IT-0education-2015” (Perm, May 14-15, 2015). Perm State National Research University, - Perm, 2015. P.60-62.

17. Plaksin M.A., Ivanova N.G., Rusakova O.L. A set of tasks for getting acquainted with parallel computing in the TRIZformashka competition. //Teaching information technologies in the Russian Federation: materials of the Thirteenth Open All-Russian Conference “IT Education-2015” (Perm, May 14-15, 2015). Perm State National Research University, - Perm, 2015. pp. 232-234.

18. Sokolovskaya M.A. Methodological system for teaching the basics of parallel programming to future computer science teachers.: abstract. dis. ...cand. ped. Sciences, Krasnoyarsk, 2012.

The concept of parallel computing

FUNDAMENTALS OF PARALLEL COMPUTING

Lecture No. 6

Under parallel or concurrent computations you can understand problem solving processes in which several computational operations can be performed simultaneously at the same time

Parallel computing forms the basis of supercomputing technologies and high-performance computing

Parallel processing

If a certain device performs one operation per unit of time, then it will perform a thousand operations in a thousand units. If we assume that there are five identical independent devices capable of operating simultaneously, then a system of five devices can perform the same thousand operations not in a thousand, but in two hundred units of time.

Similarly, a system of N devices will perform the same work in 1000/N units of time. Similar analogies can be found in life: if one soldier digs up a garden in 10 hours, then a company of fifty soldiers with the same abilities, working simultaneously, will cope with the same work in 12 minutes - the principle of parallelism in action!

The pioneer in parallel processing of data streams was Academician A.A. Samarsky, who performed the calculations necessary to simulate nuclear explosions in the early 50s. Samarsky solved this problem by seating several dozen young ladies with adding machines at tables. The young ladies transferred data to each other simply in words and put down the necessary numbers on adding machines. Thus, in particular, the evolution of the blast wave was calculated.

There was a lot of work, the young ladies were tired, and Alexander Andreevich walked among them and encouraged them. This, one might say, was the first parallel system. Although the calculations for the hydrogen bomb were carried out masterfully, their accuracy was very low, because there were few nodes in the grid used, and the calculation time was too long.

Conveyor processing

The idea of pipeline processing is to isolate individual stages of performing a general operation, and each stage, having completed its work, would pass the result to the next one, while simultaneously receiving a new portion of input data. We get an obvious gain in processing speed by combining previously spaced operations.

Let's assume that there are five micro-operations in an operation, each of which is performed in one unit of time. If there is one indivisible serial device, then it will process 100 pairs of arguments in 500 units. If each micro-operation is separated into a separate stage (or otherwise called a stage) of a conveyor device, then on the fifth unit of time, at different stages of processing of such a device, the first five pairs of arguments will be located, and the entire set of one hundred pairs will be processed in 5 + 99 = 104 units time - acceleration compared to a serial device is almost five times (according to the number of conveyor stages).

Models of parallel computers (Flynn classification)

· “One command stream - one data stream” (SISD - “Single Instruction Single Data”)

Refers to von Neumann architecture. SISD computers are ordinary, “traditional” sequential computers, in which only one operation is performed on one data element (numeric or some other value) at any given time. Most modern personal computers fall into this category.

· "One command stream - many data streams" (SIMD - "Single Instruction - Multiple Data")

SIMD (Single Instruction, Multiple Data)- a principle of computer computing that allows for parallelism at the data level. SIMD computers consist of one command processor (control module), called a controller, and several data processing modules, called processing elements. The control module receives, analyzes and executes commands.

If data is encountered in the command, the controller sends a command to all processor elements, and this command is executed on several or all processor elements. Each processing element has its own memory for storing data. One of the advantages of this architecture is that in this case the calculation logic is implemented more efficiently. SIMD processors are also called vector processors.

· “Many command streams - one data stream” (MISD - “Multiple Instruction - Single Data”)

There are practically no computers of this class and it is difficult to give an example of their successful implementation. One of the few is a systolic array of processors, in which the processors are located at the nodes of a regular lattice, the role of edges of which is played by interprocessor connections. All processor elements are controlled by a common clock generator. In each operating cycle, each processing element receives data from its neighbors, executes one command and transmits the result to its neighbors.

Arrays of PEs with direct connections between nearby PEs are called systolic. Such arrays are extremely efficient, but each of them is focused on solving a very narrow class of problems. Let's consider how you can build a systolic array to solve a certain problem. Let, for example, you want to create a device for calculating a matrix D=C+AB, Where

Here all matrices are band matrices of order n. Matrix A has one diagonal above and two diagonals below the main one; matrix B- one diagonal below and two diagonals above the main one; matrix C three diagonals above and below the main one. Let each PE be able to perform a scalar operation c+ab and simultaneously transmit data. Each PE, therefore, must have three inputs: a, b, c and three outputs: a, b, c. Input ( in) and weekends ( out) the data are related by relations

a out = a in , b out = b in , c out = c in + a in *b in ;

If at the time of the operation some data was not received, then we will assume that they are defined as zeros. Let us further assume that all PEs are located on a plane and each of them is connected to six neighboring ones. If you arrange the data as shown in the figure, the circuit will calculate the matrix D.

The array operates in clock cycles. During each clock cycle, all data is moved to neighboring nodes in the directions indicated by the arrows.

The figure shows the state of the systolic array at some point in time. In the next clock cycle, all data will move to one node and elements a11, b11, c11 will end up in one PE located at the intersection of the dashed lines. Therefore, the expression will be evaluated c11+a11b11.At the same clock, data a12 And b21 will come very close to the PE, located at the top of the systolic array.

In the next clock cycle, all data will again move one node in the direction of the arrows and will appear in the upper PE a12 And b21 and the result of the previous operation of the PE located below, i.e. c11+a11b11. Therefore, the expression will be evaluated c11+a11b11+a12b21. This is an element d11 matrices D.

Continuing the step-by-step examination of the process, we can verify that at the PE outputs corresponding to the upper boundary of the systolic array, matrix elements are periodically output after three steps D, while elements of the same diagonal appear at each output. In about 3n cycles, the calculation of the entire matrix will be completed D. In this case, the load on each systolic cell is asymptotically equal 1/3 .

"Many command streams - many data streams" (MIMD - "Multiple Instruction - Multiple Data")

This category of computer architectures is the richest if keep in mind examples of its successful implementations. It includes symmetrical parallel computing systems, workstations with multiple processors, workstation clusters, etc.

The enormous performance of parallel computers and supercomputers is more than offset by the difficulties of using them. Let's start with the simplest things. You have a program and access to, say, a 256-processor computer. What are you expecting? Yes, it is clear: you quite legitimately expect that the program will execute 256 times faster than on a single processor. But this is precisely what most likely will not happen.

Parallel computing is a method of organizing computer computing in which programs are developed as a set of interacting computing processes running simultaneously.

There are different ways to implement parallel computing: each computing process can be implemented as an operating system process, or computing processes can be a collection of threads of execution within a single process. A thread (or more correctly a thread of execution) is the smallest unit of processing whose execution can be assigned by the operating system kernel. Multiple threads of execution can exist within the same process and share resources such as memory, whereas processes do not share these resources. Parallel programs can be physically executed either sequentially on a single processor - alternating in turn the steps of each computational process, or in parallel - allocating one or more processors (located nearby or distributed in a computer network) to each computational process.

The main challenge in designing parallel programs is ensuring the correct sequence of interactions between different computing processes, as well as the sharing of resources such as RAM or peripheral devices.

In some parallel programming systems, data transfer between components is hidden from the programmer, while in others it must be specified explicitly. Explicit interactions can be divided into two types:

1. Interaction through shared memory (for example, in Java or C#). This type of parallel programming usually requires some form of control capture to coordinate the threads among themselves.

2. Interaction through message passing. Messages can be exchanged asynchronously or using the rendezvous method, in which the sender is blocked until his message is delivered. Asynchronous message transmission can be reliable (with delivery guarantee) or unreliable. Message-passing parallel systems are often easier to understand than shared memory systems and are generally considered a superior method of parallel programming. Messaging can be efficiently implemented on symmetric multiprocessors with or without shared coherent memory.

There are quite a few different parallel programming technologies. Moreover, these technologies differ not so much in programming languages as in architectural approaches to building parallel systems. For example, some technologies involve the construction of parallel solutions based on several computers (both the same and different types), while others involve working on one machine with several processor cores. Currently, the main software tools for creating parallel programs are:

1. OpenMP used in parallel systems with shared memory (for example, modern computers with multi-core processors);

2. MPI (Message Passing Interface) is a standard for message transfer systems between parallel executing processes, used in the development of programs for supercomputers;

3. POSIX Threads is a standard for implementing threads of execution;

4. The Windows operating system has built-in support for multi-threaded applications for C++ at the API level;

5. PVM (Parallel Virtual Machine) allows you to combine heterogeneous computers connected by a network into a common computing resource.

Systems based on several computers are classified as systems for distributed computing. Such solutions have been used for quite some time. The most striking example of distributed computing technology is MPI (Message Passing Interface). MPI is the most common data exchange interface standard for parallel programming; there are implementations for a huge number of computer platforms. MPI provides the programmer with a unified mechanism for interaction of branches within a parallel application, regardless of the machine architecture (uniprocessor/multiprocessor with shared/separate memory), the relative location of branches (on the same processor or on different ones).

Since MPI is intended primarily for systems with shared memory, using it to organize a parallel process in a system with shared memory is extremely difficult and impractical. However, nothing prevents you from making MPI solutions for one machine.

But parallel programming systems for working on one machine began to develop relatively recently. Of course, these are not fundamentally new ideas, but it was with the advent of multi-core systems on the market of personal computers and mobile devices that technologies such as OpenMP received significant development.

It is very important that parallel programming technology supports the ability to make a program parallel gradually. Of course, an ideal parallel program should immediately be written parallel, perhaps in some functional language where the issue of parallelization does not arise at all. But in practice, it is necessary to gradually parallelize the written sequential one in order to increase performance. In this case, OpenMP technology will be a very good choice. It allows you to select the places in the application that most need parallelization, and first of all make them parallel. The parallel version development process can be interrupted, intermediate versions of the program released, and returned to as needed. This is why OpenMP technology in particular has become quite popular.

OpenMP (Open Multi-Processing) is a set of compiler directives, library procedures, and environment variables that are designed for programming multithreaded applications on shared memory multiprocessor systems.

The OpenMP specification is being developed by several major hardware and software manufacturers, whose work is regulated by a non-profit organization called the OpenMP Architecture Review Board (ARB).

The first version appeared in 1997, intended for the Fortran language. The C/C++ version was developed in 1998. In 2008, OpenMP 3.0 was released. The OpenMP interface has become one of the most popular parallel programming technologies. OpenMP is successfully used both in programming supercomputer systems with a large number of processors, and in desktop user systems or, for example, in the Xbox 360.

OpenMP implements parallel computing using multithreading, in which a “master” thread creates a set of slave threads and the task is distributed among them. The threads are assumed to be executed in parallel on a machine with multiple processors (the number of processors does not have to be greater than or equal to the number of threads).

Tasks performed by threads in parallel, as well as the data required to perform these tasks, are described using special preprocessor directives of the corresponding language - pragmas. For example, a section of Fortran code that must be executed by several threads, each of which has its own copy of the variable N, is preceded by the following directive: !$OMP PARALLEL PRIVATE(N)

The number of created threads can be regulated both by the program itself by calling library procedures, and externally, using environment variables.

The key elements of OpenMP are

1. constructs for creating threads (parallel directive);

2. constructs for distributing work between threads (DO/for and section directives);

3. constructs for managing work with data (shared and private expressions for defining the memory class of variables);

4. constructs for synchronizing threads (critical, atomic and barrier directives);

5. runtime support library procedures (for example, omp_get_thread_num);

6. environment variables (for example, OMP_NUM_THREADS).

OpenMP uses a branch-merge parallel execution model. An OpenMP program begins as a single thread of execution, called the initial thread. When a thread encounters a parallel construct, it creates a new thread group consisting of itself and a number of additional threads, and becomes the master of the new group. All members of the new group (including the main group) execute code inside the parallel construct. There is an implicit barrier at the end of the parallel structure. After the parallel construction, only the main thread continues to execute user code. A parallel region can be nested with other parallel regions, in which each thread of the original region becomes the main thread for its group of threads. Nested regions can in turn include regions at a deeper level of nesting.

The number of threads in a group running in parallel can be controlled in several ways. One of them is using the OMP_NUM_THREADS environment variable. Another way is to call the omp_set_num_threads() procedure. Another way is to use the num_threads expression in combination with the parallel directive.

In this program, two arrays (a and b) are added in parallel by ten threads.

#include

int main(int argc, char *argv)

float a[N], b[N], c[N];

omp_set_dynamic(0); // prevent the openmp library from changing the number of threads during execution

omp_set_num_threads(10); // set the number of threads to 10

// initialize the arrays

for (I = 0; I< N; i++)

// calculate the sum of the arrays

#pragma omp parallel shared(a, b, c) private(i)

for (I = 0; I< N; i++)

c[i] = a[i] + b[i];

printf("%f\n", c);

This program can be compiled using gcc-4.4 and later with the –fopenmp flag. Obviously, if you remove the inclusion of the omp.h header file, as well as calls to the OpenMP configuration function, the program can be compiled on any C compiler as a regular sequential program.

OpenMP is supported by many modern compilers:

1. Sun Studio compilers support the official specification - OpenMP 2.5 - with improved performance under Solaris OS; Linux support is planned for the next release.

2. Visual C++ 2005 and higher supports OpenMP in Professional and Team System editions.

3. GCC 4.2 supports OpenMP, and some distributions (such as Fedora Core 5 gcc) have included support in their versions of GCC 4.1.

4. Intel C++ Compiler, including a version of Intel Cluster OpenMP for programming in distributed memory systems.

Message Passing Interface (MPI message passing interface) is an information transfer programming interface (API) that allows messages to be exchanged between processes performing the same task. Developed by William Group, Evin Lusk and others.

MPI is the most common data exchange interface standard for parallel programming, and there are implementations for a large number of computer platforms. Used in developing programs for clusters and supercomputers. The main means of communication between processes in MPI is passing messages to each other. MPI standardization is carried out by the MPI Forum. The MPI standard describes a message passing interface that must be supported both on the platform and in user applications. There are currently a large number of free and commercial implementations of MPI. There are implementations for Fortran 77/90, C and C++ languages.

MPI is primarily focused on systems with distributed memory, that is, when data transfer costs are high, while OpenMP is focused on systems with shared memory (multi-cores with a shared ESH). Both technologies can be used together to optimally use multi-core systems in a cluster.

The first version of MPI was developed in 1993-1994, and MPI 1 was released in 1994.

Most modern MPI implementations support version 1.1. The MPI standard version 2.0 is supported by most modern implementations, but some features may not be fully implemented.

sending and receiving messages between separate processes;

collective interactions of processes;

interactions in process groups;

implementation of process topologies;

dynamic process generation and process management;

one-way communications (Get/Put);

parallel input and output;

extended collective operations (processes can perform collective operations not only within one communicator, but also within several communicators).

Version MPI 2.1 was released in early September 2008.

The basic mechanism of communication between MPI processes is the transmission and reception of messages. The message contains transmitted data and information that allows the receiving party to selectively receive them:

1. sender - rank (number in the group) of the message sender;

2. recipient - recipient's rank;

3. sign - can be used to separate different types of messages;

4. communicator - process group code.

Reception and transmission operations can be blocking or non-blocking. For non-blocking operations, functions are defined for checking readiness and waiting for the operation to complete.

Another method of communication is remote memory access (RMA), which allows the memory region of a remote process to be read and modified. A local process can transfer the memory area of a remote process (within a window specified by the processes) into its memory and back, and also combine data transferred to the remote process with data available in its memory (for example, by summing). All remote memory access operations are non-blocking; however, blocking synchronization functions must be called before and after their execution.

Below is an example program for calculating π in C using MPI:

// Connecting the necessary headers

#include

// Include MPI header file

#include "mpi.h"

// Function for intermediate calculations

double f(double a)

return (4.0 / (1.0+ a*a));

// Main function of the program

int main(int argc, char **argv)

// Declaring variables

int done = 0, n, myid, numprocs, I;

double PI25DT = 3.141592653589793238462643;

double mypi, pi, h, sum, x;

double startwtime = 0.0, endwtime;

char processor_name;

// Initialize the MPI subsystem

MPI_Init(&argc, &argv);

// Get the size of the communicator MPI_COMM_WORLD

// (total number of processes within the task)

MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

// Get the number of the current process within

// communicator MPI_COMM_WORLD

MPI_Comm_rank(MPI_COMM_WORLD,&myid);

MPI_Get_processor_name(processor_name,&namelen);

// Display the thread number in the shared pool

fprintf(stdout, “Process %d of %d is on %s\n”, myid,numprocs,processor_name);

// number of intervals

fprintf(stdout, “Enter the number of intervals: (0 quits) “);

if(scanf(“%d”,&n) != 1)

fprintf(stdout, “No number entered; quitting\n”);

MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

h = 1.0 / (double)n;

// Calculate the point assigned to the process

for(I = myid + 1 ; (I<= n) ; I += numprocs)

x = h * ((double)I – 0.5);

// Reset results from all processes and add

MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

// If this is the main process, output the result

printf(“PI is approximately %.16f, Error is %.16f\n”, pi, fabs(pi – PI25DT));

endwtime = MPI_Wtime();

printf(“wall clock time = %f\n”, endwtime-startwtime);

// Release the MPI subsystem

The most common MPI implementations today are:

MPICH is the most common free implementation, runs on UNIX systems and Windows NT

LAM/MPI is another free implementation of MPI. Supports heterogeneous configurations, LAM (http://www.lam-mpi.org) supports heterogeneous configurations, Globus package and satisfies IMPI (Interoperable MPI).

Various communication systems are supported (including Myrinet).

WMPI - MPI implementation for Windows

MPI/PRO for Windows NT - commercial implementation for Windows NT

Intel MPI - commercial implementation for Windows/Linux

Microsoft MPI is included in the Compute Cluster Pack SDK. Based on MPICH2, but includes additional job management capabilities. MPI-2 specification is supported.

HP-MPI - commercial implementation from HP

SGI MPT - paid MPI library from SGI

Mvapich - free MPI implementation for Infiniband

Open MPI - free implementation of MPI, successor to LAM/MPI

Oracle HPC ClusterTools - free implementation for Solaris SPARC/x86 and Linux based on Open MPI

MPJ - MPI for Java

POSIX Threads- POSIX standard for implementing threads of execution, defining an API for creating and managing them.

Libraries that implement this standard (and the functions of this standard) are usually called Pthreads (functions are prefixed with "pthread_"). Although the best known options are for Unix-like operating systems such as Linux or Solaris, there is also an implementation for Microsoft Windows (Pthreads-w32)

Pthreads defines a set of types and functions in the C programming language. The header file is pthread.h.

Data types:

1. pthread_t – thread descriptor;

2. pthread_attr_t – list of thread attributes.

Thread control functions:

1. pthread_create() – creating a thread;

2. pthread_exit() – termination of the thread (must be called by the thread function upon termination);

3. pthread_cancel() – cancel the thread;

4. pthread_join() – block the execution of a thread until another thread specified in the function call terminates;

5. pthread_detach() – release the resources occupied by the thread (if the thread is running, the resources will be released after its completion);

6. pthread_attr_init() – initialize the thread attribute structure;

7. pthread_attr_setdetachstate() – indicate to the system that after the thread terminates, it can automatically release the resources occupied by the thread;

8. pthread_attr_destroy() – free memory from the thread attribute structure (destroy the descriptor).

Thread synchronization functions:

2. pthread_mutex_init(), pthread_mutex_destroy(), pthread_mutex_lock(), pthread_mutex_trylock(), pthread_mutex_unlock();

3. pthread_cond_init(), pthread_cond_signal(), pthread_cond_wait().

Example of using threads in C:

#include

static void wait_thread(void)

time_t start_time = time(NULL);

while (time(NULL) == start_time)

/* do nothing except chew CPU slices for up to one second. */

static void *thread_func(void *vptr_args)

for (I = 0; I< 20; i++)

fputs(“b\n”, stderr);

pthread_t thread;

if (pthread_create(&thread, NULL, thread_func, NULL) != 0)

return EXIT_FAILURE;

for (I = 0; I< 20; i++)

if (pthread_join(thread, NULL) != 0)

return EXIT_FAILURE;

return EXIT_SUCCESS;

The presented program uses two threads that print messages to the console, one that prints "a", the second - "b". Message output is mixed as a result of execution switching between threads or concurrent execution on multiprocessor systems.

The C program creates one new thread to print "b" and the main thread prints "a". The main thread (after printing "aaaaa….") waits for the child thread to complete.

Control questions

What is a parallel program?
What is the difference between a process and a thread of execution?
Can a program create 5 threads when running on a quad core processor?
What are the features of shared memory parallel programs?
What software tools exist for developing parallel programs?
Why has OpenMP become widespread when creating programs for PCs, and not, for example, MPI?