Cluster systems for high availability applications. Efficient cluster solutions

(By the way, it is possible to assemble an inexpensive and efficient cluster from an xbox 360 or PS3, the processors there are about the same as Power, and for a million you can buy more than one console.)

Based on this, we note interesting price options for building a high-performance system. Of course, it must be multiprocessor. Intel uses Xeon processors for such tasks, while AMD uses Opteron processors.

If there is a lot of money

Separately, we note the extremely expensive but productive line of processors based on the Intel Xeon LGA1567 socket.
The top processor in this series is the E7-8870 with ten 2.4 GHz cores. Its price is $4616. For such CPUs, HP and Supermicro produce! eight-processor! server chassis. Eight 10-core Xeon E7-8870 2.4 GHz processors with HyperThreading support support 8*10*2=160 threads, which is displayed in the Windows task manager as one hundred and sixty processor load graphs, a 10x16 matrix.

In order for eight processors to fit in the case, they are not placed directly on the motherboard, but on separate boards that are plugged into the motherboard. The photo shows four boards with processors installed in the motherboard (two on each). This is Supermicro's solution. In the HP solution, each processor has its own board. The cost of the HP solution is two to three million, depending on the content of processors, memory and other things. The chassis from Supermicro costs $10,000, which is more attractive. In addition, Supermicro can install four coprocessor expansion cards in PCI-Express x16 ports (by the way, there is still room for an Infiniband adapter to assemble a cluster of these), while HP has only two. Thus, for creating a supercomputer, the eight-processor platform from Supermicro is more attractive. The following photo from the exhibition shows a supercomputer assembled with four GPU boards.

However it is very expensive.

Which is cheaper?

But there is the prospect of assembling a supercomputer on more affordable AMD Opteron G34, Intel Xeon LGA2011 and LGA 1366 processors.

To choose a specific model, I compiled a table in which I calculated the price/(number of cores*frequency) indicator for each processor. I excluded from the calculation processors with a frequency below 2 GHz, and for Intel - with a bus below 6.4GT/s.

Model	Number of cores	Frequency	Price, $	Price/core, $	Price/Core/GHz
AMD
6386 SE	16	2,8	1392	87	31
6380	16	2,5	1088	68	27
6378	16	2,4	867	54	23
6376	16	2,3	703	44	19
6348	12	2,8	575	48	17
*6344*	12	2,6	415	35	13
6328	8	3,2	575	72	22
6320	8	2,8	293	37	13
INTEL
E5-2690	8	2,9	2057	257	89
E5-2680	8	2,7	1723	215	80
E5-2670	8	2,6	1552	194	75
E5-2665	8	2,4	1440	180	75
E5-2660	8	2,2	1329	166	76
E5-2650	8	2	1107	138	69
E5-2687W	8	3,1	1885	236	76
E5-4650L	8	2,6	3616	452	174
E5-4650	8	2,7	3616	452	167
E5-4640	8	2,4	2725	341	142
E5-4617	6	2,9	1611	269	93
E5-4610	6	2,4	1219	203	85
E5-2640	6	2,5	885	148	59
*E5-2630*	6	2,3	612	102	44
E5-2667	6	2,9	1552	259	89
X5690	6	3,46	1663	277	80
X5680	6	3,33	1663	277	83
X5675	6	3,06	1440	240	78
X5670	6	2,93	1440	240	82
X5660	6	2,8	1219	203	73
X5650	6	2,66	996	166	62
E5-4607	6	2,2	885	148	67
X5687	4	3,6	1663	416	115
X5677	4	3,46	1663	416	120
X5672	4	3,2	1440	360	113
X5667	4	3,06	1440	360	118
E5-2643	4	3,3	885	221	67

The model with the minimum ratio is highlighted in bold italics, the most powerful AMD and, in my opinion, the closest in performance to Xeon, are underlined.

Thus, my choice of processors for a supercomputer is Opteron 6386 SE, Opteron 6344, Xeon E5-2687W and Xeon E5-2630.

motherboards

PICMG

It is impossible to install more than four two-slot expansion cards on regular motherboards. There is another architecture - the use of backplanes, such as the BPG8032 PCI Express Backplane.

Such a board contains PCI Express expansion cards and one processor board, somewhat similar to those installed in the eight-processor Supermicro-based servers discussed above. But only these processor boards are subject to PICMG industry standards. Standards develop slowly and such boards often do not support the latest processors. The maximum of such processor boards currently available is two Xeon E5-2448L - Trenton BXT7059 SBC.

Such a system without a GPU will cost at least $5,000.

Ready-made TYAN platforms

For approximately the same amount, you can purchase a ready-made platform for assembling TYAN FT72B7015 supercomputers. This one can install up to eight GPUs and two Xeon LGA1366.

"Regular" server motherboards

For LGA2011

Supermicro X9QR7-TF - this motherboard can install 4 expansion cards and 4 processors.

Supermicro X9DRG-QF - this board is specially designed for assembling high-performance systems.

For Opteron

Supermicro H8QGL-6F - this board allows you to install four processors and three expansion cards

Strengthening the platform with expansion cards

This market is almost completely captured by NVidia, which produces, in addition to gaming video cards, also computing cards. AMD has a smaller market share, and Intel Corporation entered this market relatively recently.

A feature of such coprocessors is the presence of a large amount of RAM on board, fast calculations with double precision and energy efficiency.

	FP32, Tflops	FP64, Tflops	Price	Memory, GB
Nvidia Tesla K20X	3.95	1.31	5.5	6
AMD FirePro S10000	5.91	1.48	3.6	6
Intel Xeon Phi 5110P		1	2.7	8
Nvidia GTX Titan	4.5	1.3	1.1	6
Nvidia GTX 680	3	0.13	0.5	2
AMD HD 7970 GHz Edition	4	1	0.5	3
AMD HD 7990 Devil 13	2x3.7	2x0.92	1.6	2x3

The top solution from Nvidia is called Tesla K20X based on Kepler architecture. These are the cards that are installed in the world's most powerful supercomputer, Titan. However, Nvidia recently released the Geforce Titan graphics card. Old models had reduced FP64 performance to 1/24 of FP32 (GTX680). But in Titan, the manufacturer promises fairly high performance in double-precision calculations. Solutions from AMD are also good, but they are built on a different architecture and this can create difficulties for running calculations optimized for CUDA (Nvidia technology).

The solution from Intel - Xeon Phi 5110P is interesting because all the cores in the coprocessor are based on x86 architecture and no special code optimization is required to run calculations. But my favorite among coprocessors is the relatively inexpensive AMD HD 7970 GHz Edition. Theoretically, this video card will show maximum performance per cost.

Can be connected to a cluster

To increase system performance, several computers can be combined into a cluster, which will distribute the computing load between the computers included in the cluster.

Using regular gigabit Ethernet as a network interface to connect computers is too slow. Infiniband is most often used for these purposes. The Infiniband host adapter is inexpensive relative to the server. For example, at the international Ebay auction such adapters are sold at prices starting from $40. For example, an X4 DDR adapter (20Gb/s) will cost about $100 with delivery to Russia.

At the same time, switching equipment for Infiniband is quite expensive. And as mentioned above, a classic star as a computer network topology is not the best choice.

However, InfiniBand hosts can be connected to each other directly, without a switch. Then, for example, this option becomes quite interesting: a cluster of two computers connected via infiniband. Such a supercomputer can easily be assembled at home.

How many video cards do you need?

In the most powerful supercomputer of our time, Cray Titan, the ratio of processors to “video cards” is 1:1, that is, it has 18688 16-core processors and 18688 Tesla K20X.

In Tianhe-1A, a Chinese xeon-based supercomputer, the relationship is as follows. Two six-core processors for one Nvidia M2050 video card (weaker than K20X).

We will accept this attitude for our assemblies as optimal (because it is cheaper). That is, 12-16 processor cores per GPU. In the table below, practically possible options are indicated in bold, and the most successful ones from my point of view are underlined.

GPU	Cores		6-core CPU		8-core CPU		12-core CPU		16-core CPU
2	24	32	4	5	3	4	2	3	2	2
3	36	48	6	8	5	6	3	4	2	3
4	48	64	8	11	6	8	4	5	3	4

If a system with an already established processor/video card ratio can take on board additional computing devices, then we will add them to increase the power of the assembly.

So how much does it cost

The options presented below are a supercomputer chassis without RAM, hard drives and software. All models use the AMD HD 7970 GHz Edition video adapter. It can be replaced with another one, as required by the task (for example, xeon phi). Where the system allows, one of the AMD HD 7970 GHz Edition is replaced with a three-slot AMD HD 7990 Devil 13.

Option 1 on Supermicro H8QGL-6F motherboard


Motherboard	Supermicro H8QGL-6F	1	1200	1200
CPU	AMD Opteron 6344	4	500	2000
CPU Cooler	Thermaltake CLS0017	4	40	160
Case 1400W	SC748TQ-R1400B	1	1000	1000
Graphics accelerator	AMD HD 7970 GHz Edition	3	500	1500
				5860

Theoretically, the performance will be about 12 Tflops.

Option 2 on the TYAN S8232 motherboard, cluster

This board does not support Opteron 63xx, so 62xx is used. In this option, two computers are connected into a cluster via Infiniband x4 DDR using two cables. Theoretically, the connection speed in this case will be limited to the speed of PCIe x8, that is, 32Gb/s. Two power supplies are used. How to coordinate them with each other can be found on the Internet.

		Quantity	Price	Sum
Motherboard	TYAN S8232	1	790	790
CPU	AMD Opteron 6282SE	2	1000	2000
CPU Cooler	Noctua NH-U12DO A3	2	60	120
Frame	Antec Twelve Hundred Black	1	200	200
power unit	FSP AURUM PRO 1200W	2	200	400
Graphics accelerator	AMD HD 7970 GHz Edition	2	500	1000
Graphics accelerator	AX7990 6GBD5-A2DHJ	1	1000	1000
Infiniband adapter	X4 DDR Infiniband	1	140	140
Infiniband cable	X4 DDR Infiniband	1	30	30
				5680 (per block)

For a cluster of such configurations you need two and their cost will be $11360 . Its power consumption at full load will be about 3000W. Theoretically, performance will be up to 31Tflops.

Cluster technologies have long become available to ordinary organizations. This became possible thanks to the use of low-cost Intel servers, standard communications tools and widely used operating systems in entry-level clusters. Cluster solutions on Microsoft platforms are focused primarily on combating operator errors, hardware and software failures. Cluster solutions are an effective means to solve these problems.

With the development of computer technology, the degree of its integration into the business processes of enterprises and the activities of organizations has increased sharply. The problem of a sharp increase in the time during which computing resources are available has emerged, and this is becoming increasingly relevant. Server reliability is becoming one of the key factors for the successful operation of companies with a developed network infrastructure, this is especially important for large enterprises in which special systems support production processes in real time, for banks with an extensive branch network, or telephone operator service centers using a support system decision making. All such enterprises need servers that work continuously and provide information every day 24 hours without interruption.

The cost of equipment downtime for an enterprise is constantly growing, as it consists of the cost of lost information, lost profits, the cost of technical support and recovery, customer dissatisfaction, etc. How to create a reliable system and how much does it cost to solve this problem? There are a number of methods that allow you to calculate the cost of a minute of downtime for a given enterprise and then, based on this calculation, you can select the most appropriate solution with the best price-functionality ratio.

There are many options and tools for building a reliable computing system. RAID disk arrays and redundant power supplies, for example, “insure” part of the system equipment in case of failure of other similar system components, and allow the processing of requests for information during failures not to be interrupted. Uninterruptible power supplies will maintain system functionality in the event of a power failure. Multiprocessor motherboards will ensure the functioning of the server in the event of a failure of one processor. However, none of these options will save you if the entire computing system fails. This is where clustering comes to the rescue.

Historically, the first step towards creating clusters is considered to be the widely used “hot” standby systems. One or two such systems, part of a network of several servers, do not perform any useful work, but are ready to begin functioning as soon as any of the main systems fails. Thus, the servers duplicate each other in case of failure or breakdown of one of them. But I would like that when several computers are combined, they not only duplicate each other, but also perform other useful work, distributing the load among themselves. In many cases, clusters are the best choice for such systems.

Initially, clusters were used only for powerful computing and supporting distributed databases, especially where increased reliability was required. Later they began to be used for the Web service. However, the decline in prices for clusters has led to the fact that such solutions are increasingly being used for other needs. Cluster technologies have finally become available to ordinary organizations - in particular, thanks to the use of inexpensive Intel servers, standard communications tools and common operating systems (OS) in entry-level clusters.

Cluster solutions on Microsoft platforms are focused primarily on combating hardware and software failures. The statistics of failures of such systems are well known: only 22% of them are directly caused by failures of equipment, OS, server power, etc. To eliminate these factors, various technologies are used to increase the fault tolerance of servers (redundant and hot-swappable disks, power supplies, boards in PCI slots, etc.). However, 78% of the remaining incidents are usually caused by application failures and operator errors. Cluster solutions are an effective means to solve this problem.

Clusters allow you to build a unique architecture with sufficient performance and resistance to hardware and software failures. Such a system is easily scalable and upgradeable using universal means, based on standard components and at a reasonable price, which is significantly less than the price of a unique fault-tolerant computer or massively parallel system).

The term “cluster” implies fault tolerance, scalability, and manageability. You can also give a classic definition of a cluster: “a cluster is a parallel or distributed system consisting of several interconnected computers and at the same time used as a single, unified computer resource.” A cluster is a combination of several computers that, at a certain level of abstraction, are managed and used as a single whole. Each cluster node (a node is usually a computer that is part of a cluster) has its own copy of the OS. Let us recall that systems with SMP and NUMA architecture, having one shared copy of the OS, cannot be considered clusters. A cluster node can be either a single-processor or a multiprocessor computer, and within one cluster computers can have different configurations (different numbers of processors, different amounts of RAM and disks). Cluster nodes are connected to each other either using regular network connections (Ethernet, FDDI, Fiber Channel), or via non-standard special technologies. Such intra-cluster or inter-node connections allow nodes to communicate with each other regardless of the external network environment. Through intracluster channels, nodes not only exchange information, but also monitor each other’s performance.

There is a broader definition of a cluster: “a cluster is a system that operates as a single whole, guarantees high reliability, has centralized management of all resources and a common file system, and, in addition, provides configuration flexibility and ease of expanding resources.”

As already noted, the main purpose of a cluster is to provide a high level of availability (otherwise known as High Availability, HA) compared to a disparate set of computers or servers, as well as a high degree of scalability and ease of administration. Increased system availability ensures that user-critical applications are running for as long as possible. Critical applications include all applications on which the company’s ability to make a profit, provide a service, or provide other vital functions directly depends. Typically, using a cluster ensures that if a server or application stops functioning normally, another server in the cluster will continue to perform its tasks , will take on the role of the faulty server (or run a copy of the faulty application) in order to minimize user downtime due to a malfunction in the system.

Readiness is usually measured as a percentage of the time spent by the system in an operational state, of the total operating time. Different applications require different readiness from the computing system. System availability can be increased by various methods. The choice of method depends on the cost of the system and the cost to the enterprise of downtime. There are quite cheap solutions that tend to focus primarily on reducing downtime after a fault occurs. More expensive ones ensure the normal functioning of the system and provide service to users even when one or more of its components fail. As the system's readiness increases, its price increases nonlinearly. In the same way, the cost of its support increases non-linearly. Systems with a relatively low cost do not have a high enough level of fault tolerance - no more than 99% (this means that the enterprise’s information structure will be inoperable for approximately four days a year). This is not so much if it also includes planned downtime associated with maintenance or reconfiguration.

High degree of availability (availability) implies a solution that can continue to function or restore operation after most errors occur without operator intervention. The most advanced (and naturally expensive) fault-tolerant solutions are capable of providing 99.999% system reliability (i.e. no more than 5 minutes downtime per year).

Between single server systems with mirrored disk subsystems (or RAID disk arrays) and fault-tolerant systems, cluster solutions provide the “golden mean”. In terms of availability, they are close to fault-tolerant systems at a disproportionately lower cost. These solutions are ideal for situations where only very minor unplanned downtime can be tolerated.

In case of cluster system failure recovery is controlled by special software and hardware. Cluster software allows you to automatically detect a single hardware or software failure, isolate it and restore the system. Specially designed routines are able to select the fastest recovery method and ensure that services are up and running in the shortest possible time. Using the built-in development tool and programming interface, you can create custom programs that identify, isolate, and resolve failures that occur in user-developed applications.

An important advantage of clustering is scalability. A cluster allows you to flexibly increase the computing power of the system by adding new nodes to it without interrupting the work of users. Modern cluster solutions provide for automatic load distribution between cluster nodes, as a result of which one application can run on several servers and use their computing resources. Typical applications running on clusters are:

Database;
enterprise resource planning (ERP) systems;
message processing tools and mail systems;
means of processing transactions through the Web and Web servers;
customer relationship systems (CRM);
file separation and printing systems.

So, a cluster unites several servers connected to each other special communication channel, often called the system network. Cluster nodes monitor each other’s performance and exchange specific information, for example, about the cluster configuration, and also transfer data between shared storage and coordinate their use.

Performance monitoring carried out using special heartbeat signal("pulse"). Cluster nodes transmit this signal to each other to confirm their normal functioning. In small clusters, heartbeat signals are transmitted over the same channels as data; in large cluster systems, special lines are allocated for this. Cluster software must receive a heartbeat signal from each server at a certain time interval - if it is not received, the server is considered down and the cluster is automatically reconfigured. Conflicts between servers are also automatically resolved when, when starting a cluster, the problem arises of choosing a “leading” server or a group of servers whose task is to form a new cluster.

To organize a cluster communication channel, conventional network technologies (Ethernet, Token Ring, FDDI, ATM), shared input/output buses (SCSI or PCI), high-speed Fiber Channel interface or specialized technologies CI (Computer Interconnect), DSSI (Digital Storage System) can be used Interconnect) or Memory Channel.

The DSSI interface is designed for access to drives and for interaction between systems. It is similar to the multi-host SCSI-2 protocol, but has greater performance and the ability to organize communication between computers. DSSI clusters support system reliability enhancements, resource sharing, distributed file system, and transparency. From a management and security perspective, a DSSI cluster appears to be a single domain.

CI interface is a dual serial bus with an exchange speed of up to 70 Mbit/s. It is connected to the computer's I/O system through an intelligent controller capable of supporting both dual and single bus operation, depending on the access reliability requirements of a particular computer. All communication lines of the CI interface are connected at one end to the CI integrator - a special device that monitors connections to nodes and cluster configurations.

Memory Channel technology allows you to create a highly efficient communication environment that provides high-speed (up to 100 MB/s) message exchange between servers in a cluster.

The requirements for the speed of the communication channel depend on the degree of integration of the cluster nodes and the nature of the applications. If, for example, applications in different nodes do not interact with each other and do not simultaneously access disk drives, then the nodes exchange with each other only control messages confirming their functionality, as well as information about changing the cluster configuration, i.e. adding new nodes , redistribution of disk volumes, etc. This type of exchange will not require significant interconnection resources and can easily be satisfied with a simple 10-Mbit Ethernet channel.

There are a huge number of real cluster configurations. There are solutions that represent the combination of several clusters, and even together with additional devices. Each option meets the requirements of corresponding different applications and naturally differs in both cost and complexity of implementation. Cluster topologies such as star, ring, N-N, etc. are widely used. But, no matter how complex and exotic a cluster is, it can be qualified by two criteria:

Organization of RAM of cluster nodes,

The degree of availability of input/output devices, primarily disks.

As for RAM, there are two options: either all cluster nodes have independent RAM, or they have a common shared memory. The availability of cluster I/O devices is primarily determined by the ability to use external memory with shared disks, which implies that any node has transparent access to the file system of the shared disk space. In addition to the shared disk subsystem, cluster nodes may have local disks, but in this case they are used mainly for loading the OS on the node. Such a cluster must have a special subsystem called a Distributed Lock Manager (DLM) to eliminate conflicts when simultaneously writing to files from different cluster nodes. In systems without DLM, applications cannot work on the same data in parallel, and shared disk memory, if available, is assigned to one node at a time.

In clusters that do not support concurrent access to external memory, all nodes are completely autonomous servers. In the case of two nodes, shared memory on disks is accessed using a shared I/O bus (Figure 1). For each node, such a bus ends in a disk array. At any given time, only one node owns the shared file system. If one of the servers fails, control of the bus and shared disks passes to the other node.

Rice. 1. Building a cluster of two nodes.

For companies that have an integrated information system, where only part of the resources are used to run reliability-critical applications, the “active-standby” cluster construction scheme can be used (Fig. 2). In the simplest case, such a system includes an active server that runs the most important applications, and a backup machine that solves less critical tasks. If the active server fails, all its applications are automatically transferred to the standby one, where applications with the lowest priority cease to function. This configuration eliminates the slowdown of critical applications - users simply will not notice any changes (a special case of this scheme is the “passive-standby” configuration, in which the backup server does not bear any load and is in standby mode).

Rice. 2. Construction of an “active-standby” cluster.

There is also an “active-active” configuration, which implies that all servers in the cluster execute individual applications of the same high priority; the computing resources of the backup server are used in everyday work. The advantage of this approach is that the user has at his disposal a highly available system (the server is duplicated) and at the same time can use all the computing resources of the cluster. This allows you to reduce the total cost of the system per unit of computing power. When applications fail, they are transferred from the idle machine to the remaining ones, which, of course, affects overall performance. Active-active clusters can only exist as dedicated systems, which cannot run low-priority tasks such as supporting office work. Additionally, when building clusters with an active standby server, you can have fully duplicated servers with their own separate disks. This creates the need to constantly copy data from the primary server to the backup server - this ensures that in the event of a failure, the backup server will have the correct data. Since the data is completely duplicated, the client can have access to any server, which allows us to talk about load balancing in such a cluster. In addition, the nodes of such a cluster can be geographically dispersed, which makes the configuration disaster-resistant. This approach provides a very high level of availability, but also has a number of the following disadvantages:

The need to constantly copy data (this means that part of the computing and network resources will be continuously spent on synchronization);

Even the fastest network interface between servers within a cluster does not eliminate delays in the transfer of information, which can ultimately lead to desynchronization if one server fails and not all transactions made on its disk are reflected on the disk of the second server.

In a cluster without resource sharing(Fig. 3) servers are connected to one disk array, but each of them controls with your own set of disks. If a failure occurs on one of the nodes, the remaining server takes over management of its disks. This method eliminates the need for constant data synchronization between servers and thereby frees up additional computing and network resources. But in this configuration, the disks become a single point of failure, so drives using RAID technology are usually used in this case.

Rice. 3. Building a cluster without shared resources.

In systems with full sharing of resources(Fig. 4) all servers in the cluster have simultaneous access to the same disk. This approach requires carefully designed software that allows multiple access to a single media. As in the previous case, disks here can be a single point of failure, so the use of RAID arrays is advisable here too. In this option, there is no need for constant data synchronization between servers. This frees up additional computing and network resources.

Rice. 4. Building a cluster with shared resources.

All programs executed by the cluster can be divided into several categories. You can run almost any common program on any cluster node. Moreover, the same program can be run on different cluster nodes. However, each copy of the program must use its own resource (file system), since the file system is assigned to a specific node. In addition to conventional cluster software, there are so-called true cluster applications. Such programs are, as it were, distributed among the nodes of the cluster, and interaction is organized between parts of the program operating on different nodes. True cluster programs allow you to parallelize the load on the cluster. An intermediate position is occupied by applications designed to work in a cluster. Unlike true cluster programs, they do not use explicit parallelism; in fact, the program is ordinary, but it can use some cluster capabilities, primarily related to resource migration.

All cluster solutions on Microsoft platforms are focused primarily on combating hardware and software failures. Special software is what combines servers into clusters. Many modern enterprise applications and operating systems have built-in support for clustering, but the smooth functioning and transparency of the cluster can only be guaranteed by special middleware. It answers:

For the coordinated work of all servers;

For resolving conflicts arising in the system,

Provides cluster formation and reconfiguration after failures;

Provides load distribution across cluster nodes;

Provides restoration of applications of failed servers on available nodes (failover - migration procedure);

Monitors the status of hardware and software environments;

Allows you to run any application on the cluster without first adapting to a new hardware architecture.

Cluster software usually has several predefined scenarios for restoring system functionality, and can also provide the administrator with the ability to customize such scenarios. Failure recovery can be supported both for the node as a whole and for its individual components - applications, disk volumes, etc. This function is automatically initiated in the event of a system failure, and can also be launched by the administrator if, for example, he needs to disable one of the nodes for reconfiguration.

In addition to increased reliability and performance, cluster solutions in modern computing systems are subject to several additional requirements:

They must provide a unified external representation of the system,

High speed backup and data recovery,

Parallel access to the database,

Have the ability to transfer the load from faulty nodes to serviceable ones,

Have the means to configure a high level of availability, guarantee recovery after a disaster.

Of course, the use of several cluster nodes that simultaneously access the same data increases the complexity of the procedure for backing up and subsequently restoring information. Transferring the load from a failed node to a healthy one is the main mechanism for ensuring continuous operation of applications, subject to optimal use of cluster resources. For effective collaboration between cluster systems and DBMS, the system must have distributed lock manager, which ensures consistent changes to the database when a sequence of requests arrives from different cluster nodes. Setting up a cluster configuration while simultaneously ensuring high availability of applications is a rather complex process (this is due to the difficulty of determining the rules by which certain applications are transferred from failed cluster nodes to healthy ones). The cluster system must make it possible to easily transfer applications from one cluster node to another, as well as restore an emergency application on another node. The system user is not required to know that he is working with a cluster system, so for users the cluster should look like a single computer. It must have a single file system for all nodes, a single IP address and a single system kernel.

The most reliable are distributed clusters. Even the most reliable systems can fail if, for example, a fire, earthquake, flood, or terrorist attack occurs. Given the global scale of modern business, such events should not harm it, so the cluster can (or should) be distributed.

All leading computer companies (Compaq, Dell, Hewlett-Packard, IBM, Sun Microsystems) offer their own cluster solutions. The leading position in the UNIX cluster segment is occupied by IBM, which is actively promoting its DB2 database; Sun is actively promoting its Sun Cluster solution. Compaq Corporation is recognized as one of the most active players (both in terms of the number of platforms certified for clusters and in the variety of cluster solutions themselves), which offered an almost complete range of clusters on Windows platforms for a department or remote branch, for applications in the corporation’s infrastructure and for large centers data processing. The Compaq TrueCluster Server cluster solution best meets the modern requirements that companies place on such technology. The new software allows, for example, to install a database on several servers linked together. The need for such a consolidation arises, for example, if large capacity is required or there is a need to reduce downtime in the event of a server failure, which is achieved by transferring operations to another server in the cluster. This allows you to significantly reduce the cost of hardware platforms, making it economically feasible to build clusters of inexpensive servers of standard architecture even for relatively small enterprises. Compaq and Oracle are collaborating on technology and business to create a more scalable, manageable, reliable and cost-effective clustered database platform. In addition, Oracle has partnered with Dell and Sun Microsystems, which offer customers pre-configured and tested systems running Oracle clustering software. Dell, for example, ships clustering software on tested Windows and Linux servers.

In the corporate systems market, clusters play a key role. In many cases, cluster solutions simply do not have a worthy alternative. The real high availability and wide scalability of cluster information systems allows them to successfully solve increasingly complex problems, and with growing needs, it is easy to increase the computing power of the platform at a cost level acceptable for ordinary enterprises.

Cluster computing systems have become a continuation of the development of ideas embedded in the architecture of MPA systems. If in an MPA system a processor module acts as a complete computing node, then in cluster systems commercially available computers are used as such computing nodes. The development of communication technologies, namely the emergence of high-speed network equipment and special software libraries, for example, MPI (Message Passing Interface), which implements the mechanism for transmitting messages using standard network protocols, has made cluster technologies generally available. Currently, many small cluster systems are being created by combining the computing power of laboratory or classroom computers.

An attractive feature of cluster technologies is that in order to achieve the required performance, they make it possible to build heterogeneous systems, that is, to combine computers of various types into single computing systems, ranging from personal computers to powerful supercomputers. Cluster technologies have become widespread as a means of creating supercomputer-class systems from mass-produced components, which significantly reduces the cost of a computing system. In particular, one of the first projects to be implemented in 1998 was The COst effective COmputing Array (COCOA), in which, based on 25 dual-processor personal computers with a total cost of about $100,000, a system with performance equivalent to a 48-processor Cray T3D costing several million dollars was created .

Lyle Long, a professor of aerospace engineering at Penn State University, believes that relatively cheap cluster computing systems could well serve as an alternative to expensive supercomputers in scientific organizations. Under his leadership, the COCOA cluster was built at the university. Within the framework of this project, 25 ra-

base stations from DELL, each of which includes two Pentium II/400 MHz processors, 512 MB of RAM, a 4 GB SCSI hard drive and a Fast Ethernet network adapter. To connect the nodes, a 24-port Baynetworks 450T switch with one expansion module is used. Installed software includes the RedHat Linux operating system, Fortran 90 and HPF compilers from the Portland Group, a freely distributed implementation of MPI - Message Passing Interface Chameleon (MPICH) and a DQS queuing system.

In a paper presented at the 38th Aerospace Science Meeting and Exhibit, Long describes a parallel version of a computational load-balancing program used to predict helicopter noise levels at various locations. For comparison, this calculation program was run on three different 48-processor computers to calculate noise at 512 points. On the Cray T3E system the calculation took 177 seconds, on the SGI Origin2000 system - 95 seconds, and on the COCOA cluster - 127 seconds. Thus, clusters are a very efficient computing platform for tasks of this class.

Another advantage of cluster systems over supercomputers is that their owners do not have to share processor time with other users, as in large supercomputer centers. In particular, COCOA provides more than 400 thousand hours of processor time per year, while in supercomputing centers it can be difficult to obtain 50 thousand hours.

Of course, there is no need to talk about complete equivalence of these systems. As is known, the performance of systems with distributed memory very much depends on the performance of the switching environment, which can be characterized by two parameters: latency - the delay time when sending a message, and throughput - the speed of information transfer. For example, for a Cray T3D computer these parameters are 1 μs and 480 Mb/s, respectively, and for a cluster in which a Fast Ethernet network is used as a switching medium, 100 μs and 10 Mb/s. This partly explains the very high cost of supercomputers. With parameters such as those of the cluster under consideration, there are not many tasks that can be effectively solved on a sufficiently large number of processors.

Based on the above, we will give a definition: a cluster is a connected set of full-fledged computers used as a single computing resource. Both identical (homogeneous clusters) and different (heterogeneous clusters) computers can be used as cluster nodes. By its architecture, a cluster computing system is loosely coupled. To create clusters, either simple single-processor personal computers or two- or four-processor SMP servers are usually used. In this case, no restrictions are imposed on the composition and architecture of nodes. Each node can run its own operating system. The most commonly used standard operating systems are Linux, FreeBSD, Solaris, Tru64 Unix, and Windows NT.

The literature notes four advantages achieved by clustering a computing system:

∙ absolute scalability;

∙ scalable;

∙ high availability factor;

∙ price/performance ratio.

Let us explain each of the above features of a cluster computing system.

Property absolute scalability means that it is possible to create large clusters that exceed the computing power of even the most powerful single computers. A cluster can contain dozens of nodes, each of which is a multiprocessor.

Increasable scalability property means that the cluster can be expanded by adding new nodes in small portions. Thus, the user can start with a small system, expanding it as needed.

Since each cluster node is an independent computing machine or system, the failure of one of the nodes does not lead to loss of cluster functionality. In many systems, fault tolerance is automatically supported by the software.

And finally, cluster systems provide what supercomputers cannot achieve price-quality ratio. Clusters of any performance can be created using standard “building blocks”, and the cost of the cluster will be lower than one

night computing machine with equivalent processing power.

Thus, at the hardware level, a cluster is a collection of independent computing systems connected by a network. Solutions can be simple, based on Ethernet hardware, or complex, with high-speed networks with a throughput of hundreds of megabytes per second.

An integral part of the cluster is specialized software, which is tasked with maintaining calculations when one or more nodes fail. Such software redistributes the computing load when one or more cluster nodes fail, as well as restores calculations when a node fails. In addition, when a cluster has shared disks, the cluster software maintains a single file system.

Classification of cluster system architectures

The literature describes various ways to classify cluster systems. The simplest classification is based on the way disk arrays are used: together or separately.

In Fig. 5.5.1 and 5.5.2 show the structures of clusters of two nodes, the coordination of which is ensured by a high-speed line used for messaging. This can be a local network, also used by computers not included in the cluster, or a leased line. In the case of a leased line, one or more cluster nodes will have access to a local or global network, thereby ensuring communication between the server cluster and remote client systems.

The difference between the presented clusters is that in the case of a local network, the nodes use local disk arrays, and in the case of a dedicated line, the nodes share one redundant array of independent hard drives or the so-called RAID (Redundant Array of Independent Disks). RAID consists of several disks managed by a controller, interconnected by high-speed channels and perceived by the external system as a single whole. Depending on the type of array used, varying degrees of fault tolerance and performance can be provided.

CPU

High speed

CPU

highway

Device

I/O

Disk

Rice. 5.5.1. Cluster configuration without shared disks

Disk

Device

CPU

I/O

CPU

Device

I/O

Disk

High speed

Disk

highway

Rice. 5.5.2. Cluster configuration with shared disks

Let's look at the most common types of disk arrays:

RAID0 (striping) is a disk array of two or more hard drives with no redundancy. Information is divided into data blocks and written to both (several) disks simultaneously. The advantage is a significant increase in productivity. The disadvantage is that the reliability of RAID0 is obviously lower than the reliability of any of the disks individually and decreases with the increase in the number of disks included in RAID0, since the failure of any of the disks leads to the inoperability of the entire array.

RAID1 (mirroring) is an array consisting of at least two disks. The advantages are acceptable write speed and gains in read speed when parallelizing queries, as well as high reliability: it works as long as at least one disk in the array is functioning. The probability of failure of two disks at once is equal to the product of the probabilities of failure of each disk. In practice, if one of the disks fails, immediate action must be taken: redundancy must be restored. To do this, it is recommended to use hot spare disks with any RAID level (except zero). The advantage of this approach is maintaining constant availability. The disadvantage is that you have to pay the cost of two hard drives to get the usable capacity of one hard drive.

RAID10 is a mirrored array in which data is written sequentially to several disks, as in RAID0. This architecture is a RAID0 array whose segments are RAID1 arrays instead of individual disks. Accordingly, an array of this level must contain at least four disks. RAID10 combines high fault tolerance and performance.

A more complete picture of cluster computing systems is given by the classification of clusters according to the clustering methods used, which determine the main functional features of the system:

∙ clustering with passive redundancy;

∙ clustering with active redundancy;

∙ stand-alone servers;

∙ servers with connections to all disks;

∙ servers with shared disks.

Clustering with redundancy is the oldest and most universal method. One of the servers takes over the entire computing load, while the other remains inactive, but is ready to take over the calculations if the main server fails. The active (or primary) server periodically sends a clock message to the backup (secondary) server. If there are no heartbeat messages, which is considered a failure of the primary server, the secondary server takes over control. This approach improves availability but does not improve performance. Moreover, if the only type of communication between nodes is messaging, and if both servers in the cluster do not share disks, then the backup server has no access to the databases managed by the primary server.

Passive redundancy is not typical for clusters. The term “cluster” refers to a set of interconnected nodes that actively participate in the computing process and together create the illusion of one powerful computing machine. This configuration is commonly referred to as a system with an active secondary server, and there are three clustering methods: standalone servers, servers without disk sharing, and servers with disk sharing.

In the first method, each cluster node is treated as an independent server with its own disks, and none of the disks in the system are shared. The scheme provides high performance and high availability, but requires special software to schedule the distribution of client requests across servers so as to achieve balanced and efficient use of all servers. It is necessary that if one of the nodes fails during the execution of an application, another node in the cluster can intercept and terminate this application. To do this, the data in the system must be constantly backed up so that each server has access to all the most recent data in the system. Because of these costs, high availability comes only at the expense of performance.

To reduce communication overhead, most clusters now consist of servers connected to shared disks, usually represented by a RAID disk array (see Figure 5.5.2).

One variation of this approach assumes that disk sharing is not used. Shared disks are divided into partitions, and each cluster node is allocated its own partition. If one node fails, the cluster can be reconfigured so that access rights to its shared disk partition are transferred to another node.

Another option is that multiple servers share access to shared disks over time, so that any node has access to all partitions of all shared disks. This approach requires some kind of locking mechanism to ensure that only one of the servers has access to the data at any given time.

Clusters provide high availability - they do not have a single operating system and shared memory, i.e. there is no problem of cache coherence. In addition, special software in each node constantly monitors the performance of all other nodes. This control is based on the periodic sending by each node of the “I’m still awake” signal. If a signal is not received from a certain node, then such a node is considered to be out of order; it is not given the ability to perform I/O, its disks and other resources (including network addresses) are reassigned to other nodes, and programs running on it are restarted on other nodes.

Cluster performance scales well as nodes are added. A cluster can run multiple individual applications, but scaling a single application requires its parts to communicate by exchanging messages. However, one cannot ignore that interactions between cluster nodes take much longer than in traditional computing systems. The ability to grow the number of nodes almost unlimitedly and the lack of a single operating system make cluster architectures extremely scalable. Systems with hundreds and thousands of nodes have been successfully used.

When developing clusters, two approaches can be distinguished. The first approach is to create small cluster systems. A cluster combines fully functional computers that continue to operate as independent units, for example, classroom computers or laboratory workstations. The second approach is to purposefully create powerful computing resources. System units of computers are compactly placed -

located in special racks, and one or more fully functional computers, called host computers, are allocated to manage the system and run tasks. In this case, there is no need to equip the computers of the computing nodes with graphic cards, monitors, disk drives and other peripheral equipment, which significantly reduces the cost of the system.

Many technologies have been developed for combining computers into a cluster. The most widely used technology at the moment is Ethernet, due to the ease of use and low cost of communication equipment. However, you have to pay for this with an obviously insufficient exchange rate.

The developers of the ScaLAPACK subroutine package, designed for solving linear algebra problems on multiprocessor systems in which the share of communication operations is large, formulate the requirement for a multiprocessor system as follows: “The speed of interprocessor exchanges between two nodes, measured in MB/s, must be at least 1/ 10 peak compute node performance, measured in MFLOPS."

Cluster topologies

Let's consider the topologies characteristic of the so-called “small” clusters, consisting of two to four nodes.

Topology of cluster pairs used when organizing two- or four-node clusters (Fig. 5.5.3). Nodes are grouped in pairs, disk arrays are attached to both nodes that are part of the pair, and each node of the pair has access to all disk arrays of this pair. One of the nodes in the pair is used as a backup for the other.

A four-node cluster pair is a simple extension of a two-node topology. From the point of view of administration and configuration, both cluster pairs are considered as a single whole.

This topology can be used to organize clusters with high data availability, but fault tolerance is implemented only within a pair, since the information storage devices belonging to a pair do not have a physical connection with another pair.

		Switch



cluster	cluster		cluster	cluster


Disk	Disk		Disk	Disk

Rice. 5.5.3. Topology of cluster pairs

Topology + 1 allows you to create clusters of two, three and four nodes (Fig.5.5.4). Each disk array is connected to only two cluster nodes. Disk arrays are organized according to the RAID1 (mirroring) scheme. One server has a connection to all disk arrays and serves as a backup for all other (primary or active) nodes. The standby server can be used to provide high availability when paired with any of the active nodes.

The topology is recommended for organizing clusters with high data availability. In configurations where it is possible to dedicate one node to redundancy, this topology can reduce the load on active nodes and ensure that the load of a failed node can be replicated on the standby node without loss of performance. Fault tolerance is provided between any of the primary nodes and the backup node. At the same time, the topology does not allow for global fault tolerance, since the main cluster nodes and their information storage systems are not connected to each other.

The × topology is similar to the + 1 topology, allowing you to create clusters of two, three and four nodes, but unlike it, it has greater flexibility and scalability (Fig. 5.5.5).

		Switch



cluster	cluster		cluster	cluster

		Switch



cluster	cluster		cluster	cluster

Switch


Disk	Disk	Disk

Rice. 5.5.5. Topology ×

Only in this topology all cluster nodes have access to all disk arrays, which, in turn, are built according to the RAID1 (mirroring) scheme. The scalability of the topology is manifested in the ease of adding additional nodes and disk arrays to the cluster without changing connections in the system.

cluster

Disk

Rice. 5.5.6. Fully separate access topology

Fully separate access topology allows each disk array to be connected to only one cluster node (Fig. 5.5.6 ). Recommended only for those applications that are characterized by a completely separate access architecture.

Control questions

1. Give the definition of a cluster computing system.

2. Name the main advantages and disadvantages of cluster computing systems.

3. What classifications of cluster computing systems do you

4. What cluster system topologies do you know? Name their advantages and disadvantages.

Literature

1. Architectures and topologies of multiprocessor computing systems / A.V. Bogdanov, V.V. Korkhov, V.V. Mareev, E.N. Stankova. - M.: INTUIT.RU, 2004. - 176 p.

2. Microprocessor systems: textbook. manual for universities /

E.K. Alexandrov, R.I. Grushvitsky, M.S. Kupriyanov and others; edited by D.V. Puzankova . - St. Petersburg: Politekhnika, 2002. - 935 p.

This page was written in such a way that it could be useful not only to users of NICC computing clusters, but also to everyone who wants to get an idea of the operation of a computing cluster. The solution to typical problems of NIVC cluster users is presented on a separate page.

What is a computing cluster?

In general, a computing cluster is a set of computers (computing nodes) connected by some communication network. Each computing node has its own RAM and runs its own operating system. The most common is the use of homogeneous clusters, that is, those where all nodes are absolutely identical in their architecture and performance.

You can read more about how a computing cluster is structured and operates in the book by A. Latsis “How to build and use a supercomputer.”

How do programs run on a cluster?

For each cluster there is a dedicated computer - the head machine (front-end). This machine has software installed that controls how programs run on the cluster. The actual computing processes of users are launched on computing nodes, and they are distributed so that there is no more than one computing process per processor. You cannot run computing processes on the head machine of the cluster.

Users have terminal access to the head machine of the cluster, and there is no need for them to log into the cluster nodes. Programs are launched on the cluster in the so-called. "batch" mode - this means that the user does not have direct, "interactive" interaction with the program; the program cannot wait for keyboard input and display directly on the screen. Moreover, the user program can run when the user is not connected to the cluster.

What operating system is installed?

A computing cluster, as a rule, runs one of the varieties of the Unix OS - a multi-user, multi-tasking network operating system. In particular, at the Moscow State University Research Computing Center, clusters operate under Linux OS, a freely distributed version of Unix. Unix has a number of differences from Windows, which usually runs on personal computers, in particular these differences relate to the user interface, work with processes and the file system.

You can read more about the features and commands of the UNIX OS here:

Linux installation and first steps (book by Matt Welsh, translated into Russian by A. Solovyov).
UNIX operating system (information and analytical materials on the CIT-Forum server).

How is user data stored?

All cluster nodes have access to a common file system located on the file server. That is, a file can be created, for example, on the host machine or on some node, and then read under the same name on another node. Writing to the same file simultaneously from different nodes is impossible, but writing to different files is possible. In addition to the shared file system, there may be local disks on cluster nodes. They can be used by programs to store temporary files. After the end (more precisely, immediately before the end) of the program, these files must be deleted.

What compilers are used?

There are no specialized parallel compilers for clusters. Conventional optimizing compilers from C and Fortran languages are used - GNU, Intel or others that can create executable Linux OS programs. As a rule, to compile parallel MPI programs, special scripts are used (mpicc, mpif77, mpif90, etc.), which are add-ons to existing compilers and allow you to connect the necessary libraries.

How to use the capabilities of the cluster?

There are several ways to use the computing power of a cluster.

1. Run many single-processor tasks. This may be a reasonable option if you need to run many independent computational experiments with different input data, where the timing of each individual calculation is not important, and all the data is located in the amount of memory available to a single process.

2. Run ready-made parallel programs. For some tasks, free or commercial parallel programs are available, which you can use on a cluster if necessary. As a rule, for this it is enough that the program is available in source code and implemented using the MPI interface in C/C++ or Fortran. Examples of freely distributed parallel programs implemented using MPI: GAMESS-US (quantum chemistry), POVRay-MPI (ray tracing).

3. Call parallel libraries in your programs. Also for some areas, such as linear algebra, libraries are available that allow you to solve a wide range of standard subproblems using parallel processing capabilities. If access to such subtasks constitutes the majority of the program's computational operations, then using such a parallel library will allow one to obtain a parallel program practically without writing one's own parallel code. An example of such a library is SCALAPACK. A Russian-language manual for using this library and examples can be found on the numerical analysis server of the Research Computing Center of Moscow State University. The FFTW parallel library for computing fast Fourier transforms (FFTs) is also available. Information about other parallel libraries and programs implemented with MPI can be found at http://www-unix.mcs.anl.gov/mpi/libraries.html.

4. Create your own parallel programs. This is the most labor-intensive, but also the most universal method. There are two main options. 1) Insert parallel structures into existing parallel programs. 2) Create a parallel program from scratch.

How do parallel programs work on a cluster?

Parallel programs on a computing cluster operate in a message passing model. This means that the program consists of many processes, each of which runs on its own processor and has its own address space. Moreover, direct access to the memory of another process is impossible, and data exchange between processes occurs using the operations of receiving and sending messages. That is, the process that must receive data calls the Receive operation (receive a message) and indicates which process it should receive data from, and the process that must transfer data to another calls the Send operation (send a message) and indicates which one. the process needs to pass this data. This model is implemented using the standard MPI interface. There are several implementations of MPI, including free and commercial ones, portable and network-specific.

As a rule, MPI programs are built according to the SPMD (one program - many data) model, that is, for all processes there is only one program code, and different processes store different data and perform their actions depending on the sequence number of the process.

Lecture 5. Parallel programming technologies. Message Passing Interface.
Computational workshop on MPI technology (A.S. Antonov).
A.S. Antonov.
MPI: The Complete Reference (in English).
Chapter 8: Message Passing Interface in Ian Foster's book "Designing and Building Parallel Programs" (in English).

Where can I see examples of parallel programs?

Schematic examples of MPI programs can be found here:

Course by Vl.V. Voevodin "Parallel data processing". Appendix to lecture 5.
Examples from A.S. Antonov's manual "Parallel programming using MPI technology".

Is it possible to debug parallel programs on a personal computer?

Developing MPI programs and testing functionality is possible on a regular PC. You can run several MPI processes on a single-processor computer and thus check the functionality of the program. It is advisable that this is a PC with Linux OS, where you can install the MPICH package. This is also possible on a Windows computer, but it is more difficult.

How labor-intensive is it to program computational algorithms using MPI and are there alternatives?

The MPI interface set of functions is sometimes called "parallel assembler" because it is a relatively low level programming system. For a novice computer user, it can be quite a labor-intensive task to program a complex parallel algorithm using MPI and debug the MPI program. There are also higher-level programming systems, in particular Russian developments - DVM and NORMA, which allow the user to write a task in terms that are understandable to him, and at the output they create code using MPI, and therefore can be used on almost any computing cluster.

How to speed up calculations on a cluster?

First, you need to speed up calculations on a single processor as much as possible, for which you can take the following measures.

1. Selection of compiler optimization options. You can read more about compiler options here:

Intel C++ and Fortran compilers (Russian page on our website).

2. Using optimized libraries. If some standard operations, such as matrix multiplication, take up a significant portion of the program's running time, then it makes sense to use ready-made optimized procedures that perform these actions, rather than programming them yourself. To perform linear algebra operations on matrix and vector quantities, the BLAS (Basic Linear Algebra Procedures) library was developed. The interface for calling these procedures has actually become a standard, and now there are several implementations of this library that are well optimized and adapted to processor architectures. One such implementation is a freely distributed library, which, when installed, is configured to take into account the characteristics of the processor. Intel offers the MKL library, an optimized implementation of BLAS for Intel processors and SMP computers based on them. article about the selection of MKL options.

You can read more about linear algebra libraries (BLAS) here:

3. Elimination of swapping (automatic dumping of data from memory to disk). Each process should store no more data than the RAM available to it (in the case of a dual-processor node, this is approximately half of the node's physical memory). If it is necessary to work with a large amount of data, it may be advisable to organize work with temporary files or use several computing nodes, which together provide the required amount of RAM.

4. More optimal use of cache memory. If it is possible to change the sequence of program actions, you need to modify the program so that actions on the same or consecutively located data are also performed in a row, and not “in discord.” In some cases, it may be appropriate to change the order of loops in nested loop constructs. In some cases, it is possible at the “basic” level to organize calculations on such blocks that completely fall into the cache memory.

5. More optimal work with temporary files. For example, if a program creates temporary files in the current directory, then it makes more sense to switch to using local disks on the nodes. If there are two processes running on a node and each of them creates temporary files, and there are two local disks available on the node, then you want the two processes to create files on different disks.

6. Use the most appropriate data types. For example, in some cases it may be appropriate to use 32-bit single-precision floats or even integers instead of 64-bit doubles.

For more information on fine-grained program optimization, see the Optimization Guide for Intel Processors and other related materials on the Intel website.

How to evaluate and improve the quality of parallelization?

To speed up the work of parallel programs, it is worth taking measures to reduce the overhead of synchronization and data exchange. A combination of asynchronous transfers and computation may be an acceptable approach. To avoid idle time of individual processors, it is necessary to distribute calculations between processes as evenly as possible, and in some cases dynamic balancing may be necessary.

An important indicator that indicates whether parallelism is effectively implemented in a program is the load on the computing nodes on which the program runs. If the load on all or part of the nodes is far from 100%, it means that the program is using computing resources inefficiently, i.e. creates large communication overheads or distributes computations unevenly between processes. Users of the MSU Research Computing Center can view the download through the web interface to view the status of the nodes.

In some cases, in order to understand the reason for the low performance of a program and which specific places in the program need to be modified in order to achieve increased performance, it makes sense to use special performance analysis tools - profilers and tracers.

You can read more about improving the performance of parallel programs in the book by V.V. Voevodin and Vl.V. Voevodin

Cluster technologies have become a logical continuation of the development of ideas embedded in the architecture of MPP systems. If the processor module in an MPP system is a complete computing system, then the next step suggests itself: why not use ordinary commercially available computers as such computing nodes. The development of communication technologies, namely the emergence of high-speed network equipment and special software, such as the MPI system, which implements a message transfer mechanism over standard network protocols, has made cluster technologies generally available. Today it is not difficult to create a small cluster system by combining the computing power of computers in a separate laboratory or classroom.

An attractive feature of cluster technologies is that they make it possible to combine computers of various types into unified computing systems, ranging from personal computers to powerful supercomputers, to achieve the required performance. Cluster technologies have become widespread as a means of creating supercomputer-class systems from mass-produced components, which significantly reduces the cost of a computing system. In particular, the COCOA project was one of the first to be implemented, in which, based on 25 dual-processor personal computers with a total cost of about $100,000, a system with performance equivalent to a 48-processor Cray T3D costing several million US dollars was created.

Of course, there is no need to talk about complete equivalence of these systems. As stated in the previous section, the performance of distributed memory systems is highly dependent on the performance of the communication environment. The communication environment can be quite fully characterized by two parameters: latency- delay time when sending a message, and throughput- speed of information transfer. So for the Cray T3D computer these parameters are 1 μs and 480 Mb/s, respectively, and for a cluster in which the Fast Ethernet network is used as a communication medium, 100 μs and 10 Mb/s. This partly explains the very high cost of supercomputers. With parameters such as those of the cluster under consideration, there are not many tasks that can be effectively solved on a sufficiently large number of processors.

To put it briefly, then cluster is a connected set of full-fledged computers used as a single computing resource. The advantages of a cluster system over a set of independent computers are obvious. Firstly, many dispatch systems for batch processing of jobs have been developed, allowing you to send a job for processing to the cluster as a whole, and not to any individual computer. These dispatch systems automatically distribute tasks to free computing nodes or buffer them if there are none, which allows for a more uniform and efficient load on computers. Secondly, it becomes possible to share the computing resources of several computers to solve one problem.

To create clusters, either simple single-processor personal computers or two- or four-processor SMP servers are usually used. In this case, no restrictions are imposed on the composition and architecture of nodes. Each node can run its own operating system. The most commonly used standard operating systems are Linux, FreeBSD, Solaris, Tru64 Unix, Windows NT. In cases where the cluster nodes are heterogeneous, we talk about heterogeneous clusters.

When creating clusters, two approaches can be distinguished. The first approach is used when creating small cluster systems. A cluster combines fully functional computers that continue to operate as independent units, for example, classroom computers or laboratory workstations. The second approach is used in cases where a powerful computing resource is purposefully created. Then the computer system units are compactly placed in special racks, and one or more full-function computers, called host computers, are allocated to manage the system and run tasks. In this case, there is no need to equip the computers of the computing nodes with graphic cards, monitors, disk drives and other peripheral equipment, which significantly reduces the cost of the system.

Many technologies have been developed for connecting computers into a cluster. The most widely used technology at the moment is Fast Ethernet. This is due to the ease of use and low cost of communication equipment. However, you have to pay for this with an obviously insufficient exchange rate. In fact, this equipment provides a maximum transfer speed between nodes of 10 MB/sec, while the transfer speed with RAM is 250 MB/sec and higher. The developers of the ScaLAPACK subroutine package, designed for solving linear algebra problems on multiprocessor systems in which the share of communication operations is large, formulate the following requirement for a multiprocessor system: “The speed of interprocessor exchanges between two nodes, measured in MB/sec, must be at least 1/ 10 peak performance of a computing node, measured in Mflops"http://rsusu1.rnd.runnet.ru/tutor/method/m1/liter1.html - . Thus, if Pentium III 500 MHz class computers (peak performance 500 Mflops) are used as computing nodes, then Fast Ethernet equipment provides only 1/5 of the required speed. This situation can be partially corrected by the transition to Gigabit Ethernet technologies.

A number of companies offer specialized cluster solutions based on higher-speed networks, such as SCI from Scali Computer (~100 Mb/sec) and Mirynet (~120 Mb/sec). Manufacturers of high-performance workstations (SUN, HP, Silicon Graphics) also actively participated in supporting cluster technologies.