非本地操作的數據處理器和塊之間的耦合

（#）前言

原始文檔

Dynamics classes and data processors are the nuts and bolts of a Palabos program. Conceptually speaking, you can view dynamics classes as an implementation of a raw “lattice Boltzmann paradigm”, whereas data processors are rather a realization of a more general data-parallel multi-block paradigm. Dynamics classes are (relatively) easy, because lattice Boltzmann is easy, as long as there are no boundaries, refined grids, parallel programs, or any other advanced structural ingredients. Data processors can be a little bit harder to understand, because they have the difficult task to solve “everything else” which is not taken charge of by dynamics classes. They implement all non-local ingredients of a model, they execute operations on scalar-fields and tensor-fields, and they create couplings between all types of blocks. And, just like dynamics objects, they handle reduction operations, and they must be efficient and inherently parallelizable.
The importance of data processors is particularly obvious when you are working with scalar-fields and tensor-fields. Unlike block-lattices, these data fields do not have intelligent cells with a local reference to a dynamics object; data processors are therefore the only efficient way of performing collective operations on them. As an example, consider a field of type TensorField3D<T,3> that represents a velocity field (each element is a velocity vector of type Array<T,3>). Such a field is for example obtained from a LB simulation by calling the function computeVelocity. It might be interesting to compute space-derivatives of the velocity through finite differences.
The only morally right way of doing this is through a data processor, because it is the only approach which is parallelizable, scalable, efficient, and forward-compatible with future versions of Palabos. This is exactly what Palabos does when you call one of the functions like computeVorticity() or computeStrainRate() (see the appendix [tensor-field -> tensor-field]). Note, once more, that you could also evaluate the finite difference scheme by writing a simple loop over the space indices of the tensor-field. This would produce the correct result in serial and in parallel, and it would even be pretty efficient in serial, but it is not parallelizable (in parallel, efficiency is lost instead of gained).
All in all, it should have become clear that while data processors are powerful objects, they also need to address a certain amount of complexity and are therefore conceptually less simple than other components of Palabos. Consequently, data-processors are most certainly the part of Palabos’ user’s interface which is toughest to understand, and this section is filled with ad-hoc rule which you just need to learn at some point. To alleviate this thing a bit, the section starts with two relatively simple topics which teach you how to solve many tasks in Palabos through the available helper functions and avoid explicitly writing data processors in many cases. The remainder of this section contains many links to existing implementations in Palabos, and you are strongly encouraged to actually have a look at these examples to assimilate the theoretical concepts.

Using helper functions to avoid explicitly writing data processors

Data processors can be (arbitrarily) split into three categories, according to the use which is made out of them. The first category is about setting up a simulation, assigning dynamics objects to a sub domain of the lattice, initializing the populations, and so on. These methods are explained in the sections Initial values of density and velocity and Defining boundary conditions, and the functions are listed in the appendix Mutable (in-place) operations for simulation setup and other purposes. The second category embraces data processors which are added to a lattice to implement a physical model. Many models are predefined, as presented in section Implemented fluid models. Finally, the last category is for the evaluation of the data, and for the execution of short post-processing task, as in the above example of the computation of a velocity gradient. Examples are found in the section Data evaluation, and the available functions are listed in the appendix Non-mutable operations for data analysis and other purposes.

Convenience wrappers for local operations

Imagine that you have to perform a local initialization task on a block-lattice, i.e. a task for which you don’t need to access the value of surrounding neighbors, and for which the available Palabos functions are insufficient. As a lightweight alternative to writing a classical data-processing functional, you can implement a class which inherits from OneCellFunctionalXD and implements the virtual method (here in 2D)
virtual void execute(Cell<T,Descriptor>& cell) const;
In the body of this method, simply perform the desired action on the argument cell. If the action depends on the space position, you can instead inherit from OneCellIndexedFunctionalXD and implement the method (again illustrated for the 2D case)
virtual void execute(plint iX, plint iY, Cell<T,Descriptor>& cell) const;
An instance of this one-cell functional is then applied to the lattice through a function call like
// Case of a plain one-cell functional.
applyIndexed(lattice, domain, new MyOneCellFunctional<T,Descriptor>);
// Case of an indexed one-cell functional.
applyIndexed(lattice, domain, new MyOneCellIndexedFunctional<T,Descriptor>);
This method is used in the example program located in examples/showCases/multiComponent2d. Here, a customized initialization process is required to get access to the external scalars of a lattice for a thermal simulation.
These one-cell functionals are less general than usual data processing functionals, because they cannot be used for nonlocal operations. Furthermore, they tend to be numerically somewhat less efficient, because Palabos needs to perform a virtual function call to the method execute of the one-cell functional on each cell of the domain. However, this loss of efficiency is usually completely negligible during the initialization stage, where it is important to have a code which scales on a parallel machine, but not to optimize the code for a gain of a micro-second. In this case you should prefer a code which, as proposed by the one-cell functionals, is shorter and easier to read.

文檔翻譯

動力學類和數據處理器是Palabos程序的基本要素。從概念上講，您可以將動力學類視爲原始“格子Boltzmann範例”的實現，而數據處理器則是更通用的數據並行多塊範例的實現。動力學類相對容易，是因爲格子Boltzmann方法很容易，只要不是邊界、精細化網格，並行程序或任何其他高級結構成分。數據處理器可能有點難以理解，因爲它們具有解決“其他所有問題”的艱鉅任務，而這並不是動力學類負責的。它們實現模型的所有非局部成分，對標量字段和張量字段執行運算，並在所有類型的塊之間創建耦合。而且，就像動力學對象一樣，它們處理歸約運算，並且它們必須高效且固有地可並行化。
使用標量字段和張量字段時，數據處理器的重要性特別明顯。與塊格不同，這些數據字段沒有智能元胞，這些智能元胞沒有本地化的引用動態對象。因此，數據處理器是對其執行集體操作的唯一有效方法。例如，考慮一個TensorField3D <T，3>類型的字段，該字段表示一個速度場（每個元素都是Array <T，3>類型的速度向量）。例如，通過調用函數computeVelocity從LB仿真獲得這樣的速度字段。通過有限差分計算速度的空間導數可能很有趣。做到這一點的唯一理論上正確的方法是通過數據處理器，因爲這是唯一可與未來版本的Palabos並行化，擴展，高效且向前兼容的方法。當您調用諸如computeVorticity()或computeStrainRate()之類的函數之一時，Palabos正是這麼做的(請參見附錄[tensor-field-> tensor-field])。再次注意，您還可以通過在張量字段的空間索引上編寫一個簡單的循環來評估有限差分方案。這將在串行和並行中產生正確的結果，甚至在串行中甚至會非常有效，但是它不是可並行化的（並行結果是效率下降而不是提高）。
總而言之，應該清楚的是，儘管數據處理器是功能強大的對象，但它們也需要解決一定數量的複雜性，因此從概念上講，它不如Palabos的其他組件那麼簡單。因此，數據處理器無疑是Palabos用戶界面中最難理解的部分，這一部分充滿了臨時性的規則，您只需要在某些時候學習即可。爲了減輕這種負擔，本節從兩個相對簡單的主題開始，這些主題教您如何通過可用的輔助函數解決Palabos中的許多任務，並且在許多情況下避免顯式編寫數據處理器。本節的其餘部分包含許多與Palabos中現有實現的鏈接，強烈建議您實際看一下這些示例以吸收理論概念。

使用helper函數避免顯式地編寫數據處理器

根據數據處理器的用途，可以（任意）將其分爲三類。第一類是關於建立仿真，將動力學對象分配給晶格的子域，初始化總體等等。在密度和速度的初始化(Initial values of density and velocity，第八章)以及定義邊界條件(Defining boundary conditions，第九章)的小節中介紹了這些方法，附錄中的用於模擬設置和其他用途的可變(就地)操作中(Mutable (in-place) operations for simulation setup and other purposes，附錄)列出了這些功能。第二類包括添加到網格中以實現物理模型的數據處理器。預定義了許多模型，如已實現的流體模型(Implemented fluid models，第七章)部分中所述。最後一個類別是用於數據評估，以及用於執行簡短的後處理任務，如上述速度梯度計算示例中所示。示例在數據評估(Data evaluation，第12章)一節中，附錄中列出了用於數據分析和其他目的非可變操作(Non-mutable operations for data analysis and other purposes，附錄)的相關功能。

方便本地操作的包裝器

想象一下，您必須在塊狀格子上執行本地初始化任務，而該任務不需要訪問周圍鄰居的值，並且可用的Palabos功能不足。作爲編寫經典數據處理功能的輕型替代方案，您可以實現一個類，該類繼承自OneCellFunctionalXD並實現虛函數execute（此處爲2D）

virtual void execute(Cell<T,Descriptor>& cell) const;

在此方法的主體中，只需對參數單元格執行所需的操作。如果動作取決於空間位置，則可以從OneCellIndexedFunctionalXD繼承並實現該方法（再次針對2D情況進行說明）

virtual void execute(plint iX, plint iY, Cell<T,Descriptor>& cell) const;

然後，通過像這樣的函數調用將該單單元功能的實例應用於晶格

// Case of a plain one-cell functional.
applyIndexed(lattice, domain, new MyOneCellFunctional<T,Descriptor>);
// Case of an indexed one-cell functional.
applyIndexed(lattice, domain, new MyOneCellIndexedFunctional<T,Descriptor>);

在examples / showCases / multiComponent2d中的示例程序中使用此方法。在這裏，需要定製的初始化過程才能訪問用於熱模擬的晶格的外部標量。
這些單單元功能不如通常的數據處理功能通用，因爲它們不能用於非本地操作。此外，它們在數值上往往效率較低，因爲Palabos需要對域的每個單元上執行單單元功能的方法執行虛擬函數調用。但是，這種效率損失通常在初始化階段可以完全忽略不計，在初始化階段，重要的是要在並行計算機上縮放代碼，而不是爲了獲得微秒的增益而優化代碼。在這種情況下，您應該更喜歡單單元功能所建議的代碼，該代碼更短且更易於閱讀。

解釋說明

以上爲使用數據處理器的非常好用的方式。
通過直接寫操作函數，並由內置類調用函數指針的方式對對象執行操作，通過現成的已有數據處理器進行操作，可以避免直接寫數據處理器，邊界條件中的很多都操作都是用這種方式實現的。
通過編寫繼承自OneCellFunctionalXD類的子類並實現虛函數execute()，即可通過applyIndexed()調用對本地元胞進行各種操作，可進行的操作相較於通過調用函數指針的方式，有了跟多的靈活性，很多對於結果收集的相關函數都是這麼寫出來的。

（##）寫數據處理器

原始文檔

A common way to execute an operation on a matrix is to write some sort of loop over all elements of the matrix, or at least over a given sub-domain, and to execute the operation on each cell. If the memory of the matrix is subdivided into smaller components, as it is the case for Palabos’ multi-blocks, and these components are distributed over the nodes of a parallel machine, then your loop needs also to be subdivided into corresponding smaller loops. The purpose of the data processors in Palabos is to perform this subdivision automatically for you. It then provides the coordinates of the subdivided domains, and requires from you to execute the operation on these domains, instead of the original full domain.
As a developer of a data processor, you’re almost always in touch with so-called data-processing functionals which provide a simplified interface by performing a part of the repetitive tasks behind the scenes. There exist many different types of data-processing functionals, as listed in the next section. For the sake of illustration, we consider now the case of a data-processor which acts on a single 2D block-lattice or multi-block-lattice, and which does not perform any data reduction (it returns no value). Let’s say that the aim of the operation is to exchange the value of f[1] and f[5] on a given amount on lattice cells. This could (and should, for the sake of code clarity) be done with the simple one-cell-functional introduced in Section Convenience wrappers for local operations, but we want to do the real thing now, and get to the core of data functionals.
The data-processing functional for this job must inherit from BoxProcessingFunctional2D_L (the L indicates that the data processor acts on a single lattice), and implement, among other methods described below, the virtual method process:
template<typename T, template<typename U> class Descriptor>
class Invert_1_5_Functional2D : public BoxProcessingFunctional2D_L<T,Descriptor> {
public:
// ... implement other important methods here.
	void process(Box2D domain, BlockLattice2D<T,Descriptor>& lattice)
	{
		for (plint iX=domain.x0; iX<=domain.x1; ++iX) {
			for (plint iY=domain.y0; iY<=domain.y1; ++iY) {
				Cell<T,Descriptor>& cell = lattice.get(iX,iY);
				// Use the function swap from the C++ STL to swap the two values.
				std::swap(cell[1], cell[5]);
			}
		}
	}
};
The first argument of the function process corresponds to domain computed by Palabos after sub-dividing the original domain to fit to the small components of the original block. The second argument is corresponds to the lattice on which the operation is to be performed. This argument is always an atomic-block (i.e. it is always a block-lattice and never a multi-block-lattice), because at the point of the function call to process, Palabos has already subdivided the original block and is accessing its internal, atomic sub-blocks. If you compare this to the procedure of writing a one-cell-functional as shown in Section Convenience wrappers for local operations, you will see that the additional work you need to do in the present case is to write yourself the loop over the space indices of the domain. Having to write out these loops by hand all the time is tiring, especially when you write many data processors, and it is errorprone. But it is the cost to pay for optimal efficiency, and in the field of computational physics, efficiency counts just a bit more than in other domains of software engineering and must be valued, unfortunately, against elegance from time to time.
Another advantage against one-cell-functionals is the possibility to implement non-local operations. In the following example, the value of f[1] is swapped with f[5] on the right neighboring cell:
void process(Box2D domain, BlockLattice2D<T,Descriptor>& lattice)
{
	for (plint iX=domain.x0; iX<=domain.x1; ++iX) {
		for (plint iY=domain.y0; iY<=domain.y1; ++iY) {
			Cell<T,Descriptor>& cell = lattice.get(iX,iY);
			Cell<T,Descriptor>& partner = lattice.get(iX+1,iY);
			std::swap(cell[1], partner[5]);
		}
	}
}
You can do this without risking to access cells outside the range of the lattice if you respect two rules:

On nearest-neighbor lattices (D2Q9, D3Q19, etc.), you can be non-local by one cell but no more (you may write lattice.get(iX+1,iY) but not lattice.get(iX+2,iY)). On a lattice with extended neighborhood you can also extend the distance at which you access neighboring cells in a data processor. The amount of allowed non-locality is determined by the constant Descriptor::vicinity.

Non-local operations are not allowed in data processors which act on the communication envelope (Section The methods you need to override explains what this means).

To conclude this sub-section, let’s summarize the things you are allowed to do in a data processors, and the things you are not allowed to. You are allowed to access the cells in the provided range (plus the nearest neighbors, or a few more neighbors according to the lattice topology), to read them and to modify them. The operation which you perform can be space dependent, but this space dependency must be generic and cannot depend on the specific coordinates of the argument domain provided to the function process. This is an extremely important point which we shall line out as the Rule 0 of data processors:

Rule 0 of data processors:
A data processor must always be written in such a way that executing the data processor on a given domain has the same effect as splitting the domain into two sub-domains, and then executing the data processor consecutively on each of these sub-domains.

In practice, this means that you are not allowed to make any logical decision based on the parameters x0, x1, y0, or y1 of the argument domain, or directly based on the indices iX or iY. Instead, these local indices must first be converted to global ones, independent of the sub-division of the data processor, as explained in Section Absolute and relative position.

Categories of data-processing functionals

Depending on the type of blocks on which a data processor is applied, there exist different types of processing functionals, as listed below:
Class BoxProcessingFunctionalXD<T> void 
	  processGenericBlocks(BoxXD domain, std::vector<AtomicBlockXD<T>*> atomicBlocks);
This class is practically never used. It is the fall-back option when everything else fails. It handles an arbitrary number of blocks of arbitrary type, which were casted to the generic type AtomicBlockXD. Before use, you need to cast them back to their real type.
Class LatticeBoxProcessingFunctionalXD<T,Descriptor> void 
	  process(BoxXD domain, std::vector<BlockLatticeXD<T,Descriptor>*> lattices);
Use this class to process an arbitrary number of block-lattices, and potentially create a coupling between them. This data-processing functional is for example used to define a coupling between an arbitrary number of lattice for the Shan/Chen multi-component model defined in the files src/multiPhysics/shanChenProcessorsXD.h and .hh. This type of data-processing functional is not very frequently used either, as the two-block versions listed below are more appropriate in most cases.
Class ScalarFieldBoxProcessingFunctionalXD<T> void 
	  process(BoxXD domain, std::vector<ScalarFieldXD<T>*> fields);
Same as above, applied to scalar-fields.
Class TensorFieldBoxProcessingFunctionalXD<T,nDim> void 
	  process(BoxXD domain, std::vector<TensorFieldXD<T,nDim>*> fields);
Same as above, applied to tensor-fields.
Class BoxProcessingFunctionalXD_L<T,Descriptor> void 
	  process(BoxXD domain, BlockLatticeXD<T,Descriptor>& lattice);
Data processor acting on a single lattice.
Class BoxProcessingFunctionalXD_S<T> void 
	  process(BoxXD domain, ScalarFieldXD<T>& field);
Data processor acting on a single scalar-field.
Class BoxProcessingFunctionalXD_T<T,nDim> void 
	  process(BoxXD domain, TensorFieldXD<T,nDim>& field);
Data processor acting on a single tensor-field.
Class BoxProcessingFunctionalXD_LL<T,Descriptor1,Descriptor2> void 
	  process(BoxXD domain, 
			  BlockLatticeXD<T,Descriptor1>& lattice1, 
			  BlockLatticeXD<T, Descriptor2>& lattice2);
Data processor for processing and/or coupling two lattices with potentially different descriptors. Similarly, there is an SS version for two scalar-fields, and a TT version for two tensor-fields with potentially different dimensionality nDim.
Class BoxProcessingFunctionalXD_LS<T,Descriptor> void 
	  process(BoxXD domain, 
	  		  BlockLatticeXD<T,Descriptor>& lattice, 
	  		  ScalarFieldXD<T>& field);
Data processor for processing and/or coupling a lattice and a scalar-field. Similarly, there is an LT and an ST version for the lattice-tensor and the scalar-tensor case.
For each of these processing functionals, there exists a “reductive” version (e.g. ReductiveBoxProcessingFuncionalXD_L) for the case that the data processor performs a reduction operation and returns a value.

The methods you need to override

Additionally to the method process, a data-processing functional must override three methods. The use of these three methods is now illustrated for the example of class Invert_1_5_Functional2D introduced at the beginning of this section:
BlockDomain::DomainT appliesTo() const
{
	return BlockDomain::bulk;
}
void getModificationPattern(std::vector<bool>& isWritten) const
{
	isWritten[0] = true;
}
Invert_1_5_Functional2D<T,Descriptor>* clone() const
{
	return new Invert_1_5_Functional2D<T,Descriptor>(*this);
}
To start with, you need to provide the method clone() which is paradigmatic in Palabos (see Section Programming with Palabos). Next, you need to tell Palabos which of the blocks treated by the data processor are being modified. In the present case, there is only one block. In the general case, the size of the vector isWritten is equal to the number of involved blocks, and you must assign a flag true/false to each of them. Among others, this information is exploited by Palabos to decide whether an inter-process communication for the block is needed after execution of the data processor.
The third method, method appliesTo is used to decide whether the data processor acts only on the actual domain of the simulation (BlockDomain::bulk) or if it also includes the communication envelopes (BlockDomain::bulkAndEnvelope). Let’s remember that the atomic-blocks which are the components of a multi-block are extended by a single cell layer (or a multiple cell-layer for extended lattices) to incorporate communication between blocks. This envelope overlaps with the bulk of another atomic-block, and the information is duplicated between the corresponding bulk and envelope cells. It is this envelope which makes it possible to implement a nonlocal data processor without incurring the danger of accessing out-of-range data. Normally, it is sufficient to execute a data processor on the bulk of the atomic blocks, and it is better to do so, in order to avoid the restrictions listed below when using the envelopes. This is sufficient, because a communication is automatically initiated between the envelopes and the bulk of neighboring blocks to update the values in the envelope if needed. Including the envelope is only needed if (1) you assign a new dynamics object to some or all of the cells (as it is done when you call the function defineDynamics), or (2) if you modify the internal state of the dynamics object (as it is done when you call the function defineVelocity to assign a new velocity value on Dirichlet boundary nodes). In these cases, including the envelope is necessary, because the nature and the content of dynamics objects is not transferred during the communication step between atomic-blocks. The only information which is transferred is the cell data (the particle populations and the external scalars).
If you decide to include the envelope into the application area of the data processor, you must however respect the two following rules. Otherwise, undefined behavior shall arise.

The data processor must be entirely local, because there are no additional envelopes available to cope with non-local data access.

The data processor can have write access to at most one of the involved blocks (the vector isWritten returned from the method getModificationPattern() can have the value true at most at one place).

Absolute and relative position

The coordinates iX and iY used in the space loop of a data processor are pretty useless for anything else than the execution of the loop, because they represent local variables of an atomic-block, which is itself situated at a random position inside the overall multi-block. To make decision depending on a space position, the local coordinates must therefore first be converted to global ones:
// Access the position of the atomic-block inside the multi-block.
Dot2D relativePosition = lattice.getLocation();
// Convert local coordinates to global ones.
plint globalX = iX + relativePosition.x;
plint globaly = iY + relativePosition.x;
An example is provided in the directory examples/showCases/boussinesqThermal2d/. This conversion is a bit awkward, and this is again a good reason to use the one-cell functionals presented in Section Convenience wrappers for local operations, which do the job automatically for you.
Similarly, if you execute a data processor on more than just one block, the relative coordinates are not necessarily the same in all involved blocks. If you measure things in global coordinates, then the argument domain of the method process always overlaps with all of the involved blocks. This is something which is guaranteed by the algorithm implemented in Palabos. However, all multi-blocks on which the data processor is applied are not necessarily working with the same internal data distribution, and have potentially a different interpretation of local coordinates. The argument domain of the method process is always provided as local coordinates of the first atomic-block. To get at the coordinates of the other blocks, a corresponding conversion must be applied:
Dot2D offset_0_1 = computeRelativeDisplacement(lattice0, lattice1);
Dot2D offset_0_2 = computeRelativeDisplacement(lattice0, lattice2);
plint iX1 = iX + offset_0_1.x;
plint iY1 = iY + offset_0_1.y;
plint iX2 = iX + offset_0_2.x;
plint iY2 = iY + offset_0_2.y;
Again, this process is illustrated in the example in examples/showCases/boussinesqThermal2d/. This displacement needs to be computed if any of the following conditions is verified (if you are unsure, it is best to compute the displacement by default):

The multi-blocks on which the data processor is applied don’t have the same data distribution, because they were constructed differently.

The multi-blocks on which the data processor is applied don’t have the same data distribution, because they don’t have the same size. This is the case for all functions like computeVelocity, which computes the velocity on a sub-domain of the lattice. It uses a data-processor which acts on the original lattice (which is big) and the velocity field (which can be smaller because it has the size of the sub-domain).

The data processor includes the envelope. In this case, a relative displacement stems from the fact that bulk nodes are coupled with envelope nodes from a different atomic-block. This is one more reason why it is generally better not to include the envelope it the application domain of a data processor.

Executing, integrating, and wrapping up data-processing functionals

There are basically two ways of using a data processor. In the first case, the processor is executed just once, on one or more blocks, through a call to the function executeDataProcessor. In the second case, the processor is added to a block through a call to the function addInternalProcessor, and then adopts the role of an internal data processor. An internal data processor is part of the block and can be executed as many times as wished by calling the method executeInternalProcessors of this block. This approach is typically chosen when the data processing step is part of the algorithm of the fluid solver. As examples, consider the non-local parts of boundary conditions, the coupling between components in a multi-component fluid, or the coupling between the fluid and the temperature field in a thermal code with Boussinesq approximation. In a block-lattice, internal processors have a special role, because the method executeInternalProcessors is automatically invoked at the end of the method collideAndStream() and of the method stream(). This behavior is based on the assumption that collideAndStream() represents a full lattice Boltzmann iteration cycle, and stream(), if used, stands at the end of such a cycle. The internal processors are therefore considered to be part of a lattice Boltzmann iteration and are executed at the very end, after the collision and the streaming step.
For convenience, the function call to executeDataProcessor and to addInternalProcessor was redefined for each type of data-processing functional introduced in Section Categories of data-processing functionals, and the new functions are called applyProcessingFunctional and integrateProcessingFunctional respectively. To execute for example a data-processing functional of type BoxProcessingFunctional2D_LS on the whole domain of a given lattice and scalar field (they can be either of type multi-block or atomic-block), the function call to use has the form
applyProcessingFunctional (
	new MyFunctional<T,Descriptor>, lattice.getBoundingBox(),
	lattice, scalarField );
All predefined data-processing functionals in Palabos are additionally wrapped in a convenience function, in order to simplify the syntax. For example, one of the three versions of the function computeVelocityNorm for 2D fields is defined in the file src/multiBlock/multiDataAnalysis2D.hh as follows:
template<typename T, template<typename U> class Descriptor>
void computeVelocity( MultiBlockLattice2D<T,Descriptor>& lattice,
					  MultiTensorField2D<T,Descriptor<T>::d>& velocity,
					  Box2D domain )
{
	applyProcessingFunctional (new BoxVelocityFunctional2D<T,Descriptor>, domain, lattice, velocity );
}
Execution order of internal data processors

There are different ways to control the order in which internal data processors are executed in the function call executeInternalProcessors(). First of all, each data processor is attributed to a processor level, and these processor levels are traversed in increasing order, starting with level 0. By default, all internal processors are attributed to level 0, but you have the possibility to put them into any other level, specified as the last, optional parameter of the function addInternalProcessor or integrateProcessingFunctional. Inside a processor level, the data processors are executed in the order in which they were added to the block. Additionally to imposing an order of execution, the attribution of data processors to a given level has an influence on the communication pattern inside multi-blocks. As a matter of fact, communication is not immediately performed after the execution of a data processor with write access, but only when switching from one level to the next. In this way, all MPI communication required for by the data processors within one level is bundled and executed more efficiently. To clarify the situation, let us write down the details of one iteration cycle of a block-lattice which has data processors at level 0 and at level 1 and automatically executes them at the end of the function call collideAndStream:

Execute the local collision, followed by a streaming step.

Execute the data processors at level 0. No communication has been made so far. Therefore, the data processors at this level have only a restricted ability to perform non-local operations, because the cell data in the communication envelopes is erroneous.

Execute a communication between the atomic-blocks of the block-lattice to update the envelopes. If any other, external blocks (lattice, scalar-field or tensor-field) were modified by any of the data processors at level 0, update the envelopes in these blocks as well.

Execute the data processors at level 1.

If the block-lattice or any other, external blocks were modified by any of the data processors at level1, update the envelopes correspondingly.

Although this behavior may seem a bit complicated, it leads to an intuitive behavior of the program and offers a general way to control the execution of data processors. It should be specially emphasized that if a data processor B depends on data produced previously by another data processor A, you must make sure that a proper causality relation between A and B is implemented. In all cases, B must be executed after A. Additionally, if B is non-local (and therefore accesses data on the envelopes) and A is a bulk-only data-processor, it is required that a communication step is executed between the execution of A and B. Therefore, A and B must be defined on different processor levels.
If you execute data processors manually, you can choose to execute only the processors of a given level, by indicating the level as an optional parameter of the method executeInternalProcessors(plint level). It should also be mentioned that a processor level can have a negative value. The advantage of a negative processor level is that it is not executed automatically through the default function call executeInternalProcessors(). It can only be executed manually through the call executeInternalProcessors(plint level). It makes sense to exploit this behavior for data processors which are executed often, but not at every iteration step. Calling applyProcessingFunctional each time would be somewhat less efficient, because an overhead is incurred by the decomposition of the data processor over internal atomic-blocks.

文檔翻譯

在矩陣上執行操作的一種常用方法是在矩陣的所有元素上或至少在給定的子域上編寫某種循環，並在每個單元上執行操作。如果將矩陣的內存細分爲較小的組件（例如Palabos的多塊），並且這些組件分佈在並行計算機的節點上，那麼您的循環也需要細分爲相應的較小的循環。 Palabos中數據處理器的目的是自動爲您執行此細分。然後，它提供細分的域的座標，並要求您在這些域上執行操作，而不是在原始完整域上執行操作。
作爲數據處理器的開發人員，您幾乎總是與所謂的數據處理功能保持聯繫，這些功能通過在後臺執行部分重複性任務來提供簡化的界面。下一節列出了許多不同類型的數據處理功能。爲了說明起見，我們現在考慮一種數據處理器的情況，該處理器作用於單個2D塊格或多塊格，並且不執行任何數據縮減（不返回任何值）。假設操作的目的是在晶格單元上交換給定數量的f [1]和f [5]的值。這可以（並且爲了代碼清晰起見）可以通過本節中便於本地操作的包裝器(Convenience wrappers for local operations)中介紹的用於本地操作的簡單的單單元功能來完成，但是我們現在想做一件真正的事情，並進入數據功能的核心。
此作業的數據處理功能必須繼承自BoxProcessingFunctional2D_L（L表示數據處理器作用於單個晶格），並在下面描述的其他方法中實現虛擬方法過程：

template<typename T, template<typename U> class Descriptor>
class Invert_1_5_Functional2D : public BoxProcessingFunctional2D_L<T,Descriptor> {
public:
// ... implement other important methods here.
	void process(Box2D domain, BlockLattice2D<T,Descriptor>& lattice)
	{
		for (plint iX=domain.x0; iX<=domain.x1; ++iX) {
			for (plint iY=domain.y0; iY<=domain.y1; ++iY) {
				Cell<T,Descriptor>& cell = lattice.get(iX,iY);
				// Use the function swap from the C++ STL to swap the two values.
				std::swap(cell[1], cell[5]);
			}
		}
	}
};

函數過程的第一個形參對應於Palabos在細分原始域以計算出適合原始塊的較小計算域。第二個形參對應於要在其上執行操作的晶格。該參數始終是一個原子塊（即，它始終是一個塊格，而不是一個多塊格），因爲在函數調用過程中，Palabos已經細分了原始塊並訪問其內部，即原子子塊。如果將這與編寫方便本地操作的包裝器(Convenience wrappers for local operations)一節中所示的編寫單單元功能的過程進行比較，您會發現在當前情況下需要做的其他工作是編寫自己的空間索引循環域。必須一直手工寫出這些循環很累，尤其是當您編寫許多數據處理器時，更容易出錯。但這是爲獲得最佳效率而付出的代價，在計算物理領域，效率只比軟件工程的其他領域多一點，而且不幸的是，有時必須對它進行評估，以免受到干擾。
對比單單元功能的另一個優勢是可以實現非本地操作。在下面的示例中，在右側相鄰單元上將f [1]的值與f [5]交換：

void process(Box2D domain, BlockLattice2D<T,Descriptor>& lattice)
{
	for (plint iX=domain.x0; iX<=domain.x1; ++iX) {
		for (plint iY=domain.y0; iY<=domain.y1; ++iY) {
			Cell<T,Descriptor>& cell = lattice.get(iX,iY);
			Cell<T,Descriptor>& partner = lattice.get(iX+1,iY);
			std::swap(cell[1], partner[5]);
		}
	}
}

如果遵守兩個規則，則可以不必冒險訪問格範圍之外的單元，即可操作：

在最鄰近的晶格（D2Q9，D3Q19等）上，您只能在一個單元格內執行非局部操作，但不能在位於其他位置執行操作（您可以編寫lattice.get(iX + 1,iY)，但不能寫lattice.get(iX + 2,iY) )。在具有擴展鄰域的網格上，您還可以擴展訪問數據處理器中相鄰單元的距離。允許訪問的非局部性的數量由常量Descriptor::vicinity確定。
作用在通信信封上的數據處理器中不允許進行非本地操作（需要重寫的方法(The methods you need to override)章節解釋了這的含義）。

總結本小節，讓我們總結一下您可以在數據處理器中執行的操作以及不允許執行的操作。允許您訪問所提供範圍內的單元（根據晶格拓撲結構，加上最近的鄰居，或者再加上一些鄰居），以讀取它們並進行修改。您執行的操作可能與空間有關，但是此空間與環境之間的依賴性必須是通用的，並且不能取決於提供給函數process的參數domain的特定座標。這是非常重要的一點，我們將其列爲數據處理器的規則0：

數據處理器的規則0：
必須始終以這樣的方式編寫數據處理器：在給定域上執行數據處理器與將域拆分爲兩個子域後執行數據處理器效果相同，並且在這些子域中的每個子域上連續執行數據處理器具有相同的效果。

實際上，這意味着不允許您基於參數域的參數x0，x1，y0或y1或直接基於索引iX或iY做出任何邏輯決策。相反，必須首先將這些局部索引轉換爲全局索引，而與數據處理器的細分無關，如絕對和相對位置(Absolute and relative position)部分中所述。

數據處理器功能的類別

根據應用數據處理器的塊的類型，存在不同類型的處理功能，如下所示：

Class BoxProcessingFunctionalXD<T> void 
	  processGenericBlocks(BoxXD domain, std::vector<AtomicBlockXD<T>*> atomicBlocks);

該類幾乎從未使用過。當其他所有方法均失敗時，它是後備選項。它處理任意數量的任意類型的塊，這些塊被轉換爲通用類型AtomicBlockXD。在使用之前，您需要將它們轉換回其真實類型。

Class LatticeBoxProcessingFunctionalXD<T,Descriptor> void 
	  process(BoxXD domain, std::vector<BlockLatticeXD<T,Descriptor>*> lattices);

使用此類可處理任意數量的塊晶格，並可以在它們之間創建耦合。例如，此數據處理功能用於爲文件src/multiPhysics/ shanChenProcessorsXD.h和.hh中定義的Shan/Chen多組件模型定義任意數量的晶格之間的耦合。這種類型的數據處理功能也不是很常用，因爲下面列出的兩塊式版本在大多數情況下更合適。

Class ScalarFieldBoxProcessingFunctionalXD<T> void 
	  process(BoxXD domain, std::vector<ScalarFieldXD<T>*> fields);

與上面相同，應用於標量場。

Class TensorFieldBoxProcessingFunctionalXD<T,nDim> void 
	  process(BoxXD domain, std::vector<TensorFieldXD<T,nDim>*> fields);

與上面相同，應用於張量場。

Class BoxProcessingFunctionalXD_L<T,Descriptor> void 
	  process(BoxXD domain, BlockLatticeXD<T,Descriptor>& lattice);

作用於單個晶格的數據處理器。

Class BoxProcessingFunctionalXD_S<T> void 
	  process(BoxXD domain, ScalarFieldXD<T>& field);

作用於單個標量場的數據處理器。

Class BoxProcessingFunctionalXD_T<T,nDim> void 
	  process(BoxXD domain, TensorFieldXD<T,nDim>& field);

作用於單個張量場的數據處理器。

Class BoxProcessingFunctionalXD_LL<T,Descriptor1,Descriptor2> void 
	  process(BoxXD domain, 
			  BlockLatticeXD<T,Descriptor1>& lattice1, 
			  BlockLatticeXD<T, Descriptor2>& lattice2);

數據處理器，用於處理和/或耦合具有潛在不同描述符的兩個晶格。類似地，對於兩個標量場，有一個SS版本，對於兩個張量場，有一個可能具有不同維度nDim的TT版本。

Class BoxProcessingFunctionalXD_LS<T,Descriptor> void 
	  process(BoxXD domain, 
	  		  BlockLatticeXD<T,Descriptor>& lattice, 
	  		  ScalarFieldXD<T>& field);

用於處理和/或耦合晶格和標量場的數據處理器。類似地，晶格張量和標量張量的情況也有LT和ST版本。

對於這些處理功能中的每一個，在數據處理器執行歸約運算並返回一個值的情況下，都存在一個“簡化”版本（例如ReductionBoxProcessingFuncionalXD_L）。

您需要覆蓋的方法

除了方法process之外，數據處理功能還必須覆蓋三種方法。現在，在本節開頭介紹的Invert_1_5_Functional2D類示例中說明了這三種方法的使用：

BlockDomain::DomainT appliesTo() const
{
	return BlockDomain::bulk;
}
void getModificationPattern(std::vector<bool>& isWritten) const
{
	isWritten[0] = true;
}
Invert_1_5_Functional2D<T,Descriptor>* clone() const
{
	return new Invert_1_5_Functional2D<T,Descriptor>(*this);
}

首先，您需要提供在Palabos中範例化的clone()方法（請參閱使用Palabos編程一節，第五章）。接下來，您需要告訴Palabos，數據處理器處理的哪些塊正在被修改。在當前情況下，只有一個塊。在一般情況下，向量isWritten的大小等於所涉及塊的數量，並且必須爲每個塊分配一個true / false標誌。其中，Palabos利用此信息來確定在執行數據處理器之後是否需要對該塊進行進程間通信。
第三種方法appliesTo用於確定數據處理器是僅作用於模擬的實際域（BlockDomain :: bulk），還是還包含通信信封（BlockDomain :: bulkAndEnvelope）。讓我們記住，作爲一個多塊的組成部分的原子塊被單個單元層（或用於擴展晶格的多個單元層）擴展，以合併塊之間的通信。該信封與另一個原子塊的主體重疊，並且該信息在相應的主體和信封單元之間重複。正是這種信封使得實現非本地數據處理器成爲可能，而不會產生訪問超出範圍數據的危險。通常，在大部分原子塊上執行數據處理器就足夠了，最好這樣做，以避免使用信封時出現下面列出的限制。這是足夠的，因爲如果需要，將自動在信封和大部分相鄰塊之間啓動通信以更新信封中的值。僅在以下情況下才需要包含信封：（1）將新的動態對象分配給部分或所有單元（如調用函數defineDynamics時所做的那樣），或者（2）修改動態對象的內部狀態（當您調用函數defineVelocity在Dirichlet邊界節點上分配新的速度值時，就完成了此操作）。在這些情況下，包括信封是必要的，因爲動力學對象的性質和內容在原子塊之間的通信步驟中不會傳遞。傳遞的唯一信息是元胞數據（粒子總數和外部標量）。
但是，如果決定將信封包括在數據處理器的應用程序區域中，則必須遵守以下兩個規則。否則，將會出現不確定的行爲。

數據處理器必須完全是本地的，因爲沒有可用的其他信封來應對非本地數據訪問。
數據處理器最多可以對所涉及的塊之一進行寫訪問（從方法getModificationPattern()返回的向量isWritten最多只能在一個位置具有true值）。

絕對和相對位置

數據處理器的空間循環中使用的座標iX和iY對於執行循環以外的任何其他操作都毫無用處，因爲它們表示原子塊的局部變量，而該原子塊本身位於整個多塊中的隨機位置。爲了根據空間位置進行決策，因此必須首先將局部座標轉換爲全局座標：

// Access the position of the atomic-block inside the multi-block.
Dot2D relativePosition = lattice.getLocation();
// Convert local coordinates to global ones.
plint globalX = iX + relativePosition.x;
plint globaly = iY + relativePosition.x;

目錄examples/showCases/boussinesqThermal2d/中提供了一個示例。這種轉換是有點尷尬的，這也是將方便本地操作的包裝器（Convenience wrappers for local operations）部分中介紹的單單元功能用於本地操作的一個很好的理由，該功能會自動爲您完成相應工作，避免轉換。
同樣，如果您在多個塊上執行一個數據處理器，則相對座標不一定在所有涉及的塊中都相同。如果以全局座標衡量事物，則方法過程的參數域始終與所有涉及的塊重疊。 Palabos中實現的算法可以保證這一點。但是，在其上應用了數據處理器的所有多塊不一定都具有相同的內部數據分佈，並且可能具有不同的局部座標解釋。方法過程的參數域始終作爲第一個原子塊的局部座標提供。要獲得其他塊的座標，必須應用相應的轉換：

Dot2D offset_0_1 = computeRelativeDisplacement(lattice0, lattice1);
Dot2D offset_0_2 = computeRelativeDisplacement(lattice0, lattice2);
plint iX1 = iX + offset_0_1.x;
plint iY1 = iY + offset_0_1.y;
plint iX2 = iX + offset_0_2.x;
plint iY2 = iY + offset_0_2.y;

同樣，在examples / showCases / boussinesqThermal2d /中的示例中說明了此過程。如果滿足以下任一條件，則需要計算此位移（如果不確定，最好默認計算位移）：
1.應用數據處理器的多塊數據分佈不同，因爲它們的結構不同。
2.應用數據處理器的多個塊的數據分佈不同，因爲它們的大小不同。對於諸如computeVelocity之類的所有函數都是這種情況，該函數可計算晶格的子域上的速度。它使用一個作用於原始晶格（較大）和速度場（其可能較小，因爲它具有子域的大小）的數據處理器。
3.數據處理器包括信封。在這種情況下，相對位移源於本體節點與來自不同原子塊的信封節點耦合的事實。這是爲什麼最好不要將信封包含在數據處理器的應用程序域中的另一個原因。

執行，集成和包裝數據處理功能

基本上有兩種使用數據處理器的方式。在第一種情況下，通過調用函數executeDataProcessor在一個或多個塊上僅執行一次處理器。在第二種情況下，通過調用函數addInternalProcessor將處理器添加到塊中，變成內部數據處理器的角色。內部數據處理器是該塊的一部分，可以通過調用該塊的executeInternalProcessors方法來執行任意多次。當數據處理步驟是流體求解器算法的一部分時，通常選擇此方法。作爲示例，請考慮邊界條件的非局部部分，多組分流體中各組分之間的耦合或熱代碼中具有Boussinesq逼近的流體與溫度場之間的耦合。在塊狀晶格中，內部處理器具有特殊作用，因爲在方法collideAndStream()和方法stream()的末尾會自動調用方法executeInternalProcessors。此行爲基於以下假設：collideAndStream()代表完整的格子Boltzmann迭代週期，並且stream()（如果使用）位於該週期的結尾。因此，內部處理器被認爲是晶格Boltzmann迭代的一部分，並在衝突和流傳輸步驟之後在最後執行。
爲了方便起見，針對在數據處理功能的類別(Categories of data-processing functionals)中介紹的每種類型的數據處理功能，重新定義了對executeDataProcessor和addInternalProcessor的函數調用，新函數分別稱爲applyProcessingFunctional和IntegratedProcessingFunctional。爲了在給定的晶格和標量字段的整個域上執行BoxProcessingFunctional2D_LS類型的數據處理功能（它們可以是multi-block或atomic-block類型），使用的函數調用的形式爲

applyProcessingFunctional (
	new MyFunctional<T,Descriptor>, lattice.getBoundingBox(),
	lattice, scalarField );

Palabos中所有預定義的數據處理功能都另外包裝在便利函數中，以簡化語法。例如，在文件src/multiBlock/multiDataAnalysis2D.hh中定義了用於2D字段的computeVelocityNorm函數的三個版本之一，如下所示：

template<typename T, template<typename U> class Descriptor>
void computeVelocity( MultiBlockLattice2D<T,Descriptor>& lattice,
					  MultiTensorField2D<T,Descriptor<T>::d>& velocity,
					  Box2D domain )
{
	applyProcessingFunctional (new BoxVelocityFunctional2D<T,Descriptor>, domain, lattice, velocity );
}

內部數據處理器的執行順序

有多種方法可以控制函數調用executeInternalProcessors()中內部數據處理器的執行順序。首先，每個數據處理器都屬於一個處理器級別，並且從級別0開始以遞增的順序遍歷這些處理器級別。默認情況下，所有內部處理器都屬於級別0，但是您可以將它們放入任何其他級別，指定爲函數addInternalProcessor或IntegratedProcessingFunctional的最後一個可選參數。在處理器級別內，數據處理器按照添加到塊中的順序執行。除了強加執行順序之外，將數據處理器的屬性分配給給定級別還會影響多塊內部的通信模式。實際上，在執行具有寫訪問權限的數據處理器之後，不會立即執行通信，而僅在從一個級別切換到下一個級別時才進行通信。這樣，一層中的數據處理器所需的所有MPI通信都將被捆綁並更有效地執行。爲了澄清這種情況，讓我們寫下一個塊格的一個迭代週期的細節，該塊格的數據處理器處於0級和1級，並在函數調用collideAndStream的結尾自動執行它們：

執行局部碰撞，然後執行流傳輸步驟。
在級別0執行數據處理器。到目前爲止，尚未進行任何通信。因此，由於通信信封中的單元數據是錯誤的，因此該級別的數據處理器僅具有有限的執行非本地操作的能力。
在塊晶格的原子塊之間執行通信以更新信封。如果級別0的任何數據處理器修改了任何其他外部塊（晶格，標量場或張量場），則也將更新這些塊中的信封。
在級別1執行數據處理器。
如果在第1級的任何數據處理器修改了塊晶格或任何其他外部塊，請相應地更新信封。

儘管此行爲可能看起來有些複雜，但它引導了程序的直觀行爲，並提供了控制數據處理器執行的一般方法。應該特別強調的是，如果數據處理器B依賴於另一個數據處理器A先前生成的數據，則必須確保在A和B之間實現適當的因果關係。在所有情況下，必須在A之後執行B。此外，如果B是非本地的（因此訪問信封上的數據）並且A是僅批量處理數據的處理器，則要求在A之間執行通信步驟。因此，必須在不同的處理器級別上定義A和B。
如果手動執行數據處理器，則可以通過將級別指示爲executeInternalProcessors(plint level)方法的可選參數來選擇僅執行給定級別的處理器。還應該提到，處理器級別可以具有負值。負處理器級別的優點是它不會通過默認函數調用executeInternalProcessors()自動執行。它只能通過調用executeInternalProcessors(plint level)手動執行。對於經常執行但並非在每個迭代步驟都執行的數據處理器，可以利用這種行爲。每次調用applyProcessingFunctional的效率都會有所降低，因爲內部原子塊上數據處理器的分解會導致開銷。

解釋說明

本部分，不推薦任何對palabos理解較少的人觀看，如果只是寫來利用，不理解內部的代碼架構，基本上寫一個錯一個，還是推薦用上一部分單個#的前言部分內的相關方法來做程序編寫。這部分的基礎知識，需要你有設計模式中的，組合模式，策略模式，命令模式等基礎，還要了解一些MPI中的多塊分塊與數據通信相關的知識。如果會寫了這部分相關的程序，任何lbm的模型應該都可以編寫使用，但時間花費上是相當巨大的，不管怎樣還是不推薦初學者或着急發文章的各位嘗試這個。
其實在原有代碼案例中，很多的部分的數據處理器也是按照上面手冊裏的寫作方式寫出來的，理解一些，有助於更好地調用原有案例的相關數據處理器操作，並不需要自己寫。
此外可以直接在碰撞與遷移部分對所需要執行的數據處理器進行一次性操作，使用applyProcessingFunctional相關的函數即可實現，一般都被我用來做一些騷操作，比如，強行加一個外部進口輸入一些流體之類的。。。
其實各個模型所寫的整個模塊內，都有一些已經寫好的數據處理器類，並沒有使用在案例中，如果想用的話，還是需要自己鑽研你想使用的那個模型的源文件，找到那些東西，然後使用規範的方法調用他們，可以做很多的東西。

Palabos用戶手冊翻譯及學習（四）非本地操作的數據處理器和塊之間的耦合

非本地操作的數據處理器和塊之間的耦合

（#）前言

原始文檔

Using helper functions to avoid explicitly writing data processors

Convenience wrappers for local operations

文檔翻譯

使用helper函數避免顯式地編寫數據處理器

方便本地操作的包裝器

解釋說明

（##）寫數據處理器

原始文檔

Categories of data-processing functionals

The methods you need to override

Absolute and relative position

Executing, integrating, and wrapping up data-processing functionals

Execution order of internal data processors

文檔翻譯

數據處理器功能的類別

您需要覆蓋的方法

絕對和相對位置

執行，集成和包裝數據處理功能

內部數據處理器的執行順序

解釋說明

完

Palabos用戶手冊翻譯及學習（一）基礎數據類型

Palabos案例解析（一）permeability.cpp案例

Anydesk遠程桌面及IP直連實現高速遠程桌面

輸出Tecplot格式的數據文件

Palabos用戶手冊翻譯及學習（二）運行模擬與數據評估

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結