[For complete, up-to-date TBB information visit: http://www.ThreadingBuildingBlocks.org]
Summary
Template function performs parallel iteration over a range of values.
Syntax
template<typename Range, typename Body>
void parallel_for ( const Range& range, const Body& body );
Header
#include "tbb/parallel_for.h"
Description
A parallel_for<Range,Body> represents parallel execution of Body over each value in Range. Type Range must model the Range concept ( 3.2). The body must model the requirements in Table 7.
Table 7: Requirements for parallel_for Body
Pseudo-Signature
Semantics
Body::Body( const Body& )
Copy constructor
Body::~Body()
Destructor
void Body::operator()( Range& range ) const
Apply body to range
A parallel_for recursively splits the range into subranges to the point such that is_divisible() is false for each subrange, and makes copies of the body for each of these subranges. For each such body/subrange pair, it invokes Body::operator(). The invocations are interleaved with the recursive splitting, in order to minimize space overhead and efficiently use cache.
Some of the copies of the range and body may be destroyed after parallel_for returns. This late destruction is not an issue in typical usage, but is something to be aware of when looking at execution traces or writing range or body objects with complex side effects.
When worker threads are available ( 8.2), parallel_for executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.
When no worker threads are available, parallel_for executes iterations from left to right in the following sense. Imagine drawing a binary tree that represents the recursive splitting. Each non-leaf node represents splitting a subrange r by invoking the splitting constructor Range(r,split()). The left child represents the updated value of r. The right child represents the newly constructed object. Each leaf in the tree represents an indivisible subrange. The method Body::operator() is invoked on each leaf subrange, from left to right.
Complexity
If the range and body take O(1) space, and the range splits into nearly equal pieces, then the space complexity is O(P log(N)), where N is the size of the range and P is the number of threads.
using namespace tbb;
template<typename Iterator>
struct ParallelMergeRange {
static size_t grainsize;
Iterator begin1, end1; // [begin1,end1) is 1st sequence to be merged
Iterator begin2, end2; // [begin2,end2) is 2nd sequence to be merged
Iterator out; // where to put merged sequence
bool empty() const {return (end1-begin1)+(end2-begin2)==0;}
bool is_divisible() const {
return std::min( end1-begin1, end2-begin2 ) > grainsize;
}
ParallelMergeRange( ParallelMergeRange& r, split ) {
if( r.end1-r.begin1 < r.end2-r.begin2 ) {
std::swap(r.begin1,r.begin2);
std::swap(r.end1,r.end2);
}
Iterator m1 = r.begin1 + (r.end1-r.begin1)/2;
Iterator m2 = std::lower_bound( r.begin2, r.end2, *m1 );
begin1 = m1;
begin2 = m2;
end1 = r.end1;
end2 = r.end2;
out = r.out + (m1-r.begin1) + (m2-r.begin2);
r.end1 = m1;
r.end2 = m2;
}
ParallelMergeRange( Iterator begin1_, Iterator end1_,
Iterator begin2_, Iterator end2_,
Iterator out_ ) :
begin1(begin1_), end1(end1_),
begin2(begin2_), end2(end2_), out(out_)
{}
};
template<typename Iterator>
size_t ParallelMergeRange<Iterator>::grainsize = 1000;
template<typename Iterator>
struct ParallelMergeBody {
void operator()( ParallelMergeRange<Iterator>& r ) const {
std::merge( r.begin1, r.end1, r.begin2, r.end2, r.out );
}
};
template<typename Iterator>
void ParallelMerge( Iterator begin1, Iterator end1, Iterator begin2, Iterator end2, Iterator out ) {
parallel_for(
ParallelMergeRange<Iterator>(begin1,end1,begin2,end2,out),
ParallelMergeBody<Iterator>()
);
}
Because the algorithm moves many locations, it tends to be bandwidth limited. Speedup varies, depending upon the system.
Summary
Template function performs parallel iteration over a range of values, with the splitting of the range guided by the Partitioner parameter.
Syntax
template<typename Range, typename Body, typename Partitioner>
void parallel_for ( const Range& range, const Body& body, const Partitioner& partitioner );
Header
#include "tbb/parallel_for.h"
Description
A parallel_for<Range,Body,Partitioner> represents parallel execution of Body over each value in Range. Type Range must model the Range concept ( 3.2). The body must model the requirements in Table 7. Type Partitioner must model the Partitioner concept ( 3.3).
Example
This example shows a simple use of the Partitioner concept with a parallel_for. The code shown below is an extension of the simple example presented in the previous subsection. An auto_partitioner is used to guide the splitting of the range.
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
using namespace tbb;
struct Average {
float* input;
float* output;
void operator()( const blocked_range<int>& range ) const {
for( int i=range.begin(); i!=range.end(); ++i )
output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f);
}
};
// Note: The input must be padded such that input[-1] and input[n]
// can be used to calculate the first and last output values.
void ParallelAverage( float* output, float* input, size_t n ) {
Average avg;
avg.input = input;
avg.output = output;
parallel_for( blocked_range<int>( 0, n ), avg, auto_partitioner() );
}
Two important changes should be noted: (1) the call to parallel_for takes a third argument, an auto_partitioner object, and (2) the blocked_range constructor is not provided with a grainsize parameter.
In addition to the constructors described in Sections 3.2.1 3.2.2 and , the blocked_range and blocked_range2d template classes now define additional constructors that initialize all grainsize parameters to 1. In both of these classes, the grainsize is used to designate a size above which a range is considered to be divisible.
Table 8 provides guidance for selecting between the simple_partitioner and auto_partitioner classes.
Table 8: Guidance for Selecting a Partitioner
Partitioner Type
Discussion
simple_partitioner
Recursively splits a range until it is no longer divisible. The Range::is_divisible function is wholly responsible for deciding when recursive splitting halts. When used with classes such as blocked_range and blocked_range2d, the selection of an appropriate grainsize is therefore critical to allow concurrency while limiting overheads (see the discussion in Section 3.2.1).
auto_partitioner
Guides splitting decisions based on the work stealing behavior of the task scheduler. When used with classes such as blocked_range and blocked_range2d, the selection of an appropriate grainsize is less important. Subranges that are larger than the grain size are used unless load imbalances are detected. Therefore acceptable performance may often be achieved by simply using the default grain size of 1.
TIP: Ranges larger than grain size may be passed to the body when using an auto_partitioner. The body should not therefore use the value of grain size as an upper bound on the size of the range (for allocating temporary storage for example).
[For complete, up-to-date TBB information visit: http://www.ThreadingBuildingBlocks.org]