One problem with instruction pipelining is that a "conditional branch" prevents knowing which instruction is next. For example:
for k = 1:n y(k) = y(k) + a*x(k) end for s = 3is turned into sequence of instructions
k = 1 (*) load reg1 with x(k) load reg2 with y(k) reg2 = a*reg1 + reg2 store reg2 into y(k) k = k + 1 if (k < n or k = n) go to (*) s = 3Here reg1, reg2 designate registers. Which instruction will follow the "if" statement? That is unknown until after the test is evaluated, introducing a "bubble" into the pipeline.
One common technique to help enhance pipelining is loop unrolling. These examples will use an unrolling depth of 4, but in practice deeper unrolling is more effective. Determining an optimal unrolling depth usually requires experimentation and some knowledge of the target CPU's resources. Unrolling turns the saxpy code
for k = 1:n y(k) = y(k) + a*x(k) end forinto two loops:
for k = 1:mod(n,4) y(k) = a*x(k) + y(k) end for for k = mod(n,4)+1:4:n (the middle 4 in the triplet means "in steps of 4") y(k) = a*x(k) + y(k) y(k+1) = a*x(k+1) + y(k+1) y(k+2) = a*x(k+2) + y(k+2) y(k+3) = a*x(k+3) + y(k+3) end for s = 3How does this help? This unrolling:
As another example, consider the dotproduct of vectors x and y:
s = 0.0 for k = 1:n s = s + x(k)*y(k) end forStraightforward unrolling turns the code into
s = 0.0 for k = 1:mod(n,4) s = s + x(k)*y(k) end for for k = mod(n,4)+1:4:n s = s + x(k )*y(k ) + x(k+1)*y(k+1) + x(k+2)*y(k+2) + x(k+3)*y(k+3) end forThis reduces loop overhead, but still won't execute rapidly. The problem is the dependency between iterations: an iteration cannot begin before the earlier one is completely finished, because the variable s must be written on iteration k before it can be read on iteration k+4. We can remove that dependency by using four temporary variables, and sum them up after the loop finishes:
s = 0.0 s1 = 0.0 s2 = 0.0 s3 = 0.0 for k = 1:mod(n,4) s = s + x(k)*y(k) end for for k = mod(n,4)+1:4:n s = s + x(k )*y(k ) s1 = s1 + x(k+1)*y(k+1) s2 = s2 + x(k+2)*y(k+2) s3 = s3 + x(k+3)*y(k+3) end for s = s + s1 + s2 + s3This runs faster, but there is is something "wrong" with this that wasn't encountered with the saxpy operation. In particular, an optimizing compiler cannot make this change without risking changing the results of your computations. What is the problem?
The important question is, does pipelining really help much? The answer is "yes", but for simple cases like daxpy or dotproduct you should let a compiler handle the details of optimization using unrolling.
In the next graph the computational rate in Mflops per second is plotted against the length of the vector on which dotproduct and saxpy (vector update) operations are performed. The data labeled with "un" prefixed is from unrolling the loops of the un-"un"ed versions.
That does provide a good boost, but consider the effects of compiler optimization on a quad-core i7 Intel machine. The next four plots show the ratio of computational rate of unrolled to standard loops for three kernels:
As soon as the optimization -O2 is applied, the hand-unrolled code is always slower than the straight-forward loops without unrolling. At most for -O0, the performance is 2.5 times faster, which matches the general rule that floating point units are pipelined to a depth of 2-3. In any case, the conclusion is: for simple vector operations, let the GNU foundation (or Intel, or Portland Group, or IBM) do the heavy lifting and handle this optimization for you.
Unrolling does improve the performance, but why bother if compilers take care of it so well? The answer is that compilers cannot reliably recognize the opportunities to improve pipelining, especially when data dependencies intervene. Pipelining opportunities can occur anywhere, and not just in loops implementing basic vector operations. For example, a code for solving linear systems of equations with a tridiagonal coefficient matrix stores the three nonzero diagonals in arrays c( ), d( ), and e( ). During the solve, "pivoting" involves swapping entries in the vectors, which in one package is done as:
t = c(kp1) c(kp1) = c(k) c(k) = t t = d(kp1) d(kp1) = d(k) d(k) = t t = e(kp1) e(kp1) = e(k) e(k) = t[The index variable kp1 is k+1]. The performance problem is that the temporary variable t is used in three consecutive blocks. Instead, use three temporaries:
t1 = c(k+1) c(k+1) = c(k) c(k) = t1 t2 = d(k+1) d(k+1) = d(k) d(k) = t2 t3 = e(k+1) e(k+1) = e(k) e(k) = t3This change made the overall code system run 30% faster. Always consider pipelining effects, both within loops and elsewhere, when developing high-performance kernels. The question is how much effort to put into that, and the analytical model for pipelining later will help answer that.
C: x = 20 x 0.1011 y = 21 x 0.1100 S: x = 21 x 0.0101 y = 21 x 0.1100 A: z = 21 x 1.0001 N: z = 22 x 0.1000"Normalizing" the result means making a "1" bit the first significant bit as the IEEE floating point representation requires. In that case the leading 1 bit does not need to be explicitly stored. However, here the leading 1 bit has been explicitly written.
For a sequence of n additions like this, the operations are
z1 = x1 + y1 z2 = x2 + y2 . . . . . . . . . zn = xn + ynThis generates the following timing chart:
Cycle C S A N Out 1 x1,y1 2 x2,y2 x1,y1 3 x3,y3 x2,y2 x1,y1 4 x4,y4 x3,y3 x2,y2 x1,y1 5 x5,y5 x4,y4 x3,y3 x2,y2 z1 6 x6,y6 x5,y5 x4,y4 x3,y3 z2 7 x7,y7 x6,y6 x5,y5 x4,y4 z3 . . . . . . . . . n xn,yn xn-1,yn-1 xn-2,yn-2 xn-3,yn-3 zn-4 n+1 xn,yn xn-1,yn-1 xn-2,yn-2 zn-3 n+2 xn,yn xn-1,yn-1 zn-2 n+3 xn,yn zn-1 n+4 znImportant features to notice:
Starting up a pipeline costs time, and that needs to be accounted for. Let
ts = sequential time = n*s*τ
tp = piped time = setup + (time for first result) + one cycle * (number remaining ops)
which is again a linear function in n . What is the x-intercept of this linear function? Try graphing it, identifying all the usual linear parameters like intercepts and slope.
rinf is the asymptotic rate of computation for an infinitely long vector. The actual rate of computation for a vector of length n is
where ρ = n½/n. That means n = n½ is the vector length that achieves half of the asymptotic rate, explaining why it has ½ as a subscript. The model gives the rate of computation versus vector length. Since machines can vary in their peak speed, let's normalize the graph by plotting (ρ/rinf) versus (n/n½)
Some values have been marked: when (n/n½) = 1, then (ρ/rinf) = 0.5 -- which just restates the definition of n½. To get 90% of the asymptotic peak performance, the vector length must have (n/n½) = 9, or equivalently, n = 9*n½.
[The Matlab script that created the image above is an example of how to add bells and whistles to a plot in Matlab.]
Questions: How can n½ and rinf be computed for a given machine? What problem might be encountered? Why was the start-up time modeled, but not the time it takes to drain a pipeline?
which gives a linear system Ax = b, where A is the k × 2 matrix
|n1 1 | |n2 1 | A = |n3 1 | |. . | |. . | |. . | |nk 1 |The observations can (and should) have repetitions: many timings can be made for a single vector size like n = 100 and added as rows in A and timings in b. Here b is the k × 1 vector b = (t1, t2, ... , tk)T, and x is the 2 × 1 vector (r-1 inf , r-1 inf n½)T.
Since there are only two unknowns and usually k >> 2 observations, this is a least squares problem: find the value of x that minimizes the two-norm of Ax -b. Matlab has an easy way of doing this: x = A\b. Then rinf and n½ can be extracted from the two components in the solution x.