Vectorization in Julia

Julia is a new language for technical computing that combines interactive scripting convenience with high performance. Version 0.3 was released Aug. 20, 2014, and introduces experimental support for vectorizing loops, which can significantly improve performance of some kernels. This article explains how to use the vectorization feature effectively. This article is based on material that I first presented at JuliaCon 2014

What Is Vectorization?

"Vectorization" has two different meanings in Julia, both related to operating on chunks of data:

Writing your code in terms of operations that operate on whole arrays. For example, writing d=a+b-c where the variables all indicate array objects. See the Julia @devec package for more information about this style of code.
Compiler transformations that improve performance by using SIMD (Single Instruction Multiple Data) instructions that operate on chunks of data. For example, hardware with Intel® Advanced Vector Extensions (Intel® AVX) can do eight 32-bit floating-point additions at once.

The first definition concerns how you write your code. The second definition concerns code generation. This article concerns the second definition.

Julia 0.3 has vectorization capabilities that can exploit SIMD instructions when executing loops, under the right conditions. Sometimes you have to give it a little nudging. Here is an example function and its invocation on two random arrays with 1003 32-bit values:

function axpy(a,x,y)
    @simd for i=1:length(x)
        @inbounds y[i] += a*x[i]
    end
end

n = 1003
x = rand(Float32,n)
y = rand(Float32,n)
axpy(1.414f0, x, y)

The @simd and @inbounds will be explained later. When the call to axpy happens, the Julia just-In-time (JIT) compiler generates code for an instance of axpy specialized to the arguments' types. For serial execution (that is, non-SIMD code) each iteration of the instance behaves as if the code were written like this:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
    n=length(x)
    i = 1
    @inbounds while i<=n
        t1 = x[i]
        t2 = y[i]
        t3 = w*t1[i]
        t4 = t2+t3
        y[i] = t4
        i += 1
    end
end

Here each assignment in the loop represents one machine instruction.

The @simd macro gives the compiler license to vectorize without checking whether it will change the program's visible behavior. The vectorized code will behave as if the code were written to operate on chunks of the arrays. For example, for hardware with 4-wide execution units, Julia might generate code as if the source were written as follows, using fictional Julia operations + and * that operate on tuples:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
    n=length(x)
    i = 1
    # Vectorized loop - four logical iterations per physical iteration
    @inbounds while i+3<=n
        t1 = (x[i],x[i+1],x[i+2],x[i+3])   # Load tuple
        t2 = (y[i],y[i+1],y[i+2],y[i+3])   # Load tuple
        t3 = a*t1                          # Scalar times tuple
        t4 = t2+t3                         # Tuple add
        (y[i],y[i+1],y[i+2],y[i+3]) = t4   # Tuple store
        i += 4                             # Finished 4 logical iterations
    end
    # Scalar loop for remaining iterations
    @inbounds while i<=n
        t1 = x[i]
        t2 = y[i]
        t3 = a*t1
        t4 = t2+t3
        y[i] = t4
        i += 1
    end
end

As long as the contents of array y do not overlap the contents of array x, the result will be the same as the serial code. However, if there is overlap, the results might differ, because @simdgives the compiler license to reorder operations. Here is a diagram showing the original order:

@simd transposes the order of operations for chunks of iterations, like this:

The horizontal arrows are academic here, since with 4-wide execution hardware, each row of operations happens instantaneously. However, @simd gives license to "vectorize" across chunks of iterations wider than the execution hardware. For example, the order above is also valid for 2-wide execution hardware too. In practice, the compiler often uses chunks that are wider than the execution hardware, so that multiple operations can be overlapped. So do not assume that the chunk size for transposition is the natural hardware size. Furthermore, @simd actually grants the compiler even more reordering latitude as I will discuss later.

Implicit vs. Explicit Vectorization

Vectorization is a program transform. For any program transform, a compiler has to ask three questions:

Is the transform possible?
Is the transform legal?
Is the transform profitable?

The part of a compiler that answers these questions for vectorization is called the vectorizer. The first question relates to the capabilities of the vectorizer and hardware. Some vectorizers can handle only very simple loops. Others can deal with complicated control flow. If the vectorizer determines that it does not know how to vectorize the loop, there is no point in asking the next two questions.

The meaning of "legal" depends on the language specification. Julia currently does not have a formal specification, so in practice "legal" means that the visible program behavior is identical to that of unvectorized code. Sometimes the behavior will be slightly different in a way that the programmer did not care about, but the vectorizer must nonetheless assume vectorization is not legal unless the programmer grants explicit license.

The meaning of "profitable" can depend on context. For purposes here, "profitable" means "runs faster", though in other contexts it might be "takes less code space in memory". Vectorization often, but not always, improves performance at the cost of increasing code space.

Vectorization can be either implicit or explicit. In implicit vectorization, the compiler proves that the transposition of operations is legal. For Julia, the hard part of the proof is proving that no output array overlaps with any input or output array. If the compiler can otherwise vectorize the code, but cannot prove the absence of overlap, it may generate the vector code anyway, with a run-time check that jumps to the remainder loop if overlap is detected. Here is our example with a run-time check:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
    n=length(x)
    i = 1
    if !overlap(x,y)                           # "overlap" is fictional function
        # Vectorized loop - four logical iterations per physical iteration
        @inbounds while i+3<=n
            t1 = (x[i],x[i+1],x[i+2],x[i+3])   # Load tuple
           ...
            (y[i],y[i+1],y[i+2],y[i+3]) = t4   # Tuple store
            i += 4                             # Finished 4 logical iterations
        end
    end
    # Scalar loop for remaining iterations
    @inbounds while i<=n
        t1 = x[i]
        ...
        y[i] = t4
        i += 1
    end
end

In this example, the run-time check is relatively cheap since only two arrays are involved. You won't see a noticeable performance difference by adding @simd to the example since all it would do is to remove the run-time check. However, the cost of the check can grow quadratically with the number of arrays referenced by the loop body. Given M output arrays and N input arrays, the cost is O(M*(M+N)). Furthermore, a run-time check is impractical in cases that involve tricky subscripting patterns, such as:

for i=1:n
    t = w[j[i]]    # "gather"
    w[k[i]] = t    # "scatter"
end

Here, proving that transposition of chunks does not change the visible program behavior amounts to detailed inspection of j[i] and k[i] that, in the absence of special hardware support, can be slower than just executing the code serially.

In explicit vectorization, you as the programmer guarantee that vectorization is legal. In Julia, that's done be prefixing a for loop with @simd. Be careful using it. If you use @simd on a loop what was not legal to vectorize, your results may be wrong.

What You Promise With @simd

For some technical reasons, @simd actually grants more latitude than just permission to transpose evaluations. It also tells the compiler that all iterations are independent in both of the following senses:

No iteration reads or writes a location that is written by another iteration.
No iteration waits on another iteration.

The second sense is what distinguishes an @simd loop from multi-threaded loops found in some other languages. Currently, it's an academic point for Julia until it acquires shared memory multi-threading capabilities. To summarize, when you use @simd, you promise that iterations do not communicate with each other.

Reductions

Reductions are an exception to the no communication rule for @simd. Reduction operations will work as long as the vectorizer recognizes them as such. The rules for making them recognizable need to be formalized, but for now use +=, *=, &=, |=, or $=, or expand the op= into its equivalent syntactic form. For example, "s += expr" can be written as "s =s + expr". The reduction variable should be a local variable. Here is an example reduction:

function summation(x)
    s = zero(x[1])
    @simd for i=1:length(x)
        @inbounds s += x[i]
    end
    s
end

Integer min/max reductions also work. If the compiler fails to recognize your reduction, it will refuse to vectorize the code.

For 4-wide SIMD execution, the vectorized code acts as if it were written like this:

function summation(x::Array{Float32,1})
    n = length(x)
    t = (0f0,0f0,0f0,0f0)          # Initialize partial sums
    i=1
    @inbounds while i+3<=n
        # Four logical iterations per physical iteration
        t += (x[i],x[i+1],x[i+2],x[i+3])
        i += 4
    end
    s = (t[1]+t[2]) + (t[3]+t[4])  # Merge partial sums
    @inbounds while i<=n
        s += x[i]
        i += 1
    end
    s
end

Vectorization of reductions not only reorders loads and stores, it also reorders the reduction operations. Here is a drawing of how data flows through a serial summation:

Here is how it flows through a vectorized summation:

Since flloating-point addition is commutative, but not associative, the reordering may cause a different result. Whether the result from the reordered summation is more or less accurate than from the original serial order depends on the addends being summed. In fact, given random addends, the vector order probably gives a slightly more accurate result. Summing rand(Float32,1000), the odds are better than 6 to 1 that the four-way vector summation gives a more accurate result than the original serial code.

Of course, under some conditions, the serial sum may be more accurate. For example, if you are sorting your addends by magnitude before adding them, and there are a few addends with relatively large magnitude which cancel each other, the vectorized summation will likely give a less accurate result. But that kind of tricky case tends to be easily identifiable by noticing the sort.

The Julia compiler does implicit vectorization of reductions only if the result is the same as the serial code, as it will be for integers. Otherwise you need to use @simd. The following table summarizes the situation.

Which Reductions Vectorize in Julia 0.3
	Integer	Float32 and Float64
Implicit	+, *, &, \|, $, min, max	none
Explicit (@simd)	+, *, &, \|, $, min, max	+, *

Floating-point min/max reductions might work in some future version of Julia. The issue is that LLVM recognizes floating-point min reductions for min(x,y) defined as x<y?x:y, but Julia uses a trickier definition of min that conforms with the IEEE floating-point standard:

min{T<:FloatingPoint}(x::T, y::T) = ifelse((y < x) | (x != x), y, x)

The expression x!=x is true when x is an IEEE NaN (Not a Number) value.

Speedup Surprise

The reassociation allowed by @simd sometimes improves speed by more than you would think possible. For example, I've seen @simd speed up the summation example by 12x on an Intel 4th Generation processor, even though the hardware has only 8-wide execution units. This is possible because the reassociation lets the compiler do the summation with conceptual 32-wide vector operations, each synthesized from four 8-wide operations. The compiler then overlaps those 8-wide operations in a way that lets the hardware pipeline them efficiently in a way that was not possible in the serial code.

Inspecting Whether Code Vectorizes

There is currently no feedback from the compiler on whether code vectorized. Feedback might be possible in future versions of Julia since the most recent version of the LLVM vectorizer implements such feedback. In lieu of feedback, the best thing is to learn how to skim LLVM code to check if vectorization is happening as you expect. You don't have to be a compiler expert since the vectorizer leaves some footprints in the code.

To inspect the LLVM code for the earlier axpy example, invoke the macro @code_llvm like this:

julia> @code_llvm axpy(1.414f0, x, y)

where the arguments are the same as the earlier example. Since code generation pays attention only to the types, not the values of the arguments, you can also use:

julia> @code_llvm axpy(0.0f0, Float32[],Float32[])

The macro dumps the LLVM code for the top-level expression. There's also a similar function code_llvm that is invoked like this:

julia> code_llvm(axpy,(Float32,Array{Float32,1},Array{Float32,1}))

The function code_llvm takes a function as its first argument and a tuple of argument types as its second argument.

Because Julia uses a just-in-time Compiler (JIT), the LLVM output depends on your processor. Indeed, one of the benefits of a JIT is that you can get code tailored to your processor. What I get on an Intel 4th Generation Core i7 processor for @code_llvm axpy(1.414f0, x, y) is too long to include completely here. See the attached .txt file if you are interested in seeing it all. For skimming purposes, here are the relevant lines (... denotes deleted portions)

  ...
vector.ph:                                        ; preds = %if
  %broadcast.splatinsert12 = insertelement <8 x float> undef, float %0, i32 0
  %broadcast.splat13 = shufflevector <8 x float> %broadcast.splatinsert12, <8 x float> undef, <8 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  ...
  %wide.load = load <8 x float>* %24, align 4
  ...
  %wide.load9 = load <8 x float>* %26, align 4
  ...
  %wide.load10 = load <8 x float>* %28, align 4
  ...
  %wide.load11 = load <8 x float>* %30, align 4
  %31 = fmul <8 x float> %wide.load10, %broadcast.splat13
  %32 = fmul <8 x float> %wide.load11, %broadcast.splat13
  %33 = fadd <8 x float> %wide.load, %31
  %34 = fadd <8 x float> %wide.load9, %32
  store <8 x float> %33, <8 x float>* %24, align 4
  store <8 x float> %34, <8 x float>* %26, align 4
  ...
  br i1 %35, label %middle.block, label %vector.body
  ...

The footprints to look for are the labels prefixed with vector, and types of the form <n x float>. Those labels are added by the vectorizer for the vectorized loop. The operations on <n x float> are the SIMD instructions. In this example, they correspond to AVX instructions on 8-wide vectors. Note that the vectorizer used a chunk size of 16 here, using pairs of 8-wide instructions for each conceptually 16-wide operation.

By the way, Julia has more features for inspecting code at different levels used by the Julia compiler. For example, you can use code_native to see the native code generated by compiler. Leah Hanson's blog is a good introduction to the various levels of "introspection".

Vectorization Recommendations for Julia 0.3

Earlier, I described how vectorization is a transform, and to decide whether to perform a transform, a compiler must answer three questions:

Is it possible?
Is it legal?
Is it profitable?

@simd provides a way to force the answer about legality. But you also have to make vectorization possible within the capabilities of the compiler, and profitable for the given hardware. So you have to learn how to cater to the limitations of the vectorizer. The following sections describe the catering rules and their rationale. A summary of those rules will be presented afterwards.

Trip Count Must Be Obvious

The trip count of a loop is the number of times the body is executed. The vectorized code needs to be able to calculate that trip count before commencing the loop, so that it knows how many chunk iterations to execute. Fortunately, Julia has concise for-loop syntax from which the trip count is obvious when the the loop is over a range object such as m:n. Stick with that form and the trip count should not be an issue.

The trip count is an issue when a for-loop is applied to iterable objects other than ranges, because a regular Julia for-loop is really just a glorified while-loop that marches an iterator across the object until the iterator says "no more". For sake of being able to compute a trip count, @simd causes a for-loop to executed differently. Given a loop of the form:

    @simd for i=r…
    end

execution of @simd assumes that:

length(r) returns the trip count.
first(r) returns the first index value.
step(r) returns the stride between successive index values.

A regular for-loop does not assume such expressions are valid. See Section 2.11 of this paper for how regular for-loops are handled.

The Loop Body Should Be Straight-Line Code

The current vectorizer for Julia generally requires that the loop body not contain branches or function calls. But practically all operations in Julia are function calls, furthermore complicated by run-time dispatch (i.e. run-time overload resolution). Fortunately, the Julia compiler is good at eliminating calls to short functions by inlining them, as long as it can infer the argument types at compilation time. So the key is learning how to write type-stable code, which lets the inference mechanism work well. See John Myles White's article to learn about type-stable code.

Code with constructs that might throw exceptions also contains branches, and so will not vectorize. This is why the @inbounds notation is currently necessary. It turns off subscript checking that might throw an exception. Be sure that your subscripts are in bounds when using @inbounds, otherwise you can corrupt your Julia session.

Short conditional expressions involving &&, ||, or ?: will sometimes vectorize. Here is an example that vectorized with Julia 0.3 when I tried it on an Intel(R) 4th Generation Core i7 processor:

function clip(x, a, b)
    @simd for i=1:length(x)
        @inbounds x[i] = ifelse(x[i]<a,a,ifelse(x[i]>b,b,x[i]))
    end
end

# Shows that code vectorizes for Float32 arrays.
@code_llvm clip(Float32[],0.0f0,0.0f0)

Vectorization of p?q:r depends on the compilers ability to figure out whether it can legally/profitably replace it with the equivalent of Julia's ifelse(p,q.r). It's not always legally the same, because ifelse evaluates both expressions q and r regardless of the value of p. If you intend a loop body to be vectorizable, consider writing it with ifelse instead of ?:, like this:

function clip(x, a, b)
    @simd for i=1:length(x)
        @inbounds x[i] = ifelse(x[i]<a,a,ifelse(x[i]>b,b,x[i]))
    end
end

The current Julia definitions of min and max use ifelse, so this example could be written more simply using minand max, like this:

function clip(x, a, b)
    @simd for i=1:length(x)
        @inbounds x[i] = max(x[i]<a,a,min(b,x[i]))
    end
end

Subscripts Should Be Unit Stride

The amount that an array subscript changes between iterations is called its stride. The current vectorizer targets loops that have unit-stride access patterns. In practice, that means that for a @simd loop with index i, you want array subscripts to either be:

loop_invariant_value
i
i + loop_invariant_value
i- loop_invariant_value

The vectorizer will tolerate an occasional non-unit stride index, such as 2i, but be warned that the resulting code may be slow. For example, I tried this example, which has an access with stride 2:

function stride2(a, b, x, y)
    @simd for i=1:length(y)
        @inbounds y[i] = a * x[2i] + b
    end
end

@code_llvm stride2(0.0f,0.0f,Float32[],Float32[])

On an Intel 4th Generation Core i7 processor, the LLVM output showed that the compiler synthesized the load x[2i] from a bunch of scalar loads, so clumsily that removing the @simd actually speeds up the example by about 1.4x. I'm hoping futures versions of Julia will do better, but for now I recommend sticking with unit-stride subscripts when using @simd.

When working with nested loops on two-dimensional arrays, use @simd on the inner loop and make that loop index the leftmost subscript of arrays. This rule comes naturally to Fortran programmers, but rubs against habit for C/C++ programmers. Here is an example:

function updateV(irange, jrange, U, Vx, Vy, A)
    for j in jrange
        @simd for i in irange
            @inbounds begin
                Vx[i,j] += (A[i,j+1]+A[i,j])*(U[i,j+1]-U[i,j])
                Vy[i,j] += (A[i+1,j]+A[i,j])*(U[i+1,j]-U[i,j])
            end
        end
    end
end

# Shows that code vectorizes for Float32
R = 1:8
A = Float32[]
@code_llvm updateV(R,R,A,A,A,A)

Use 32-Bit Floating-Point Arithmetic

You'll notice that all the examples that I've shown use Float32 instead of Float64. As far as I can tell the LLVM 3.3 vectorizer in Julia 0.3 on Intel hardware refuses to vectorize 64-bit math. Hopefully the situation will improve in the future.

Summary Recommendations for Effective Vectorization in Julia

For implicit or explicit vectorization:

No cross-iteration dependencies
Straight-line loop body only. Use ifelse for conditionals.
Use @inbounds
Make sure that all calls are inlined. Write type-stable code.
Use unit-stride subscripts
Use 32-bit floating-point arithmetic
Reduction variables should be local variables.

For implicit vectorization, the additional constraints are:

Access no more than about 4 arrays inside the loop.
Do not use floating-point reductions.

Otherwise, use explicit vectorization by marking your loop with @simd.

Future Directions

@simd is a first step in adding vectorization to Julia. There is much room for improvement, particularly if Julia is to match the performance of statically typed languages. Some possible future improvements are:

Report to the user why a loop failed to vectorize.
Vectorize 64-bit arithmetic.
Vectorize complex arithmetic.
Vectorize tuple math. This was prototyped, but took too much compilation time for Julia 0.3. With changes to LLVM it may be practical in the future.
Vectorize loops without@inbounds. The semantics of @simd are designed to allow doing subscript checking (and throwing an exception if necessary) before the loop really starts. Or perhaps the checks could be vectorized.
Vectorize loops bodies that have complicated control-flow. This may become more important and practical with instruction set extensions such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) that support masking.
Add support for limited forms of cross-iteration dependencies that are useful in practice. See my notes for more details.

LLVM hackers looking for projects: please take note!

Conclusion

There is a joke that vectorizers in the 1970's did not work well, but they taught programmers how to write code that could be vectorized by simple vectorizers. Such is the case with the current Julia implementation: Learn how to cater to the vectorizer and it can deliver. Remember too that Julia is an interactive "scripting" language. That such a language can often be at least half as fast as vectorized static languages such as C/C++/Fortran, without restricting that performance to a set of precompiled library routines, is amazing.

Acknowledgments

Jeff Bezanson pointed out ifelse as a way to avoid branches. Jim Cownie suggested corrections and improvements to an earlier draft. Elliot Saba, Patrick O'Leary, and Jacob Quinn corrected errors in the first public draft. Jacob pointed out the convenience of using @code_llvm instead of code_llvm. Thanks goes to the LLVM project for a framework that enables new languages to get going quickly. Thanks goes to the Julia people for starting a modern language for technical computing and making it open source.

Vectorization in Julia

What Is Vectorization?

Implicit vs. Explicit Vectorization

What You Promise With @simd

Reductions

Speedup Surprise

Inspecting Whether Code Vectorizes

Vectorization Recommendations for Julia 0.3

Trip Count Must Be Obvious

The Loop Body Should Be Straight-Line Code

Subscripts Should Be Unit Stride

Use 32-Bit Floating-Point Arithmetic

Summary Recommendations for Effective Vectorization in Julia

Future Directions

Conclusion

Acknowledgments

Trending Articles

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

NCERT Solutions for Class 9th Sanskrit Chapter 2 अविवेकः परमापदां पदम्

Ifield Avenue closed following crash in Langley Green

Practice Sheet of Right form of verbs for HSC Students

S.K. Macharia Biography, Wealth, Awards, Family, Wife and Children

TASK ERROR: storage migration failed: block job (mirror) error:...

Electronic Bank Statement field Assignment (ZUONR) missing alphabets from...

गर्मी पर स्टेटस – Funny Summer Status in Hindi for Whatsapp

Forum Post: RE: TMS570LC4357: Disable error pin output for ESM group 1, 2, 3

newbie need guide - help - read flash xc2287-96F with dap miniwiggler

Karimnagar District Police Office Mobile Numbers List in Telangana State

Griffith faces three more offences

Shatta Wale – You Shock Me (Prod. by Willis Beatz)

09g927750** 6 speed transmission TCM VAG original firmware files

Parris out on $9,000 bail

More things we have to put up with: when NOT to raise hell with Disclosure

Karnataka SSLC 10th Exam Time Table 2016 (www.kseeb.kar.nic.in)

The 10 Wyoming Cities With The Largest Black Population For 2021

PSM I question: Product Backlog item considered complete

Scripting Tracker - Development Tool for SAP GUI Scripting