Vector API Developer Program for Java* Software

Introduction

Big data applications, distributed deep learning and artificial intelligence solutions today can run directly on top of existing Spark or Apache Hadoop* clusters, and can benefit from efficient scale out. For desirable data-parallelism in these applications, Open JDK Project Panama offers Vector API. Vector API Developer Program for Java* software provides a broad set of methods to enrich machine learning and deep learning experience for Java developers.

This article introduces Vector API to Java developers, it shows how to start using the API in Java programs, and provides examples of vector algorithms. It provides step-by-step details on how to build the vector API and build java applications using it. Further, we provide a detailed tutorial on how to implement Vector code for your own algorithms in Java for faster performance.

What is SIMD?

Single Instruction, Multiple Data (SIMD) allows the same operations to be performed on multiple data-points simultaneously, benefiting from data level parallelism in the application. Modern CPUs have advanced SIMD operation support such as AVX2, AVX3 which provide SIMD (instructions) acceleration.

Big Data applications e.g. Apache Flink, Apache Spark Machine Learning libraries and Intel Big DL, Data Analytics and Deep Learning training workloads etc. run highly data-parallel algorithms. Having robust SIMD support in Java opens up ways to expand some of these areas.

What is Vector API?

Vector API Developer Program for Java* software makes it possible to write compute-intensive applications, machine learning and artificial intelligence algorithms in Java without Java Native Interface (JNI) performance overhead or further maintenance need for non-portable native code. It introduces a set of methods for Data-parallel operations on sized vector-types for programming in Java directly, without any required knowledge of underlying CPU. These low-level APIs are further efficiently mapped to SIMD instructions on modern CPUs by the JVM JIT complier for desired performance acceleration; otherwise default VM implementation will be used to map java byte-codes into hardware instructions.

Vector Interface

Vector API interface looks as below:

Vector Type

Vector type (Vector<E,S>) takes 'E' Element type and 'S' shape or bitwise length of the vector. Based on the recent development, Project Panama supports Vectors creation of the following Elements and Shapes.

 Element types:  Byte, Short, Integer, Long, Float, and Double
 Shape types (bit-size): 128, 256, and 512

Vector shapes are chosen to closely map them to largest SIMD registers available on the CPU platform.

Vector Operations

Basic Vector-Vector functionalities is available for all of these Vector types. Vector operations for typical arithmetic and trigonometric are available in masked format. Mask is used for conditional if-else type operations.

Example section shows how to use Vector mask in the program.

	  public abstract class DoubleVector<S extends Vector.Shape<Vector<?,?>>> implements Vector<Double,S> {
	  Vector<Double, S> add (Vector<Double, S> v2);
	  Vector<Double,S> add (Vector<Double, S> o, Mask<Double, S> m);
	  Vector<Double, S> mul (Vector<Double, S> v2);
	  Vector<Double, S> mul (Vector<Double, S> o, Mask<Double, S> m);….
	  Vector<Double, S> sin ();
	  Vector<Double, S> sin (Mask<Double, S> m);
	  Vector<Double, S> sqrt (),…
	}

Vector API also provides more advanced vector operations, often needed in Financial Services Industry (FSI) and Machine Learning applications.

	  public abstract class IntVector<S extends Vector.Shape<Vector<?,?>>> implements Vector<Integer,S> {
	  int sumAll ();
             void intoArray(int[] a, int ix);
             void intoArray (int [] is, int ix, Mask<Integer, S> m);
             Vector<Integer, S> fromArray (int [] fs, int ix);
             Vector<Integer, S> blend (Vector<Integer, S> o, Mask<Integer, S> m);
             Vector<Integer, S> shuffle (Vector<Integer, S> o, Shuffle<Integer, S> s);
             Vector<Integer, S> fromByte (byte f);…
            }

Performance Speed Up in Machine Learning

Basic Linear Algebra Subprograms (BLAS)

Using Vector implementation BLAS level-I, II and III routines can achieve 3-4 times performance speed-up.

BLAS level-I and II routines are commonly used in Apache Spark Machine Learning libraries. Those are applicable to classification and regression of liner models and decision trees, collaborative filtering and clustering, and dimensional reduction problems. BLAS-III routines like GEMM are vastly used in solving deep learning and neural network problems used in artificial intelligence.

*Open JDK Project Panama source build 09182017. Java Hotspot 64-bit Server VM (mixed mode). OS version: Cent OS 7.3 64-bit

Intel^® Xeon^® Platinum 8180 processor (using 512 byte and 1024 byte chunk of floating pointing data).

JVM options: -XX:+UnlockDiagnosticVMOptions -XX:-CheckIntrinsics -XX:TypeProfileLevel=121 -XX:+UseVectorApiIntrinsics

Image Processing Filtering

Using Vector API, Sepia filtering can be done up to 6X faster.

Writing Vector Code

Using Vector API in Java*

Vector interface is bundled as part of com.oracle.vector package, we begin with Vector API by importing the following in our program. Depending on the Vector type, user can chose to import FloatVector, IntVector etc.

import jdk.incubator.vector.FloatVector;
 import jdk.incubator.vector.Vector;
import jdk.incubator.vector.Shapes;

Vector type (Vector<E, S>) takes two parameters.

‘E’: the element type, broadly supporting int, float and double primitive types.

‘S’ specifies shape or bitwise size of the vector.

Before using vector operations, programmer must create a very first vector instance to capture both element type and vector shape. Using that vectors of that particular size and shape can be created.

	private static final FloatVector.FloatSpecies<Shapes.S256Bit> species = (FloatVector.FloatSpecies<Shapes.S256Bit>) Vector.speciesInstance (Float.class, Shapes.S_256_BIT);
	IntVector.IntSpecies<Shapes.S512Bit> ispec = (IntVector.IntSpecies<Shapes.S512Bit>) Vector.speciesInstance(Integer.class, Shapes.S_512_BIT);

Here onwards, users can create vector instances of FloatVector<Shapes.S256Bit> and IntVector<Shapes.S512Bit> types.

Simple Vector Loops

In this section we provide a flavor of vector API programming. Detailed tips and tricks on how to write vector algorithms in provided in the white paper Vector API: writing own-vector algorithms in Java* for performance. Sample Vector code examples for BLAS and FSI routines can be found in the subsequent sections.

First example shows vector addition of two arrays. Program uses vector operations like fromArray(), intoArray() to load/store the vectors into arrays.

Vector add() operation for the arithmetic operation.

	public static void AddArrays (float [] left, float [] right, float [] res, int i) {
	FloatVector.FloatSpecies<Shapes.S256Bit> species = (FloatVector.FloatSpecies<Shapes.S256Bit>)
           Vector.speciesInstance (Float.class, Shapes.S_256_BIT);
           FloatVector<Shapes.S256Bit> l  = species.fromArray (left, i);
           FloatVector<Shapes.S256Bit> r  = species.fromArray (right, i);
           FloatVector<Shapes.S256Bit> lr = l.add(r);
           lr.intoArray (res, i);
	}

Vector loops should be written by querying for vector size using species.length (). Consider the scalar loop below which adds arrays A and B and stores the result into array C.

                for (int i = 0; i < C.length; i++) {
                     C[i] = A[i] + B[i];
                 }

Vectorized loop looks like one below:

	public static void add (int [] C, int [] A, int [] B) {
	        IntVector.IntSpecies<Shapes.S256Bit> species =
	        (IntVector.IntSpecies<Shapes.S256Bit>)     Vector.speciesInstance(Integer.class, Shapes.S_256_BIT);
	        int i;
	        for (i = 0; (i + species.length()) < C.length; i += species.length ()) {
		IntVector<Shapes.S256Bit> av = species.fromArray (A, i);
	            IntVector<Shapes.S256Bit> bv = species.fromArray (B, i);
	            av.add(bv).intoArray(C, i);
	        }
	        for (; i < C.length; i++) { // Cleanup loop
	            C[i] = A[i] + B[i];
	        }
	    }

One can also write this program in length-agnostic manner, independent of the vector size. Following program parameterizes the vector code via the Shape without providing it explicitly.

	public class AddClass<S extends Vector.Shape<Vector<?, ?>>> {
	      private final FloatVector.FloatSpecies<S> spec;
	      AddClass (FloatVector.FloatSpecies<S> v) {spec = v; }
	      //vector routine for add
	       void add (float [] A, float [] B, float [] C) {
	        int i=0;
	        for (; i+spec.length ()<C.length;i+=spec.length ()) {
	            FloatVector<S> av = spec.fromArray (A, i);
	            FloatVector<S> bv = spec.fromArray (B, i);
	            av.add (bv).intoArray(C, i);
	        }
	       //clean up loop
	        for (;i<a.length;i++) C[i]=A[i]+B[i];

Operations in conditional statement can be written in vector form using the mask. We grab a mask for the condition first.

Scalar routine below,

	for (int i = 0; i < SIZE; i++) {
	    float res = b[i];
	    if (a[i] > 1.0) {
	      res = res * a[i];
              }
             c[i] = res;
          }

Vector routine with mask.

	public void useMask (float [] a, float [] b, float [] c, int SIZE) {
	 FloatVector.FloatSpecies<Shapes.S256Bit> species = (FloatVector.FloatSpecies <Shapes.S256Bit>) Vector.speciesInstance   Float.class, Shapes.S_256_BIT);
	 FloatVector<Shapes.S256Bit> tv=species.broadcast (1.0f); int i = 0;
	 for (; i+ species.length() < SIZE; i+ = species.length()){
	   FloatVector<Shapes.S256Bit> rv = species.fromArray (b, i);
	   FloatVector<Shapes.S256Bit> av = species.fromArray (a, i);
	   Vector.Mask<Float,Shapes.S256Bit> mask = av.greaterThan (tv);
	   rv.mul (av, mask).intoArray(c, i);
	 }
	  //tail processing
	}

Tutorial: writing own-vector algorithms

Vector API: writing own-vector algorithms in Oracle Java for faster performance white paper provides several tricks and tips on writing Java code using Vector API and, also goes over some ways to increase performance.

These examples should give you some guidelines and best practices for vector programming in Oracle Java*, to help you to write successful vector versions of your own compute-intensive algorithms.

See the attached PDF for more information.

Tutorial: all you need to know about Vector API

Getting Started

Building Vector API

This section assumes users to be familiar with basic Linux utilities.

Set up JDK8 binary set up as JAVA_HOME

Project Panama requires JDK8 on the system. JDK can be downloaded from this location.

# export JAVA_HOME=/pathto/jdk1.8-u91
# export PATH=$JAVA_HOME/bin:$PATH

Download and build Panama Sources

One can download Project Panama sources using mercurial source control management tool.

# hg clone http://hg.openjdk.java.net/panama/panama/
# source get_source.sh
# ./configure
# make all

Building your own application using Panama JDK

We need to copy vector.jar file to the location of java application. From the parent directory containing panama sources.

import jdk.incubator.vector.IntVector;
import jdk.incubator.vector.Shapes;
import jdk.incubator.vector.Vector;

public class HelloVectorApi {
    public static void main(String[] args) {
        IntVector.IntSpecies<Shapes.S128Bit> species =
                (IntVector.IntSpecies<Shapes.S128Bit>) Vector.speciesInstance(
                        Integer.class, Shapes.S_128_BIT);
        int val = 1;
        IntVector<Shapes.S128Bit> hello = species.broadcast(val);
        if (hello.sumAll() == val * species.length()) {
            System.out.println("Hello Vector API!");
        }
    }
}

Run your application

Currently these are set of experimental flags used by Project Panama.

/pathto/panama/build/linux-x86_64-normal-server-release/images/jdk/bin/java --add-modules=jdk.incubator.vector -XX:TypeProfileLevel=121 HelloVectorApi

IDE Configurations

Configuring IntelliJ for development for OpenJDK Panama

1) Create a new project. If this is a fresh installation of IntelliJ or have no projects open, you will press the "Create New Project" on the window that comes up (you can see window below).

Otherwise, File > New > Project... will have the same effect.

2) In the "New Project" window that comes up, make sure to select Java on left side. You will also have to select the Panama build as your Project SDK.

If Panama build has not been set up as a Project SDK, press the "New..." button on right side. Otherwise, go to step 4.

3) The window that comes up will be named "Select Home Directory for JDK". The path you want to select is /path/to/panama/build/linux-x86_64-normal-server-release/images/jdk. Press OK.

4) Press Next. At this point you can select Create project from template. Go ahead and select "Command Line App" and click Next again.

5) Give your project a name and location and click Finish.

6) Once project is created, a few more steps are needed to use Vector API successfully. Go to File > Project Structure ...

7) Make sure that "Project" is selected in left pane. Change "Project language level:" to "9 - Modules, private methods in interfaces etc.". Finally, press OK.

8) In the left pane that shows directory structure, right click on "src" folder. Navigate to New > module-info.java

9) Inside this file, add following line "requires jdk.incubator.vector;". Save the file.

10) Go back to Main.java. Add your desired code that uses API. For an example, please see HelloVectorApi.java.

11) Before running application, you will need to edit the run configuration. Press on button with class name next to "play" button. You should see "Edit Configurations...". Click on that.

12) In the VM options, you will need to add "-XX:TypeProfileLevel=121 -XX:+UseVectorApiIntrinsics". Both of these are likely to become optional in future. If you want to play with turning on/off the optimization that converts VectorApi to optimized x86 intrinsics (for stability reasons), you will need to say:

"-XX:-UseVectorApiIntrinsics".

13) Press the "play" button to build and run the application. At bottom of screen in terminal window, you should see "Hello Vector API!" or whatever output you made application print.

Vector Examples

BLAS Machine Learning

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

BLAS-I

import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Vector;
import jdk.incubator.vector.Shapes;
import java.lang.Math;

public class BLAS  {



    static void VecDaxpy(double[] a, int a_offset, double[] b, int b_offset, double alpha) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        DoubleVector<Shapes.S512Bit> alphaVec = spec.broadcast(alpha);
        int i = 0;
        for (; (i + a_offset+ spec.length()) < a.length && (i + b_offset + spec.length()) < b.length; i += spec.length()) {
            DoubleVector<Shapes.S512Bit> bv = spec.fromArray(b, i + b_offset);
            DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, i + a_offset);
            bv.add(av.mul(alphaVec)).intoArray(b, i + b_offset);
        }

        for (; i+a_offset < a.length && i+b_offset<b.length; i++) b[i + b_offset] += alpha * a[i + a_offset]; //tail
    }

	static void VecDaxpyFloat(float[] a, int a_offset, float[] b, int b_offset, float alpha) {
        FloatVector.FloatSpecies<Shapes.S256Bit> spec= (FloatVector.FloatSpecies<Shapes.S256Bit>) Vector.speciesInstance(Float.class, Shapes.S_256_BIT);

        int i = 0;
        for (; (i + a_offset+spec.length()) < a.length && (i+b_offset+spec.length())<b.length; i += spec.length()) {

            FloatVector<Shapes.S256Bit> bv = spec.fromArray(b, i + b_offset);
            FloatVector<Shapes.S256Bit> av = spec.fromArray(a, i + a_offset);
            FloatVector<Shapes.S256Bit> alphaVec = spec.broadcast(alpha);
            bv.add(av.mul(alphaVec)).intoArray(b, i + b_offset);
        }

        for (; i+a_offset < a.length && i+b_offset<b.length; i++) b[i + b_offset] += alpha * a[i + a_offset];
    }


    static void VecDdot(double[] a, int a_offset, double[] b, int b_offset) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);

        int i = 0; double sum = 0;
        for (; (i + a_offset + spec.length()) < a.length && (i + b_offset+ spec.length()) < b.length; i += spec.length()) {
            DoubleVector<Shapes.S512Bit> l = spec.fromArray(a, i + a_offset);
            DoubleVector<Shapes.S512Bit> r = spec.fromArray(b, i + b_offset);
            sum+=l.mul(r).sumAll();
        }
        for (; (i + a_offset < a.length) && (i + b_offset < b.length); i++) sum += a[i+a_offset] * b[i+b_offset]; //tail
    }

	static void VecDdotFloat(float[] a, int a_offset, float[] b, int b_offset) {
        FloatVector.FloatSpecies<Shapes.S256Bit> spec= (FloatVector.FloatSpecies<Shapes.S256Bit>) Vector.speciesInstance(Float.class, Shapes.S_256_BIT);

        int i = 0; float sum = 0;
        for (; i+a_offset + spec.length() < a.length && i+b_offset+spec.length()<b.length; i += spec.length()) {
            FloatVector<Shapes.S256Bit> l = spec.fromArray(a, i + a_offset);
            FloatVector<Shapes.S256Bit> r = spec.fromArray(b, i + b_offset);
            sum+=l.mul(r).sumAll();
        }
        for (; i+a_offset < a.length && i+b_offset<b.length; i++) sum += a[i+a_offset] * b[i+b_offset]; //tail
    }
}

BLAS-II (DSPR)

import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Vector;
import jdk.incubator.vector.Shapes;

public class BLAS_II {

    public static void VecDspr(String uplo, int n, double alpha, double[] x, int _x_offset, int incx, double[] ap, int _ap_offset) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);

        double temp = 0.0;
        int i = 0;
        int ix = 0;
        int j = 0;
        int jx = 0;
        int k = 0;
        int kk = 0;
        int kx = 0;
        kk = 1;
        if (uplo.equals("U")) {
            // *        Form  A  when upper triangle is stored in AP.
            if (incx == 1) {
                for (j=0; j<n; j++) {
                    if (x[j+_x_offset] != 0.0) {
                        temp = alpha*x[j+_x_offset];
                        DoubleVector<Shapes.S512Bit> tv = spec.broadcast(temp);
                        for (i=0, k=kk; i+spec.length()<=j && i + _x_offset + spec.length() < x.length && k + _ap_offset + spec.length() < ap.length; i+= spec.length(), k+=spec.length()) {
                            DoubleVector<Shapes.S512Bit> av = spec.fromArray(ap, k+_ap_offset);
                            DoubleVector<Shapes.S512Bit> xv = spec.fromArray(x, i+_x_offset);
                            av.add(xv.mul(tv)).intoArray(ap,k+_ap_offset);
                        }
                        for (; i<=j && i + _x_offset < x.length && k + _ap_offset <ap.length; i++, k++) {
                            ap[k+_ap_offset]=ap[k+_ap_offset]+x[i+_x_offset]*temp;
                        }
                    }
                    kk = kk + j;
                }
            }
        } else {
            // *        Form  A  when lower triangle is stored in AP.
            if (incx == 1) {
                for (j=0; j<n; j++) {
                    if (x[j+_x_offset] != 0.0) {
                        temp = alpha*x[j+_x_offset];
                        DoubleVector<Shapes.S512Bit> tv=spec.broadcast(temp);
                        k = kk;
                        for (i=j; i+spec.length()<n && i + _x_offset + spec.length() < x.length && k + _ap_offset + spec.length() < ap.length; i+=spec.length(), k+=spec.length()) {
                            DoubleVector<Shapes.S512Bit> av = spec.fromArray(ap, k+_ap_offset);
                            DoubleVector<Shapes.S512Bit> xv = spec.fromArray(x, i+_x_offset);
                            av.add(xv.mul(tv)).intoArray(ap,k+_ap_offset);
                        }
                        for (; i<n && i + _x_offset < x.length && k + _ap_offset <ap.length; i++, k++) {
                            ap[k+_ap_offset] = ap[k+_ap_offset]+x[i+_x_offset]*temp;
                        }
                    }
                    kk = kk+n-j;
                }
            }
        }
    }

}

BLAS-II (DYSR)

import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Shapes;
import jdk.incubator.vector.Vector;

public class BLAS2DSYR {


    public static void VecDsyr(String uplo, int n, double alpha, double[] x, int _x_offset, int incx, double[] a, int _a_offset, int lda) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        double temp = 0.0;
        int i = 0;
        int ix = 0;
        int j = 0;
        int jx = 0;
        int kx = 0;

        if (uplo.equals("U") && incx == 1) {
            for (j=0; j<n; j++) {
                if (x[j+_x_offset] != 0.0) {
                    temp=alpha*x[j+_x_offset];
                    DoubleVector<Shapes.S512Bit> tv = spec.broadcast(temp);
                    for (i=0; (i+spec.length())<=j && i+_x_offset+spec.length()<x.length && i+j*lda+_a_offset+spec.length()<a.length; i+= spec.length()) {
                        DoubleVector<Shapes.S512Bit> xv = spec.fromArray(x, i+_x_offset);
                        DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, i+j*lda+_a_offset);
                        av.add(xv.mul(tv)).intoArray(a,i+j*lda+_a_offset);
                    }
                    for (; i<=j && i+j*lda+_a_offset<a.length && i+_x_offset<x.length; i++) {
                        a[i+j*lda+_a_offset] = a[i+j*lda+_a_offset]+x[i+_x_offset]*temp;
                    }
                }
            }

        } else if (uplo.equals("L") && incx == 1) {
            for (j = 0; j < n; j++) {
                if (x[j+_x_offset] != 0.0) {
                    temp=alpha*x[j+_x_offset];
                    DoubleVector<Shapes.S512Bit> tv = spec.broadcast(temp);
                    for (i=j; (i+spec.length())<n && i+_x_offset+spec.length()<x.length && i+j*lda+_a_offset+spec.length()<a.length; i+=spec.length()) {
                        DoubleVector<Shapes.S512Bit> xv = spec.fromArray(x,i+_x_offset);
                        DoubleVector<Shapes.S512Bit> av = spec.fromArray(a,i+j*lda+_a_offset);
                        av.add(xv.mul(tv)).intoArray(a,i+j*lda+_a_offset);
                    }
                    for (; i<n && i+j*lda+_a_offset<a.length && i+_x_offset<x.length; i++) {
                        a[i+j*lda+_a_offset]=a[i+j*lda+_a_offset]+x[i+_x_offset]*temp;
                    }
                }
            }

        }

    }
}

BLAS-III(DSYR2K)

import jdk.incubator.vector.FloatVector;
import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Shapes;
import jdk.incubator.vector.Vector;

public class BLAS3DSYR2K {


    public void VecDsyr2k(String uplo, String trans, int n, int k, double alpha, double[] a, int _a_offset, int lda, double[] b, int _b_offset, int ldb, double beta, double[] c, int _c_offset, int Ldc) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        double temp1 = 0.0;
        double temp2 = 0.0;
        int i = 0;
        int info = 0;
        int j = 0;
        int l = 0;
        int nrowa = 0;
        boolean upper = false;
        if (trans.equals("N")) {
            nrowa = n;
        } else {
            nrowa = k;
        }              //  Close else.
        DoubleVector<Shapes.S512Bit> zeroVec = spec.broadcast(0.0D);
        DoubleVector<Shapes.S512Bit> betaVec = spec.broadcast(beta);
        upper = uplo.equals("U");
        if (alpha == 0.0) {
            if (upper) {
                if (beta == 0.0) {
                    for (j = 0; j < n; j++) {
                        i = 0;
                        for (; (i + spec.length()) < j; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < j; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0;
                        }
                    }
                } else {
                    for (j = 0; j < n; j++) {
                        i = 0;
                        for (; (i + spec.length()) < j; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < j; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }
                }
            }

            //lower
            else {
                if (beta == 0.0) {
                    for (j = 0; j < n; j++) {
                        i = j;
                        for (; i + spec.length() < n; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0;
                        }
                    }
                } else {
                    for (j = 0; j < n; j++) {
                        i = j;
                        for (; i + spec.length() < n; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                    }
                    for (; i < n; i++) {
                        c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                    }
                }
            }
        }
        //start operations
        if (trans.equals("N")) {
            // *        Form  C := alpha*A*B**T + alpha*B*A**T + C.
            if (upper) {
                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        i = 0;
                        for (; i + spec.length() < j; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < j; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0;
                        }

                    } else if (beta != 1.0) {
                        i = 0;
                        for (; i + spec.length() < j; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < j; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }

                    for (l = 0; l < k; l++) {
                        if ((a[j + l * lda + _a_offset] != 0.0) || (b[j + l * ldb + _b_offset] != 0.0)) {
                            temp1 = alpha * b[j + l * ldb + _b_offset]; DoubleVector<Shapes.S512Bit> tv1 = spec.broadcast(temp1);
                            temp2 = alpha * a[j + l * lda + _a_offset]; DoubleVector<Shapes.S512Bit> tv2 = spec.broadcast(temp2);
                            i = 0;
                            for (; (i + spec.length()) < j; i += spec.length()) {
                                DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                                DoubleVector<Shapes.S512Bit> bV = spec.fromArray(b, i + l * ldb + _b_offset);
                                DoubleVector<Shapes.S512Bit> aV = spec.fromArray(a, i + l * lda + _a_offset);
                                cV.add(aV.mul(tv1)).add(bV.mul(tv2)).intoArray(c, i + j * Ldc + _c_offset);
                            }
                            for (; i < j; i++) {
                                c[i + j * Ldc + _c_offset] = c[i + j * Ldc + _c_offset] + a[i + l * lda + _a_offset] * temp1 + b[i + l * ldb + _b_offset] * temp2;
                            }
                        }
                    }
                }
            } else {

                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        i = j;
                        for (; (i + spec.length()) < n; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0;
                        }
                    } else if (beta != 1.0) {
                        i = j;
                        for (; (i + spec.length()) < n; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }
                    for (l = 0; l < k; l++) {
                        if ((a[j + l * lda + _a_offset] != 0.0) || (b[j + l * ldb + _b_offset] != 0.0)) {
                            temp1 = alpha * b[j + l * ldb + _b_offset]; DoubleVector<Shapes.S512Bit> tv1 = spec.broadcast(temp1);
                            temp2 = alpha * a[j + l * lda + _a_offset]; DoubleVector<Shapes.S512Bit> tv2 = spec.broadcast(temp2);
                            i = j;
                            for (; i + spec.length() < n; i += spec.length()) {
                                DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                                DoubleVector<Shapes.S512Bit> aV = spec.fromArray(a, i + l * lda + _a_offset);
                                DoubleVector<Shapes.S512Bit> bV = spec.fromArray(b, i + l * ldb + _b_offset);
                                cV.add(aV.mul(tv1)).add(bV.mul(tv2)).intoArray(c, i + j * Ldc + _c_offset);
                            }
                            for (; i < n; i++) {
                                c[i + j * Ldc + _c_offset] = c[i + j * Ldc + _c_offset] + a[i + l * lda + _a_offset] * temp1 + b[i + l * ldb + _b_offset] * temp2;
                            }
                        }
                    }
                }
            }
        } else {

// *        Form  C := alpha*A**T*B + alpha*B**T*A + C.
            if (upper) {
                for (j = 0; j < n; j++) {
                    for (i = 0; i < j; i++) {
                        temp1 = 0.0;
                        temp2 = 0.0;
                        l = 0;
                        for (; l + spec.length() < k; l += spec.length()) {
                            DoubleVector<Shapes.S512Bit> aV1 = spec.fromArray(a, l + i * lda + _a_offset);
                            DoubleVector<Shapes.S512Bit> bV1 = spec.fromArray(b, l + j * ldb + _b_offset);
                            DoubleVector<Shapes.S512Bit> aV2 = spec.fromArray(a, l + j * lda + _a_offset);
                            DoubleVector<Shapes.S512Bit> bV2 = spec.fromArray(b, l + i * ldb + _b_offset);
                            temp1 += aV1.mul(bV1).sumAll();
                            temp2 += aV2.mul(bV2).sumAll();
                        }
                        for (; l < k; l++) {
                            temp1 = temp1 + a[l + i * lda + _a_offset] * b[l + j * ldb + _b_offset];
                            temp2 = temp2 + b[l + i * ldb + _b_offset] * a[l + j * lda + _a_offset];
                        }
                        if (beta == 0.0) {
                            c[i + j * Ldc + _c_offset] = alpha * temp1 + alpha * temp2;
                        } else {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset] + alpha * temp1 + alpha * temp2;
                        }
                    }
                }
            } else {
                for (j = 0; j < n; j++) {
                    for (i = j; i < n; i++) {
                        temp1 = 0.0;
                        temp2 = 0.0;
                        l = 0;
                        for (; l+spec.length() < k; l+=spec.length()) {
                            DoubleVector<Shapes.S512Bit> aV1=spec.fromArray(a,l + i * lda + _a_offset);
                            DoubleVector<Shapes.S512Bit> bV1=spec.fromArray(b,l + j * ldb + _b_offset);
                            DoubleVector<Shapes.S512Bit> bV2=spec.fromArray(b,l + i * ldb + _b_offset);
                            DoubleVector<Shapes.S512Bit> aV2=spec.fromArray(a,l + j * lda + _a_offset);
                            temp1+=aV1.mul(bV1).sumAll();
                            temp2+=aV2.mul(bV2).sumAll();
                        }
                        for (; l < k; l++) {
                            temp1 = temp1 + a[l + i * lda + _a_offset] * b[l + j * ldb + _b_offset];
                            temp2 = temp2 + b[l + i * ldb + _b_offset] * a[l + j * lda + _a_offset];
                        }
                        if (beta == 0.0) {
                            c[i + j * Ldc + _c_offset] = alpha * temp1 + alpha * temp2;
                        } else {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset] + alpha * temp1 + alpha * temp2;
                        }
                    }
                }
            }
        }

    }

    static public void VecDsyr2kFloat(String uplo, String trans, int n, int k, float alpha, float[] a, int _a_offset, int lda, float[] b, int _b_offset, int ldb, float beta, float[] c, int _c_offset, int Ldc) {
        FloatVector.FloatSpecies<Shapes.S256Bit> spec= (FloatVector.FloatSpecies<Shapes.S256Bit>) Vector.speciesInstance(Float.class, Shapes.S_256_BIT);
        float temp1 = 0.0f;
        float temp2 = 0.0f;
        int i = 0;
        int info = 0;
        int j = 0;
        int l = 0;
        int nrowa = 0;
        boolean upper = false;
        if (trans.equals("N")) {
            nrowa = n;
        } else {
            nrowa = k;
        }              //  Close else.
        //FloatVector<Shapes.S256Bit> zeroVec = spec.broadcast(0.0f);
       // FloatVector<Shapes.S256Bit> betaVec = spec.broadcast(beta);
        upper = uplo.equals("U");
        if (alpha == 0.0) {
            if (upper) {
                if (beta == 0.0) {
                    for (j = 0; j < n; j++) {
                        i = 0;
                        for (; (i + spec.length()) < j; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> zeroVec = spec.broadcast(0.0f);
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i <= j; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0f;
                        }
                    }
                } else {
                    for (j = 0; j < n; j++) {
                        i = 0;
                        for (; (i + spec.length()) <= j; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            FloatVector<Shapes.S256Bit> betaVec = spec.broadcast(beta);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i <= j; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }
                }
            }

            //lower
            else {
                if (beta == 0.0) {
                    for (j = 0; j < n; j++) {
                        i = j;
                        for (; i + spec.length() < n; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> zeroVec = spec.broadcast(0.0f);
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0f;
                        }
                    }
                } else {
                    for (j = 0; j < n; j++) {
                        i = j;
                        for (; i + spec.length() < n; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            FloatVector<Shapes.S256Bit> betaVec = spec.broadcast(beta);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                    }
                    for (; i < n; i++) {
                        c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                    }
                }
            }
        }
        //start operations
        if (trans.equals("N")) {
            // *        Form  C := alpha*A*B**T + alpha*B*A**T + C.
            if (upper) {
                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        i = 0;
                        for (; i + spec.length() <= j; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> zeroVec = spec.broadcast(0.0f);
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i <= j; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0f;
                        }

                    } else if (beta != 1.0) {
                        i = 0;
                        for (; i + spec.length() <= j; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            FloatVector<Shapes.S256Bit> betaVec = spec.broadcast(beta);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i <= j; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }

                    for (l = 0; l < k; l++) {
                        if ((a[j + l * lda + _a_offset] != 0.0) || (b[j + l * ldb + _b_offset] != 0.0)) {
                            temp1 = alpha * b[j + l * ldb + _b_offset];
                            temp2 = alpha * a[j + l * lda + _a_offset];
                            i = 0;
                            for (; (i + spec.length()) <= j; i += spec.length()) {
                                FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                                FloatVector<Shapes.S256Bit> bV = spec.fromArray(b, i + l * ldb + _b_offset);
                                FloatVector<Shapes.S256Bit> aV = spec.fromArray(a, i + l * lda + _a_offset);
                                FloatVector<Shapes.S256Bit> tv1 = spec.broadcast(temp1);
                                FloatVector<Shapes.S256Bit> tv2 = spec.broadcast(temp2);
                                cV.add(aV.mul(tv1)).add(bV.mul(tv2)).intoArray(c, i + j * Ldc + _c_offset);
                            }
                            for (; i <= j; i++) {
                                c[i + j * Ldc + _c_offset] = c[i + j * Ldc + _c_offset] + a[i + l * lda + _a_offset] * temp1 + b[i + l * ldb + _b_offset] * temp2;
                            }
                        }
                    }
                }
            } else {

                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        i = j;
                        for (; (i + spec.length()) < n; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> zeroVec = spec.broadcast(0.0f);
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0f;
                        }
                    } else if (beta != 1.0) {
                        i = j;
                        for (; (i + spec.length()) < n; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            FloatVector<Shapes.S256Bit> betaVec = spec.broadcast(beta);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }
                    for (l = 0; l < k; l++) {
                        if ((a[j + l * lda + _a_offset] != 0.0) || (b[j + l * ldb + _b_offset] != 0.0)) {
                            temp1 = alpha * b[j + l * ldb + _b_offset];
                            temp2 = alpha * a[j + l * lda + _a_offset];
                            i = j;
                            for (; i + spec.length() < n; i += spec.length()) {
                                FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                                FloatVector<Shapes.S256Bit> aV = spec.fromArray(a, i + l * lda + _a_offset);
                                FloatVector<Shapes.S256Bit> bV = spec.fromArray(b, i + l * ldb + _b_offset);
                                FloatVector<Shapes.S256Bit> tv1 = spec.broadcast(temp1);
                                FloatVector<Shapes.S256Bit> tv2 = spec.broadcast(temp2);
                                cV.add(aV.mul(tv1)).add(bV.mul(tv2)).intoArray(c, i + j * Ldc + _c_offset);
                            }
                            for (; i < n; i++) {
                                c[i + j * Ldc + _c_offset] = c[i + j * Ldc + _c_offset] + a[i + l * lda + _a_offset] * temp1 + b[i + l * ldb + _b_offset] * temp2;
                            }
                        }
                    }
                }
            }
        } else {

// *        Form  C := alpha*A**T*B + alpha*B**T*A + C.
            if (upper) {
                for (j = 0; j < n; j++) {
                    for (i = 0; i < j; i++) {
                        temp1 = 0.0f;
                        temp2 = 0.0f;
                        l = 0;
                        for (; l + spec.length() < k; l += spec.length()) {
                            FloatVector<Shapes.S256Bit> aV1 = spec.fromArray(a, l + i * lda + _a_offset);
                            FloatVector<Shapes.S256Bit> bV1 = spec.fromArray(b, l + j * ldb + _b_offset);
                            FloatVector<Shapes.S256Bit> aV2 = spec.fromArray(a, l + j * lda + _a_offset);
                            FloatVector<Shapes.S256Bit> bV2 = spec.fromArray(b, l + i * ldb + _b_offset);
                            temp1 += aV1.mul(bV1).sumAll();
                            temp2 += aV2.mul(bV2).sumAll();
                        }
                        for (; l < k; l++) {
                            temp1 = temp1 + a[l + i * lda + _a_offset] * b[l + j * ldb + _b_offset];
                            temp2 = temp2 + b[l + i * ldb + _b_offset] * a[l + j * lda + _a_offset];
                        }
                        if (beta == 0.0) {
                            c[i + j * Ldc + _c_offset] = alpha * temp1 + alpha * temp2;
                        } else {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset] + alpha * temp1 + alpha * temp2;
                        }
                    }
                }
            } else {
                for (j = 0; j < n; j++) {
                    for (i = j; i < n; i++) {
                        temp1 = 0.0f;
                        temp2 = 0.0f;
                        l = 0;
                        for (; l+spec.length() < k; l+=spec.length()) {
                            FloatVector<Shapes.S256Bit> aV1=spec.fromArray(a,l + i * lda + _a_offset);
                            FloatVector<Shapes.S256Bit> bV1=spec.fromArray(b,l + j * ldb + _b_offset);
                            FloatVector<Shapes.S256Bit> bV2=spec.fromArray(b,l + i * ldb + _b_offset);
                            FloatVector<Shapes.S256Bit> aV2=spec.fromArray(a,l + j * lda + _a_offset);
                            temp1+=aV1.mul(bV1).sumAll();
                            temp2+=aV2.mul(bV2).sumAll();
                        }
                        for (; l < k; l++) {
                            temp1 = temp1 + a[l + i * lda + _a_offset] * b[l + j * ldb + _b_offset];
                            temp2 = temp2 + b[l + i * ldb + _b_offset] * a[l + j * lda + _a_offset];
                        }
                        if (beta == 0.0) {
                            c[i + j * Ldc + _c_offset] = alpha * temp1 + alpha * temp2;
                        } else {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset] + alpha * temp1 + alpha * temp2;
                        }
                    }
                }
            }
        }

    }

} // End class.

BLASS-III(DGEMM)

import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Shapes;
import jdk.incubator.vector.Vector;

public class BLAS3GEMM {



    void VecDgemm(String transa, String transb, int m, int n, int k, double alpha, double[] a, int a_offset, int lda, double[] b, int b_offset, int ldb, double beta, double[] c, int c_offset, int ldc) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        double temp = 0.0;
        int i = 0;
        int info = 0;
        int j = 0;
        int l = 0;
        int ncola = 0;
        int nrowa = 0;
        int nrowb = 0;
        boolean nota = false;
        boolean notb = false;
        DoubleVector<Shapes.S512Bit> zeroVec = spec.broadcast(0.0);

        if (m == 0 || n == 0 || ((alpha == 0 || k == 0) && beta == 1.0))
            return;
        //double temp=0.0;
        if (alpha == 0.0) {
            if (beta == 0.0) {
                for (j = 0; j < n; j++) {
                    for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                        zeroVec.intoArray(c, i + j * ldc + c_offset);
                    }
                    for (; i < m && i + j * ldc + c_offset<c.length; i++) {
                        c[i + j * ldc + c_offset] = 0.0;
                    }
                }
            }

            //beta!=0.0
            else {
                for (j = 0; j < n; j++) {
                    DoubleVector<Shapes.S512Bit> bv = spec.broadcast(beta);
                    for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                        DoubleVector<Shapes.S512Bit> cv = spec.fromArray(c, i + j * ldc + c_offset);
                        cv.mul(bv).intoArray(c, i + j * ldc + c_offset);
                    }
                    for (; i < m && i + j * ldc + c_offset<c.length; i++) c[i + j * ldc + c_offset] = beta * c[i + j * ldc + c_offset];
                }
            }

        }

        if (notb) {
            if (nota) {

// *           Form  C := alpha*A*B + beta*C.

                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * ldc + c_offset);
                        }
                        for (; i < m; i++) {
                            c[i + j * ldc + c_offset] = 0.0;
                        }
                    } else if (beta != 1.0) {
                        DoubleVector<Shapes.S512Bit> bv = spec.broadcast(beta);
                        for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cv = spec.fromArray(c, i + j * ldc + c_offset);
                            cv.mul(bv).intoArray(c, i + j * ldc + c_offset);
                        }
                        for (; i < m && (i + j * ldc + c_offset)<c.length; i++) c[i + j * ldc + c_offset] = beta * c[i + j * ldc + c_offset];
                    }

                    for (l = 0; l < k; l++) {
                        if (b[l + j * ldb + b_offset] != 0.0) {
                            temp = alpha * b[l + j * ldb + b_offset];
                            DoubleVector<Shapes.S512Bit> tv = spec.broadcast(temp);
                            for (i = 0; (i + spec.length()) < m && (i + l * lda + a_offset+spec.length())<a.length && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                                DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, i + l * lda + a_offset);
                                DoubleVector<Shapes.S512Bit> cv = spec.fromArray(c, i + j * ldc + c_offset);
                                cv.add(av.mul(tv)).intoArray(c, i + j * ldc + c_offset); //tv.fma(av, cv).toDoubleArray(c, i+j*ldc+c_offset);
                            }
                            for (; i < m && (i + l * lda + a_offset)<a.length && (i + j * ldc + c_offset)<c.length; i++)
                                c[i + j * ldc + c_offset] = c[i + j * ldc + c_offset] + temp * a[i + l * lda + a_offset];
                        }
                    }
                }
            } else {
                for (j = 0; j < n; j++) {
                    for (i = 0; i < m; i++) {
                        temp = 0.0;
                        for (l = 0; (l + spec.length()) < k && (l + i * lda + a_offset+spec.length())<a.length && (l + j * ldb + b_offset+spec.length())<b.length; l += spec.length()) {
                            DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, l + i * lda + a_offset);
                            DoubleVector<Shapes.S512Bit> bv = spec.fromArray(b, l + j * ldb + b_offset);
                            temp += av.mul(bv).sumAll();
                        }
                        for (; l < k && l + i * lda + a_offset<a.length && l + j * ldb + b_offset<b.length; l++) temp = temp + a[l + i * lda + a_offset] * b[l + j * ldb + b_offset];

                        if (beta == 0.0) {
                            c[i + j * ldc + c_offset] = alpha * temp;
                        } else {
                            c[i + j * ldc + c_offset] = alpha * temp + beta * c[i + j * ldc + c_offset];
                        }
                    }
                }

            }
        } else {
            if (nota) {
                // *           Form  C := alpha*A*B**T + beta*C
                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * ldc + c_offset);
                        }
                        for (; i < m && (i + j * ldc + c_offset)<c.length; i++) {
                            c[i + j * ldc + c_offset] = 0.0;
                        }
                    } else if (beta != 1.0) {
                        DoubleVector<Shapes.S512Bit> bv = spec.broadcast(beta);
                        for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cv = spec.fromArray(c, i + j * ldc + c_offset);
                            cv.mul(bv).intoArray(c, i + j * ldc + c_offset);
                        }
                        for (; i < m && i + j * ldc + c_offset<c.length; i++) {
                            c[i + j * ldc + c_offset] = beta * c[i + j * ldc + c_offset];
                        }
                    }

                    for (l = 0; l < k; l++) {
                        if (b[j + l * ldb + b_offset] != 0.0) {
                            temp = alpha * b[j + l * ldb + b_offset];
                            DoubleVector<Shapes.S512Bit> tv = spec.broadcast(temp);
                            for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length && (i + l * lda + a_offset+spec.length())<a.length; i += spec.length()) {
                                DoubleVector<Shapes.S512Bit> cv = spec.fromArray(c, i + j * ldc + c_offset);
                                DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, i + l * lda + a_offset);
                                cv.add(tv.mul(av)).intoArray(c, i + j * ldc + c_offset); //tv.fma(av, cv).toDoubleArray(c, i + j * ldc + c_offset);
                            }
                            for (; i < m && (i + j * ldc + c_offset)<c.length && (i + l * lda + a_offset)<a.length; i++)
                                c[i + j * ldc + c_offset] = c[i + j * ldc + c_offset] + temp * a[i + l * lda + a_offset];
                        }
                    }
                }
            } else {
                // *           Form  C := alpha*A**T*B**T + beta*C
                for (j = 0; j < n; j++) {
                    for (i = 0; i < m; i++) {
                        temp = 0.0;
                        for (l = 0; (l + spec.length()) < k && (l + i * lda + a_offset+spec.length())<a.length && (j + l * ldb + b_offset+spec.length())<b.length; l += spec.length()) {
                            DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, l + i * lda + a_offset);
                            DoubleVector<Shapes.S512Bit> bv = spec.fromArray(b, j + l * ldb + b_offset);
                            temp += av.mul(bv).sumAll();
                        }
                        for (; l < k && (l + i * lda + a_offset)<a.length && (j + l * ldb + b_offset)<b.length; l++) {
                            temp = temp + a[l + i * lda + a_offset] * b[j + l * ldb + b_offset];
                        }

                        if (beta == 0.0) {
                            c[i + j * ldc + c_offset] = alpha * temp;
                        } else {
                            c[i + j * ldc + c_offset] = alpha * temp + beta * c[i + j * ldc + c_offset];
                        }

                    }
                }
            }
        }

    }

}

Financial Services (FSI) algorithms

GetOptionPrice

import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Shapes;
import jdk.incubator.vector.Vector;

public class FSI_getOptionPrice  {


    public static double getOptionPrice(double Sval, double Xval, double T, double[] z, int numberOfPaths, double riskFree, double volatility)
    {
        double val=0.0 , val2=0.0;
        double VBySqrtT = volatility * Math.sqrt(T);
        double MuByT = (riskFree - 0.5 * volatility * volatility) * T;

        //Simulate Paths
        for(int path = 0; path < numberOfPaths; path++)
        {
            double callValue  = Sval * Math.exp(MuByT + VBySqrtT * z[path]) - Xval;
            callValue = (callValue > 0) ? callValue : 0;
            val  += callValue;
            val2 += callValue * callValue;
        }

        double optPrice=0.0;
        optPrice = val / numberOfPaths;
        return (optPrice);
    }


    public static double VecGetOptionPrice(double Sval, double Xval, double T, double[] z, int numberOfPaths, double riskFree, double volatility) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        double val = 0.0, val2 = 0.0;

        double VBySqrtT = volatility * Math.sqrt(T);
        DoubleVector<Shapes.S512Bit> VByVec = spec.broadcast(VBySqrtT);
        double MuByT = (riskFree - 0.5 * volatility * volatility) * T;
        DoubleVector<Shapes.S512Bit> MuVec = spec.broadcast(MuByT);
        DoubleVector<Shapes.S512Bit> SvalVec = spec.broadcast(Sval);
        DoubleVector<Shapes.S512Bit> XvalVec = spec.broadcast(Xval);
        DoubleVector<Shapes.S512Bit> zeroVec =spec.broadcast(0.0D);

        //Simulate Paths
        int path = 0;
        for (; (path + spec.length()) < numberOfPaths; path += spec.length()) {
            DoubleVector<Shapes.S512Bit> zv = spec.fromArray(z, path);
            DoubleVector<Shapes.S512Bit> tv = MuVec.add(VByVec.mul(zv)).exp(); //Math.exp(MuByT + VBySqrtT * z[path])
            DoubleVector<Shapes.S512Bit> callValVec = SvalVec.mul(tv).sub(XvalVec);
            callValVec = callValVec.blend(zeroVec, callValVec.greaterThan(zeroVec));
            val += callValVec.sumAll();
            val2 += callValVec.mul(callValVec).sumAll();
        }
        //tail
        for (; path < numberOfPaths; path++) {
            double callValue = Sval * Math.exp(MuByT + VBySqrtT * z[path]) - Xval;
            callValue = (callValue > 0) ? callValue : 0;
            val += callValue;
            val2 += callValue * callValue;
        }
        double optPrice = 0.0;
        optPrice = val / numberOfPaths;
        return (optPrice);
    }
}

BinomialOptions

import jdk.incubator.oracle.vector.*;

public class FSI_BinomialOptions  {



    public static void VecBinomialOptions(double[] stepsArray, int STEPS_CACHE_SIZE, double vsdt, double x, double s, int numSteps, int NUM_STEPS_ROUND, double pdByr, double puByr) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        IntVector.IntSpecies<Shapes.S512Bit> ispec = (IntVector.IntSpecies<Shapes.S512Bit>) Vector.speciesInstance(Integer.class, Shapes.S_512_BIT);

        //   double stepsArray [STEPS_CACHE_SIZE];
        DoubleVector<Shapes.S512Bit> sv = spec.broadcast(s);
        DoubleVector<Shapes.S512Bit> vsdtVec = spec.broadcast(vsdt);
        DoubleVector<Shapes.S512Bit> xv = spec.broadcast(x);
        DoubleVector<Shapes.S512Bit> pdv = spec.broadcast(pdByr);
        DoubleVector<Shapes.S512Bit> puv = spec.broadcast(puByr);
        DoubleVector<Shapes.S512Bit> zv = spec.broadcast(0.0D);
        IntVector<Shapes.S512Bit> inc = ispec.fromArray(new int[]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, 0);
        IntVector<Shapes.S512Bit> nSV = ispec.broadcast(numSteps);
        int j;
        for (j = 0; (j + spec.length()) < STEPS_CACHE_SIZE; j += spec.length()) {
            IntVector<Shapes.S512Bit> jv = ispec.broadcast(j);
            Vector<Double,Shapes.S512Bit> tv = jv.add(inc).cast(Double.class).mul(spec.broadcast(2.0D)).sub(nSV.cast(Double.class));
            DoubleVector<Shapes.S512Bit> pftVec=sv.mul(vsdtVec.mul(tv).exp()).sub(xv);
            pftVec.blend(zv,pftVec.greaterThan(zv)).intoArray(stepsArray,j);
        }
        for (; j < STEPS_CACHE_SIZE; j++) {
            double profit = s * Math.exp(vsdt * (2.0D * j - numSteps)) - x;
            stepsArray[j] = profit > 0.0D ? profit : 0.0D;
        }

        for (j = 0; j < numSteps; j++) {
            int k;
            for (k = 0; k + spec.length() < NUM_STEPS_ROUND; k += spec.length()) {
                DoubleVector<Shapes.S512Bit> sv0 = spec.fromArray(stepsArray, k);
                DoubleVector<Shapes.S512Bit> sv1 = spec.fromArray(stepsArray, k + 1);
                pdv.mul(sv1).add(puv.mul(sv0)).intoArray(stepsArray, k); //sv0 = pdv.fma(sv1, puv.mul(sv0)); sv0.intoArray(stepsArray,k);
            }
            for (; k < NUM_STEPS_ROUND; ++k) {
                stepsArray[k] = pdByr * stepsArray[k + 1] + puByr * stepsArray[k];
            }
        }
    }

}