avectorapiforjava内容摘要:

up Binding to Machine Instruction static final MethodType MT_L4_BINARY = (, , )。 private static final MethodHandle MHm256_vaddps = ( mm256_vaddps, MT_L4_BINARY, requires(AVX), new Register[][]{xmmRegistersSSE, xmmRegistersSSE, xmmRegistersSSE}, (Register[] regs) { Register out = regs[0]。 Register in1 = regs[1]。 Register in2 = regs[2]。 int[] vex = vex_prefix(rBit(out),X_LOW,bBit(in2),M_0F,W_LOW,in1,L_256,PP_NONE)。 return vex_emit(vex, 0x58, modRM(out, in2))。 })。 Registers via JVMCI Desired Register Masks MethodHandle Type Featurechecking predicate Macroized x86 encoding Software and Services Group Checked Invocation private static Long4 vaddps_naive(Long4 a, Long4 b) { float[] res = new float[8]。 for (int i = 0。 i 8。 i++) { res[i] = getFloat(a, i) + getFloat(b, i)。 } return long4FromFloatArray(res,0)。 } public static Long4 vaddps(Long4 a, Long4 b) { try { Long4 res = (Long4) (a, b)。 assert assertEquals(res, vaddps_naive(a, b))。 return res。 } catch (Throwable e) { throw new Error(e)。 } } Pure Java equivalent function. Typesafe invocation point. Software and Services Group A Small Example public static float[] proc(float[] left, float[] right, float[] res){ if( != ){ throw new UnsupportedOperationException(Arrays unequal.)。 } else if ( % 8 != 0) { throw new UnsupportedOperationException(Length must be n*8)。 } for(int i = 0。 i。 i+=8){ addArrays(left,right,res,i)。 } return res。 //Convenience } Loop Kernel Software and Services Group Small Example (cont’d) //Isolated for code quality purposes in prototype public static void addArrays(float[] left, float[] right, float[] res, int i){ //VMOVDQU ymmX, YMMWORD PTR … Long4 l = (left,i)。 Long4 rr = (l,right,i)。 //VMOVDQU YMMWORD PTR …, ymmX (res,i,rr)。 } Scaled load Scaled store vaddps reg, YMMWORD PTR ... Software and Services Group Generating C2 Code java XaddExports: XaddExports: XX:+UnlockDiagnosticVMOptions XX:UseSuperWord XX:LoopMaxUnroll=1 XX:PrintAssemblyOptions=intel XX:CompileCommand=option,*AddArraysLong4PS::addArrays,PrintAssembly cp build AddArraysLong4PS Snippets!!!!! Generated Code Software and Services Group Performance of This Example  Compared to Scalar implementation  Disabled SuperWord and Loop Unrolling  We see a ~40% reduction in clock cycles spent in the loop kernel with the vectorized version.  This workload is a prototype PoC, we need more advanced workloads that better leverage vectorization.  Bigger, more intensive workloads to e  Wall clock time indicates overhead ing from outside of the loop kernel vs. the scalar version – more work to do! The Vector API Software and Services Group Java Needs an Abstraction for Vectors  Vector ISA Extensions are powerful, expressive, and deep.  Most instructions have many different forms and support differing operand sizes  NxM problems abound for API writers  Needs to be to capture the essence of vectorization in the spirit of Java  Platform independence – Snippets too low level  Meaningful static checking  Familiar patterns to abstract operational plexity Software and Services Group Vector API  Intended API to enpass the CodeSnippets implementation  Proposed by John Rose*. Work continues within the Panama Project  interface VectorE, S ext。
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。 用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。