avectorapiforjava内容摘要:
up Binding to Machine Instruction static final MethodType MT_L4_BINARY = (, , )。 private static final MethodHandle MHm256_vaddps = ( mm256_vaddps, MT_L4_BINARY, requires(AVX), new Register[][]{xmmRegistersSSE, xmmRegistersSSE, xmmRegistersSSE}, (Register[] regs) { Register out = regs[0]。 Register in1 = regs[1]。 Register in2 = regs[2]。 int[] vex = vex_prefix(rBit(out),X_LOW,bBit(in2),M_0F,W_LOW,in1,L_256,PP_NONE)。 return vex_emit(vex, 0x58, modRM(out, in2))。 })。 Registers via JVMCI Desired Register Masks MethodHandle Type Featurechecking predicate Macroized x86 encoding Software and Services Group Checked Invocation private static Long4 vaddps_naive(Long4 a, Long4 b) { float[] res = new float[8]。 for (int i = 0。 i 8。 i++) { res[i] = getFloat(a, i) + getFloat(b, i)。 } return long4FromFloatArray(res,0)。 } public static Long4 vaddps(Long4 a, Long4 b) { try { Long4 res = (Long4) (a, b)。 assert assertEquals(res, vaddps_naive(a, b))。 return res。 } catch (Throwable e) { throw new Error(e)。 } } Pure Java equivalent function. Typesafe invocation point. Software and Services Group A Small Example public static float[] proc(float[] left, float[] right, float[] res){ if( != ){ throw new UnsupportedOperationException(Arrays unequal.)。 } else if ( % 8 != 0) { throw new UnsupportedOperationException(Length must be n*8)。 } for(int i = 0。 i。 i+=8){ addArrays(left,right,res,i)。 } return res。 //Convenience } Loop Kernel Software and Services Group Small Example (cont’d) //Isolated for code quality purposes in prototype public static void addArrays(float[] left, float[] right, float[] res, int i){ //VMOVDQU ymmX, YMMWORD PTR … Long4 l = (left,i)。 Long4 rr = (l,right,i)。 //VMOVDQU YMMWORD PTR …, ymmX (res,i,rr)。 } Scaled load Scaled store vaddps reg, YMMWORD PTR ... Software and Services Group Generating C2 Code java XaddExports: XaddExports: XX:+UnlockDiagnosticVMOptions XX:UseSuperWord XX:LoopMaxUnroll=1 XX:PrintAssemblyOptions=intel XX:CompileCommand=option,*AddArraysLong4PS::addArrays,PrintAssembly cp build AddArraysLong4PS Snippets!!!!! Generated Code Software and Services Group Performance of This Example Compared to Scalar implementation Disabled SuperWord and Loop Unrolling We see a ~40% reduction in clock cycles spent in the loop kernel with the vectorized version. This workload is a prototype PoC, we need more advanced workloads that better leverage vectorization. Bigger, more intensive workloads to e Wall clock time indicates overhead ing from outside of the loop kernel vs. the scalar version – more work to do! The Vector API Software and Services Group Java Needs an Abstraction for Vectors Vector ISA Extensions are powerful, expressive, and deep. Most instructions have many different forms and support differing operand sizes NxM problems abound for API writers Needs to be to capture the essence of vectorization in the spirit of Java Platform independence – Snippets too low level Meaningful static checking Familiar patterns to abstract operational plexity Software and Services Group Vector API Intended API to enpass the CodeSnippets implementation Proposed by John Rose*. Work continues within the Panama Project interface VectorE, S ext。avectorapiforjava
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。
用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。