gcc - How do you load/store from/to an array of doubles with GNU C Vector Extensions? -
i'm using gnu c vector extensions, not intel's _mm_* intrinsics.
i want same thing intel's _m256_loadu_pd intrinsic. assigning values 1 one slow: gcc produces code has 4 load instructions, rather 1 single vmovupd (which _m256_loadu_pd generate).
typedef double vector __attribute__((vector_size(4 * sizeof(double))));  int main(int argc, char **argv) {     double a[4] = {1.0, 2.0, 3.0, 4.0};     vector v;      /* */     v[0] = a[0];     v[1] = a[1];     v[2] = a[2];     v[3] = a[3]; }   i want this:
v = (vector)(a);   or
v = *((vector*)(a));   but neither work. first fails "can't convert value vector" while second results in segfaults.
update: see you're using gnu c's native vector syntax, not intel intrinsics. avoiding intel intrinsics portability non-x86? gcc bad job compiling code uses gnu c vectors wider target machine supports. (you'd hope use 2 128b vectors , operate on each separately, apparently it's worse that.)
anyway, this answer shows how can use intel x86 intrinsics load data gnu c vector-syntax types
first of all, looking @ compiler output @ less -o2 waste of time if you're trying learn compile code.  main() optimize ret @ -o2.
besides that, it's not totally surprising bad asm assigning elements of vector 1 @ time.
aside: normal people call type v4df (vector of 4 double float) or something, not vector, don't go insane when using c++ std::vector.  single-precision, v8sf.  iirc, gcc uses type names internally __m256d.
on x86, intel intrinsic types (like __m256d) implemented on top of gnu c vector syntax (which why can v1 * v2 in gnu c instead of writing _mm256_mul_pd(v1, v2)).  can convert freely __m256d v4df, i've done here.
i've wrapped both sane ways in functions, can @ asm. notice how we're not loading array define inside same function, compiler won't optimize away.
i put them on godbolt compiler explorer can @ asm various compile options , compiler versions.
typedef double v4df __attribute__((vector_size(4 * sizeof(double))));  #include <immintrin.h>  // note return types.  gcc6.1 compiles no warnings, @ -wall -wextra v4df load_4_doubles_intel(const double *p) { return _mm256_loadu_pd(p); }     vmovupd ymm0, ymmword ptr [rdi]   # tmp89,* p     ret  v4df avx_constant() { return _mm256_setr_pd( 1.0, 2.0, 3.0, 4.0 ); }     vmovapd ymm0, ymmword ptr .lc0[rip]     ret   if args _mm_set* intrinsics aren't compile-time constants, compiler best can make efficient code elements single vector.  it's best rather writing c stores tmp array , loads it, because that's not best strategy.  (store-forwarding failure on multiple narrow stores forwarding wide load costs ~10 cycles (iirc) of latency on top of usual store-forwarding delay.  if doubles in registers, it's best shuffle them together.)
see is possible cast floats directly __m128 if 16 byte alligned? list of various intrinsics getting single scalar vector. x86 tag wiki has links intel's manuals, , intrinsics finder.
load/store gnu c vectors without intel intrinsics:
i'm not sure how you're "supposed" that.  this q&a suggests casting pointer memory want load, , using vector type typedef char __attribute__ ((vector_size (16),aligned (1))) unaligned_byte16;  (note aligned(1) attribute).
you segfault *(v4df *)a because presumably a isn't aligned on 32-byte boundary, you're using vector type assume natural alignment.  (just __m256d if dereference pointer instead of using load/store intrinsics communicate alignment info compiler.)
Comments
Post a Comment