gcc - How do you load/store from/to an array of doubles with GNU C Vector Extensions? -
i'm using gnu c vector extensions, not intel's _mm_* intrinsics.
i want same thing intel's _m256_loadu_pd intrinsic. assigning values 1 one slow: gcc produces code has 4 load instructions, rather 1 single vmovupd (which _m256_loadu_pd generate).
typedef double vector __attribute__((vector_size(4 * sizeof(double)))); int main(int argc, char **argv) { double a[4] = {1.0, 2.0, 3.0, 4.0}; vector v; /* */ v[0] = a[0]; v[1] = a[1]; v[2] = a[2]; v[3] = a[3]; } i want this:
v = (vector)(a); or
v = *((vector*)(a)); but neither work. first fails "can't convert value vector" while second results in segfaults.
update: see you're using gnu c's native vector syntax, not intel intrinsics. avoiding intel intrinsics portability non-x86? gcc bad job compiling code uses gnu c vectors wider target machine supports. (you'd hope use 2 128b vectors , operate on each separately, apparently it's worse that.)
anyway, this answer shows how can use intel x86 intrinsics load data gnu c vector-syntax types
first of all, looking @ compiler output @ less -o2 waste of time if you're trying learn compile code. main() optimize ret @ -o2.
besides that, it's not totally surprising bad asm assigning elements of vector 1 @ time.
aside: normal people call type v4df (vector of 4 double float) or something, not vector, don't go insane when using c++ std::vector. single-precision, v8sf. iirc, gcc uses type names internally __m256d.
on x86, intel intrinsic types (like __m256d) implemented on top of gnu c vector syntax (which why can v1 * v2 in gnu c instead of writing _mm256_mul_pd(v1, v2)). can convert freely __m256d v4df, i've done here.
i've wrapped both sane ways in functions, can @ asm. notice how we're not loading array define inside same function, compiler won't optimize away.
i put them on godbolt compiler explorer can @ asm various compile options , compiler versions.
typedef double v4df __attribute__((vector_size(4 * sizeof(double)))); #include <immintrin.h> // note return types. gcc6.1 compiles no warnings, @ -wall -wextra v4df load_4_doubles_intel(const double *p) { return _mm256_loadu_pd(p); } vmovupd ymm0, ymmword ptr [rdi] # tmp89,* p ret v4df avx_constant() { return _mm256_setr_pd( 1.0, 2.0, 3.0, 4.0 ); } vmovapd ymm0, ymmword ptr .lc0[rip] ret if args _mm_set* intrinsics aren't compile-time constants, compiler best can make efficient code elements single vector. it's best rather writing c stores tmp array , loads it, because that's not best strategy. (store-forwarding failure on multiple narrow stores forwarding wide load costs ~10 cycles (iirc) of latency on top of usual store-forwarding delay. if doubles in registers, it's best shuffle them together.)
see is possible cast floats directly __m128 if 16 byte alligned? list of various intrinsics getting single scalar vector. x86 tag wiki has links intel's manuals, , intrinsics finder.
load/store gnu c vectors without intel intrinsics:
i'm not sure how you're "supposed" that. this q&a suggests casting pointer memory want load, , using vector type typedef char __attribute__ ((vector_size (16),aligned (1))) unaligned_byte16; (note aligned(1) attribute).
you segfault *(v4df *)a because presumably a isn't aligned on 32-byte boundary, you're using vector type assume natural alignment. (just __m256d if dereference pointer instead of using load/store intrinsics communicate alignment info compiler.)
Comments
Post a Comment