There is more than one way to skin a cat. When converting from AOS to SOA and back there are many variations. How many instructions you have available to schedule in your even/odd pipes, how many spare registers you have to spend, and the latency of the combination of instructions used will dictate which you should use.
The goal is to first find all variations where you can trade even for odd or odd for even instructions -- and minimize the number of registers used in the process (including shuffle masks used).
To specify shuffle masks, I'll use A-D,0 to specify the element 1 through 4 in the first parameter and a-d,0 to specify element 1 through 4 in the second parameter. '0' is special in that it means put a zero in the output for that element.
For example, s_ABab would take the first 2 elements in parameter 1 and the first 2 elements in parameter 2 and put them side by side into the output register.
Example - AOS to SOA, 1 element - 0 even, 3 odd, 1 shuffle mask, 9 cycles
Lets first consider the simple case of 4 input registers, where we are interested in combining the first element of each in(1-4) register into a single out register.
|shufb t1, in1, in2, s_ACac||t1 = in1.x, ?, in2.x, ?|
|shufb t2, in3, in4, s_ACac||t2 = in3.x, ?, in4.x, ?|
|shufb out, t1, t2, s_ACac||out = in1.x, in2.x, in3.x, in4.x|
Example - AOS to SOA, 1 element - 1 even, 2 odd, 2 shuffle masks, 7 cycles
This variation on the above splits up the even/odd pipe usage a bit at the cost of more masks.
|shufb t1, in1, in2, s_Aa00||t1 = in1.x, in2.x, 0, 0|
|shufb t2, in3, in4, s_00Aa||t2 = 0, 0, in3.x, in4.x|
|or out, t1, t2||out = in1.x, in2.x, in3.x, in4.x|
Example - SOA to AOS, 1 element - 0 even, 3 odd, 0 shuffle masks, 6 cycles
This example converts back from SOA to AOS. Still working on 1 element.
|shlqbyi out2, in, 4||out2 = in.y, in.z, in.w, in.x|
|shlqbyi out3, in, 8||out3 = in.z, in.w, in.x, in.y|
|shlqbyi out4, in, 12||out4 = in.w, in.x, in.y, in.z|
Know a transpose that I didn't list? Find a better one? Post in the comments and I'll update the post.
... will be on 2 element transposes.