I'm currently writing NEON code for the Qt PorterDuff SourceOver implementation. At the beginning one has to make the decision to use inline assembly, a seperate .S file or the ARM NEON Intrinsics.
I have chosen to go with the ARM NEON Intrinsics embedded into C++ code for a couple of simple reasons. At first it is portable across gcc and RVCT doing a .S or inline assembly would not work for RVCT that is used by the Symbian people. The second reason is that I get type safety. The NEON registers can be seen as 8bit, 16bit, 32bit, 64bit signed/unsigned registers when doing low level assembly you might pick the wrong operation and it is hard to see, with using the intrinsics you get a compiler warning about your mistake. One downside is that with some easy things I can make my compiler abort with an internal compiler error... but this will change over time.
Next is the myth that GCC is crap and that the instrinsics are badly "scheduled". From my looking at the assembly code it is mostly arranged like I wanted it to be. On a simple operation GCC was putting a LDR in the code right inbetween neon load and stores operations, with a simple change in the code this LDR was gone and I should not see any of the described hazards.
Now my ARM NEON code is slower than the C code (that is using tricks) but that is entirely my fault and I have some things I can try to make it faster. And to be more specific the ARM NEON code is four frames faster than the old C code (that was not using tricks).