I think this is an example where C++ really shines. The code is not much longer (44 vs 33 lines, although the output in python version is much nicer) than the python code and on my computer it is about 140x faster than python 2.7.1.
We did a distributed N-queens solver to test the Gifu Prefectural Computing Grid that an ex-job helped to develop, and that was literally all ~35 of my lines of production C code ever. (Though most of the grid turned on "graphical" mode, which both a) solved in Java and b) solved in Java at the speed of the Swing ability to update the visual board... which, after realizing, I patched to sleep frequently and run the C in the background anyway.)
This was ~5 years ago, and I don't remember the exact magnitude of the speedup on Java vs. C. I do remember it being rather less than I was expecting.
That's why you write a C-extension to CPython when you need the performance. And if you don't need the performance you stay sane by not having to write stuff in C++.
Using a library for computing the permutations I guess? I haven't touched C++ for years but I'd be surprised if the 44 lines include a permutations implementation.
For the curious, STL <algorithm> is one of those things that you wish you knew ages ago. Usually my code is much cleaner after I run through it and replace relevant parts of it with STL stuff.
The optimized version of the PyPY code is around 17x faster than the first version of the Python code. That makes your C++ code 8x faster than the PyPy code, but I consider any Python code within an order of magnitude of C++ good.