That’s how Apple (standardly?) implements emoji skin color – it combines a color swatch with a face using the existing zero-width joining character designed for use with scripts whose letters may or may not join up to adjacent letters. Clients that don’t support that convention will just show the swatch and then a face.
The reason I’m imagining a new control character is that it’s unrealistic to expect all clients to support all possible combinations of even two emoji, let alone three or more, so it would be good if you could explicitly specify what to display as a fallback (i.e. display only the first emoji, or display them all, or display a completely different character, or none). Though, older clients would still display all the component characters, and there’s no way round that because existing Unicode doesn’t provide a way to suppress the display of characters (it includes the ASCII backspace character, but does not define it to mean “remove the previous character from the string”).