With 2D raster art (or 3D voxel art, for that matter) there's very little you can do to stem the sort of combinatorial explosion you're describing. You can reduce the number of base variations by imposing certain standards -- say, "the hands are always placed in this position when the player carries a two-handed weapon, and in this position when the player caries a one-handed weapon." and then you animate the weapons, individually, based on that.
Another trick you can do is to reduce the number of variants by not encoding final colors into the images directly -- then, you only have to animate the wizards' robe once, and you can color it differently inside a pixel shader, or with palette tricks.
There's no way to avoid the combinatorial nature of the number of assets you have. Its just math. All you can do is A) reduce the number of variables in the equation (fewer axes of customization), and B) reduce the magnitude of those variables (fewer customizations within each axis). The above are two ways of doing that.
3D rendering can help, because you then have 3D objects and animations (eliminating the need for rotation as one of the axes), textures and shaders can be changed independent of geometry, and skeletal animations separate geometry from animation. Still, it has its own set of problems -- for example, if you animate a character for using a sword, but he's holding a long staff instead, the animation might cause the staff to awkwardly impale its holder mid-animation -- in other words, even though you have the benefit of skeletal animation, you still might have to have a set of animations for wielding a sword and a different set for wielding a staff. You still might have to have different animation sets for different characters, depending on how different their body styles are. In general, 3D is less affected by combinatorial explosion, but its not a magic bullet.