Your question is excellent. The answer to your question lies somewhere in between, I think. There are pros and cons for both approaches. A model trained with more images with less instances tends to generalize better but might not perform well for certain instances while overfitting on some other instances. On the other hand, a model trained with more instances but with less images might cover pretty well for specific instance but not generalize well for unseen cases. A good rule of thumb is always look at the actual distribution of the real world and try to emulate similar distributions in your training, validation, and test data. Of course, after it's been deployed, you need to monitor the distribution and recollect/re-annotate/re-train as needed. HTH.