Social media data is steeped with user-generated content and social information. Most of user-generated content can be text and multimedia. Social media is a new source of data and therefore, social media research faces novel challenges. We discuss one of such challenges - evaluation dilemmas. One evaluation dilemma is that there is often no ground truth in evaluating research findings of social media. Without ground truth, how can we perform credible and reproducible evaluation? Another associated dilemma is that we frequently resort to crowdsourcing mechanisms such as Amazon’s Mechanical Turk for evaluation tasks. It costs even if a small group of Turkers is employed. Is it too small? Large-scale evaluation could be very costly. Can we find alternative ways of evaluation that are more objective, reproducible, or scalable? We use case studies to illustrate these dilemmas and show how to overcome associated challenges in mining big social media data.