If I told you that a tumor DNA sequencing research study found 25% of lung cancer patients have a mutation in the gene KRAS, would that truly mean that if I were to gather together every person on this planet who has lung cancer, 25% of them would have a KRAS mutation? There are of course countless confounding factors to consider that would likely make it not so: age, race/ethnicity, and gender being a few. These traits are likely much more diverse in the global population than what they could ever be in a study population. And although we all consider it obvious that there are many such factors that we should consider when performing biomedical research – I wonder how much we actually bother considering them in practice. When we perform an experiment with a well-known cancer cell line that researchers have been using since the 1970s, for example, do we ever stop to think of who the cell line came from? What if we found that the majority of cell lines came from one specific ethnic group, age group, or gender? Would we still consider our findings as scientifically sound, or as relevant to general populations? Or if you read a paper and see that the authors are able to reproduce their findings in multiple cell lines or in a handful of patient samples in their final figure of data, are you satisfied that their results have clinical relevance? Or do you question how representative and relevant those samples are to the target patient population?
Take this example. The Cancer Genome Atlas (TCGA) is a huge, multi-million dollar sequencing effort involving 20 different collaborating institutions across the U.S. and Canada to obtain genomic, transcriptomic, and epigenetic information from multiple cancer types. In total, TCGA obtained data from almost 11,000 patients. This effort has resulted in a slew of high-profile, high-impact papers and has provided enormous data sets that countless groups are mining and using to inform their research. These data have allowed researchers to discover many new targets which are being investigated for developing into new cancer therapies. A vast cohort of researchers have invested a great deal of time and effort into following up on studies directed by findings from TCGA. But, who are the actual patients represented in these datasets, and from whom we are drawing all of these sweeping biological conclusions? According to a brief report published in JAMA Oncology, 77% of the patients enrolled in the study were white, 12% were black, and most other ethnic groups were represented less than 5%. Interestingly, this mirrors the composition of the U.S. population fairly well: according to the U.S. Census Bureau, as of 2015, 77% of the U.S. population is white, 13% is Black, and most other ethnic groups are at 5% or less (as of 2015). One dramatic difference to note, however, is that about 17% of the U.S. population identifies as Hispanic, but only 3% of the patients in TCGA dataset are Hispanic.
One could maybe argue that the over-representation of certain ethnic groups in TCGA datasets is justified as it is a fair representation of the racial composition of the U.S. population, and thus we are focusing on understanding the majority of patients. But even, for example, the teeny 1% minority of the U.S. population classified as American Indian/Native Alaskan is still equivalent to about 3 million people – not at all a small number deserving to being ignored! Furthermore, cancer is not an American disease, it is a global one, and as arguably one of the most powerful, influential, and wealthiest nations in the world, it might be fair to hope that research in the U.S. is invested in finding cures for a larger cohort of human beings, and not just a subset that is the majority specifically in this country. The JAMA report further shows that for virtually all the cancer types they looked at in their report, there were only a sufficient number of TCGA samples for the White patient group, and not any other ethnic group, to detect a 10% mutational frequency rate. In short, TCGA datasets may have so poorly represented certain ethnic groups that we could be missing out entirely on important biology that drives their cancers. Thus, although TCGA is typically thought of as a huge dataset that is representative of a diverse population, the reality is that it may only be highlighting the biology of a specific subset of individuals.
Obviously, it is much easier to point fingers and complain than it is to actually do something to address the problem. And the issue, of course, is much more complex than even what I discuss here. White patients, Black patients, or Asian patients aren’t exactly homogeneous groups in and of themselves: the diversity within a socially-defined “race” is not something to be dismissed either. Regardless, this is certainly an important issue and one that we need to discuss more. TCGA proudly claims on their website that they obtained data from 33 cancer types, including 10 rare cancer types, but I hope in the future we can make similar claims about types of people.
Peer edited by Tamara Vital.
Follow us on social media and never miss an article: