Cells and Samples Have Race Too!


Science should reflect the diversity of its subjects.

If I told you that a tumor DNA sequencing research study found 25% of lung cancer patients have a mutation in the gene KRAS, would that truly mean that if I were to gather together every person on this planet who has lung cancer, 25% of them would have a KRAS mutation? There are of course countless confounding factors to consider that would likely make it not so: age, race/ethnicity, and gender being a few. These traits are likely much more diverse in the global population than what they could ever be in a study population. And although we all consider it obvious that there are many such factors that we should consider when performing biomedical research – I wonder how much we actually bother considering them in practice. When we perform an experiment with a well-known cancer cell line that researchers have been using since the 1970s, for example, do we ever stop to think of who the cell line came from? What if we found that the majority of cell lines came from one specific ethnic group, age group, or gender? Would we still consider our findings as scientifically sound, or as relevant to general populations? Or if you read a paper and see that the authors are able to reproduce their findings in multiple cell lines or in a handful of patient samples in their final figure of data, are you satisfied that their results have clinical relevance? Or do you question how representative and relevant those samples are to the target patient population?

Take this example. The Cancer Genome Atlas (TCGA) is a huge, multi-million dollar sequencing effort involving 20 different collaborating institutions across the U.S. and Canada to obtain genomic, transcriptomic, and epigenetic information from multiple cancer types. In total, TCGA obtained data from almost 11,000 patients. This effort has resulted in a slew of high-profile, high-impact papers and has provided enormous data sets that countless groups are mining and using to inform their research. These data have allowed researchers to discover many new targets which are being investigated for developing into new cancer therapies. A vast cohort of researchers have invested a great deal of time and effort into following up on studies directed by findings from TCGA. But, who are the actual patients represented in these datasets, and from whom we are drawing all of these sweeping biological conclusions? According to a brief report published in JAMA Oncology, 77% of the patients enrolled in the study were white, 12% were black, and most other ethnic groups were represented less than 5%. Interestingly, this mirrors the composition of the U.S. population fairly well: according to the U.S. Census Bureau, as of 2015, 77% of the U.S. population is white, 13% is Black, and most other ethnic groups are at 5% or less (as of 2015). One dramatic difference to note, however, is that about 17% of the U.S. population identifies as Hispanic, but only 3% of the patients in TCGA dataset are Hispanic.


Cancer is a global disease, and requires study of all patient groups.

One could maybe argue that the over-representation of certain ethnic groups in TCGA datasets is justified as it is a fair representation of the racial composition of the U.S. population, and thus we are focusing on understanding the majority of patients. But even, for example, the teeny 1% minority of the U.S. population classified as American Indian/Native Alaskan is still equivalent to about 3 million people – not at all a small number deserving to being ignored! Furthermore, cancer is not an American disease, it is a global one, and as arguably one of the most powerful, influential, and wealthiest nations in the world, it might be fair to hope that research in the U.S. is invested in finding cures for a larger cohort of human beings, and not just a subset that is the majority specifically in this country. The JAMA report further shows that for virtually all the cancer types they looked at in their report, there were only a sufficient number of TCGA samples for the White patient group, and not any other ethnic group, to detect a 10% mutational frequency rate. In short, TCGA datasets may have so poorly represented certain ethnic groups that we could be missing out entirely on important biology that drives their cancers. Thus, although TCGA is typically thought of as a huge dataset that is representative of a diverse population, the reality is that it may only be highlighting the biology of a specific subset of individuals.


Diversity within race needs to be considered in scientific studies.

Obviously, it is much easier to point fingers and complain than it is to actually do something to address the problem. And the issue, of course, is much more complex than even what I discuss here. White patients, Black patients, or Asian patients aren’t exactly homogeneous groups in and of themselves: the diversity within a socially-defined “race” is not something to be dismissed either. Regardless, this is certainly an important issue and one that we need to discuss more. TCGA proudly claims on their website that they obtained data from 33 cancer types, including 10 rare cancer types, but I hope in the future we can make similar claims about types of people.

Peer edited by Tamara Vital.

Follow us on social media and never miss an article:

The Trouble with Reproducibility in Science

As scientists, many of us have read a paper, been inspired by the glamorous data, carefully followed the methods section in order to replicate the results in our own hands, and failed to validate the original results. I’ve often attributed these issues to my own inexperience and naiveté as a young scientist, but over the past several years, the irreproducibility of published data has become a widespread problem. This lack of reproducibility could be perceived as a manifestation of poor experimental design and faulty interpretation of results by researchers. However, this seems counterintuitive in that so much of a scientist’s reputation rests upon the quality of his or her publication record.

Just how rampant is the reproducibility problem?

A 2012 study led by C. Glenn Begley (then the head of cancer research at Amgen, Inc.) probed the boundaries of reproducibility in cancer literature by investigating 53 landmark publications from reputable labs and high impact journals. Despite closely following the methods sections of those publications, and even consulting with the authors and sharing reagents, Begley et al. found that the data in 47 of the 53 publications could not be reproduced; only 6 held up under scrutiny. A similar study performed at Bayer Healthcare in Germany replicated only 25% of the publications examined. These reproducibility issues do not only plague the clinical sciences. The field of psychology recently came under scrutiny during an effort called the ‘Reproducibility Project: Psychology.’ Of 100 published studies, only 39 could be reproduced by independent researchers. These facts are at once shocking, depressing, and infuriating, especially when considering preclinical publications that spawn countless secondary publications, which may lead to expensive and faulty clinical trials that inevitably fail. Unfortunately, the increasing number of flawed publications has led to a precipitous decline in the public’s trust in science and medicine.

What’s causing all of these issues?

Continue reading