As the amount of digital health data continues to grow – providing more and more insight into our health – some epidemiologists have begun to wonder whether they are still needed to identify, describe and vet solutions for problems in public health. Doesn’t the data speak for itself?
In an article in the 2021 Annual Review of Public Health, Margaret Handley, Paul Wesson, Gilmer Valdes and Kristefer Stojanovski and Yulin Hswen make the case that the methods of epidemiology are changing in response to the explosion of data, but that the fundamental principles are more important than ever. The core challenge of epidemiology has always been not just to obtain data, but rather to obtain data that accurately reflects the health of a given population. Big Data has largely failed to represent marginalized groups accurately, as a growing number of scholars argue.
In fact, the book Automating Inequality spurred Handley to propose that Annual Reviews – a prestigious set of journals highlighting top-level issues across a range of scientific disciplines – include an overview of how health research using Big Data can minimize its risk of baking in inequality. Once greenlit, Handley recruited what she calls a “dream team” of co-authors, including UCSF early career faculty members Wesson and Hswen.
The team decided that addressing potential distortions in data sets – which might underrepresent those who are less digitally connected, to take an obvious example – must be a primary consideration when using Big Data in research.
“We wanted to elevate the thinking and not just the techniques,” Handley explained.
While Big Data is generally defined in terms of five V’s – volume, variety, velocity, veracity and value – the authors argued that researchers need to include a sixth V – virtue – to remind researchers that one of their first questions should be whether a given dataset accurately reflects diversity and/or inequities. Addressing that question may mean limiting the types of questions asked of Big Data or employing innovative statistical methods to account for the data’s shortcomings.
How to handle social media?
Social media posts have often appealed to researchers: They’re abundant (volume) and provide immediate information (velocity). But critics point out that online big social data may not reflect all demographics and be generalizable to all populations. Hswen, whose work focuses on using social media data responsibly, sees online social data as a powerful early-alert system for public health and a method to capture more organic and authentic information.
In the Annual Reviews article section on this issue, Hswen and her colleagues consider food-borne illness as an example of successful use of social media data predicting disease outbreaks. When people get food poisoning, they may not know to report it to the Food and Drug Administration. Those who suspect they got sick at a restaurant are more likely to post a searing review on Yelp, sometimes graphically describing their symptoms. If a disproportionate number of these reviews pop up in a single area, it offers an early warning signal that there may be a food-borne pathogen circulating. If a number of reviews mention a specific food item – chicken, say – researchers can also begin to look for the source. This may all happen before the FDA has enough data to know there’s a problem.
“We’re actually getting additional information that is not being covered by traditional forms, and we could get it much earlier,” Hswen explained.
The missing population problem
Virtuosity in Big Data research may mean, as Hswen argues, recognizing when your data is good enough to give a clear signal. But it may also mean recognizing that your data falls short. Wesson focuses on groups that are traditionally hard to find and count. These include people whose health risks stem from behaviors that are socially stigmatized and those who aren’t likely to be reached using a given study’s methods. One can see how a WhatsApp survey wouldn’t reach one’s grandfather, and a landline survey wouldn’t reach one’s nieces and nephews.
In a section of the article on “data augmentation,” the researchers, led by Wesson, propose some ways to improve representation of these groups in Big Data analyses.
“Depending on to what degree the population that is traditionally absent from data sets is present in some way or another in the specific data being used,” Wesson explained, researchers may be able to “triangulate who that population is and what their characteristics are by taking different types of data sets and weaving them together to try to get the bigger picture from incomplete snapshots.”
If a data set doesn’t include enough members of a group to draw meaningful conclusions about them, researchers can also bring in a second data set that does. Looking at multiple, unconnected data sets, researchers can also employ multiple systems estimation (MSE), which originated in wildlife biology. The method allows researchers to use the degree of overlap on lists representing two distinct counts to estimate what percent of a population they have accounted for. MSE only shows the size of the missing population – the group’s health behaviors and risks remain unknown – but this method is at least more transparent about who study findings apply to.
“A major theme of the article is thinking about how the data got to you in the first place,” Wesson said. “Big Data is great, but that doesn’t mean we can ignore the fundamental principles of epidemiology.”