For decades, one of the mainstay tools for survey research has been the landline telephone. A whole science and a multimillion-dollar industry are based on recruiting and surveying research participants by phone. It’s the way we measure who people will vote for, what products they buy – and our attitudes about health and medicine.
But today, fewer people have landlines, and laws that ban robotic dialing of cellphones mean they aren’t easily used for this kind of research. The people with landlines are disproportionately white, older and rural, so surveying them doesn’t lead to a sample that is representative of the entire population. But there is a place where we can find people who these landline surveys are missing: Twitter.
Our team – a public health researcher who studies vaccine refusal, a computer scientist who develops computational tools, and a systems integration expert – is working on a method to analyze tweets in real time. We plan to analyze attitudes about vaccines to figure out how to use Twitter as a social science research tool to provide an accurate, real-time sample of a much larger population.
What do we know about people on Twitter?
Wait, you say: Aren’t people already doing research using Twitter? Yes, but at the moment most Twitter research has significant limitations.
Researchers who lack computer science expertise may find tweets difficult to gather or categorize and difficult to quantify. Social scientists, in particular, need to be able to sort people by demographics and other characteristics.
But it is not always easy to glean that kind of information about people on Twitter. At the moment, we don’t have a good method for analyzing all of the information that people put out there every day. Our research team is trying to develop a way to do just that. The idea is that in the future other researchers can use the same method to study social science questions on social media too.
What can public opinion research do for health?
Think back to the early days of the AIDS epidemic, and how critical to the prevention effort it was for researchers to understand and correct people’s misconceptions about transmission of the disease, or to identify which groups were engaging in risky behaviors so they could be approached with tailored messages.
Imagine what a skewed picture we would have had if researchers had been talking about attitudes, beliefs and practices with too many older, white people in rural areas and not enough younger people and minorities in cities.
Policymakers and public health officials depend on accurate data to make decisions, and in a crisis, researchers can’t afford to get it wrong. This is why we are trying to find a way to use Twitter to fill in the gaps in our current surveys.
Getting a representative sample
Today we know that for a survey to be accurate, it needs to poll a large enough group of people. It also needs to sufficiently represent groups of people who tend to participate less often in research, like males versus females, or African-Americans versus white Americans, whose attitudes, beliefs and behaviors are therefore under-represented in traditional research.
In the early days of social research in the 1920s and 1930s researchers typically just found people available on the street, or mailed surveys to people whose addresses were easy to get because they had telephones, owned cars, or subscribed to magazines. These samples were often great indicators of what mostly white, affluent Americans thought, but were wildly inaccurate when it came to taking the pulse of the nation.
But a few years before World War II, pollsters like Gallup, Harris and Roper started using a new technique called sampling. This careful selection of survey participants allowed pollsters to interview a small number of people and generalize their views to accurately represent people across the country, using estimates based on statistical probability.
Like those early polls, our current survey methods aren’t that good at generating a representative sample of the population to measure opinions and attitudes. We think our approach to using social media will go a long way toward correcting this sampling problem.
Surveys need to go where people are: online
Young people and minorities are the heaviest users of social media platforms like Twitter. And groups of all ages, races and socioeconomic levels are becoming active social media users. We can use the wealth of information they produce to fill in the gaps in existing social research surveys. But first, we need to figure out how to do that.
To develop a method for using Twitter to understand public opinion in real time, our team is going take millions of tweets related to vaccinations, such as tweets from the 2009-2010 H1N1 (swine flu) pandemic, and we’ll compare them to past survey data collected during the pandemic, such as research about vaccine attitudes.
Then we are going to compare thousands of existing survey responses about H1N1 with these tweets until we can parse out patterns in the Twitter data that resemble proven patterns from the surveys.
Using geolocation, language recognition algorithms and existing knowledge of group attitudes and narratives, we are going to match data from Twitter with responses in surveys, paying special attention to demographic groups that are well represented online. If they match, then in the future we can go straight to the quicker, cheaper Twitter to get the information we’re looking for.
We want to get to the point where we can say “Here’s what we know from surveys: that during the swine flu pandemic, for example, Hispanic parents were far more worried about the vaccine than African-American or white parents, but still vaccinated their children in much higher numbers,” and then use algorithms, language recognition software and other analytical tools to detect the same attitudes from the same group on Twitter.
Such research could complement data from existing surveys, fill in our gaps in knowledge about groups who are often under-represented in surveys, and possibly be generalized to a larger population. And once we do all that comparison, going forward we could apply the same principles about group identity and opinion formation to Twitter data evolving in real time.
What will this data be good for?
In an unfolding public health situation, we can gather and analyze that data – statistically – the same way we could analyze the much slower, much more expensive survey data. This can help public health officials understand where and how to target messages on the fly. Although we are starting with attitudes and behaviors about vaccination, our tool could be used in the same way for any other health issue.
What this means to public health researchers, and to the taxpayers who often fund their studies, is reliable data, delivered much faster and at a much lower cost than what we have now. The bonus is because of the current demographics of Twitter, we will also have information on those hard-to-reach younger people and racial and ethnic minorities.