Background: The lack of publicly available and culturally relevant data sets on African American and bilingual/Spanish-speaking Hispanic adults’ disease prevention and health promotion priorities presents a major challenge for researchers and developers who want to create and test personalized tools built on and aligned with those priorities. Personalization depends on prediction and performance data. A recommender system (RecSys) could predict the most culturally and personally relevant preventative health information and serve it to African American and Hispanic users via a novel smartphone app. However, early in a user’s experience, a RecSys can face the “cold start problem” of serving untailored and irrelevant content before it learns user preferences. For underserved African American and Hispanic populations, who are consistently being served health content targeted toward the White majority, the cold start problem can become an example of algorithmic bias. To avoid this, a RecSys needs population-appropriate seed data aligned with the app’s purposes. Crowdsourcing provides a means to generate population-appropriate seed data.
Objective: Our objective was to identify and test a method to address the lack of culturally specific preventative personal health data and sidestep the type of algorithmic bias inherent in a RecSys not trained in the population of focus. We did this by collecting a large amount of data quickly and at low cost from members of the population of focus, thereby generating a novel data set based on prevention-focused, population-relevant health goals. We seeded our RecSys with data collected anonymously from self-identified Hispanic and self-identified non-Hispanic African American/Black adult respondents, using Amazon Mechanical Turk (MTurk).
Methods: MTurk provided the crowdsourcing platform for a web-based survey in which respondents completed a personal profile and a health information–seeking assessment, and provided data on family health history and personal health history. Respondents then selected their top 3 health goals related to preventable health conditions, and for each goal, reviewed and rated the top 3 information returns by importance, personal utility, whether the item should be added to their personal health library, and their satisfaction with the quality of the information returned. This paper reports the article ratings because our intent was to assess the benefits of crowdsourcing to seed a RecSys. The analysis of the data from health goals will be reported in future papers.
Results: The MTurk crowdsourcing approach generated 985 valid responses from 485 (49%) self-identified Hispanic and 500 (51%) self-identified non-Hispanic African American adults over the course of only 64 days at a cost of US $6.74 per respondent. Respondents rated 92 unique articles to inform the RecSys.
Conclusions: Researchers have options such as MTurk as a quick, low-cost means to avoid the cold start problem for algorithms and to sidestep bias and low relevance for an intended population of app users. Seeding a RecSys with responses from people like the intended users allows for the development of a digital health tool that can recommend information to users based on similar demography, health goals, and health history. This approach minimizes the potential, initial gaps in algorithm performance; allows for quicker algorithm refinement in use; and may deliver a better user experience to individuals seeking preventative health information to improve health and achieve health goals.