Sources of Big Data in Medicine

Sources of Big Data in Medicine

Big Data in Medicine
PeopleImages / Getty Images

A simple definition of big data in medicine is “the totality of data related to patient healthcare and well-being” (Raghupathi 2014). But what exactly are these types of data, and where do they come from?

The following is a broad overview of the types and sources of big data of interest to health care providers, researchers, payers, policymakers, and industry. These categories are not mutually exclusive, because the same data can originate from a variety of sources.

Nor is this list exhaustive, because the practical application of big data analytics will surely continue to expand.

Clinical Information Systems

These are traditional sources of clinical data that health care providers are accustomed to viewing.

  • Electronic health records (EHRs) collect, store, and display information such as demographics, past medical history, active medical problems, immunizations, allergies, medications, vital signs, results from laboratory and radiology tests, pathology reports, progress notes created by health care providers, and administrative and financial documents

  • Electronic medical records (EMRs) are not identical to EHRs and usually pertain to data stored with a particular physician.

  • Health information exchanges serve as hubs between disparate clinical information systems

  • Patient registries maintained by health care organizations on their own patients, often linked to the EHR. Other registries track immunizations, cancer, trauma, and other public health issues on a wider geographic scale.

  • Patient portals allow patients to access personal health information stored in a health care organization’s EHR. Some patient portals also allow users to request prescription refills and exchange secure electronic messages with the health care team.

  • Clinical data warehouses aggregate patient-level data from multiple clinical information systems, such as EHRs and other sources listed above

    Claims Data From Payers

    Public payers (e.g. Medicare) and private payers have large repositories of claims data on their beneficiaries.

    Research Studies

    Research databases contain information about study participants, experimental treatments, and clinical outcomes. Large studies are usually sponsored by pharmaceutical companies or government agencies. An application of personalized medicine is to match individual patients with effective treatments, based on patterns in clinical trials data.

    This approach moves beyond applying evidence-based medicine principles, by which a health care provider determines whether a patient shares broad characteristics (e.g. age, gender, race, clinical status) with trial participants. With big data analytics, it is possible to select a treatment based on much more granular information, such as the genetic profile of a patient’s cancer (see below).

    Clinical decision support systems (CDSS) have also been developing rapidly and now represent a big part of artificial intelligence (AI) in medicine. They use patient data to assist clinicians with their decision-making and are often combined with EHRs. 

    Genetic Databases

    The repository of human genetic information continues to accumulate at a rapid pace.

    Since the Human Genome Project was completed in 2003, the cost of human DNA sequencing has been reduced by a million-fold. The Personal Genome Project (PGP), launched in 2005 by Harvard Medical School, seeks to sequence and publicize the complete genomes of 100,000 volunteers from around the world. The PGP itself is a prime example of big data project due to the sheer volume and variety of data. A personal genome contains about 100 gigabytes of data. In addition to sequencing genomes, the PGP is also collecting data from EHRs, surveys, and microbiome profiles.

    A number of companies offer direct-to-consumer genetic sequencing for health, personal traits, and pharmacogenetics on a commercial basis.

    This personal information could be subjugated to big data analytics. For example, 23andMe stopped offering health-related genetic reports to new customers as of November 22, 2013 to comply with the U.S. Food and Drug Administration. However, in 2015, the company started offering certain health components of their genetic saliva test again, this time with the FDA’s approval. 

    Public Records

    The government keeps detailed records of events related to health, such as immigration, marriage, birth, and death. The U.S. Census has collected vast amounts of information every 10 years since 1790. The Census’ statistics website had 370 billion cells as of 2013, with approximately 11 billion more added yearly.

    Web Searches

    Web search information gathered by Google and other web search providers could provide real-time insights related to a population’s health. However, the value of big data from web search patterns might be improved by combining it with traditional sources of health data.

    Social Media

    Facebook, Twitter, and other social media platforms generate a rich variety of data around the clock, giving a view into the locations, health behaviors, emotions, and social interactions of users. 

    The application of social media big data to public health has been referred to as digital disease detection or digital epidemiology. Twitter, for example, has been used to analyze influenza epidemics among the general population.

    The World Well-Being Project that started at the University of Pennsylvania is another example of studying social media to understand people’s experience and health better. The project brings together psychologists, statisticians and computer scientists who analyze language used when interacting online, for instance, when writing status updates on Facebook. Scientists are observing how users’ language relates to their health and happiness.

    The Internet of Things (IoT)

    Massive troves of health-related information are also collected and stored on mobile and home devices.

    • Smartphones: Thousands of mHealth apps capture information on the user’s physical activity, nutritional intake, sleep patterns, emotions, and other parameters. Native cell phone apps (e.g. GPS, email, texting) can also give clues about an individual’s health status.

    • Wearable monitors and devices: Pedometers, accelerometers, glasses, watches, and chips embedded under the skin also gather health-related information and can also send them into the cloud.

    • Telemedicine devices allow health care providers to monitor patients’ parameters such as blood pressure, heart rate, respiratory rate, oxygenation, temperature, ECG tracings, and weight.

    Financial Transactions

    Patients’ credit card transactions are included in the predictive models used by Carolinas HealthCare System to identify patients who are at high-risk for being readmitted to the hospital.

    Ethical and Privacy Implications

    It needs to be highlighted that, in some cases, there might be important ethical and privacy implications when gathering and accessing data in health care. New sources of big data can improve our understanding of what impacts individuals and population health, however, different risks need to be carefully considered and monitored. 


    Carolinas HealthCare System. How Carolinas HealthCare System is Turning Big Data Into Better Care.

    Conway M, O’Connor D. Social media, big data, and mental health: current advances and ethical implications. Current Opinion in Psychology 2016;9:77-82.

    Fernandes L et al. Big Data, Bigger Outcomes. Journal of AHIMA 83, no.10 (October 2012): 38-43.

    Lazer D et al. The Parable of Google Flu: Traps in Big Data Analysis. Science 2014:343 (6176):1203-1205.

    Raghupathi W & Raghupathi V. Big data analytics in healthcare: promise and potential. Health Information Science and Systems 2014; 2:3. doi:10.1186/2047-2501-2-3.

    Continue Reading