Introduction: Big Data, Data Mining, and Knowledge Discovery:
Data mining includes tools for visualizing relationships in the data and mechanizes the process of discovering predictive information in massive databases.
Data mining tools scan databases and identify previously hidden patterns.
Big Data to Knowledge (BD2K) - Launched in 2013 by the National Institutes of Health (NIH) to support the research and development of innovative and transformative approaches and tools to maximize and hasten the integration of big data and data science into biomedical research.
The BD2K program supports initial efforts toward making data sets Findable, Accessible, Interoperable, and Reusable (FAIR).
Much of our big data are unstructured data; unstructured big data reside in text files, which represent more than 75% of an organization’s data. Such data are not contained in databases and can be easily overlooked; moreover, it is difficult to discern trends and patterns in such data.
Big data - A collection of data that is huge in size and yet growing exponentially with time. Such data are so large and complex that none of the traditional data management tools are able to store it or process it efficiently.
More data means more knowledge, greater insights, smarter ideas and expanded opportunities for organizations to harness and learn from their data.
Data mining - The process of using software to sort through data to discover patterns and ascertain or establish relationships. This information can then be used to increase profits or decrease costs or both. In health care, it is being used to improve efficiency and quality, resulting in better healthcare practices and improved patient outcomes.
Data mining projects help organizations discover interesting knowledge. These projects can be predictive, exploratory, or focused on data reduction.
Data mining focuses on producing a solution that generates useful forecasting through a four-phase process:
Problem identification - Initial phase of data mining. The problem must be defined, and everyone involved must understand the objectives and requirements of the data mining process they are initiating.
Data exploration - Begins with exploring and preparing the data for the data mining process. This phase might include data access, cleansing, sampling, and transformation; based on the problem being solved, data might need to be transformed into another format. The goal of this phase is to identify the relevant or important variables and determine their nature.
Pattern discovery - A complex phase of data mining. In this phase, different models are applied to the same data to choose the best model for the data set being analyzed. It is imperative that the model chosen should identify the patterns in the data that will support the best predictions. This phase ends with a highly predictive, consistent pattern-identifying model.
Knowledge deployment or application of knowledge to new data to forecast or generate predictions - Takes the pattern and model identified in the pattern discovery phase and applies them to new data to test whether they can achieve the desired outcome. The model achieves insight by following the rules of a decision tree to generate predictions or estimates of the expected outcome.
Data mining is an analytic, logical process with the ultimate goal of forecasting or predicting. It mines or unearths concealed predictive information, constructing a picture or view of the data that lends insight into future trends, actions, or behaviors.
Data mining develops a model that uses an algorithm to act on a data set for one situation when the organization knows the outcome and then applies this same model to another situation when the outcome is not known—an extension known as scoring.
Scoring - The data mining process of applying a model to new data.
Data mining is a dynamic, iterative process that is adjusted as new information surfaces. It is a robust, predictive, proactive information and knowledge tool that, when used correctly, empowers organizations to predict and react to specific characteristics of and behaviors within their systems.
Data mining is also known as knowledge discovery and data mining (KDD), knowledge discovery in data, and knowledge discovery in databases.
Knowledge discovery is key because data mining looks at data from different vantage points, aspects, and perspectives and brings new insights to the data set.
The healthcare sector has discovered data mining through the realization that knowledge discovery can help improve:
Healthcare policy making.
Healthcare practices.
Disease prevention
Detection of disease outbreaks
Prevention of sequelae,
Prevention of in-hospital deaths.
On the business side, healthcare organizations use data mining to detect falsified or fraudulent insurance claims.
KDD and Research:
“Big data” does not just refer to size but rather “is an opportunity to find insights in new and emerging types of data and content, to make your business more agile, and to answer questions that were previously considered beyond your reach”
Data Mining Concepts:
Bagging - The use of voting and averaging in predictive data mining to synthesize the predictions from many models or methods or the use of the same type of a model on different data. This term can also refer to the unpredictability of results when complex models are used to mine small data sets.
Boosting - A means of increasing the power of the models generated by weighting the combinations of predictions from those models into a predicted classification. This iterative process uses voting or averaging to combine the different classifiers.
Data reduction - Shrinks large data sets into manageable, smaller data sets. One way to accomplish this is via aggregation of the data or clustering.
Drill-down analysis - Typically begins by identifying variables of interest to drill down into the data. You could identify a diagnosis and drill down, for example, to determine the ages of those diagnosed or the number of males. You could then continue to drill down and expose even more of the data.
Exploratory data analysis - An approach or philosophy that uses mainly graphical techniques to gain insight into a data set. Its goal varies based on the purpose of the analysis, but it can be applied to the data set to extract variables, detect outliers, or identify patterns.
Feature selection - Reduces inputs to a manageable size for processing and analysis because the model either chooses or rejects an attribute based on its usefulness for analysis.
Machine learning - A subset of artificial intelligence that permits computers to learn either inductively or deductively. Inductive machine learning is the process of reasoning and making generalizations or extracting patterns and rules from huge data sets—that is, reasoning from a large number of examples to a general rule. Deductive machine learning moves from premises that are assumed true to conclusions that must be true if the premises are true.
Meta-learning - Combines the predictions from several models. It is helpful when several different models are used in the same project. The predictions from the different classifiers or models can be included as input into the metalearning. The goal is to synthesize these predicted classifications to generate a final best predicted classification—a process also referred to as stacking.
Predictive data mining - Identifies the data mining project as one with the goal of identifying a model that can predict classifications.
Stacking - Synthesizes the predictions from several models.
Data Mining Techniques:
The commonly used techniques in data mining are neural networks, decision trees, rule induction, algorithms, and the nearest neighbor method.
Neural networks - Represent nonlinear predictive models. These models learn through training and resemble the structure of biological neural networks; that is, they model the neural behavior of the human brain. Neural networks are a way to bridge the gap between computers and humans. Neural networks go through a learning process or training on existing data so that they can predict, recognize patterns, associate data, or classify data.
Decision trees - Named because the sets of decisions form a tree-shaped structure. The decisions generate rules for classifying a data set. CART and chi-square automatic interaction detection are two commonly used types of decision tree methodologies.
Rule induction - Based on statistical significance. Rules are extracted from the data using if–then statements, which become the rules.
Algorithms - Typically computer- based recipes or methods used to develop data mining models. To create the model, the data set is first analyzed by the algorithm, which looks for specific patterns and trends. Based on the results of this analysis, the algorithm defines the parameters of the data mining model. The identified parameters are then applied to the entire data set to mine it for patterns and statistics
Nearest neighbor analysis - Classifies each record in a data set based on a select number of its nearest neighbors. This technique is sometimes known as the k-nearest neighbor.
Text mining for text is equivalent to data mining for numerical data. Because text is not always consistent in health care because of the lack of a generally accepted terminology structure, it is more difficult to analyze. Text documents are analyzed by extracting key words or phrases.
Online analytic processing - Generates different views of the data in multidimensional databases. These perspectives range from simplistic views such as descriptive statistics, frequency tables, or comparative summaries to more complicated analyses requiring various forms of cleansing the data such as removing outliers.
Brushing - A technique in which the user manually chooses specific data points or observations or subsets of data on an interactive data display. These selected data can be visualized in two-dimensional or three-dimensional surfaces as scatter plots. Brushing is also known as graphical exploratory data analysis.
Data Mining Models:
A data mining model is developed by exercising more than algorithms on data. Specifically, the data mining model consists of a mining structure plus an algorithm. The data mining model remains empty until it applies the algorithm or processes and analyzes the data provided by the mining structure.
Cross-Industry Standard Process for Data Mining (CRISP-DM):
The CRISP-DM model follows a path or series of steps to develop a business understanding by gaining an understanding of the business data collected and analyzed.
The six steps are:
Business understanding.
Data understanding.
Data preparation.
Modeling - Involves selecting the modeling methods and their application to the prepared data set.
Evaluation
Deployment.
The CRISP-DM model employs a process that has been proven to make data mining projects both more rapid and more effectual. Using this model helps avoid typical mistakes while assessing business problems and detailing data mining techniques.
Six Sigma:
Six Sigma - A data-driven method to eliminate defects, avoid waste, or assess quality control issues. It aims to decrease discrepancies in business and manufacturing processes through dedicated improvements.
Six Sigma uses the steps of: (DMAIC).
Define.
Measure
Analyze
Improve
Control
Sample, Explore, Modify, Model, Assess (SEMMA):
SEMMA - The process of Sampling, Exploring, Modifying, Modeling, and Assessing (SEMMA) large amounts of data to uncover previously unknown patterns which can be utilized as a business advantage” (para. 1). This model is similar to Six Sigma but concentrates more on the technical activities characteristically involved in data mining.
Benefits of Knowledge Discovery and Data Mining:
KDD can enhance the business aspects of healthcare delivery and help improve patient care. Examples of how KDD can be applied effectively follow:
A durable medical equipment company analyzed its recent sales and enhanced its targeting of hospitals and clinics that yielded the highest return on investment.
Several plastic surgery suites were bought by the same group of surgeons. They wanted to know how those organizations were the same and how they were different. They ran analytics for disparities while looking for patterns and trends that led them to develop standardized policies and modify treatment plans.
Analytic techniques were used in the clinical trials of a new oral contraceptive to aid in monitoring trends and disparities.
Hidden patterns and relationships between death and disease in selected populations can be uncovered.
Government spending on certain aspects of health care or specific disease conditions can be analyzed to discover patterns and relationships and to distinguish between the real versus desired outcomes from the investment.
Patient data can be analyzed to identify effective treatments and discover patterns or relationships in the data to predict inpatient length of stay (LOS).
Data can be analyzed to help detect medical insurance fraud.
Data Mining and Electronic Health Records:
EHR data mining can:
Help with managing population health.
Can assist with and inform administrative processes.
Can provide metrics for quality improvement.
Support value-based reimbursement.
Provide data for registry software that helps with population health management.
Uses of EHR data in physician practices:
Demographic analytics support efficient diagnosing.
Combining disparate data types creates opportunities to strengthen financial planning.
Tracking the patient flow enables productivity improvements.
Improving system performance in an interconnected world. Information in an EHR comes from diverse sources, hospitals, prior doctors, third-party payers and other organizations.
Comparing your organization’s performance to peers and national standards allows you to discover strengths and weaknesses in your operation.
Registries are also being used to identify care gaps in patient populations in a physician practice or in a healthcare organization. Some EHRs have a built-in registry function whereas others interface with third-party registries. Registries are designed to:
Provide lists of subpopulations, such as patients with hypertension and diabetes.
Identify patients with care gaps based on evidence-based guidelines.
Support outreach to patients who have care gaps.
Provide feedback on how each physician is doing on particular types of care, such as the percentage of their diabetic patients who have their Hba1c levels or blood pressure under control.
Generate quality reports for the practice.
EHR can be used to improve administrative processes.
Ethics of Data Mining:
Practitioners engaging in data mining must ensure that such data are deidentified and that confidentiality is maintained. Because most data mining depends on the aggregation of data, maintaining individual patient confidentiality should be relatively straightforward.