More than 50 percent of a data scientist’s time is spent cleaning data, according to a Cloudflower 2017 Data Scientist Report.
Michael Garel, director of data strategy for Accruent, gave a presentation at AAMI Exchange in Cleveland this weekend, in which he argued that estimate is actually "quite low."
“From what I’ve seen, data scientists spend most, if not 80 or 90 percent, of their time cleaning data,” he said. “Everybody thinks this data scientist role is the greatest ever. It’s really processing a lot of data.”
Cleaning data refers to the process of detecting and correcting corrupt or inaccurate records so that the true analytic insight can shine through the information.
With over 500 million work orders — more than 230 million in healthcare — Accruent utilizes a number of tools in data analytics, machine learning and deep learning to clean data and uncover insights for completing hospital equipment work orders more efficiently. Which tools to use comes down to what information the user is trying to uncover, the type of work order and the variables involved.
In his presentation, entitled Big Data Insights on Capital Equipment from 500 Million Work Orders
, Garel examined specific uses and scenarios that a few of these tools are best suited for addressing:
Data science is the ability to comprehend and process data, and to extract value from it, visualize it and communicate it. Applying data analytics can be helpful for this, depending on the type of scenarios users are faced with.
- Descriptive analytics – Describes what has happened in the past to understand current conditions, and visualize and communicate insights extracted from data to peers or management.
- Predictive analytics – Predicts what will happen in the future. This form is inherently probabilistic in nature and utilizes historical data to anticipate future performance, events and results.
- Prescriptive analytics – Maps out recommendations for next steps to achieve objectives and goals.
Machine learning is data analysis that automates analytical model building. While most software requires training on where to look, the aim of this technology is to uncover hidden insights without explicitly being programmed where to search. It can instead learn from data, by identifying patterns and by making predictions.
- Supervised Learning – Utilizing tagged data, the machine is trained to identify features in select images and applied to identify them in images not used in their training. This is especially helpful for risk assessment, fraud detection, and image and speech recognition.
- Unsupervised Learning – The computer relies on patterns in all images to identify variables, without "knowing" what the image is. This is best for anomaly detection and differentiating stand-alone activities from continuous ones.
- Reinforcement learning – The computer learns from its mistakes again and again until it completes the task error free and without being directed. This is often used in robotics and navigation.
- Semi-supervised – Groups of different variables are assigned in different ways, with the algorithm over time learning the most optimal approach. Used in speech recognition, web page classification, and image recognition and classification.
A more accurate and faster form of machine learning, deep learning does not require as much upfront, as the tools and framework are already built in.. It can train networks and adjust variables within them. The downside is the greater amounts of data required for training, compared to standard machine learning.
- Convolutional – Extracts complex features of data at each level to determine the output. This is suited well for noisier images that contain features other than the specific information one is looking for.
- Recurrent Networks – Form of deep learning that stores information in context nodes so that the machine can learn data sequences and output another sequence. This is especially helpful for real language translation.
- Generative Adversarial Networks (GANS) – Composed of two neural networks — one of which produces real images, another which produces false images. These networks teach one another to detect fake from real data and generate information that is indistinguishable. With it, users could potentially create a healthy 2D or 3D image of a scan for a person with a tumor to deceive radiologists and oncologists. “All kinds of bad applications can happen from this. We certainly have to find a way around this because it is just too easy to do at this point,” said Garel.
He adds that if used correctly, all of these and other tools available can help speed up data cleaning and ensure faster completion of work orders for workflow and quality patient care.
“Our hypothesis is that we can apply data science to clean up enough of this data so we can actually make it useful to generate results. That’s the goal.”