Modern business intelligence (BI) platforms have made report building look and feel easy. Analysts can practically take any piece of data available today and analyze it in some sort of desktop or web-based application. Data funneled from different types of sources can be combined into convenient data models and ultimately used to provide valuable, actionable insights.
However, one of the most important steps in this data journey—frequently overlooked by the business analyst is data clean-up (cleansing), sometimes called data wrangling. The importance of this step lies in the premise often repeated among the BI professionals: “garbage in – garbage out”. This simply means that unrefined, raw data will most likely provide skewed, inaccurate results compared to reports that use healthy, pre-scrubbed data.
This first of two articles discusses the importance of data quality, while the next article lays out 7 types of data cleaning required in nearly every data project. Meant for both business analyst and business intelligence developer professionals, the articles will serve as a foundational guide to better business insight—regardless of the domain.
Why do businesses need “healthy” data?
The basic idea behind good data quality for any BI reporting is that it:
- Decreases data maintenance time/cost
- Decreases data processing time/cost
- Makes data modeling easier
- Improves report quality
- Improves decision-making processes
Therefore, a significant amount of effort is put in to elevating data to the level where data-caused errors are minimized, and a higher likelihood of accurate reporting is achieved. It is well known that on average, data analysts/scientists spend 60-80% of their time scrubbing and organizing data, and 20-40% of their time providing insight by creating data products. Fortunately, the data management industry is applying new toolsets to ensure that lopsided ratio is reversed—a topic covered in a future Verstand Insight piece.
What is “healthy” data?
Before diving into performing the actual data clean-up steps, it is important to know what the desirable outcome of this process is. The so called, “healthy” or “clean” data is data that is free of: incomplete, inconsistent, incorrect, irrelevant, redundant information and is structured in a way where it can be directly used or further transformed. Healthy data exhibits a strong connection between its structure and its meaning. In this data, all pieces are logically linked so that:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
On the other hand, “dirty” data is riddled with inconsistencies, errors, and missing information, which makes the development process more complex, time consuming, and almost impossible to make informed decisions from it.
The team at Verstand AI understands that the significance of data quality is greater than ever before. With ever-increasing data volume, businesses without established data cleansing techniques can find themselves in very difficult situations in a matter of days or even hours. Thus, data clean-up strategies are one of the first steps we implement for our clients.
For example, one of our customers, a nationwide retail chain, ingests data from more than 20 different sources. This diverse data is first stored in “raw” tables, then undergoes cleansing and other transformations. At the end of this process, the “healthy” data is stored in ready-to-use tables that can be queried directly by business users or used in BI reporting. Using this “clean” data, the Verstand AI team creates user-friendly Power BI and Looker reports, as well as versatile data models used in self-service analytics allowing business users to build their own ad-hoc reports.
Data clean-up strategies such as this one creates a powerful base structure with data accuracy, validity, integrity, consistency, and completeness as its major pillars. Having a structurally sound base allows our clients to build up their information pool without fear of introducing “dirty” data into their databases and reports, as well as unnecessary data maintenance and storage costs. This ultimately leads to not only more accurate reporting and decision-making but can also lead to improved business processes and overall operational cost reduction.
In our next article you’ll discover seven of the most common characteristics of “dirty” data and how to efficiently cleanse the datasets for the benefit of better decision making and the business.
For more detail on this subject or to discuss how Verstand AI can help you ensure your data is in a state to provide the best results for your company, contact us at insights@verstand.ai.