In the previous article, “‘Healthy’ Data: A Must for BI Reporting”, we explained the importance of data quality in business intelligence reporting, as well as the concepts of “healthy” (“clean”) and “dirty” data.
Imagine you are an analyst asked to create a new report using “freshly” ingested data from a source system. Before diving into creating the front end of the report, you will need to make sure data is ready for modeling. For this, you may turn to a data engineer or embark on the data clean-up journey on your own.
In this post, we will describe seven of the most used data clean-up techniques that will help make your “dirty” data squeaky “clean”. Keep in mind there are various levels of data “messiness”; therefore, not all steps will have to be applied on every single “dirty” dataset. Some will have only minor consistency and redundancy issues while others may have major structural issues.
To be able to determine which steps need to be completed, it is important to first inspect and audit the data. By examining the raw data, you will have a better picture of the type of data you are dealing with. Once the data audit is complete, the clean-up process begins.
1. Remove Duplicates
Most BI (Business Intelligence) applications today have the ETL (extract, transform, load) capability, which makes it easy to detect and remove duplicates. For example, in PowerBI (Power Query Editor) this can be done by selecting one or more columns of the table, right-clicking on one of them and selecting the “Remove Duplicates” option. Simple as that!
You can also remove duplicates in a database, such as Google Big Query, by creating a script that identifies duplicated values and subsequently updates the table by excluding those table rows. However, prior to making these changes, consult with the source system owner and make sure those edits make sense and are acceptable.
2. Remove Irrelevant Fields
Once data is free of duplicates, it is important to make another pass and remove any irrelevant information that may impact the quality of future reporting. For example, if you are building an accounting report that deals with sales and budget information only you may want to remove customer information from your dataset, as it is not pertinent to the information business users will look for in these reports. Decluttering your dataset will result in smoother data modeling, decreased data maintenance, and increased report performance.
3. Fix Incorrect Data
Incorrect data can happen for many varied reasons. The most common cases are:
a) wrong values being entered at the source (incorrect data flowing in the system)
example: salesperson typing an invalid SKU on an order
b) invalid/incorrect value as a result of data transformation
example: 6-digit zip code being created from an original text field (United States zip codes are 5 numerical digits long)
In the first case, it may be difficult to determine if a wrong (valid) SKU is entered, however, checks can be set in place to make sure invalid values are not flowing in the database. To fix those values, more investigative work may need to be completed, but once that is done, simply update the SKU value for the correct one.
In the second case, to fix the issue, a more detailed examination of the function that is causing the incorrect zip code length will be needed. Once the error is identified, fix the function, and run the script to update the zip code values in the data.
Fixing incorrect data is a critical process in data engineering. It ensures that accurate data is flowing in the system and therefore maximizing the number of factual conclusions that can be deducted from the reports it is feeding.
4. Handle Missing Data
Absence of data is the trickiest to handle by far. The reason behind this is that we really do not know if we should be replacing it or what we should be replacing it with. As with the incidents of incorrect data, this can happen for multiple reasons. It can be simply missing from the point of input or caused by some data transformation. Here are a few ways of solving this problem:
a) If the missing data is caused by a transformation – fix the function that is causing this problem.
b) If the number of missing values is miniscule you may drop those rows or exclude those rows not to corrupt the results.
c) If the number of missing values is still low (based on your discretion), you may want to replace those values with mean, median, or similar values that would not skew the results too much; or may replace those values with similar values from another similar dataset.
If pointing out what is missing can be construed as informative, flag this data, and use it in the reporting. Sometimes it is valuable to see what is not there. It may provide more context for the overall analysis.
5. Handle Outliers
What are outliers? You can think of them as values in a dataset that are much different than the rest of the data. They add variability to your data, which can skew and distort results that can ultimately lead to wrong conclusions and business decisions.
Your instinct may tell you to remove them; however, before eliminating them, make sure you understand what is causing them, and if they can provide additional information on the subject-matter or on the data collection process.
For example, let’s say you are working with patient age information that was typed in manually. By just looking at the below list, we can see 2 apparent outliers – 128 years old and -10 years old.
One value (-10) is impossible, and the other value (128 years) is highly improbable. One way of fixing this is by going back to the original record and fixing the values (this should be done only if we can get the accurate information), otherwise you may want to make those values null or flag them so that they can be included/excluded from the reporting. The only case this should be done is when you have determined that the data is incorrect and there is no way of going back and fixing it.
6. Fix Structural Errors
Structured data is a foundation of every good data model and, as the term implies, has an easy identifiable structure which makes drawing inferences from it more user friendly. It is presented as a row/column data frame mostly seen in relational databases. These types of datasets contain data that is defined, formatted, and highly organized.
Therefore, when conducting the step of identifying structural errors in our raw data, we should be looking for any types of inconsistencies that could be defined as structural errors. Some examples of these errors are typos, inconsistent capitalization, mislabeled categories, invalid column types and formats, unclear breaks between values/columns. Most of the modern BI applications will have the necessary transformation tools to identify and fix these types of errors.
The plan of action for this data clean-up step is identifying structure issues and updating our data transformation steps to prevent those issues in the future.
7. Data Standardization
Standardization is another key step in the data wrangling workflow. Specifically, the notion of standardizing data means to take the raw data from one source and transform it into one common data format in another data vessel.
We can see from the above example that each dataset contains the Punch-In Time date time column in a different format. To standardize this data, we need to decide on the format that will be our standard moving forward and implement it. This seems like a trivial example. However, things can get a lot more complicated which makes this step often time-consuming. If you are working with many tables in a data model, going through each one making sure the data types and formats align across all tables can be somewhat of a daunting task, however, necessary.
Having the same data type and format across multiple tables is paramount when working with primary and foreign keys in data models. For example, if we are joining two tables on a date column, we need to make sure they are both in the same date format so that we can avoid additional CAST functions in the join that may slow down queries. In another example, if you are joining two tables using a numerical ID value, you want to make sure the column in both tables is formatted as text or number. Again, this helps to avoid unnecessary conversions on the join level that can affect query and ultimately report performance.
Going through the data clean-up workflow can sometimes be a challenging task but it is worthwhile. As mentioned in the first article: accurate, well structured, “healthy” data is vital for building high-performing data models and reports. Implementation of this workflow sets companies up for success in today’s ever-changing world of big data. It does not only improve decision-making processes but also extends the life of a whole business by reducing costs, improving overall performance and productivity.
For more detail on this subject or to discuss how Verstand can help you ensure your data is in a state to provide the best results for your company, contact us at firstname.lastname@example.org.