The 7 Deadly Sins Of Enterprise Data Quality
Atchison Frazer looks at the issues that can lead to problems with data quality
Enterprise data quality problems can be categorised into three main areas:
- Processes that bring data into a repository, manually or otherwise, which may either cause problems due to existing, incorrect incoming data or by errors within the extraction and loading processes.
- Processes that manipulate data already in the database, which can be routine or brought about by upgrades, updates and a range of ad-hoc activities.
- Processes that cause data to become inaccurate or degrade over time without any physical changes having been made. This usually happens when real life objects relating to the data change, while data collection processes remain the same.
All these complex processes are an important part of data processing and therefore cannot be cut off in an effort to avoid problems. The only way to maintain the integrity of data is to make certain that all these processes work as intended and avoiding the seven major causes of data quality problems.
- Initial Data Conversion
Most of the time, databases begin with the conversion of data from a pre-existing source. Data conversion never goes as seamlessly as intended. Some parts of the original datasets fail to convert to the new database, while other datasets just mutate during the process. The source itself could also not be all that perfect to begin with. To avoid problems, more time must be spent on profiling the data, as compared to time spent on code transformation algorithms.
- System Consolidation
When combining old systems with new ones or phasing out systems, data consolidation is crucial. Problems may arise, especially when unplanned – resulting in hastened system consolidations.
- Batch Feeds
Batch feeds are large data exchanges that happen between systems on a regular basis. Each batch feed carries large volumes of data and if bottlenecks occur, this can cause problems for consequent feeds. This can be avoided by using a tool to detect process errors and stop them from causing performance problems.
- Real-time Interfaces
Real-time interfaces are used to exchange data between systems. With one combined database, necessary procedures are triggered that send data to the rest of the databases downstream. This fast propagation of data is a perfect recipe for disaster, especially if there is nothing at the other end to react to potential problems. The ability to respond to such problems as soon as they arise is key to stopping such situations from spreading errors and causing more harm.
- Data Processing
Data processing comes in many forms, from normal transactions done by users to periodical calculations and adjustments. Ideally, all of these processes should run like clockwork, but underlying data changes and programs change, evolve and sometimes are corrupted.
- Data Scrubbing
Data scrubbing is aimed at improving data quality. Cleansing of data was done manually in the early days, which was relatively safe. Today, with the added complexity of Big Data, new automated ways to cleanse data have arisen that work to make corrections by the volume.
- Data Purging
With data purging, old data gets purged from a system to make room for new data. It is a normal process when certain retention limits are met and old data is no longer required. Problems of data quality may occur where relevant data is accidentally purged, caused by errors in the database or simply when the purging program just fails. The application of infrastructure performance monitoring solutions will ensure that such errors don’t disrupt business operations.
Most data quality issues can be mitigated by having the breadth of knowledge and robust technology to correlate cross-silo intelligence, providing deeper insights for organisations adopting hybrid-cloud infrastructures that need to control data quality, plan efficiently and optimise their infrastructure.