Data Cleansing Processing for removing errors
Data cleansing is the process of removing errors from the input stream and is part of the integration process. It is perhaps one of the most critical steps in the data warehouse. If the cleansing process is faulty, the best thing that could happen is that the decision maker will not trust the data and the warehouse will fail. If that's the best thing, what could be worse? The worst thing is that the warehouse could provide bad information and the strategist could trust it. This could mean the development of a corporate strategy that fails. The stakes are indeed high.
A good cleansing process, however, can improve the quality of not only the data within the warehouse, but the operational environment as well. The extraction log records errors detected in the data cleansing process. The data administrator in turn examines this log to determine the source of the errors. At times, the data administrator will detect errors that originated in the operational environment. Some of these errors could be due to a problem with the application or something as simple as incorrect data entry. In either case, the data administrator should report these errors to those responsible for operational data quality. Some errors will be due to problems with the metadata. Perhaps the cleansing process did not receive a change to the metadata. Perhaps the metadata for the cleansing process was incorrect or incomplete. The data administrator must determine the source of this error and take corrective action. In this way, the data warehouse can be seen as improving the quality of the data throughout the entire organization.
There is some debate as to the appropriate action for the cleansing process to take when errors are detected in the input data stream. Some purists feel the warehouse should not incorporate records with errors. The errors in this case should be reported to the operational environment, where they will be corrected and then resubmitted to the warehouse. Others feel that the records should be corrected whenever possible and incorporated into the warehouse. Errors are still reported to the operational environment, but it is the responsibility of those maintaining the operational systems to take corrective action. The concern is making sure that the data in the warehouse reflects what is seen in the operational environment. A disagreement between the two environments could lead to a lack of confidence in the warehouse.
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home