Data Cleaning
I want to go through the Wikipedia series on Machine Learning and Data mining. Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
References
Notes
Data cleaning or data cleaning is the process of identifying and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset, table, or database. It involves detecting incomplete, incorrect, or inaccurate parts of the data and then replacing, modifying, or deleting the affected data. Data cleaning can be performed interactively using data wrangling tools, or through batch processing often via scripts or a data quality firewall.
High-quality data needs to pass a set of quality criteria. Those include:
- Validity: The degree to which the measures conform to defined business rules or constraints. When modern database technology is used to design data-capture systems, validity is fairly easy to ensure: invalid data arises mainly in legacy contexts (where constrains were not implemented in software) or where inappropriate data-capture technology was used. Data constraints fall into the following categories:
- Data-Type Constraints
- Range Constraints
- Mandatory Constraints (Not Null)
- Unique Constraints
- Accuracy: The degree of conformity of a measure to a standard or true value.
- Completeness: The degree to which all required measures are known. Incompleteness if almost impossible to fix with data cleaning methodology.
- Consistency: The degree to which a set of measures are equivalent in across systems.
- Uniformity: The degree to which a set data measures are specified using the same units of measure in all systems.
Process
- Data auditing: The data is audited with the use of statistical and database methods to detect anomalies and contradictions: this eventually indicates the characteristics of the anomalies and their locations.
- Workflow specification: The detection and removal of anomalies are performed by a sequence of operations on the data known as the workflow.
- Workflow execution: In this stage, the workflow is executed after its specification is complete and its correctness is verified.
- Post-processing and controlling: After executing the cleansing workflow, the results are inspected to verify correctness.
- Parsing: for the detection of syntax errors.
- Data transformation: Data transformation allows the mapping of the data from its given format into the format expected by the appropriate application.
- Duplicate estimation: Duplicate detection requires an algorithm for determining whether data contains duplicate representations of the same entity. Usually, data is stored by a key that would bring duplicate entries closer together for faster identification.
- Statistical methods: By analyzing the data using the values of mean, standard deviation, range, or clustering algorithms, it is possible for an expert to find values that are unexpected and this erroneous.
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.