Data Cleaning

I want to go through the Wikipedia series on Machine Learning and Data mining. Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

Date Created:
1 22

References



Notes


Data cleaning or data cleaning is the process of identifying and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset, table, or database. It involves detecting incomplete, incorrect, or inaccurate parts of the data and then replacing, modifying, or deleting the affected data. Data cleaning can be performed interactively using data wrangling tools, or through batch processing often via scripts or a data quality firewall.

High-quality data needs to pass a set of quality criteria. Those include:

  • Validity: The degree to which the measures conform to defined business rules or constraints. When modern database technology is used to design data-capture systems, validity is fairly easy to ensure: invalid data arises mainly in legacy contexts (where constrains were not implemented in software) or where inappropriate data-capture technology was used. Data constraints fall into the following categories:
    • Data-Type Constraints
    • Range Constraints
    • Mandatory Constraints (Not Null)
    • Unique Constraints
  • Accuracy: The degree of conformity of a measure to a standard or true value.
  • Completeness: The degree to which all required measures are known. Incompleteness if almost impossible to fix with data cleaning methodology.
  • Consistency: The degree to which a set of measures are equivalent in across systems.
  • Uniformity: The degree to which a set data measures are specified using the same units of measure in all systems.

Process

  • Data auditing: The data is audited with the use of statistical and database methods to detect anomalies and contradictions: this eventually indicates the characteristics of the anomalies and their locations.
  • Workflow specification: The detection and removal of anomalies are performed by a sequence of operations on the data known as the workflow.
  • Workflow execution: In this stage, the workflow is executed after its specification is complete and its correctness is verified.
  • Post-processing and controlling: After executing the cleansing workflow, the results are inspected to verify correctness.
  • Parsing: for the detection of syntax errors.
  • Data transformation: Data transformation allows the mapping of the data from its given format into the format expected by the appropriate application.
  • Duplicate estimation: Duplicate detection requires an algorithm for determining whether data contains duplicate representations of the same entity. Usually, data is stored by a key that would bring duplicate entries closer together for faster identification.
  • Statistical methods: By analyzing the data using the values of mean, standard deviation, range, or clustering algorithms, it is possible for an expert to find values that are unexpected and this erroneous.


Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC