Published OnJanuary 15, 2020Records and information managers should know: What is data cleaning (aka data cleansing)? Learn some tips and facts on how to identify and clean data.
What is data cleaning? Records and information managers should be able to answer that question. Data cleaning (or data cleansing) keeps information current, accurate and complete. Good data hygiene ensures that people can easily access the information when they need it. An organization would no more keep dirty data around than you would store garbage in a suitcase or gift wrap it for a friend.
So, what are clean data, dirty data and data cleaning? And how does an organization separate the crisp data from the dingy?
Telling Clean Data From Dirty Data
Clean data has no errors. Dirty data has errors of commission, such as improper punctuation, incorrect information and duplicate data. Dirty data also has errors of omission where the data is old or incomplete.
What Is Data Cleaning?
Data cleaning is an ongoing process using data wrangling or text analysis software to remove errors, so that the information is correct, appropriately inclusive and up to date, ensuring its utility.
Ready to clean that data? Then, capture these must-have data-cleaning facts, basics and tips.
- Dirty data is typically found in information as it appears when people and systems first digitize or record it in some fixed form. For example, data in audio recordings, images, and typed or handwritten text can be dirty.
- When people and systems first create it, the data can lack metadata, such as attributes that describe it.
- Records and information managers discuss clean data with data health and data integrity.
- RIM managers discuss dirty data in relation to data clutter.
- Data cleaning can mean removing characters in HTML code.
- Data cleaning can include sifting for stop words, which are common natural language words like “if,” “and” or “but.”
- Data cleansing can require deleting punctuation marks and other characters that appear in the original format but create confusion in the new one.
- Because language errors can alter the meaning, data cleaning can include checks on grammar, spelling and usage.
- The organization should consider its data cleansing needs before it acquires data wrangling or text analysis software. Different software packages have different data cleaning strengths.
- Once the organization corrals the data, it must detect any absent information, errors and contradictions before it can cleanse it. The more disparate the data types and sources, the more challenging the cleaning process.
Happy data cleaning!