Data Pre-processing Process for ML Models
Data preprocessing, on simple way can be said as a transformation technique that transform the raw from of data to suitable format for building & training machine learning models. It is the most cruicial steps for enhancing the quality of data so that the extraction of meaningful information/Insights is possible. During this step, we will make sure the data is in required format for the ML Model and this formated data is free from noise, complete and consistent. Now I will list out why we cant use raw data for ML model:
- Raw data are highly vulnerable to missing, noise, outliers and inconsistent because of their huge size, multiple resources and their gathering methods.
- Poor quality data will negatively effect on ML model
- Preprocessing technique must be applied on data to improve efficiency of data.
Now we will move on to the Data preprocessing techniques. I will move on to this on step by step basis:
1. Data Quality Assessment:
In this step we evaluate the data to identify whether it will meet the qaulity required for our machine learning model or not. We will also make sure that the data are of required quantity and type so that it support our intended use in developing the machine learning model. We will identify following issues on data during DQA:
- Mismatched data types
- Mixed data values
- Data outliers
- Missing data
2. Data Cleaning:
SImply name suggest what we try to do in this step. Data cleaning is the process of removing corrupt or irrelevant data by editing, correcting and structuring data within the dataset. Though this step seems tedious it is very essential to get powerful and meaningful insights from the data and best utilize the ML model using data. Here comes one priciple behind the data processing techinque:
If you’re training ML models with bad & uncleaned data, we can't expect the end analysis results to be trustworthy, it will be completely misleading analysis if you are concluding based on this.
Data Cleaning Steps & Techniques:
- Step 1: Remove irrelevant data
- Step 2: Deduplicate your data
- Step 3: Fix structural errors
- Step 4: Deal with missing data
- Step 5: Filter out data outliers
3. Data Transformation:
From Data cleaning we started to edit the raw data, now we will move on to reformating of the data in which our ML model wiil be taking as an input. Data transformation changes the format, structure, or values of the data and converts them into clean, usable data
Here are the steps involved in data transformation process:
- Smoothing: Smoothing is a process used to remove the unnecessary, corrupt or meaningless data or ‘noise’ in a dataset. Smoothing improves the algorithm’s ability to detect useful patterns in data.
- Aggregation: Data aggregation is gathering data from a number of sources and storing it in a single format. Aggregation, in itself, is a process of improving the quality of the data where it helps gather info about data clusters and collect lots of data.
- Discretization: Discretization is one of the transformation methods that break up continuous data into small intervals. Although data mining requires continuous data, the existing frameworks can only handle discrete data chunks.
- Attribute construction: In attribute construction, new attributes are generated and applied in the mining process from the existing set of attributes. It improves mining efficiency by simplifying the original data.
- Generalization: Generalization is used to convert low-level data attributes to high-level data attributes by the use of concept hierarchy. An example is an age in the numerical form of raw data (22, 52) is converted into (Young, old) categorical value.
- Normalization: Normalization is an important step in data transformation and also called pre-processing. Here the data is transformed to categorize it under a given range.
4. Data Reduction:
Data reduction technique reduces representation of dataset to much smaller in volume, while maintaing the integrity of the original datasets. Data reduction not only makes the analysis easier and more accurate, but cuts down on data storage. This technique is really make sure of proper memory utilization during of ML model processing. We can use follwing steps for data reduction techniques.
- Data Cube aggregation:
Construct data cube by applying operations of aggregation on data without losing the necessary information
- Attribute Subset Selection:
Reduce the dataset size by removing redundant features or dimensions and irrelevant attributes. We can use correlation analysis for excluding certain data.
5. Dimensionality Reduction:
It is also known as data compression which uses mechanism of encoding to reduce size of dataset.
- Principal Component Analysis (PCA)
- Wavelet Transforms
- Multi-dimensional Scaling
- Self Organizing map
I have simplified the data preprocessing steps that i have learned. I know each steps above can be full post for blog but i just wanted to outline the steps so that you can found out which direction to look at during the preprocessing step.