Data Pre-processing Process for ML Models

#Machine Learning

Data preprocessing, on simple way can be said as a transformation technique that transform the raw from of data to suitable format for building & training machine learning models. It is the most cruicial steps for enhancing the quality of data so that the extraction of meaningful information/Insights is possible. During this step, we will make sure the data is in required format for the ML Model and this formated data is free from noise, complete and consistent. Now I will list out why we cant use raw data for ML model:

Raw data are highly vulnerable to missing, noise, outliers and inconsistent because of their huge size, multiple resources and their gathering methods.
Poor quality data will negatively effect on ML model
Preprocessing technique must be applied on data to improve efficiency of data.

Now we will move on to the Data preprocessing techniques. I will move on to this on step by step basis:

1. Data Quality Assessment:

In this step we evaluate the data to identify whether it will meet the qaulity required for our machine learning model or not. We will also make sure that the data are of required quantity and type so that it support our intended use in developing the machine learning model. We will identify following issues on data during DQA:

Mismatched data types
Mixed data values
Data outliers
Missing data

2. Data Cleaning:

SImply name suggest what we try to do in this step. Data cleaning is the process of removing corrupt or irrelevant data by editing, correcting and structuring data within the dataset. Though this step seems tedious it is very essential to get powerful and meaningful insights from the data and best utilize the ML model using data. Here comes one priciple behind the data processing techinque:

If you’re training ML models with bad & uncleaned data, we can't expect the end analysis results to be trustworthy, it will be completely misleading analysis if you are concluding based on this.

Data Cleaning Steps & Techniques:

Step 1: Remove irrelevant data
Step 2: Deduplicate your data
Step 3: Fix structural errors
Step 4: Deal with missing data
Step 5: Filter out data outliers

3. Data Transformation:

From Data cleaning we started to edit the raw data, now we will move on to reformating of the data in which our ML model wiil be taking as an input. Data transformation changes the format, structure, or values of the data and converts them into clean, usable data

Here are the steps involved in data transformation process:

Smoothing: Smoothing is a process used to remove the unnecessary, corrupt or meaningless data or ‘noise’ in a dataset. Smoothing improves the algorithm’s ability to detect useful patterns in data.
Aggregation: Data aggregation is gathering data from a number of sources and storing it in a single format. Aggregation, in itself, is a process of improving the quality of the data where it helps gather info about data clusters and collect lots of data.
Discretization: Discretization is one of the transformation methods that break up continuous data into small intervals. Although data mining requires continuous data, the existing frameworks can only handle discrete data chunks.
Attribute construction: In attribute construction, new attributes are generated and applied in the mining process from the existing set of attributes. It improves mining efficiency by simplifying the original data.
Generalization: Generalization is used to convert low-level data attributes to high-level data attributes by the use of concept hierarchy. An example is an age in the numerical form of raw data (22, 52) is converted into (Young, old) categorical value.
Normalization: Normalization is an important step in data transformation and also called pre-processing. Here the data is transformed to categorize it under a given range.

4. Data Reduction:

Data reduction technique reduces representation of dataset to much smaller in volume, while maintaing the integrity of the original datasets. Data reduction not only makes the analysis easier and more accurate, but cuts down on data storage. This technique is really make sure of proper memory utilization during of ML model processing. We can use follwing steps for data reduction techniques.

Data Cube aggregation:
Construct data cube by applying operations of aggregation on data without losing the necessary information
Attribute Subset Selection:
Reduce the dataset size by removing redundant features or dimensions and irrelevant attributes. We can use correlation analysis for excluding certain data.

5. Dimensionality Reduction:

It is also known as data compression which uses mechanism of encoding to reduce size of dataset.

Principal Component Analysis (PCA)
Wavelet Transforms
Multi-dimensional Scaling
Self Organizing map

I have simplified the data preprocessing steps that i have learned. I know each steps above can be full post for blog but i just wanted to outline the steps so that you can found out which direction to look at during the preprocessing step.

0 Comments

Name:

Subject:

Email:

Body:

Insights on Geo-Spatial World

Blog for learning

Data Pre-processing Process for ML Models

1. Data Quality Assessment:

2. Data Cleaning:

Data Cleaning Steps & Techniques:

3. Data Transformation:

4. Data Reduction:

5. Dimensionality Reduction:

0 Comments

Leave a Reply

Post Categories

Machine Learning

Django

Google Earth Engine (GEE)

Sentinel-1

Related Posts

What Is Unsupervised Learning? How Does It Work?

Components of Machine learning

Introduction of machine learning

Recent Posts

Vector tiles with Django for developing GIS web applications

Minimum, Mean and Maximum NDVI values for Polygon samples on Google Earth Engine (Python API)

Histogram from the Image on Google Earth Engine (GEE) with python API

Time series EVI data with Savitzky-Golay filter in Google Earth Engine (GEE) with python API

Accessing the Planet's imagery (NICFI's) for analysis from Google Earth Engine (GEE)