If you have ever wondered about the distinction between exploratory data analysis (EDA), preprocessing, and feature engineering – we are here to clear things up. These three terms are essential for machine learning, and getting the hand of the concepts is not so difficult.
Exploratory Data Analysis
Exploratory data analysis (EDA) is a set of methods used in the preliminary stages of analysis. Exploratory analysis is designed to establish priorities, develop operational definitions, and improve the final design. It includes basic descriptive statistics, data visualization, and correlation analysis.
EDA means examining data down to its core to derive actionable information from it, like data anomalies or other inconsistencies. It involves analyzing and summarizing massive data sets, often in the form of charts and graphs.
As the essential step in a data science project, it usually takes 70-80% of the project’s timeline. The better you know your data set, the better you can use it.
Although EDA is applied according to the situation and the types of data available, there are basic techniques that become the foundation of data analysis:
- Univariate analysis. As the name implies, univariate analysis is when variables are analyzed individually. Whether a variable is categorical or continuous, if we examine it independently of others, it is called univariate analysis.
- Bivariate analysis studies relationships between two variables in a data set. It can be a connection between two predictor variables or with a target variable. Such relationships, if they exist, can cause problems during model development, such as noise.
So EDA is uncovering knowledge from your data by visualizing it. This process allows us to create new features (more about feature engineering later) and incorporate knowledge into machine learning models.
Preprocessing
Preprocessing prepares your data for analysis by making it fit the requirements of the task at hand.
It is the most crucial stage of Data Mining. If it is not performed, further analysis in most cases is impossible because the analytical algorithms will not work or the results of their work will be incorrect. In other words, the GIGO principle is implemented – garbage in, garbage out.
Data preprocessing includes two directions: cleaning and optimization.
Cleaning is done to exclude all kinds of factors that reduce data quality and interfere with analytical algorithms. It includes processing duplicates, inconsistencies, and dummy values, restoring and filling in gaps, smoothing, suppressing noise, and editing anomalous values. In addition, the cleaning process restores violations of data structure and converts incorrect formats.
Data optimization as an element of preprocessing includes dimensionality reduction, identification, and exclusion of insignificant features. The main difference between optimization and cleaning is that the factors eliminated in the cleaning significantly reduce the accuracy of problem-solving or make the work of analytical algorithms impossible. The problems solved in optimization adapt the data to the specific task and increase the efficiency of the analysis.
Data preprocessing is performed when offloading data from primary sources and OLTP systems; in the data warehouse and the analytics platform.
Feature engineering
Feature engineering is the most creative data preparation phase of Machine Learning. It’s done after the sample is created and the data-cleaning process is finished.
Features extracted from any type of data, including text, images, and geodata. When processing textual information, tokenization is performed first, followed by lemmatization and digitization (conversion of words into numeric vectors).
In the case of images, we often analyze not only the content of an image as a set of pixels of different colors but also the metadata of the image file, such as date of capture, resolution, camera model, etc.
Here you can train many different versions of your model on different transformed data sets, the goal being to create the most accurate model you can get by changing the data values.
The types of transformations considered in feature development are often inspired by discoveries made during the EDA phase when examining a dataset.
Conclusion
These are three commonly used terms when it comes to data preparation. We have mainly focused on EDA because this is the first critical step toward understanding the dataset. Data preprocessing involves data acquisition and cleaning before transferring it to data research specialists who perform feature engineering to create an optimal set of features that will be included in the subsequent modeling stage.