The use of machine learning to prepare data

Data generation process

The more disciplined you are in managing your data, the more consistent and better results you are likely to get. The process of preparing data for a machine learning algorithm can be summarized in three steps:

Step 1: Select the Data

Step 2: Preprocess Data

Step 3: Transform the Data

Step 1: Select Data

In this step, you will select the subset of all available data that you will be working with. There is always a strong desire to include all available data, which follows the adage "more is better". This may or may not be true.

You need to consider what data you really need to solve the question or problem you are working on. Make some assumptions about the data you need, and be careful to record those assumptions so you can test them later if you need to.

Below are some questions to help you think about this process:

What is the data limit available to you? For example through time, database tables, connected systems. Make sure you have a clear picture of everything you can use.

What data is not available that you wish you had? For example data that has not been recorded or cannot be recorded. You may be able to obtain or simulate this data.

What data do you not need to solve the problem? It is almost always easier to exclude data than to include data. Note which data you excluded and why.


This only happens in small problems, like contest or toy datasets, where the data is already selected for you.

Read More : The 10 Most Atrocious Python Mistakes Aspirants Often Make!



Step 2: Preprocess Data

2 Once you have selected your data, you can consider a utility. The upshot of preprocessing is that you're getting a data selection and working with a formula.

Three common data preprocessing steps are formatting, cleaning and sampling:

Format: It is possible that the data you have selected cannot be selected. Data is a relational base data and deci que esten in an archivo plano, or an archivo propéterio and deci is a format in the relational base of an archivo de texto.

Limpizza: Limpizza of the teeth is the elimination of the correction of the dentition. We have examples of data that are incomplete and do not contain the content that is needed to solve a problem. It is possible that immediate elimination is required. ADMAS, different features may be in the case of confidential information and it is possible to anonymize the features to eliminate the data completely.

Mustreo: We may have to make a lot of selections that are necessary for sales. The mass-released result has long persisted the algorithms and core computing requirements and memories. There is a significant amount of data to choose from who can work quickly to explore the data and prototype solutions before the data is considered complete.

It is very possible that you use automatic learning which affects the pre-process which seems real to you. It could be new after potentially being replicated.

Read More : Top 10 Data Science Skills That Will Transform Your Career In 2022!

Step 3: Transform Data

The last step is to transform the process data. The specific algorithm you are working with and knowledge of the problem domain will affect this step and you will need to revisit different transformations of your pre-processed data as you work on your problem.

Three common data transformations are scaling, attribute decomposition, and attribute aggregation. This step is also called feature engineering.

Scaling: Preprocessed data can contain attributes with a mix of scales for different quantities such as dollars, kilograms, and sales amounts. Data features such as those in many machine learning methods have a single scale between 0 and 1 for the smallest and largest values ​​for a given feature. Consider any feature scaling you may need to do.

Decomposition: There may be features that represent a complex concept that when broken down into component parts may be more useful to the machine learning method. An example is a date which may have day and time components which in turn may be further split. Perhaps only the time of day is relevant to the solution of the problem. Consider what feature decomposition you can do.

Aggregation: There may be features that can be aggregated into a single feature that would be more meaningful to the problem you are trying to solve. For example, each time a customer logs into the system there may be a data instance that can be aggregated into a count for the number of logins, allowing additional instances to be discarded. Consider what kind of feature aggregation can perform.

You can spend a lot of time engineering features from your data and this can be very beneficial to algorithmic performance. Start small and build on the skills you've learned



Post a Comment

0 Comments