Statistics For You

2. 10 FEATURE ENCODING TECHNIQUES

EVERY DATA SCIENTIST MUST KNOW

FEATURE ENCODING TECHNIQUES

1- LABEL ENCODING

Label encoding is intuitive and easy to understand. Label Encoding refers to converting the labels into the numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

Example:

Suppose we have a column Height in some dataset.

After applying label encoding, the Height column is converted into:

where 0 is the label for tall, 1 is the label for medium and 2 is the label for short height.

Limitation of label Encoding

Label encoding converts the data in machine-readable form, but it assigns a unique number (starting from 0) to each class of data. This may lead to the generation of priority issues in the training of data sets. A label with a high value may be considered to have high priority than a label having a lower value.

2- ONE-HOT ENCODING

Sometimes in datasets, we encounter columns that contain categorical features (string values) for example parameter Gender will have categorical parameters like Male, Female. These labels have no specific order of preference and also since the data is string labels, the machine learning model can not work on such data.

One approach to solve this problem can be label encoding where we will assign a numerical value to these labels for example Male and female mapped to 0 and 1. But this can add bias in our model as it will start giving higher preference to the Female parameter as 1>0 and ideally both labels are equally important in the dataset. To deal with this issue we will use the One Hot Encoding technique.

What’s the difference? Well, our categories were formerly rows, but now they’re columns. Our numerical variable, calories, has however stayed the same. In other words, we have created an additional binary column for each category.

It’s not immediately clear why this is better, and that’s because there isn’t a clear reason. Like many things in machine learning, we won’t be using this in every situation; it’s not outright better than label encoding. It just fixes a problem that you’ll encounter with label encoding when working with categorical data.

3- BINARY ENCODING

Binary encoding is a procedure to convert data to a form that is easily used by different computer operating systems. This is achieved by converting binary data to an ASCII string format, specifically, converting 8-bit data into a 7-bit format, that uses a standard set of ASCII printable characters. ASCII, American Standard Code for Information Interchange, was developed by AT&T in the early 1960s And is the most widely used character encoding format. Modern character encoding continues to be base on ASCII although it supports many additional characters and different languages.

4- HASH ENCODING

Hashing involves computing a fixed-length mathematical summary of data, the input data can be any size. In contrast to encoding, hashing cannot be reversed. It is not possible to take a hash and convert it back to the original data. Hashing is commonly used to verify the integrity of data, commonly referred to as a checksum. If two pieces of identical data are hashed using the same hash function, the resulting hash will be identical. If the two pieces of data are different, the resulting hashes will be different and unique.

As an example, say Sachin wants to send Dhoni a file and verify that Dhoni has the exact same file and that no changes occurred in the transferring process. Sachin will email Dhoni the file along with a hash of the file. After Dhoni downloads the file, he can verify the file is identical by performing a hash function on the file and verify the resulting hash is the same as Sachin provided.

An example of a hash function is: SHA512

In addition to verifying the integrity of data, hashing is the recommended data transformation technique in authentication processes for computer systems and applications. It is recommended to never store passwords and instead store only the hash of the “salted password”. A salt is a random string appended to a password that only the authentication process system knows; this guarantees that if two users have the same password the stored hashes are different.

When a user inputs a password to a web application, the password is sent to the webserver. The web server then appends the salt to the password and performs a hash function on the password and a salt and compares this output hash with the hash stored in the database for the user. If the hashes match for that user, the user is granted access. Hashing ensures in the event of a breach, or malicious insider the original passwords can never be retrieved. Salting ensures that, if a breach does occur, an attacker cannot determine which users have the same passwords.

Picture credit: (https://cheapsslsecurity.com/blog/explained-hashing-vs-encryption-vs-encoding/)

In Summary:

Encoding: Reversible transformation of data format, used to preserve the usability of data.

Hashing: This is a one-way summary of data, cannot be reversed, used to validate the integrity of data.

Encryption: Secure encoding of data used to protect the confidentiality of data.GET/MEAN ENCODING

5- TARGET/MEAN ENCODING

From a mathematical point of view, mean encoding represents a probability of your target variable, conditional on each value of the feature. In a way, it embodies the target variable in its encoded value. Target encoding is where you average the target value by category.

6- DUMMY ENCODING

Dummy coding provides one way of using categorical predictor variables in various kinds of estimation models, such as linear regression. Dummy coding uses only ones and zeros to convey all of the necessary information on group membership.

There are two different ways to encoding categorical variables. Say, one categorical variable has n values. One-hot encoding converts it into n variables, while dummy encoding converts it into n-1 variables. If we have k categorical variables, each of which has n values. One hot encoding ends up with kn variables, while dummy encoding ends up with kn-k variables.

7- EFFECT ENCODING

Effect coding provides one way of using categorical predictor variables in various kinds of estimation models, such as linear regression. This encoding technique is also known as Deviation Encoding or Sum Encoding. Effect encoding is almost similar to dummy encoding, with a little difference. In dummy coding, we use 0 and 1 to represent the data but in effect encoding, we use three values i.e. 1,0, and -1.

8- BASE N ENCODING

Base-N encoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.

9- ORDINAL ENCODING

When categorical features in the dataset contain variables with intrinsic natural order such as Low, Medium and High, these must be encoded differently than nominal variables (where there is no intrinsic order for e.g. Male or Female). This can be achieved in PyCaret using the ordinal features parameter within setup which accepts a dictionary with feature names and levels in the increasing order from lowest to highest.

10-FREQUENCY ENCODING

Frequency encoding is used to determine one axis in the xy-plane of the slice. Just like with slice selection, remember that the presence of a magnetic field gradient creates a corresponding gradient in the precession frequencies of the protons along that direction.