Link Search Menu Expand Document

Transformations Parameters

Solutions

Transformation Type

Transformation Use Case Equation / Calculation Method
Log Transformation Stabilizing variance, normalizing data, useful for data with exponential changes (e.g., economic data). y = log(x)
Square Root Transformation Stabilizing variance, normalizing data, suitable for data sets with non-negative values (counts, areas). y = √x
Box-Cox Transformation Generalized form for stabilizing variance, making data more normal-like, applicable in various scenarios. y(λ) = (x^λ -1)/λ, for x > 0 and λ ≠ 0
Z-Score/Standard Score Standardizing data to have mean of 0 and standard deviation of 1, used in outlier detection and normalization. z = (x-µ)/σ
Min-Max Scaling Scaling data to a fixed range (0-1), useful in scale-sensitive algorithms (neural networks, k-NN). Xscaled = (X - Xmin) / (Xmax - Xmin)
Normalization (L1, L2 norms) Scaling individual samples, essential for algorithms needing comparable scales (support vector machines).  
Difference Transformation Making time-series data stationary by subtracting previous observation from the current one. Δxt = xt - xt-1
Categorical Encoding Converting categorical data into a numeric format for use in mathematical models (One-Hot, Label Encoding). One-Hot: Binary vectors, Label: Integer encoding
Binning/Discretization Transforming continuous variables into discrete bins, simplifying complex relationships in data. Interval: Equal-Width Quantile: Equal Frequency Count Tree-Based: Decision Tree

Encoding Type

Type Data Type Details
Label Encoding Ordinal data Unique labels with Equal interval
Ordinal Encoding Ordinal data Same as label encoding with ordered data
Target-guided Encoding Ordinal data Equal interval
Polynomial Encoding Ordinal data Non-Equal interval
Helmert Encoding Ordinal data Non-Equal interval
Sum/ Count Encoding Ordinal data Contains count or sum for each category, Non-Equal interval
Backward Different Encoding Ordinal data Non-Equal interval
One-hot Encoding Nominal data Convert all categories into each column with 0 1, < 15 Cardinality, not suitable for decision-tree based algorithm
Dummy Encoding Nominal data Same as One-hot encoding with drop one feature (randomly)
Effect Encoding Nominal data Same as Dummy encoding with alter row with zeros to all -1
Mean Encoding Nominal data  
Binary Encoding Ordinal / Nominal data converts data into more several columns in binary labels, each colums contain 0 and 1 to label each category in binary, Some info loss acceptable for lower dimesionality
BaseN Encoding Ordinal / Nominal data  
Feature Hashing Encoding Ordinal / Nominal data Some info loss acceptable for lower dimesionality
Frequency Encoding Ordinal / Nominal data  
LeaveOneOut Encoding Ordinal / Nominal data info loss is not acceptable, the encoder can handle overfitting / response leakage
Target Encoding Ordinal / Nominal data info loss is not acceptable, the encoder can’t handle overfitting / response leakage
Weights of Evidence Encoding Ordinal / Nominal data info loss is not acceptable, the encoder can’t handle overfitting / response leakage
James-Stein Encoding Ordinal / Nominal data info loss is not acceptable, the encoder can’t handle overfitting / response leakage
M-Estimator Encoding Ordinal / Nominal data info loss is not acceptable, the encoder can’t handle overfitting / response leakage
Generalzied Linear Mixed Model Encoding Ordinal / Nominal data  
CatBoost Encoding Ordinal / Nominal data  
RareLabel Encoding Ordinal / Nominal data  

Handling Missing Data

Type Type Type Details
Common data Deletion Deletion Row-wise Deletion
Common data Deletion Deletion Columns-wise Deletion
Time-Series data Imputation Mean, Median, Mode, Random Sample Imputation Data without Trend and without seasonality
Time-Series data Imputation Linear Interpolation Data with Trend and without seasonality
Time-Series data Imputation Seasonal adjustment + interpolation Data with Trend and with seasonality
Common data Imputation Make NA as level Categorical
Common data Imputation Logistic Regression Categorical
Common data Imputation Mean, Median, Mode, Linear Regression Numerical