Log Link vs Log Transformation in R — The Difference that Misleads Your Entire Data Evaluation

-

distributions are essentially the most commonly used, numerous real-world data unfortunately will not be normal. When faced with extremely skewed data, it’s tempting for us to utilize log transformations to normalize the distribution and stabilize the variance. I recently worked on a project analyzing the energy consumption of coaching AI models, using data from Epoch AI [1]. There isn’t a official data on energy usage of every model, so I calculated it by multiplying each model’s power draw with its training time. The brand new variable, Energy (in kWh), was highly right-skewed, together with some extreme and overdispersed outliers (Fig. 1).

Figure 1. Histogram of Energy Consumption (kWh)

To deal with this skewness and heteroskedasticity, my first instinct was to use a log transformation to the Energy variable. The distribution of log(Energy) looked rather more normal (Fig. 2), and a Shapiro-Wilk test confirmed the borderline normality (p ≈ 0.5).

Figure 2. Histogram of log of Energy Consumption (kWh)

Modeling Dilemma: Log Transformation vs Log Link

The visualization looked good, but once I moved on to modeling, I faced a dilemma: Should I model the log-transformed response variable () or should I model the original response variable using a log link function ? I also considered two distributions — Gaussian (normal) and Gamma distributions — and combined each distribution with each log approaches. This gave me 4 different models as below, all fitted using R’s Generalized Linear Models (GLM):

all_gaussian_log_link <- glm(Energy_kWh ~ Parameters +
      Training_compute_FLOP +
      Training_dataset_size +
      Training_time_hour +
      Hardware_quantity +
      Training_hardware, 
    family = gaussian(link = "log"), data = df)
all_gaussian_log_transform <- glm(log(Energy_kWh) ~ Parameters +
                          Training_compute_FLOP +
                          Training_dataset_size +
                          Training_time_hour +
                          Hardware_quantity +
                          Training_hardware, 
                         data = df)
all_gamma_log_link  <- glm(Energy_kWh ~ Parameters +
                    Training_compute_FLOP +
                    Training_dataset_size +
                    Training_time_hour +
                    Hardware_quantity +
                    Training_hardware + 0, 
                  family = Gamma(link = "log"), data = df)
all_gamma_log_transform  <- glm(log(Energy_kWh) ~ Parameters +
                    Training_compute_FLOP +
                    Training_dataset_size +
                    Training_time_hour +
                    Hardware_quantity +
                    Training_hardware + 0, 
                  family = Gamma(), data = df)

Model Comparison: AIC and Diagnostic Plots

I compared the 4 models using Akaike Information Criterion (AIC), which is an estimator of prediction error. Typically, the lower the AIC, the higher the model matches.

AIC(all_gaussian_log_link, all_gaussian_log_transform, all_gamma_log_link, all_gamma_log_transform)

                           df       AIC
all_gaussian_log_link      25 2005.8263
all_gaussian_log_transform 25  311.5963
all_gamma_log_link         25 1780.8524
all_gamma_log_transform    25  352.5450

Among the many 4 models, models using log-transformed outcomes have much lower AIC values than those using log links. For the reason that difference in AIC between log-transformed and log-link models was substantial (311 and 352 vs 1780 and 2005), I also examined the diagnostics plots to further validate that log-transformed models fit higher:

Figure 4. Diagnostic plots for the log-linked Gaussian model. The Residuals vs Fitted plot suggests linearity despite a couple of outliers. Nevertheless, the Q-Q plot shows noticeable deviations from the theoretical line, suggesting non-normality.
Figure 5. Diagnostics plots for the log-transformed Gaussian model. The Q-Q plot shows a a lot better fit, supporting normality. Nevertheless, the Residuals vs Fitted plot has a dip to -2, which can suggest non-linearity. 
Figure 6. Diagnostic plots for the log-linked Gamma model. The Q-Q plot looks okay, yet the Residuals vs Fitted plot shows clear signs of non-linearity
Figure 7. Diagnostic plots for the log-transformed Gamma model. The Residuals vs Fitted plot looks good, with a small dip of -0.25 at the start. Nevertheless, the Q-Q plot shows some deviation at each tails.

Based on the AIC values and diagnostic plots, I made a decision to maneuver forward with the log-transformed Gamma model, because it had the second-lowest AIC value and its Residuals vs Fitted plot looks higher than that of the log-transformed Gaussian model. 
I proceeded to explore which explanatory variables were useful and which interactions could have been significant. The ultimate model I chosen was:

glm(formula = log(Energy_kWh) ~ Training_time_hour * Hardware_quantity + 
    Training_hardware + 0, family = Gamma(), data = df)

Interpreting Coefficients

Nevertheless, once I began interpreting the model’s coefficients, something felt off. Since only the response variable was log-transformed, the results of the predictors are multiplicative, and we want to exponentiate the coefficients to convert them back to the unique scale. A one-unit increase in 𝓍 multiplies the consequence 𝓎 by exp(β), or each additional unit in 𝓍 results in a (exp(β) — 1) × 100 % change in 𝓎 [2]. 

Taking a look at the outcomes table of the model below, now we have and their interaction term are continuous variables, so their coefficients represent slopes. Meanwhile, since I specified +0 within the model formula, all levels of the explicit act as intercepts, meaning that every hardware type acted because the intercept β₀ when its corresponding dummy variable was energetic. 

> glm(formula = log(Energy_kWh) ~ Training_time_hour * Hardware_quantity + 
    Training_hardware + 0, family = Gamma(), data = df)

Coefficients:
                                                 Estimate Std. Error t value Pr(>|t|)    
Training_time_hour                             -1.587e-05  3.112e-06  -5.098 5.76e-06 ***
Hardware_quantity                              -5.121e-06  1.564e-06  -3.275  0.00196 ** 
Training_hardwareGoogle TPU v2                  1.396e-01  2.297e-02   6.079 1.90e-07 ***
Training_hardwareGoogle TPU v3                  1.106e-01  7.048e-03  15.696  < 2e-16 ***
Training_hardwareGoogle TPU v4                  9.957e-02  7.939e-03  12.542  < 2e-16 ***
Training_hardwareHuawei Ascend 910              1.112e-01  1.862e-02   5.969 2.79e-07 ***
Training_hardwareNVIDIA A100                    1.077e-01  6.993e-03  15.409  < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 40 GB         1.020e-01  1.072e-02   9.515 1.26e-12 ***
Training_hardwareNVIDIA A100 SXM4 80 GB         1.014e-01  1.018e-02   9.958 2.90e-13 ***
Training_hardwareNVIDIA GeForce GTX 285         3.202e-01  7.491e-02   4.275 9.03e-05 ***
Training_hardwareNVIDIA GeForce GTX TITAN X     1.601e-01  2.630e-02   6.088 1.84e-07 ***
Training_hardwareNVIDIA GTX Titan Black         1.498e-01  3.328e-02   4.501 4.31e-05 ***
Training_hardwareNVIDIA H100 SXM5 80GB          9.736e-02  9.840e-03   9.894 3.59e-13 ***
Training_hardwareNVIDIA P100                    1.604e-01  1.922e-02   8.342 6.73e-11 ***
Training_hardwareNVIDIA Quadro P600             1.714e-01  3.756e-02   4.562 3.52e-05 ***
Training_hardwareNVIDIA Quadro RTX 4000         1.538e-01  3.263e-02   4.714 2.12e-05 ***
Training_hardwareNVIDIA Quadro RTX 5000         1.819e-01  4.021e-02   4.524 3.99e-05 ***
Training_hardwareNVIDIA Tesla K80               1.125e-01  1.608e-02   6.993 7.54e-09 ***
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB   1.072e-01  1.353e-02   7.922 2.89e-10 ***
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB  9.444e-02  2.030e-02   4.653 2.60e-05 ***
Training_hardwareNVIDIA V100                    1.420e-01  1.201e-02  11.822 8.01e-16 ***
Training_time_hour:Hardware_quantity            2.296e-09  9.372e-10   2.450  0.01799 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.05497984)

    Null deviance:    NaN  on 70  degrees of freedom
Residual deviance: 3.0043  on 48  degrees of freedom
AIC: 345.39

When converting the slopes to percent change in response variable, the effect of every continuous variable was almost zero, even barely negative:

All of the intercepts were also converted back to only around 1 kWh on the unique scale. The outcomes didn’t make any sense as a minimum of certainly one of the slopes should grow together with the big energy consumption. I wondered if using the log-linked model with the identical predictors may yield different results, so I fit the model again:

glm(formula = Energy_kWh ~ Training_time_hour * Hardware_quantity + 
    Training_hardware + 0, family = Gamma(link = "log"), data = df)

Coefficients:
                                                 Estimate Std. Error t value Pr(>|t|)    
Training_time_hour                              1.818e-03  1.640e-04  11.088 7.74e-15 ***
Hardware_quantity                               7.373e-04  1.008e-04   7.315 2.42e-09 ***
Training_hardwareGoogle TPU v2                  7.136e+00  7.379e-01   9.670 7.51e-13 ***
Training_hardwareGoogle TPU v3                  1.004e+01  3.156e-01  31.808  < 2e-16 ***
Training_hardwareGoogle TPU v4                  1.014e+01  4.220e-01  24.035  < 2e-16 ***
Training_hardwareHuawei Ascend 910              9.231e+00  1.108e+00   8.331 6.98e-11 ***
Training_hardwareNVIDIA A100                    1.028e+01  3.301e-01  31.144  < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 40 GB         1.057e+01  5.635e-01  18.761  < 2e-16 ***
Training_hardwareNVIDIA A100 SXM4 80 GB         1.093e+01  5.751e-01  19.005  < 2e-16 ***
Training_hardwareNVIDIA GeForce GTX 285         3.042e+00  1.043e+00   2.916  0.00538 ** 
Training_hardwareNVIDIA GeForce GTX TITAN X     6.322e+00  7.379e-01   8.568 3.09e-11 ***
Training_hardwareNVIDIA GTX Titan Black         6.135e+00  1.047e+00   5.862 4.07e-07 ***
Training_hardwareNVIDIA H100 SXM5 80GB          1.115e+01  6.614e-01  16.865  < 2e-16 ***
Training_hardwareNVIDIA P100                    5.715e+00  6.864e-01   8.326 7.12e-11 ***
Training_hardwareNVIDIA Quadro P600             4.940e+00  1.050e+00   4.705 2.18e-05 ***
Training_hardwareNVIDIA Quadro RTX 4000         5.469e+00  1.055e+00   5.184 4.30e-06 ***
Training_hardwareNVIDIA Quadro RTX 5000         4.617e+00  1.049e+00   4.401 5.98e-05 ***
Training_hardwareNVIDIA Tesla K80               8.631e+00  7.587e-01  11.376 3.16e-15 ***
Training_hardwareNVIDIA Tesla V100 DGXS 32 GB   9.994e+00  6.920e-01  14.443  < 2e-16 ***
Training_hardwareNVIDIA Tesla V100S PCIe 32 GB  1.058e+01  1.047e+00  10.105 1.80e-13 ***
Training_hardwareNVIDIA V100                    9.208e+00  3.998e-01  23.030  < 2e-16 ***
Training_time_hour:Hardware_quantity           -2.651e-07  6.130e-08  -4.324 7.70e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 1.088522)

    Null deviance: 2.7045e+08  on 70  degrees of freedom
Residual deviance: 1.0593e+02  on 48  degrees of freedom
AIC: 1775

This time, and would increase the whole energy consumption by 0.18% per additional hour and 0.07% per additional chip, respectively. Meanwhile, their interaction would decrease the energy use by 2 × 10⁵%. These results made more sense as can reach as much as 7000 hours and as much as 16000 units.

To visualise the differences higher, I created two plots comparing the predictions (shown as dashed lines) from each models. The left panel used the log-transformed Gamma GLM model, where the dashed lines were nearly flat and shut to zero, nowhere near the fitted solid lines of raw data. Then again, the precise panel used log-linked Gamma GLM model, where the dashed lines aligned rather more closely with the actual fitted lines. 

test_data <- df[, c("Training_time_hour", "Hardware_quantity", "Training_hardware")]
prediction_data <- df %>%
  mutate(
    pred_energy1 = exp(predict(glm3, newdata = test_data)),
    pred_energy2 = predict(glm3_alt, newdata = test_data, type = "response"),
  )
y_limits <- c(min(df$Energy_KWh, prediction_data$pred_energy1, prediction_data$pred_energy2),
              max(df$Energy_KWh, prediction_data$pred_energy1, prediction_data$pred_energy2))

p1 <- ggplot(df, aes(x = Hardware_quantity, y = Energy_kWh, color = Training_time_group)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(data = prediction_data, aes(y = pred_energy1), method = "lm", se = FALSE, 
              linetype = "dashed", size = 1) + 
  scale_y_log10(limits = y_limits) +
  labs(x="Hardware Quantity", y = "log of Energy (kWh)") +
  theme_minimal() +
  theme(legend.position = "none") 
p2 <- ggplot(df, aes(x = Hardware_quantity, y = Energy_kWh, color = Training_time_group)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(data = prediction_data, aes(y = pred_energy2), method = "lm", se = FALSE, 
              linetype = "dashed", size = 1) + 
  scale_y_log10(limits = y_limits) +
  labs(x="Hardware Quantity", color = "Training Time Level") +
  theme_minimal() +
  theme(axis.title.y = element_blank()) 
p1 + p2

Why Log Transformation Fails

To know the explanation why the log-transformed model can’t capture the underlying effects because the log-linked one, let’s walk through what happens after we apply a log transformation to the response variable:

Let’s say Y is the same as some function of X plus the error term:

After we apply a log transforming to Y, we are literally compressing each f(X) and the error:

Which means we're modeling a complete latest response variable, log(Y). After we plug in our own function g(X)— in my case — it's attempting to capture the combined effects of each the “shrunk” f(X) and error term.

In contrast, after we use a log link, we're still modeling the unique Y, not the transformed version. As an alternative, the model exponentiates our own function g(X) to predict Y.

The model then minimizes the difference between the actual Y and the expected Y. That way, the error terms stays intact on the unique scale:

Conclusion

Log-transforming a variable will not be the identical as using a log link, and it could not all the time yield reliable results. Under the hood, a log transformation alters the variable itself and distorts each the variation and noise. Understanding this subtle mathematical difference behind your models is just as necessary as trying to seek out the best-fitting model. 


[1] Epoch AI. . Retrieved from https://epoch.ai/data/notable-ai-models

[2] University of Virginia Library. Retrieved from https://library.virginia.edu/data/articles/interpreting-log-transformations-in-a-linear-model

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x