Regression Models and Analysis of Variance

Author

Michał Janiga

1 Analysis of Variance (ANOVA) for Exchange Rates

For the closing rate of the Euro (EURPLN):
- Draw box plots illustrating the exchange rate distributions for the last six months of 2025 (separately for each month from July to December 2025).
- Using the analysis of variance method, verify the hypothesis of equality of the mean exchange rates across these six months. Comment on the obtained results.
- Perform a post-hoc analysis of deviations from the mean using Tukey’s and Bonferroni’s methods. Comment on the obtained results.
- Conduct tests comparing the mean exchange rates across these six months without assuming normality of their distributions. Comment on the obtained results.

1.1 Data Preparation and Retrieval

show/hide
# Helper function to load data from stooq.pl
read_rates <- function(symbol, dateStart, dateEnd) {
  if (file.exists(".env")) readRenviron(".env")
  apiKey <- Sys.getenv("STOOQ_API_KEY")
  
  # Formatting dates to the YYYYMMDD format required by the stooq API
  d1 <- format(dateStart, "%Y%m%d")
  d2 <- format(dateEnd, "%Y%m%d")
  
  webLink <- paste0('https://stooq.pl/q/d/l/?s=', symbol, 
                    '&i=d&d1=', d1, '&d2=', d2, '&apikey=', apiKey)
  
  fileName <- paste0('data/', symbol, '_', d1, '_', d2, '.csv')
  if (!file.exists(fileName)) {
    download.file(webLink, fileName, quiet = TRUE)
  }
  
  df_symbol <- read.csv(fileName)
  # Map Polish column headers to English for consistency
  names(df_symbol)[names(df_symbol) == "Data"] <- "Date"
  names(df_symbol)[names(df_symbol) == "Zamkniecie"] <- "Close"
  df_symbol$Date <- as.Date(df_symbol$Date)
  
  # Optional local filtering for certainty
  df_symbol <- df_symbol[which(df_symbol$Date >= dateStart & df_symbol$Date <= dateEnd),]
  
  df_symbol <- subset(df_symbol, select = c(Date, Close))
  names(df_symbol) <- c('Date', symbol)
  df_symbol
}

# Fetching data for EURPLN (July - December 2025)
dateStart <- as.Date("2025-07-01")
dateEnd <- as.Date("2025-12-31")
df_eur <- read_rates('EURPLN', dateStart, dateEnd)

# Data preparation - adding a column with the month number for grouping
df_eur$Month <- format(df_eur$Date, "%m")

1.2 Box Plots

show/hide
# Box plots for individual months
boxplot(EURPLN ~ Month, data = df_eur, 
        main = "EURPLN exchange rate distribution in the 2nd half of 2025",
        xlab = "Month", ylab = "Closing rate [PLN]",
        names = c("Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
        col = "lightblue")
grid()

1.3 Analysis of Variance (ANOVA)

show/hide
# Conducting the Analysis of Variance (ANOVA)
aov_res <- aov(EURPLN ~ Month, data = df_eur)
summary(aov_res)
             Df  Sum Sq  Mean Sq F value Pr(>F)    
Month         5 0.02364 0.004728   51.27 <2e-16 ***
Residuals   125 0.01153 0.000092                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
show/hide
tab_anova <- summary(aov_res)[[1]]
p_raw <- tab_anova["Month", "Pr(>F)"]
p_value <- format.pval(p_raw, eps = 0.001, digits = 3)

Comment: The obtained p-value \(Pr(>F) <0.001\) is significantly smaller than the assumed significance level \(\alpha = 0.05\). This provides a solid basis to reject the null hypothesis of equal mean EURPLN exchange rates in the analyzed months (July–December 2025). This result indicates that the mean exchange rates of the currency in the studied periods differed from one another in a statistically significant way. The extremely high F-statistic value confirms that the month is a factor that strongly differentiates the level of the closing exchange rate. To determine precisely which months differ from each other, a post-hoc analysis (e.g., Tukey’s test) must be conducted.

1.4 Post-Hoc Analysis (Tukey and Bonferroni)

show/hide
# Tukey's HSD method
tukey_res <- TukeyHSD(aov_res)
print(tukey_res)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = EURPLN ~ Month, data = df_eur)

$Month
              diff          lwr          upr     p adj
08-07  0.004397660 -0.003990956  0.012786277 0.6537979
09-07  0.002453202 -0.005835158  0.010741561 0.9558889
10-07 -0.007185217 -0.015380966  0.001010531 0.1212538
11-07 -0.018106435 -0.026603968 -0.009608902 0.0000001
12-07 -0.033689526 -0.041977885 -0.025401167 0.0000000
09-08 -0.001944459 -0.010423579  0.006534662 0.9855265
10-08 -0.011582878 -0.019971494 -0.003194261 0.0014979
11-08 -0.022504095 -0.031187796 -0.013820395 0.0000000
12-08 -0.038087186 -0.046566307 -0.029608066 0.0000000
10-09 -0.009638419 -0.017926778 -0.001350060 0.0127134
11-09 -0.020559636 -0.029146526 -0.011972747 0.0000000
12-09 -0.036142727 -0.044522673 -0.027762781 0.0000000
11-10 -0.010921217 -0.019418751 -0.002423684 0.0039818
12-10 -0.026504308 -0.034792667 -0.018215949 0.0000000
12-11 -0.015583091 -0.024169980 -0.006996202 0.0000092
show/hide
plot(tukey_res, las = 1)

show/hide
# Bonferroni correction method
bonf_res <- pairwise.t.test(df_eur$EURPLN, df_eur$Month, p.adj = 'bonf')
print(bonf_res)

    Pairwise comparisons using t tests with pooled SD 

data:  df_eur$EURPLN and df_eur$Month 

   07      08      09      10      11     
08 1.0000  -       -       -       -      
09 1.0000  1.0000  -       -       -      
10 0.1859  0.0016  0.0152  -       -      
11 1.3e-07 1.5e-10 3.0e-09 0.0045  -      
12 < 2e-16 < 2e-16 < 2e-16 1.1e-14 9.4e-06

P value adjustment method: bonferroni 

Comment: The post-hoc analysis using Tukey’s and Bonferroni’s methods yields highly consistent results and allows for a precise identification of month pairs between which significant differences occurred. The tests confirm that the summer months (July, August, September) form a cluster with a similar exchange rate level (no significant differences, Bonferroni p-value equal to 1.000). Significant shifts begin in October (which differs significantly from August and September). In turn, November and December differ in a highly significant manner (p < 0.001) from all preceding months. These findings, combined with the negative mean differences observed in Tukey’s test, clearly document a downward trend in the EURPLN rate (appreciation of the Polish złoty) across the entire studied half-year. Tukey’s plot visualizes this trend, as the majority of confidence intervals are shifted to the left of the zero line.

1.5 Kruskal-Wallis Test (Without Assuming Normality)

show/hide
# Conducting the Kruskal-Wallis test
kruskal_res <- kruskal.test(EURPLN ~ Month, data = df_eur)
print(kruskal_res)

    Kruskal-Wallis rank sum test

data:  EURPLN by Month
Kruskal-Wallis chi-squared = 79.224, df = 5, p-value = 1.22e-15

Comment: The Kruskal-Wallis test result is fully consistent with the ANOVA findings. The extremely low p-value allows us to reject the null hypothesis of equal EURPLN exchange rate distributions across the analyzed months. This confirms that the previously observed differences are not an artifact of potential deviations from the normality assumption, but rather reflect real and statistically significant changes in the currency rate level during the second half of 2025. Mirroring the ANOVA results, the high chi-squared statistic indicates a strong influence of the time factor (month) on exchange rate volatility.


2 Multiple Regression Modeling in the Forex Market

Using linear regression on data from the last four months (January to April 2026), determine the relationships of:
- the EURPLN exchange rate relative to the USDPLN, GBPPLN, AUDPLN, and JPYPLN exchange rates,
- the KGHM (KGH) stock price relative to the EURPLN, USDPLN, GBPPLN, AUDPLN, and JPYPLN exchange rates.
Evaluate the significance of each explanatory variable in the constructed models. Assume a significance level of \(\alpha = 0.05\).

2.1 Data Preparation and Retrieval

show/hide
# Setting the date range for the first four months of 2026
dateStart2 <- as.Date("2026-01-01")
dateEnd2 <- as.Date("2026-04-30")

# List of symbols to download
symbols <- c('EURPLN', 'USDPLN', 'GBPPLN', 'AUDPLN', 'JPYPLN', 'KGH')

# Fetching data using the previously defined read_rates function
data_list <- lapply(symbols, function(s) read_rates(s, dateStart2, dateEnd2))

# Merging all data frames into a single data frame by the Date column
df_reg <- Reduce(function(x, y) merge(x, y, by = "Date"), data_list)

2.2 Linear Regression

show/hide
# Model 1: Relationship of the EURPLN rate relative to other currencies
model_eur <- lm(EURPLN ~ USDPLN + GBPPLN + AUDPLN + JPYPLN, data = df_reg)
summary(model_eur)

Call:
lm(formula = EURPLN ~ USDPLN + GBPPLN + AUDPLN + JPYPLN, data = df_reg)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.020873 -0.004665 -0.001045  0.004684  0.018486 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.24351    0.14581  15.387  < 2e-16 ***
USDPLN       0.23745    0.02930   8.105 6.38e-12 ***
GBPPLN       0.12323    0.04673   2.637   0.0101 *  
AUDPLN       0.13492    0.01736   7.770 2.82e-11 ***
JPYPLN       8.33776    4.20687   1.982   0.0511 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.00754 on 77 degrees of freedom
Multiple R-squared:  0.9255,    Adjusted R-squared:  0.9217 
F-statistic: 239.3 on 4 and 77 DF,  p-value: < 2.2e-16
show/hide
# Model 2: Relationship of the KGHM stock price relative to exchange rates
model_kgh <- lm(KGH ~ EURPLN + USDPLN + GBPPLN + AUDPLN + JPYPLN, data = df_reg)
summary(model_kgh)

Call:
lm(formula = KGH ~ EURPLN + USDPLN + GBPPLN + AUDPLN + JPYPLN, 
    data = df_reg)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.195  -9.524   0.427   9.927  36.430 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2044.94     573.95   3.563 0.000637 ***
EURPLN        -168.55     222.23  -0.758 0.450519    
USDPLN        -267.93      77.77  -3.445 0.000932 ***
GBPPLN          52.81      95.16   0.555 0.580513    
AUDPLN          71.80      45.23   1.587 0.116558    
JPYPLN      -21480.08    8410.28  -2.554 0.012650 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.7 on 76 degrees of freedom
Multiple R-squared:  0.6445,    Adjusted R-squared:  0.6211 
F-statistic: 27.56 on 5 and 76 DF,  p-value: 8.355e-16

2.3 Significance Assessment of Variables

Commentary on the EURPLN Model: Based on the performed linear regression and the analysis of the \(p\)-values (Pr(>|t|)) at the significance level of \(\alpha = 0.05\), we can conclude that:

  • The variables that have a statistically significant effect on the EURPLN exchange rate are USDPLN (\(p = 6.38 \cdot 10^{-12}\)), GBPPLN (\(p = 0.0101\)), and AUDPLN (\(p = 2.82 \cdot 10^{-11}\)).
  • The JPYPLN exchange rate (\(p = 0.0511\)) lies on the verge of significance; however, under a strict assumption of \(\alpha = 0.05\), we do not consider it a significant variable in this specific model.
  • The Adjusted Coefficient of Determination (\(Adjusted R^2\)) is 0.9217, meaning that the model explains 92.17% of the variance in the Euro exchange rate relative to the Polish złoty in the analyzed period, demonstrating an excellent model fit.

Commentary on the KGH Model: Analysis of the regression model for KGHM stock indicates that:

  • The significant explanatory variables (\(p < 0.05\)) for the KGHM stock price are USDPLN (\(p = 0.000932\)) and JPYPLN (\(p = 0.012650\)).
  • For the US Dollar, a negative impact on the company’s valuation was observed (Estimate coefficient \(\approx -267.93\)). This indicates that an increase in the USDPLN rate (depreciation of the Polish złoty against the Dollar) is statistically associated with a decline in KGHM stock prices during the analyzed period.
  • Other currencies, including the EURPLN rate (\(p = 0.4505\)), do not show any statistically significant impact on the company’s stock price.
  • The Adjusted Coefficient of Determination (\(Adjusted R^2\)) of 0.6211 suggests a moderate fit of the model—currency factors explain approximately 62% of the variance in the KGHM price. This suggests a strong influence of non-currency factors, such as global copper prices or general stock market sentiment.

3 Development and Evaluation of Predictive Sales Models

The file sprzedaz.txt contains data on a company’s advertising expenses (in thousands of PLN) and the sales value of its products (in millions of PLN) across various quarters.
- Using linear regression, determine the relationship between sales value and advertising expenses. Plot the data points and the regression line on a single chart.
- Calculate the forecasted sales values if advertising expenses are 300, 500, and 700 thousand PLN. Estimate the standard deviation of the error associated with the forecasted sales values for each level of advertising expenses.
- For the data in the sprzedaz.txt file, investigate whether a quadratic relationship would provide a better model for the dependency between advertising expenses (in thousands of PLN) and sales value (in millions of PLN). Compare the values of the coefficient of determination \(R^2\) for both models. Plot the corresponding line representing this relationship on the chart containing the data and the linear regression line determined in the previous step.

3.1 Linear Regression Plot

show/hide
# Loading data
sales <- read.csv("data/sprzedaz.txt")

# Linear regression model
lin_model <- lm(Income ~ Advert, data = sales)

# Scatter plot with regression line
plot(sales$Advert, sales$Income, 
     xlab = "Advertising expenses [thousand PLN]", 
     ylab = "Sales value [million PLN]", 
     main = "Relationship between Sales and Advertising Expenses",
     pch = 19, col = "blue")
abline(lin_model, col = "red", lwd = 2)
legend("topleft", legend = "Regression line", col = "red", lwd = 2)

3.2 Forecasting and Prediction Error (300, 500, 700 thousand PLN)

show/hide
new_data <- data.frame(Advert = c(300, 500, 700))
lin_pred <- predict(lin_model, newdata = new_data, se.fit = TRUE)

# Calculating the standard deviation of the prediction error for an individual observation
# Prediction error accounts for the mean estimation error and the residual variance
se_pred_error <- sqrt(lin_pred$se.fit^2 + lin_pred$residual.scale^2)

# Compiling the forecast results
forecast_results <- data.frame(
  Advertising_Expenses = new_data$Advert,
  Forecasted_Sales = lin_pred$fit,
  Prediction_Error_Estimate = se_pred_error
)
print(forecast_results)
  Advertising_Expenses Forecasted_Sales Prediction_Error_Estimate
1                  300         9.574201                 0.6388024
2                  500        12.502700                 0.6250514
3                  700        15.431199                 0.6768939

3.3 Quadratic Model and Statistical Verification

show/hide
quad_model <- lm(Income ~ Advert + I(Advert^2), data = sales)

# 1. Comparison of R^2 and Adjusted R^2
sum_lin <- summary(lin_model)
sum_quad <- summary(quad_model)

r2_lin <- sum_lin$r.squared
r2_adj_lin <- sum_lin$adj.r.squared

r2_quad <- sum_quad$r.squared
r2_adj_quad <- sum_quad$adj.r.squared

# 2. ANOVA Test - checking if the addition of the quadratic term is significant
test_anova <- anova(lin_model, quad_model)
p_value_anova <- test_anova$"Pr(>F)"[2]

# Displaying the comparison results
stat_comparison <- data.frame(
  Model = c("Linear", "Quadratic"),
  R_Squared = c(r2_lin, r2_quad),
  Adj_R_Squared = c(r2_adj_lin, r2_adj_quad)
)
print(stat_comparison)
      Model R_Squared Adj_R_Squared
1    Linear 0.9140610     0.9095379
2 Quadratic 0.9326505     0.9251672
show/hide
cat("\nANOVA test result (p-value for change):", p_value_anova)

ANOVA test result (p-value for change): 0.03879825
show/hide
# 3. Comparative plot
plot(sales$Advert, sales$Income, 
     xlab = "Advertising expenses [thousand PLN]", 
     ylab = "Sales value [million PLN]", 
     main = "Comparison of Linear and Quadratic Models",
     pch = 19, col = "blue")
abline(lin_model, col = "red", lwd = 2)

# Adding the quadratic model curve
x_grid <- seq(min(sales$Advert), max(sales$Advert), length.out = 100)
y_quad <- predict(quad_model, newdata = data.frame(Advert = x_grid))
lines(x_grid, y_quad, col = "darkgreen", lwd = 2)

legend("topleft", 
       legend = c("Data", "Linear Regression", "Quadratic Regression"), 
       col = c("blue", "red", "darkgreen"), 
       pch = c(19, NA, NA), 
       lwd = c(NA, 2, 2))

3.4 Conclusions and Analysis

Forecasted Sales Values: Based on the linear model, the following sales forecasts (in millions of PLN) were determined for the given advertising expenses (in thousands of PLN): - For expenses of 300 thousand PLN: the forecast is 9.57 million PLN (standard deviation of the prediction error: 0.64).
- For expenses of 500 thousand PLN: the forecast is 12.5 million PLN (standard deviation of the prediction error: 0.63).
- For expenses of 700 thousand PLN: the forecast is 15.43 million PLN (standard deviation of the prediction error: 0.68).

Comparison and Model Selection:

To eliminate the risk of overfitting and confirm the validity of the model selection, an in-depth analysis was conducted:

  1. Adjusted Coefficient of Determination (\(R^2\)): An increase was observed from 0.9095 (linear model) to 0.9252 (quadratic model). Since the adjusted coefficient accounts for the number of parameters, its increase unambiguously confirms that the non-linear model describes the data structure better, rather than merely fitting the noise.
  2. ANOVA Test: Comparing both models with an F-test revealed a significant difference (\(p = 0.0388\)). Since the \(p\)-value is substantially smaller than the adopted significance level of 0.05, we reject the hypothesis that the linear model is sufficient.
  3. Visual Conclusions: The plot confirms that the linear regression line (red) systematically deviates from the data at extreme levels of advertising expenses, whereas the quadratic curve (green) naturally tracks the slight arc formed by the data points.

Summary: All applied statistical criteria confirm that the quadratic model is more adequate and statistically justified. The higher precision of this model stems from the non-linear nature of the relationship between advertising and sales in this specific dataset.