Fast Linear Regression - Optimizing Linear Regression to Run in Constant Time in Circular Buffers

Abstract

Linear regression is a fundamental statistical tool used to model the relationship between a dependent variable and one or more independent variables. Traditional methods for computing linear regression parameters can be computationally intensive, especially with large datasets or real-time data streams. This paper presents an optimized approach to perform linear regression in constant time and fixed space complexity using circular buffers. By leveraging incremental calculations and precomputed constants, the proposed method significantly reduces computational overhead, making it ideal for real-time applications. We provide mathematical derivations, explain the assumptions, and discuss potential applications.

Introduction

Linear regression is widely used in various fields such as economics, engineering, and natural sciences to model and analyze relationships between variables. In real-time systems, data is continuously generated, and there is a need to update regression models efficiently without recomputing from scratch. Traditional algorithms for linear regression have a time complexity of $O(n)$ , which becomes a bottleneck in systems with high data throughput or limited computational resources.

Circular buffers offer a way to handle streaming data by maintaining a fixed-size window of the most recent data points. This paper explores how to optimize linear regression computations to run in constant time $O(1)$ using circular buffers. The approach involves precomputing certain constants and updating sums incrementally, thus avoiding redundant calculations.

This paper provides detailed mathematical derivations, explains the assumptions, and outlines the implementation.

Mathematical Background

Linear Regression Fundamentals

Simple Linear regression aims to fit a linear model to a set of data points $(x_i, y_i)$ by minimizing the sum of squared differences between the observed values and the values predicted by the model.

The linear model is defined as:

y_i = \beta_0 + \beta_1 x_i + \epsilon_i,

where:

$y_i$ is the dependent variable,
$x_i$ is the independent variable,
$\beta_0$ is the intercept,
$\beta_1$ is the slope,
$\epsilon_i$ is the error term.

The parameters $\beta_0$ and $\beta_1$ are estimated using the least squares method. The formulas for the slope $\beta_1$ and intercept $\beta_0$ are:

\beta_1 = \frac{S_{xy}}{S_{xx}},

\beta_0 = \bar{y} - \beta_1 \bar{x},

where:

$\bar{x}$ and $\bar{y}$ are the sample means of $x_i$ and $y_i$ , respectively,
$S_{xx}$ is the total sum of squares of $x_i$ ,
$S_{xy}$ is the sum of the products of deviations of $x_i$ and $y_i$ from their means.

The sums $S_{xx}$ and $S_{xy}$ are calculated as:

S_{xx} = \sum_{i=1}^{n} (x_i - \bar{x})^2,

S_{xy} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}).

Computational Challenges

Computing $S_{xx}$ and $S_{xy}$ requires iterating over all data points, resulting in $O(n)$ time complexity for each update. In real-time systems with streaming data, recalculating these sums for every new data point is inefficient. There is a need for methods that can update the regression parameters incrementally, preferably in constant time.

Proposed Method

Assumptions and Requirements

Fixed Window Size: The data is processed in a sliding window of fixed size $n$ . This window moves forward as new data arrives, discarding the oldest data.
Equally Spaced Independent Variable: The independent variable $x_i$ represents equally spaced indices, such as time steps. Specifically, $x_i = i$ , where $i$ ranges from $1$ to $n$ .
Real-Time Data Stream: The method is designed for applications where data arrives sequentially, and regression parameters need to be updated in real time.

Circular Buffers

A circular buffer is a fixed-size data structure that wraps around upon reaching its end, overwriting the oldest data. It efficiently manages streaming data by maintaining a window of the most recent $n$ data points.

In this context, we use a circular buffer to store the most recent $n$ values of the dependent variable $y_i$ . The independent variable $x_i$ corresponds to fixed indices $1$ to $n$ for the current window.

Precomputing $S_{xx}$

Since $x_i = i$ , and the values of $x_i$ do not change as the window slides, we can precompute $S_{xx}$ and $\bar{x}$ .

Computing $\bar{x}$

The mean of $x_i$ from $i = 1$ to $n$ is:

\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{1}{n} \left( \frac{n(n + 1)}{2} \right ) = \frac{n + 1}{2}.

Computing $S_{xx}$

We compute $S_{xx}$ as:

S_{xx} = \sum_{i=1}^{n} (x_i - \bar{x})^2 = \sum_{i=1}^{n} x_i^2 - n \bar{x}^2.

Using the formula for the sum of squares of the first $n$ natural numbers:

\sum_{i=1}^{n} x_i^2 = \frac{n(n + 1)(2n + 1)}{6}.

Substituting back into $S_{xx}$ :

S_{xx} = \frac{n(n + 1)(2n + 1)}{6} - n \left( \frac{n + 1}{2} \right )^2.

Simplifying:

S_{xx} = \frac{n(n + 1)(2n + 1)}{6} - n \left( \frac{(n + 1)^2}{4} \right ) = n \left( \frac{(n + 1)(2n + 1)}{6} - \frac{(n + 1)^2}{4} \right ).

Further simplifying:

S_{xx} = n(n + 1) \left( \frac{2n + 1}{6} - \frac{n + 1}{4} \right ) = n(n + 1) \left( \frac{4(2n + 1) - 6(n + 1)}{24} \right ).

Simplify the numerator:

4(2n + 1) - 6(n + 1) = 8n + 4 - 6n - 6 = 2n - 2.

Thus:

S_{xx} = n(n + 1) \left( \frac{2n - 2}{24} \right ) = n(n + 1)(n - 1) \left( \frac{1}{12} \right ).

So:

S_{xx} = \frac{n(n + 1)(n - 1)}{12}.

Incremental Update of $S_{xy}$

Our goal is to update $S_{xy}$ efficiently as new data points are added and old data points are removed, without recalculating from scratch.

Maintaining Sums

We maintain the following sums:

Sum of $y_i$ :

\sum y_i.

Sum of $x_i y_i$ :

\sum x_i y_i.

Updating the Sums

When a new data point $y_{\text{new}}$ is added at position $x_{\text{new}} = n$ , and the oldest data point $y_{\text{old}}$ at position $x_{\text{old}} = 1$ is removed, the sums are updated as:

Sum of $y_i$ :

\sum y_i' = \sum y_i + y_{\text{new}} - y_{\text{old}}.

Sum of $x_i y_i$ :

\sum x_i y_i' = \sum x_i y_i + n y_{\text{new}} - 1 \cdot y_{\text{old}}.

Updating $\bar{y}$

The mean of $y_i$ is updated as:

\bar{y}' = \frac{\sum y_i'}{n} = \bar{y} + \frac{y_{\text{new}} - y_{\text{old}}}{n}.

Updating $S_{xy}$

The updated $S_{xy}$ is calculated as:

S_{xy}' = \sum x_i y_i' - n \bar{x} \bar{y}'.

Substituting the updated sums:

S_{xy}' = \left( \sum x_i y_i + n y_{\text{new}} - y_{\text{old}} \right ) - n \bar{x} \left( \bar{y} + \frac{y_{\text{new}} - y_{\text{old}}}{n} \right ).

Simplify $n \bar{x} \bar{y}'$ :

n \bar{x} \bar{y}' = n \bar{x} \bar{y} + \bar{x} ( y_{\text{new}} - y_{\text{old}} ).

Substitute back into $S_{xy}'$ :

S_{xy}' = S_{xy} + n y_{\text{new}} - y_{\text{old}} - \left( n \bar{x} \bar{y} + \bar{x} ( y_{\text{new}} - y_{\text{old}} ) \right ).

Simplify:

S_{xy}' = S_{xy} + ( n y_{\text{new}} - \bar{x} y_{\text{new}} ) - ( y_{\text{old}} - \bar{x} y_{\text{old}} ).

Compute $n - \bar{x}$ and $\bar{x} - 1$ :

n - \bar{x} = n - \frac{n + 1}{2} = \frac{n - 1}{2}, \quad \bar{x} - 1 = \frac{n - 1}{2}.

Thus, the incremental update becomes:

S_{xy}' = S_{xy} + \left( \frac{n - 1}{2} \right )( y_{\text{new}} + y_{\text{old}} ).

Final Incremental Update Formulas

Update $\sum y_i$ :

\sum y_i' = \sum y_i + y_{\text{new}} - y_{\text{old}}.

Update $S_{xy}$ :

S_{xy}' = S_{xy} + \left( \frac{n - 1}{2} \right )( y_{\text{new}} + y_{\text{old}} ).

Compute $\beta_1$ :

\beta_1' = \frac{S_{xy}'}{S_{xx}}.

Compute $\beta_0$ :

\beta_0' = \bar{y}' - \beta_1' \bar{x}.

These formulas allow us to update the regression parameters in constant time as new data arrives.

Implementation

Algorithm Overview

The implementation involves maintaining the following variables:

Circular Buffer: A fixed-size buffer storing the most recent $n$ values of $y_i$ .
Sum of $y_i$ : Maintained as $\sum y_i$ .
Sum $S_{xy}$ : Maintained using the incremental update formula.
Precomputed Constants: $\bar{x}$ and $S_{xx}$ .

Pseudocode

Initialization:

Set buffer size $n$ .
Precompute $\bar{x} = \frac{n + 1}{2}$ .
Precompute $S_{xx} = \frac{n(n + 1)(n - 1)}{12}$ .
Initialize $\sum y_i = 0$ and $S_{xy} = 0$ .
Initialize an empty circular buffer of size $n$ .

Update Function (called when a new $y_{\text{new}}$ arrives):

If the buffer is full:

Remove $y_{\text{old}}$ , the oldest value from the buffer.
Update $\sum y_i \leftarrow \sum y_i + y_{\text{new}} - y_{\text{old}}$ .
Update $S_{xy} \leftarrow S_{xy} + \left( \frac{n - 1}{2} \right )( y_{\text{new}} + y_{\text{old}} )$ .

Else:

Insert $y_{\text{new}}$ into the buffer.
Update $\sum y_i \leftarrow \sum y_i + y_{\text{new}}$ .
Recalculate $S_{xy}$ from scratch (since $n$ is not yet reached).

Compute $\beta_1 = \frac{S_{xy}}{S_{xx}}$ .
Compute $\bar{y} = \frac{ \sum y_i }{n}$ .
Compute $\beta_0 = \bar{y} - \beta_1 \bar{x}$ .

Notes:

When the buffer is not yet full, we cannot use the incremental update for $S_{xy}$ reliably. We may need to compute $S_{xy}$ directly until the buffer is full.
Once the buffer is full, all updates can be performed in constant time.

Results and Performance Analysis

Time Complexity

The optimized method ensures that each update operation—adding or removing a data point—runs in $O(1)$ time after the buffer is full. This is a significant improvement over traditional $O(n)$ methods, especially in systems where $n$ is large or updates are frequent.

Space Complexity

The space complexity of this optimized linear regression method is $O(n)$ , where $n$ is the size of the circular buffer.

This is because:

The circular buffer stores $n$ values of $y_i$ , requiring $O(n)$ space.
We maintain a constant number of additional variables ( $\sum y_i$ , $S_{xy}$ , $\bar{x}$ , $S_{xx}$ ), each requiring $O(1)$ space.

Importantly, the space complexity remains constant regardless of how many data points are processed over time. This is in contrast to naive implementations that might store all historical data, leading to unbounded growth in memory usage.

The fixed space requirement makes this method particularly suitable for embedded systems or any application with limited memory resources.

Applications

Real-Time Monitoring: Systems that require immediate insights from data streams, such as IoT sensors or network traffic analysis.
Financial Markets: High-frequency trading platforms where rapid data processing is critical.
Embedded Systems: Devices with limited computational resources benefit from the reduced overhead.
Process Control: Industrial systems that need to adjust parameters in real time based on sensor data.

Assumptions and Limitations

The method assumes that the independent variable $x_i$ is a sequence of equally spaced values (e.g., time steps), and specifically that $x_i = i$ for $i = 1, 2, \dots, n$ .
The buffer size $n$ is fixed; changing $n$ would require recomputing the precomputed constants.
The method is best suited for univariate linear regression with one independent variable.

Conclusion

This paper presents a method to optimize linear regression computations using circular buffers, achieving constant time updates after the initial buffer is filled. By precomputing constants and incrementally updating sums, the approach reduces computational demands, making it suitable for real-time applications. We have provided detailed mathematical derivations, clarified the assumptions, and explained how to apply the method.

Future work may explore extending this method to multiple regression, handling non-equally spaced independent variables, or applying similar incremental techniques to other statistical models.

Note: This paper presents an algorithmic approach to optimize linear regression computations using circular buffers and iterative updates. This is not a general purpose linear regression algorithm but rather a highly optimized version of linear regression for a specific use case.