Fast Linear Regression - Optimizing Linear Regression to Run in Constant Time in Circular Buffers

01 August 2024

Authored by Punit Arani

Abstract

Linear regression is a fundamental statistical tool used to model the relationship between a dependent variable and one or more independent variables. Traditional methods for computing linear regression parameters can be computationally intensive, especially with large datasets or real-time data streams. This paper presents an optimized approach to perform linear regression in constant time and fixed space complexity using circular buffers. By leveraging incremental calculations and precomputed constants, the proposed method significantly reduces computational overhead, making it ideal for real-time applications. We provide mathematical derivations, explain the assumptions, and discuss potential applications.

Introduction

Linear regression is widely used in various fields such as economics, engineering, and natural sciences to model and analyze relationships between variables. In real-time systems, data is continuously generated, and there is a need to update regression models efficiently without recomputing from scratch. Traditional algorithms for linear regression have a time complexity of O(n)O(n), which becomes a bottleneck in systems with high data throughput or limited computational resources.

Circular buffers offer a way to handle streaming data by maintaining a fixed-size window of the most recent data points. This paper explores how to optimize linear regression computations to run in constant time O(1)O(1) using circular buffers. The approach involves precomputing certain constants and updating sums incrementally, thus avoiding redundant calculations.

This paper provides detailed mathematical derivations, explains the assumptions, and outlines the implementation.

Mathematical Background

Linear Regression Fundamentals

Simple Linear regression aims to fit a linear model to a set of data points (xi,yi)(x_i, y_i) by minimizing the sum of squared differences between the observed values and the values predicted by the model.

The linear model is defined as:

yi=β0+β1xi+ϵi,y_i = \beta_0 + \beta_1 x_i + \epsilon_i,

where:

  • yiy_i is the dependent variable,
  • xix_i is the independent variable,
  • β0\beta_0 is the intercept,
  • β1\beta_1 is the slope,
  • ϵi\epsilon_i is the error term.

The parameters β0\beta_0 and β1\beta_1 are estimated using the least squares method. The formulas for the slope β1\beta_1 and intercept β0\beta_0 are:

β1=SxySxx,\beta_1 = \frac{S_{xy}}{S_{xx}}, β0=yˉβ1xˉ,\beta_0 = \bar{y} - \beta_1 \bar{x},

where:

  • xˉ\bar{x} and yˉ\bar{y} are the sample means of xix_i and yiy_i, respectively,
  • SxxS_{xx} is the total sum of squares of xix_i,
  • SxyS_{xy} is the sum of the products of deviations of xix_i and yiy_i from their means.

The sums SxxS_{xx} and Sxy S_{xy} are calculated as:

Sxx=i=1n(xixˉ)2,S_{xx} = \sum_{i=1}^{n} (x_i - \bar{x})^2, Sxy=i=1n(xixˉ)(yiyˉ).S_{xy} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}).

Computational Challenges

Computing SxxS_{xx} and SxyS_{xy} requires iterating over all data points, resulting inO(n)O(n) time complexity for each update. In real-time systems with streaming data, recalculating these sums for every new data point is inefficient. There is a need for methods that can update the regression parameters incrementally, preferably in constant time.

Proposed Method

Assumptions and Requirements

  • Fixed Window Size: The data is processed in a sliding window of fixed size nn. This window moves forward as new data arrives, discarding the oldest data.
  • Equally Spaced Independent Variable: The independent variable xix_i represents equally spaced indices, such as time steps. Specifically, xi=ix_i = i, where ii ranges from 11 to nn.
  • Real-Time Data Stream: The method is designed for applications where data arrives sequentially, and regression parameters need to be updated in real time.

Circular Buffers

A circular buffer is a fixed-size data structure that wraps around upon reaching its end, overwriting the oldest data. It efficiently manages streaming data by maintaining a window of the most recent nn data points.

In this context, we use a circular buffer to store the most recent nn values of the dependent variable yiy_i. The independent variable xix_i corresponds to fixed indices 11 to nn for the current window.

Precomputing SxxS_{xx}

Since xi=ix_i = i, and the values of xix_i do not change as the window slides, we can precompute SxxS_{xx} and xˉ\bar{x}.

Computing xˉ\bar{x}

The mean of xix_i from i=1i = 1 to nn is:

xˉ=1ni=1nxi=1n(n(n+1)2)=n+12.\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{1}{n} \left( \frac{n(n + 1)}{2} \right ) = \frac{n + 1}{2}.

Computing SxxS_{xx}

We compute SxxS_{xx} as:

Sxx=i=1n(xixˉ)2=i=1nxi2nxˉ2.S_{xx} = \sum_{i=1}^{n} (x_i - \bar{x})^2 = \sum_{i=1}^{n} x_i^2 - n \bar{x}^2.

Using the formula for the sum of squares of the first nn natural numbers:

i=1nxi2=n(n+1)(2n+1)6.\sum_{i=1}^{n} x_i^2 = \frac{n(n + 1)(2n + 1)}{6}.

Substituting back into SxxS_{xx}:

Sxx=n(n+1)(2n+1)6n(n+12)2.S_{xx} = \frac{n(n + 1)(2n + 1)}{6} - n \left( \frac{n + 1}{2} \right )^2.

Simplifying:

Sxx=n(n+1)(2n+1)6n((n+1)24)=n((n+1)(2n+1)6(n+1)24).S_{xx} = \frac{n(n + 1)(2n + 1)}{6} - n \left( \frac{(n + 1)^2}{4} \right ) = n \left( \frac{(n + 1)(2n + 1)}{6} - \frac{(n + 1)^2}{4} \right ).

Further simplifying:

Sxx=n(n+1)(2n+16n+14)=n(n+1)(4(2n+1)6(n+1)24).S_{xx} = n(n + 1) \left( \frac{2n + 1}{6} - \frac{n + 1}{4} \right ) = n(n + 1) \left( \frac{4(2n + 1) - 6(n + 1)}{24} \right ).

Simplify the numerator:

4(2n+1)6(n+1)=8n+46n6=2n2.4(2n + 1) - 6(n + 1) = 8n + 4 - 6n - 6 = 2n - 2.

Thus:

Sxx=n(n+1)(2n224)=n(n+1)(n1)(112).S_{xx} = n(n + 1) \left( \frac{2n - 2}{24} \right ) = n(n + 1)(n - 1) \left( \frac{1}{12} \right ).

So:

Sxx=n(n+1)(n1)12.S_{xx} = \frac{n(n + 1)(n - 1)}{12}.

Incremental Update of SxyS_{xy}

Our goal is to update SxyS_{xy} efficiently as new data points are added and old data points are removed, without recalculating from scratch.

Maintaining Sums

We maintain the following sums:

  • Sum of yiy_i:
yi.\sum y_i.
  • Sum of xiyix_i y_i:
xiyi.\sum x_i y_i.

Updating the Sums

When a new data point ynewy_{\text{new}} is added at positionxnew=nx_{\text{new}} = n, and the oldest data point yoldy_{\text{old}} at positionxold=1x_{\text{old}} = 1 is removed, the sums are updated as:

  • Sum of yiy_i:
yi=yi+ynewyold.\sum y_i' = \sum y_i + y_{\text{new}} - y_{\text{old}}.
  • Sum of xiyix_i y_i:
xiyi=xiyi+nynew1yold.\sum x_i y_i' = \sum x_i y_i + n y_{\text{new}} - 1 \cdot y_{\text{old}}.

Updating yˉ\bar{y}

The mean of yiy_i is updated as:

yˉ=yin=yˉ+ynewyoldn.\bar{y}' = \frac{\sum y_i'}{n} = \bar{y} + \frac{y_{\text{new}} - y_{\text{old}}}{n}.

Updating SxyS_{xy}

The updated SxyS_{xy} is calculated as:

Sxy=xiyinxˉyˉ.S_{xy}' = \sum x_i y_i' - n \bar{x} \bar{y}'.

Substituting the updated sums:

Sxy=(xiyi+nynewyold)nxˉ(yˉ+ynewyoldn).S_{xy}' = \left( \sum x_i y_i + n y_{\text{new}} - y_{\text{old}} \right ) - n \bar{x} \left( \bar{y} + \frac{y_{\text{new}} - y_{\text{old}}}{n} \right ).

Simplify nxˉyˉn \bar{x} \bar{y}' :

nxˉyˉ=nxˉyˉ+xˉ(ynewyold).n \bar{x} \bar{y}' = n \bar{x} \bar{y} + \bar{x} ( y_{\text{new}} - y_{\text{old}} ).

Substitute back into SxyS_{xy}' :

Sxy=Sxy+nynewyold(nxˉyˉ+xˉ(ynewyold)).S_{xy}' = S_{xy} + n y_{\text{new}} - y_{\text{old}} - \left( n \bar{x} \bar{y} + \bar{x} ( y_{\text{new}} - y_{\text{old}} ) \right ).

Simplify:

Sxy=Sxy+(nynewxˉynew)(yoldxˉyold).S_{xy}' = S_{xy} + ( n y_{\text{new}} - \bar{x} y_{\text{new}} ) - ( y_{\text{old}} - \bar{x} y_{\text{old}} ).

Compute nxˉn - \bar{x} and xˉ1\bar{x} - 1:

nxˉ=nn+12=n12,xˉ1=n12.n - \bar{x} = n - \frac{n + 1}{2} = \frac{n - 1}{2}, \quad \bar{x} - 1 = \frac{n - 1}{2}.

Thus, the incremental update becomes:

Sxy=Sxy+(n12)(ynew+yold).S_{xy}' = S_{xy} + \left( \frac{n - 1}{2} \right )( y_{\text{new}} + y_{\text{old}} ).

Final Incremental Update Formulas

  • Update yi\sum y_i:
yi=yi+ynewyold.\sum y_i' = \sum y_i + y_{\text{new}} - y_{\text{old}}.
  • Update SxyS_{xy}:
Sxy=Sxy+(n12)(ynew+yold).S_{xy}' = S_{xy} + \left( \frac{n - 1}{2} \right )( y_{\text{new}} + y_{\text{old}} ).
  • Compute β1\beta_1:
β1=SxySxx.\beta_1' = \frac{S_{xy}'}{S_{xx}}.
  • Compute β0\beta_0:
β0=yˉβ1xˉ.\beta_0' = \bar{y}' - \beta_1' \bar{x}.

These formulas allow us to update the regression parameters in constant time as new data arrives.

Implementation

Algorithm Overview

The implementation involves maintaining the following variables:

  • Circular Buffer: A fixed-size buffer storing the most recent nn values of yiy_i.
  • Sum of yiy_i: Maintained as yi\sum y_i.
  • Sum SxyS_{xy}: Maintained using the incremental update formula.
  • Precomputed Constants: xˉ\bar{x} and SxxS_{xx}.

Pseudocode

Initialization:

  1. Set buffer size nn.
  2. Precompute xˉ=n+12\bar{x} = \frac{n + 1}{2}.
  3. Precompute Sxx=n(n+1)(n1)12S_{xx} = \frac{n(n + 1)(n - 1)}{12}.
  4. Initialize yi=0\sum y_i = 0 and Sxy=0S_{xy} = 0.
  5. Initialize an empty circular buffer of size nn.

Update Function (called when a new ynewy_{\text{new}} arrives):

  1. If the buffer is full:
  • Remove yoldy_{\text{old}}, the oldest value from the buffer.
  • Update yiyi+ynewyold\sum y_i \leftarrow \sum y_i + y_{\text{new}} - y_{\text{old}}.
  • Update SxySxy+(n12)(ynew+yold)S_{xy} \leftarrow S_{xy} + \left( \frac{n - 1}{2} \right )( y_{\text{new}} + y_{\text{old}} ).
  1. Else:
  • Insert ynewy_{\text{new}} into the buffer.
  • Update yiyi+ynew\sum y_i \leftarrow \sum y_i + y_{\text{new}}.
  • Recalculate SxyS_{xy} from scratch (since nn is not yet reached).
  1. Compute β1=SxySxx\beta_1 = \frac{S_{xy}}{S_{xx}}.
  2. Compute yˉ=yin\bar{y} = \frac{ \sum y_i }{n}.
  3. Compute β0=yˉβ1xˉ\beta_0 = \bar{y} - \beta_1 \bar{x}.

Notes:

  • When the buffer is not yet full, we cannot use the incremental update for SxyS_{xy} reliably. We may need to compute SxyS_{xy} directly until the buffer is full.
  • Once the buffer is full, all updates can be performed in constant time.

Results and Performance Analysis

Time Complexity

The optimized method ensures that each update operation—adding or removing a data point—runs inO(1)O(1) time after the buffer is full. This is a significant improvement over traditional O(n)O(n) methods, especially in systems where nn is large or updates are frequent.

Space Complexity

The space complexity of this optimized linear regression method is O(n)O(n), where nn is the size of the circular buffer.

This is because:

  1. The circular buffer stores nn values of yiy_i, requiring O(n)O(n) space.
  2. We maintain a constant number of additional variables (yi\sum y_i, SxyS_{xy}, xˉ\bar{x}, SxxS_{xx}), each requiring O(1)O(1) space.

Importantly, the space complexity remains constant regardless of how many data points are processed over time. This is in contrast to naive implementations that might store all historical data, leading to unbounded growth in memory usage.

The fixed space requirement makes this method particularly suitable for embedded systems or any application with limited memory resources.

Applications

  • Real-Time Monitoring: Systems that require immediate insights from data streams, such as IoT sensors or network traffic analysis.
  • Financial Markets: High-frequency trading platforms where rapid data processing is critical.
  • Embedded Systems: Devices with limited computational resources benefit from the reduced overhead.
  • Process Control: Industrial systems that need to adjust parameters in real time based on sensor data.

Assumptions and Limitations

  • The method assumes that the independent variable xix_i is a sequence of equally spaced values (e.g., time steps), and specifically that xi=ix_i = i for i=1,2,,ni = 1, 2, \dots, n.
  • The buffer size nn is fixed; changing nn would require recomputing the precomputed constants.
  • The method is best suited for univariate linear regression with one independent variable.

Conclusion

This paper presents a method to optimize linear regression computations using circular buffers, achieving constant time updates after the initial buffer is filled. By precomputing constants and incrementally updating sums, the approach reduces computational demands, making it suitable for real-time applications. We have provided detailed mathematical derivations, clarified the assumptions, and explained how to apply the method.

Future work may explore extending this method to multiple regression, handling non-equally spaced independent variables, or applying similar incremental techniques to other statistical models.


Note: This paper presents an algorithmic approach to optimize linear regression computations using circular buffers and iterative updates. This is not a general purpose linear regression algorithm but rather a highly optimized version of linear regression for a specific use case.