Wednesday, December 03, 2008

Groovy Best Fit Line

An engineer on my team at work asked if I could help with coding an algorithm for calculating a line that best fits a given set of data points. I thought, "what a great excuse to practice some Groovy", and got started right away.

A few Google searches led me to this 10-yr-old page on Regression Functions by Stefan Waner from Hofstra University. Stefan outlined the exact algorithm I was looking for.

Here is my implementation in Groovy:
// Prints the start/end points of a line that best fits 4 sample data points
pts = [[1, 1.5], [2, 1.6], [3, 2.1], [4, 3.0]]
println bestFit(pts)

/**
* Given a set of points, uses a linear regression algorithm to find the start and end points
* of a line that best fits the set of points.
* Returns the two points as [[x1, y1], [x2, y2]].
* The algorithm is from
* http://people.hofstra.edu/stefan_waner/realworld/calctopic1/regression.html.
*/
def bestFit(pts) {
// Find sums of x, y, xy, x^2
n = pts.size()
xSum = pts.collect() {p -> p[0]}.sum()
ySum = pts.collect() {p -> p[1]}.sum()
xySum = pts.collect() {p -> p[0]*p[1]}.sum()
xSqSum = pts.collect() {p -> p[0]*p[0]}.sum()

// Find m and b such that y = mx + b
// m is the slope of the line and b is the y-intercept
m = (n*xySum - xSum*ySum) / (n*xSqSum - xSum*xSum)
b = (ySum - m*xSum) / n

// Find start and end points based on the left-most and right-most points
x1 = pts.collect() {p -> p[0]}.min()
y1 = m*x1 + b
x2 = pts.collect() {p -> p[0]}.max()
y2 = m*x2 + b

[[x1, y1], [x2, y2]]
}

Running this script prints the following:
[[1, 1.3], [4, 2.8]]


You gotta love Groovy!