Journal

Linear Regression from Scratch: Unveiling the Secrets of Gradient Descent in Code

Linear Regression from Scratch: Unveiling the Secrets of Gradient Descent in Code
At it’s heart, Linear Regression is a mathematical model that wants to answer one simple question: “How does one variable depend on another?” Imagine you have data points on a scatterplot, and you suspect that one variable influences the other. Linear Regression searches for a line that best fits these points, like a snug-fitting glove for your data.
Image
This magical line, known as the “regression line,” describes the relationship between your variables in the simplest way possible — straight and linear. It’s as if you’re saying, “Hey data, I think you behave in a straight-line fashion.”
Hypothesis Function:
At the heart of Linear Regression lies the Hypothesis Function, the blueprint for our detective’s investigation. It’s a straightforward mathematical expression that captures the relationship between the independent variable (usually denoted as “x”) and the dependent variable (usually denoted as “y”). In its most basic form, the linear hypothesis function for a simple linear regression is:  

h(x) = θ₀ + θ₁x

Here, “h(x)” represents the predicted value of “y” and “θ₀” and “θ₁” are the parameters that the detective wants to find. These parameters determine the slope and the y-intercept of the regression line, allowing us to make predictions. or in the other form, as you may know from high school:

y = mx + c 


In this form: “y” is the dependent variable. “x” is the independent variable. “m” is the slope of the line, indicating how steep or shallow the line is. “c” is the y-intercept, representing the point where the line intersects the y-axis. 
Cost Function:
Now, the detective needs a way to evaluate how well the line fits the data. The Cost Function measures the discrepancy between the predictions made by the Hypothesis Function and the actual data points. In the context of linear regression, the Mean Squared Error (MSE) is a common choice for the Cost Function:

Image
Here, “n” is the number of data points, the summation symbol (∑) denotes the sum across all data points, “ŷ” represents the prediction for the “i-th” data point, and “yᵢ” is the actual value. The Cost Function quantifies the “cost” of our predictions, and the goal is to minimize this cost by adjusting the parameters (θ₀ and θ₁). If we replace ŷ with the prediction equation:
Image
Gradient Descent:
The hero of our story, coming to the rescue in the quest for parameter optimization. Gradient Descent it’s a method used to find the values of “θ₀” and “θ₁” that minimize the Cost Function. Think of it as our detective’s journey downhill, following the steepest slope to reach the lowest point, which corresponds to the optimal parameters. Gradient Descent works through an iterative process. At each step, it calculates the gradient of the Cost Function with respect to “θ₀” and “θ₁” and updates these parameters to reduce the cost. The process continues until the cost reaches a minimum, signaling that we’ve found the best-fitting line.
Derivative of the Cost Function (J) with respect to θ₀:
The Cost Function measures how well the linear regression model fits the data. To optimize the model, we need to find the derivatives of J with respect to the parameters θ₀ and θ₁. Let’s start with θ₀: Derivative of J with respect to θ₀ is denoted as ∂J/∂θ₀: ∂J/∂θ₀ = (1/n) * ∑(yᵢ — h(xᵢ)) The derivative with respect to θ₀ represents how the cost changes as we adjust the y-intercept of the regression line.
Derivative of the Cost Function (J) with respect to θ₁:
Next, let’s find the derivative of the Cost Function with respect to θ₁: ∂J/∂θ₁ = (1/m) * ∑[(h(xᵢ) — yᵢ) * xᵢ] The derivative with respect to θ₁ represents how the cost changes as we adjust the slope of the regression line.
Updating Parameters using Gradient Descent:
Now, with the derivatives in hand, we can update the parameters (θ₀ and θ₁) using the Gradient Descent algorithm. The update rule for each parameter is as follows: θ₀: θ₀ := θ₀ — α * ∂J/∂θ₀ θ₁: θ₁ := θ₁ — α * ∂J/∂θ₁ Here, “α” is the learning rate, which controls the step size in each iteration. By iteratively applying these update rules, Gradient Descent guides θ₀ and θ₁ toward the values that minimize the Cost Function. This process continues until the cost converges to a minimum, and the regression line fits the data as closely as possible.