linear regression

2025-12-19 10:59:59 +01:00 · 2025-12-19 10:59:59 +01:00 · ab48065a61
commit ab48065a61
4 changed files with 261 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,10 @@
+**/*.aux
+**/*.fdb_latexmk
+**/*.fls
+**/*.log
+**/*.out
+**/*.pdf
+**/*.gz
+**/*.toc
+**/*.swp
+**/*.dvi
--- a/linear_regression.py
+++ b/linear_regression.py
@ -0,0 +1,60 @@
+import pandas as pd
+import matplotlib.pyplot as plt
+
+def predict(x, w, b):
+  return x * w + b
+
+def loss(x, y):
+  return (y - x) ** 2
+
+def mse(predicted, actual):
+  sum = 0.0
+  for i, x in enumerate(predicted):
+    sum += loss(x, actual[i])
+  return sum / len(predicted)
+
+def slope_weight(input, predicted, actual):
+  sum = 0.0
+  for i, x in enumerate(predicted):
+    sum += (actual[i] - x) * input[i]
+  return sum / len(predicted) * 2
+
+def slope_bias(predicted, actual):
+  sum = 0.0
+  for i, x in enumerate(predicted):
+    sum += (actual[i] - x)
+  return sum / len(predicted) * 2
+
+data  = pd.read_csv("california_housing_train.csv")
+
+w = 0
+b = 0
+prev = 0
+delta = 10
+
+beta = 0.01
+
+median_income = data['median_income']
+median_house_value = data['median_house_value']
+
+mu_input = median_income.mean()
+sigma_input = median_income.std()
+
+mu_actual = median_house_value.mean()
+sigma_actual = median_house_value.std()
+
+median_income = (median_income - mu_input) / sigma_input
+median_house_value = (median_house_value - mu_actual) / sigma_actual
+
+while delta > 1e-6:
+  predicted = []
+  actual = median_house_value.tolist()
+  for sample in median_income.tolist():
+    predicted.append(predict(sample, w, b))
+  delta = abs(mse(predicted, actual) - prev)
+  
+  w += slope_weight(median_income.tolist(), predicted, actual) * beta
+  b += slope_bias(predicted, actual) * beta
+  prev = mse(predicted, actual)
+
+# `w` and `b` now are our weight and bias
--- a/linear_regression.tex
+++ b/linear_regression.tex
@ -0,0 +1,140 @@
+Linear regression is a technique used to find the relationship between variables. If we have 2 variables and their relationship is linear, a function in the form of $y=mx+q$ can be found in machine learning by \textit{training} a model. Let's say we have the values shown in Table \ref{table:linear_regr_1}.
+
+\begin{table}[h!]
+\centering
+\begin{tabular}{| c | c | c |}
+	\hline
+	Pounds in 1000s & Miles per gallon\\
+	\hline
+	3.5 & 18\\
+	\hline
+	3.69 & 15\\
+	\hline
+	3.44 & 18\\
+	\hline
+	3.43 & 16\\
+	\hline
+	4.34 & 15\\
+	\hline
+	4.42 & 14\\
+	\hline
+	2.37 & 24\\
+	\hline
+\end{tabular}
+\caption{Relationship between miles per gallon and weight of a car.}
+\label{table:linear_regr_1}
+\end{table}
+
+If we plot this data using the pounds as $x$ and the miles per gallon as $y$, we would get something like Figure \ref{fig:linear_regr_1}
+
+\begin{figure}[h]
+    \centering
+    \begin{tikzpicture}
+    \begin{axis}[%
+        x=5cm,
+        y=0.25cm,
+        scatter/classes={%
+          a={mark=o,draw=black}}]
+
+    % 1. The Scatter Plot
+    \addplot[scatter,only marks,%
+        scatter src=explicit symbolic]%
+    table[meta=label] {
+    x y label
+    3.5 18 a
+    3.69 15 a
+    3.44 18 a
+    3.43 16 a
+    3.34 15 a
+    3.42 14 a
+    2.37 24 a
+    };
+
+    \end{axis}
+    \end{tikzpicture}
+    \caption{Scatter plot with previously defined data}
+    \label{fig:linear_regr_1}
+\end{figure}
+
+\begin{figure}[h]
+    \centering
+    \begin{tikzpicture}
+    \begin{axis}[%
+        x=5cm,
+        y=0.25cm,
+        scatter/classes={%
+          a={mark=o,draw=black}}]
+
+    % 1. The Scatter Plot
+    \addplot[scatter,only marks,%
+        scatter src=explicit symbolic]%
+    table[meta=label] {
+    x y label
+    3.5 18 a
+    3.69 15 a
+    3.44 18 a
+    3.43 16 a
+    3.34 15 a
+    3.42 14 a
+    2.37 24 a
+    };
+
+    \addplot [red, thick, domain=2:4] {-6.80*x + 39.66};
+
+    \end{axis}
+    \end{tikzpicture}
+    \caption{Scatter plot with linear regression line}
+    \label{fig:scatter_regression}
+\end{figure}
+
+Our goal is to find the line equation that given the pounds, returns the \textit{most probable} amount of miles per gallon. In this case, the line would look something like Figure \ref{fig:scatter_regression}.
+\subsection{Training}
+The process used to find the $m$ and $q$ parameters is called training. In machine learning $m$ is usually referred to as $w$ (weight) and $q$ is usually referred to as $b$ (bias) so, in the end, our linear function will be $y = wx+b$. To find these values we go through an iterative process that checks how "wrong" our results are and updates the weight and bias accordingly, until we notice that changing it more doesn't change the result in a significant way. When this happens, we say that the model \textit{converged}.
+\subsection{Loss}
+We said that we have to find how "wrong" we are with our predictions, the "wrongness" is called \textit{loss} in machine learning. There are different ways to calculate the loss. In linear regression we can fallback to 2 different ones: mean absolute error and mean squared error.
+\begin{equation}
+	\text{MAE} = \frac{1}{N}\sum_{i=1}^{N} \rvert y_i - y \rvert
+\end{equation}
+\begin{equation}
+	\text{MSE}= \frac{1}{N}\sum_{i=1}^{N} (y_i - y)^2
+\end{equation}
+where $N$ is the amount of samples, $y_i$ is the actual value in our dataset and $y$ is the value we predicted.
+
+\subsection{Gradient descent}
+After we getting the loss, we have to find the direction in which our function should be updated. To do this, we calculate the slope of the loss by taking the partial derivative of the loss function for both the weight and the bias.\\
+\begin{center}
+	First, we put $u=(y_i-y)^2$
+\end{center}
+\begin{equation}
+	\frac{\partial \text{MSE}}{\partial b} = \frac{\partial \text{MSE}}{\partial u} \frac{\partial u}{\partial y} \frac{\partial y}{\partial b}
+\end{equation}
+\begin{center}
+	for the bias and
+\end{center}
+\begin{equation}
+	\frac{\partial \text{MSE}}{\partial w} = \frac{\partial \text{MSE}}{\partial u} \frac{\partial u}{\partial y} \frac{\partial y}{\partial w}
+\end{equation}
+\begin{center}
+	for the weight which evaluate to
+\end{center}
+\begin{equation}
+	\frac{\partial \text{MSE}}{\partial b} = \frac{2}{N}\sum_{i=1}^{N}(y_i - y)
+\end{equation}
+\begin{center}
+and
+\end{center}
+\begin{equation}
+	\frac{\partial \text{MSE}}{\partial w} = \frac{2}{N}\sum_{i=1}^{N}(y_i - y) * x_i
+\end{equation}
+Now that we know the slope of the loss, we can update our parameters by multiplying the slope by a small number called \textit{learning rate} and then subtracting the result to the weight and the bias.
+\begin{equation}
+	w = w - \frac{\partial MSE}{\partial w} * \beta
+\end{equation}
+
+\begin{equation}
+	b = b - \frac{\partial MSE}{\partial b} * \beta
+\end{equation}
+We then repeat this process until the model \textit{converges}.
+
+\subsection{Python implementation}
+\lstinputlisting[language=Octave]{linear_regression.py}
--- a/main.tex
+++ b/main.tex
@ -0,0 +1,51 @@
+\documentclass{article}
+
+\usepackage{listings}
+\usepackage{xcolor}
+\usepackage{hyperref}
+\usepackage{graphicx}
+\usepackage[margin=0.5in]{geometry}
+\usepackage{pgfplots}
+\usepackage{amsmath}
+
+\definecolor{codegreen}{rgb}{0,0.6,0}
+\definecolor{codegray}{rgb}{0.5,0.5,0.5}
+\definecolor{codepurple}{rgb}{0.58,0,0.82}
+\definecolor{backcolour}{rgb}{0.95,0.95,0.92}
+
+\lstdefinestyle{python}{
+    backgroundcolor=\color{backcolour},   
+    commentstyle=\color{codegreen},
+    keywordstyle=\color{magenta},
+    numberstyle=\tiny\color{codegray},
+    stringstyle=\color{codepurple},
+    basicstyle=\ttfamily\footnotesize,
+    breakatwhitespace=false,         
+    breaklines=true,                 
+    captionpos=b,                    
+    keepspaces=true,                 
+    numbers=left,                    
+    numbersep=5pt,                  
+    showspaces=false,                
+    showstringspaces=false,
+    showtabs=false,                  
+    tabsize=2
+}
+
+\lstset{style=python}
+
+\title{Machine Learning notes}
+\author{Lorenzo Torres}
+\date{2025}
+
+\begin{document}
+	\maketitle
+	\tableofcontents
+	\section{Linear regression}
+	\input{linear_regression}
+	\section{Logistic regression}
+	\section{Classification}
+	\section{Neural Networks}
+	\section{Vector sets}
+	\section{Large language models}
+\end{document}