跳转至

IML Lab1 report

Author: Ruixuan Huang Student ID: PB20111686

Aims

  1. Implement logistic regression framework using given dataset
  2. Use the given dataset to test the model we trained
  3. Illustrate loss curve of one training process
  4. Give comparation table of different parameters
  5. Give the best accuracy of test data

Methods

Step1. Data process

The original dataset (loan.csv) is typical for dichotomous classification problem. Before using it to train models, I noted that there are some cases with missing attributes. Here my choice is using df.dropna() to discard these cases.

Then, we should convert non numerical descriptions of some attributes to continuous numerical descriptions. I noted that the values of each to-be-converted attribute have some order relationship. So I simply use df[{attr_name}].map() to convert.

In addition, I normalized the variables (except Loan_Status). This is because I noticed that the numeric accuracy of Python is also to be considered. Some values of orginal data is so big that the sigmoid function will directly give the output 1.0 during every iteration. Thus \boldsymbol\beta could not update.

I decided to use hold-out method to train a model. So my strategy to split the dataset can be described as follows:

  1. Use df.sample(frac=1.0) to randomize the dataset
  2. Use df.sort_values("Loan_Status", inplace=True) to place the positive and negative cases in two consecutive areas to facilitate the data spliting
  3. In the positive area, choose its top 40% cases and add them into X_train and y_train. For negative cases the rate is 80%. This is because the proportion of positive and negative of original dataset is 2:1
  4. After executing the above 3 procedures, I get the X_train, X_test, y_train, y_test

Step2. Implement the LR class

According to the (3.27) in the textbook, our objective is to minimize

l(\boldsymbol{\beta})=\sum_{i=1}^m\left(-y_i\boldsymbol{\beta}^\top\hat{\boldsymbol{x}}_i+\ln\left(1+e^{\boldsymbol{\beta}^\top\hat{\boldsymbol{x}}_i}\right)\right)

I assumed that fit_intercept=True, so here \boldsymbol{\beta}=(\boldsymbol{w};b),\hat{\boldsymbol{x}}=(\boldsymbol{x};1).

In my code I also dealt with the situation when fit_intercept=False, but in the report I'll only discuss the situation when fit_intercept=True.

I use \displaystyle\sigma(x)=\frac{1}{1+e^{-z}} as the sigmoid function, so the result of one prediction is \sigma(\boldsymbol{\beta}^\top\hat{\boldsymbol{x}}).

Now implement the training (fit) method. The cost function is

cost(\boldsymbol\beta)=\frac1m\left[l(\boldsymbol\beta)+\gamma ||\boldsymbol\beta||_1\mathbb{I}(L1)+\frac\gamma2||\boldsymbol\beta||_2^2\mathbb{I}(L2)\right]

I use gradient descent to decrease the cost function and find the optimal \boldsymbol\beta. According to the (3.30), the gradient function is

\nabla_\boldsymbol{\beta}cost(\boldsymbol\beta)=\frac1m\left\{-\sum_{i=1}^m[y_i-\sigma(\boldsymbol{\beta}^\top\hat{\boldsymbol{x}}_i)]\hat{\boldsymbol{x}}_i+\gamma[\text{sgn}(\beta_i)]_i\mathbb{I}(L1)+\gamma\boldsymbol\beta\mathbb{I}(L2)\right\}

In each iteration, if the absolute value of the increment of the cost function is less than the parameter tol, the iteration will break, which means the training is done. Otherwise the \boldsymbol\beta should update according to

\boldsymbol\beta←\boldsymbol\beta-\alpha\nabla_\boldsymbol{\beta}cost(\boldsymbol\beta)

Here \alpha is the parameter lr. The training has a maximum number of iteration which is set by the parameter max_iter. One more thing, in order to draw a loss curve, the fit method will return an array of values of cost function during the training process.

The last thing is implementation of predict method. It will use the trained \boldsymbol\beta to caculate a result for each line in X_test. The result will be caculated according to

p(\hat{\boldsymbol x}_i)=\sigma(\boldsymbol{\beta}^\top\hat{\boldsymbol{x}}_i)

And the predict method will return an array of results from all test lines.

Step3. Train and test in framework

Now I have the split dataset (X_train, X_test, y_train, y_test) and LR class. Simply call the fit method on X_train and y_train. And using matplotlib to illustrate the loss curve according to the return value of fit.

In the test process simply call the predict method on X_test and get its return value. Then compare each element of it and its corresponding label in y_test. In this process I will use a confusion matrix to display the result of the trained model running on the test dataset.

Step4. Using different parameters

In order to get the comparation table, I repeat the steps above using one different parameter each time, using accuracy \left(\displaystyle\frac{TP+FN}{TP+TN+FP+FN}\right) as the metric. For lr, there will be a extra metric: trend of the loss curve.

Result

Loss curve

The loss curve (one training process) of training process is as follows. Here the y-axis represents the value of cost function, the x-axis represents the number of iterations.

image-20221011213012000

Comparation table

The original parameters I picked is:

  • Positive and negative cases ratio in test dataset (PNR): 1:1
  • Ratio of training set data to all (TAR): 50%
  • Learning rate (lr): 0.1
  • Penalty and gamma: \gamma=0.01,L1

Define some varibles as follows:

mark meaning
c_0 the number of negative cases of the dataset
c_1 the number of positive cases of the dataset
t_0 the number of negative cases of X_train
t_1 the number of positive cases of X_train

The four variables above satisfy the following formulas: $$ TAR=\frac{t_0+t_1}{c_0+c_1}\quad\quad PNR=\frac{t_1}{t_0} $$ Given certain TAR and PNR, we can get $$ t_0=\frac{TAR\times(c_0+c_1)}{1+PNR}\quad\quad t_1=PNR\times t_0 $$ The results above make it easier to change the parameter TAR and PNR when watching their effect of training, which enables us to be free of caculating the size of training set. But one thing to be noticed is that $$ \textbf{assert } t_1 <= c_1\textbf{ and }t_0 <= c_0 $$

Then each time I will use one difference parameter on the basis of these values.


PNR

image-20221011214314728

PNR 1:1 1.3:1 1.6:1 1.9:1 2.2:1 2.5:1
Accuracy 61.79% 77.79% 80.04% 81.69% 79.83% 79.00%
  • Other: TAR=0.5,lr=0.1,tol=10^{-5},\gamma=0.01,L1
  • Each PNR case repeated 10 times, The result is the average.

TAR

image-20221011215752700

TAR 20% 25% 30% 35% 40% 45% 50%
Accuracy 63.76% 64.74% 66.21% 63.99% 67.63% 64.82% 63.93%
  • Other: PNR=1:1,lr=0.1,tol=10^{-5},\gamma=0.01,L1
  • Each TAR case repeated 10 times, The result is the average.

lr

image-20221011221427870

lr 0.01 0.05 0.1 0.2 0.4 0.8
Accuracy 62.51% 66.48% 63.05% 65.06% 65.02% 60.12%
  • Other: PNR=1:1,TAR=50\%,tol=10^{-5},\gamma=0.01,L1
  • Each lr case repeated 10 times, The result is the average.

image-20221011221653865

The loss curve for each case of lr. We can intuitively conclude that the larger the value of lr, the faster the loss function decreases.


penalty and gamma

penalty = "L1"

image-20221011222952158

penalty = "L2"

image-20221011223259001

gamma 1 1e-1 1e-2 1e-3 1e-4 1e-5 1e-6 0
L1 67.78% 64.71% 61.50% 61.64% 61.36% 61.36% 63.31% 64.43%
L2 63.73% 64.30% 64.99% 67.08% 65.13% 65.13% 64.29%
  • Other: PNR=1:1,TAR=50\%,tol=10^{-5},lr=0.1
  • Each lr case repeated 3 times, The result is the average.

Best accuracy

From the results above, the best accuracy is 81.94%, which has these parameters: $$ PNR=1.9:1\quad TAR=0.4\quad lr=0.05\quad tol=10^{-5}\quad penalty=L2\quad \gamma=10^{-3} $$ Its confusion matrix is

image-20221011224017901

本文阅读量