画流程图的网站,wordpress启用主题404,门户网站源码,wordpress 好的相册#x1f368; 本文为#x1f517;365天深度学习训练营 中的学习记录博客#x1f356; 原作者#xff1a;K同学啊
前言 LSTM模型一直是一个很经典的模型#xff0c;一般用于序列数据预测#xff0c;这个可以很好的挖掘数据上下文信息#xff0c;本文将使用LSTM进行糖尿病… 本文为365天深度学习训练营 中的学习记录博客 原作者K同学啊
前言 LSTM模型一直是一个很经典的模型一般用于序列数据预测这个可以很好的挖掘数据上下文信息本文将使用LSTM进行糖尿病预测(二分类问题)采用LSTMLinear解决分类问题 糖尿病预测之前我用随机森林做过机器学习/数据分析案例—糖尿病预测; 后面打算用机器学习(随机森林、SVM等)结合深度学习LSTM做一个比较完整的项目大家可以关注一下哈; LSTM讲解 深度学习基础–LSTM学习笔记(李沐《动手学习深度学习》) 欢迎收藏 关注本人将会持续更新 文章目录 1、数据导入和数据预处理1、数据导入2、数据统计3、数据分布分析4、相关性分析 2、数据标准化和划分3、创建模型4、模型训练1、创建训练集2、创建测试集函数3、设置超参数 5、模型训练6、模型结果展示7、预测 1、数据导入和数据预处理
1、数据导入
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
#设置字体
from pylab import mpl
mpl.rcParams[font.sans-serif] [SimHei] # 显示中文
plt.rcParams[axes.unicode_minus] False # 显示负号# 数据不大用CPU即可
device cpudata_df pd.read_excel(./dia.xls)data_df.head()卡号性别年龄高密度脂蛋白胆固醇低密度脂蛋白胆固醇极低密度脂蛋白胆固醇甘油三酯总胆固醇脉搏舒张压高血压史尿素氮尿酸肌酐体重检查结果是否糖尿病0180544210381.252.991.070.645.31838304.99243.350101180544220311.151.990.840.503.98856304.72391.047102180544230271.292.210.690.604.19736105.87325.751103180544240330.932.010.660.843.60836002.40203.240204180544250361.172.830.830.734.83856704.09236.84300
2、数据统计
data_df.info()class pandas.core.frame.DataFrame
RangeIndex: 1006 entries, 0 to 1005
Data columns (total 16 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 卡号 1006 non-null int64 1 性别 1006 non-null int64 2 年龄 1006 non-null int64 3 高密度脂蛋白胆固醇 1006 non-null float644 低密度脂蛋白胆固醇 1006 non-null float645 极低密度脂蛋白胆固醇 1006 non-null float646 甘油三酯 1006 non-null float647 总胆固醇 1006 non-null float648 脉搏 1006 non-null int64 9 舒张压 1006 non-null int64 10 高血压史 1006 non-null int64 11 尿素氮 1006 non-null float6412 尿酸 1006 non-null float6413 肌酐 1006 non-null int64 14 体重检查结果 1006 non-null int64 15 是否糖尿病 1006 non-null int64
dtypes: float64(7), int64(9)
memory usage: 125.9 KBdata_df.describe()卡号性别年龄高密度脂蛋白胆固醇低密度脂蛋白胆固醇极低密度脂蛋白胆固醇甘油三酯总胆固醇脉搏舒张压高血压史尿素氮尿酸肌酐体重检查结果是否糖尿病count1.006000e031006.0000001006.0000001006.0000001006.0000001006.0000001006.0000001006.0000001006.0000001006.0000001006.0000001006.0000001006.0000001006.0000001006.0000001006.000000mean1.838279e070.59841050.2882701.1522012.7074750.9983111.8967204.85762480.81908576.8866800.1739565.562684339.34542764.1063621.6093440.444334std6.745088e050.49046416.9214870.3134260.8480700.7158912.4214031.02997312.54227012.7631730.3792601.64634284.56984629.3384370.7723270.497139min1.805442e070.00000020.0000000.4200000.8400000.1400000.3500002.41000041.00000045.0000000.0000002.210000140.80000030.0000000.0000000.00000025%1.807007e070.00000037.2500000.9200002.1000000.6800000.8800004.20000072.00000067.0000000.0000004.450000280.85000051.2500001.0000000.00000050%1.807036e071.00000050.0000001.1200002.6800000.8500001.3350004.78500079.00000076.0000000.0000005.340000333.00000062.0000002.0000000.00000075%1.809726e071.00000060.0000001.3200003.2200001.0900002.0875005.38000088.00000085.0000000.0000006.367500394.00000072.0000002.0000001.000000max2.026124e071.00000093.0000002.5000007.98000011.26000045.84000012.610000135.000000119.0000001.00000018.640000679.000000799.0000003.0000001.000000
3、数据分布分析
# 缺失值统计
data_df.isnull().sum()卡号 0
性别 0
年龄 0
高密度脂蛋白胆固醇 0
低密度脂蛋白胆固醇 0
极低密度脂蛋白胆固醇 0
甘油三酯 0
总胆固醇 0
脉搏 0
舒张压 0
高血压史 0
尿素氮 0
尿酸 0
肌酐 0
体重检查结果 0
是否糖尿病 0
dtype: int64# 数据分布、异常值分析
feature_name {性别: 性别,年龄: 年龄,高密度脂蛋白胆固醇: 高密度脂蛋白胆固醇,低密度脂蛋白胆固醇: 低密度脂蛋白胆固醇,极低密度脂蛋白胆固醇: 极低密度脂蛋白胆固醇,甘油三酯: 甘油三酯,总胆固醇: 总胆固醇,脉搏: 脉搏,舒张压: 舒张压,高血压史: 高血压史,尿素氮: 尿素氮,肌酐: 肌酐,体重检查结果: 体重检查结果,是否糖尿病: 是否糖尿病
}# 子箱图 展示
plt.figure(figsize(20, 20))for i, (col, col_name) in enumerate(feature_name.items(), 1):plt.subplot(4, 4, i)# 绘制子箱图sns.boxplot(xdata_df[是否糖尿病],ydata_df[col])# 设置标题plt.title(f{col_name}的纸箱图, fontsize10)plt.ylabel(数值, fontsize12)plt.grid(axisy, linestyle--, alpha0.7)plt.show()
异常值分析(查阅资料后发现)
总数据较少特征参数受很多因素的影响故这里假设没有异常值(数据多的时候可以进一步分析)
患糖尿病和不患糖尿病数据分布分析
发现患病和不患病在年龄、高密度蛋白固醇、低密度蛋白固醇、低密度蛋白固醇、甘油三肪、舒张压、高血压、尿素的相关因素等数据因素有关
4、相关性分析
plt.figure(figsize(15, 10))
sns.heatmap(data_df.corr(), annotTrue, fmt.2f)
plt.show()
高密度蛋白胆固醇存在负相关故删除该特征
2、数据标准化和划分
时间步长为1
# 特征选择
x data_df.drop([卡号, 高密度脂蛋白胆固醇, 是否糖尿病], axis1)
y data_df[是否糖尿病]# 数据标准化(数据之间差别大), 二分类问题y不需要做标准化
sc StandardScaler()
x sc.fit_transform(x)# 转换为tensors数据
x torch.tensor(np.array(x), dtypetorch.float32)
y torch.tensor(np.array(y), dtypetorch.int64)# 数据划分, 训练测试 8: 2
x_train, x_test, y_train, y_test train_test_split(x, y, test_size0.2 ,random_state42)# 维度设置, [batch_size, seq, features], 当然不设置也没事因为这样默认** 设置 seq 为 1**
x_train x_train.unsqueeze(1)
x_test x_test.unsqueeze(1)# 查看维度
x_train.shape, y_train.shape(torch.Size([804, 1, 13]), torch.Size([804]))# 构建数据集
batch_size 16train_dl DataLoader(TensorDataset(x_train, y_train),batch_sizebatch_size,shuffleTrue)test_dl DataLoader(TensorDataset(x_test, y_test),batch_sizebatch_size,shuffleFalse)for X, Y in train_dl:print(X.shape)print(Y.shape)break torch.Size([16, 1, 13])
torch.Size([16])3、创建模型
class Model_lstm(nn.Module):def __init__(self):super().__init__()模型结构:1、两层lstm2、一层linear self.lstm1 nn.LSTM(input_size13, hidden_size200,num_layers1, batch_firstTrue)self.lstm2 nn.LSTM(input_size200, hidden_size200,num_layers1, batch_firstTrue)# 展开分类self.lc1 nn.Linear(200, 2)def forward(self, x):out, hidden1 self.lstm1(x)out, _ self.lstm2(out, hidden1) # 将上一个层的最后隐藏层状态作为lstm2的这一层的隐藏层状态out self.lc1(out)return outmodel Model_lstm().to(device)modelModel_lstm((lstm1): LSTM(13, 200, batch_firstTrue)(lstm2): LSTM(200, 200, batch_firstTrue)(lc1): Linear(in_features200, out_features2, biasTrue)
)model(torch.randn(8, 1, 13)).shapetorch.Size([8, 1, 2])4、模型训练
1、创建训练集
def train(dataloader, model, loss_fn, opt):size len(dataloader.dataset)num_batch len(dataloader)train_acc, train_loss 0.0, 0.0 for X, y in dataloader:X, y X.to(device), y.to(device)pred model(X).view(-1, 2)loss loss_fn(pred, y)# 梯度设置opt.zero_grad()loss.backward()opt.step()train_loss loss.item()# 求最大概率配对train_acc (pred.argmax(1) y).type(torch.float).sum().item()train_acc / size train_loss / num_batchreturn train_acc, train_loss
2、创建测试集函数
def test(dataloader, model, loss_fn):size len(dataloader.dataset)num_batch len(dataloader)test_acc, test_loss 0.0, 0.0 with torch.no_grad():for X, y in dataloader:X, y X.to(device), y.to(device)pred model(X).view(-1, 2)loss loss_fn(pred, y)test_loss loss.item()# 求最大概率配对test_acc (pred.argmax(1) y).type(torch.float).sum().item()test_acc / size test_loss / num_batch return test_acc, test_loss3、设置超参数
learn_rate 1e-4
opt torch.optim.Adam(model.parameters(), lrlearn_rate)
loss_fn nn.CrossEntropyLoss()5、模型训练
epochs 50train_acc, train_loss, test_acc, test_loss [], [], [], []for i in range(epochs):model.train()epoch_train_acc, epoch_train_loss train(train_dl, model, loss_fn, opt)model.eval()epoch_test_acc, epoch_test_loss test(test_dl, model, loss_fn)train_acc.append(epoch_train_acc)train_loss.append(epoch_train_loss)test_acc.append(epoch_test_acc)test_loss.append(epoch_test_loss)# 输出template (Epoch:{:2d}, Train_acc:{:.1f}%, Train_loss:{:.3f}, Test_acc:{:.1f}%, Test_loss:{:.3f})print(template.format(i 1, epoch_train_acc*100, epoch_train_loss, epoch_test_acc*100, epoch_test_loss))print(---------------Done---------------)Epoch: 1, Train_acc:58.5%, Train_loss:0.677, Test_acc:75.7%, Test_loss:0.655
Epoch: 2, Train_acc:71.0%, Train_loss:0.643, Test_acc:77.2%, Test_loss:0.606
Epoch: 3, Train_acc:75.2%, Train_loss:0.590, Test_acc:79.7%, Test_loss:0.533
Epoch: 4, Train_acc:76.9%, Train_loss:0.524, Test_acc:80.2%, Test_loss:0.469
Epoch: 5, Train_acc:77.5%, Train_loss:0.481, Test_acc:79.7%, Test_loss:0.436
Epoch: 6, Train_acc:78.4%, Train_loss:0.470, Test_acc:79.7%, Test_loss:0.419
Epoch: 7, Train_acc:78.6%, Train_loss:0.452, Test_acc:80.7%, Test_loss:0.412
Epoch: 8, Train_acc:78.5%, Train_loss:0.449, Test_acc:80.7%, Test_loss:0.406
Epoch: 9, Train_acc:78.7%, Train_loss:0.444, Test_acc:80.7%, Test_loss:0.400
Epoch:10, Train_acc:79.0%, Train_loss:0.435, Test_acc:81.2%, Test_loss:0.395
Epoch:11, Train_acc:78.4%, Train_loss:0.428, Test_acc:81.2%, Test_loss:0.391
Epoch:12, Train_acc:79.1%, Train_loss:0.428, Test_acc:81.2%, Test_loss:0.388
Epoch:13, Train_acc:79.0%, Train_loss:0.421, Test_acc:80.7%, Test_loss:0.385
Epoch:14, Train_acc:79.2%, Train_loss:0.415, Test_acc:81.7%, Test_loss:0.382
Epoch:15, Train_acc:79.1%, Train_loss:0.415, Test_acc:81.7%, Test_loss:0.379
Epoch:16, Train_acc:79.7%, Train_loss:0.422, Test_acc:81.7%, Test_loss:0.377
Epoch:17, Train_acc:79.5%, Train_loss:0.410, Test_acc:81.7%, Test_loss:0.375
Epoch:18, Train_acc:79.2%, Train_loss:0.406, Test_acc:81.7%, Test_loss:0.374
Epoch:19, Train_acc:80.3%, Train_loss:0.407, Test_acc:82.2%, Test_loss:0.372
Epoch:20, Train_acc:80.1%, Train_loss:0.409, Test_acc:81.2%, Test_loss:0.370
Epoch:21, Train_acc:80.2%, Train_loss:0.397, Test_acc:80.7%, Test_loss:0.368
Epoch:22, Train_acc:81.0%, Train_loss:0.399, Test_acc:81.7%, Test_loss:0.367
Epoch:23, Train_acc:80.7%, Train_loss:0.396, Test_acc:81.2%, Test_loss:0.365
Epoch:24, Train_acc:81.0%, Train_loss:0.401, Test_acc:81.7%, Test_loss:0.363
Epoch:25, Train_acc:81.1%, Train_loss:0.392, Test_acc:82.2%, Test_loss:0.363
Epoch:26, Train_acc:81.2%, Train_loss:0.385, Test_acc:82.2%, Test_loss:0.362
Epoch:27, Train_acc:80.6%, Train_loss:0.392, Test_acc:82.2%, Test_loss:0.361
Epoch:28, Train_acc:80.5%, Train_loss:0.382, Test_acc:81.2%, Test_loss:0.358
Epoch:29, Train_acc:81.1%, Train_loss:0.386, Test_acc:81.7%, Test_loss:0.358
Epoch:30, Train_acc:80.7%, Train_loss:0.380, Test_acc:82.2%, Test_loss:0.358
Epoch:31, Train_acc:81.5%, Train_loss:0.378, Test_acc:81.7%, Test_loss:0.357
Epoch:32, Train_acc:80.6%, Train_loss:0.373, Test_acc:81.2%, Test_loss:0.356
Epoch:33, Train_acc:81.3%, Train_loss:0.373, Test_acc:81.7%, Test_loss:0.357
Epoch:34, Train_acc:80.8%, Train_loss:0.378, Test_acc:81.7%, Test_loss:0.354
Epoch:35, Train_acc:81.5%, Train_loss:0.372, Test_acc:81.2%, Test_loss:0.355
Epoch:36, Train_acc:81.5%, Train_loss:0.368, Test_acc:81.2%, Test_loss:0.354
Epoch:37, Train_acc:81.2%, Train_loss:0.368, Test_acc:80.7%, Test_loss:0.354
Epoch:38, Train_acc:81.2%, Train_loss:0.369, Test_acc:81.2%, Test_loss:0.353
Epoch:39, Train_acc:81.7%, Train_loss:0.365, Test_acc:81.2%, Test_loss:0.354
Epoch:40, Train_acc:81.5%, Train_loss:0.363, Test_acc:81.2%, Test_loss:0.355
Epoch:41, Train_acc:81.7%, Train_loss:0.358, Test_acc:81.2%, Test_loss:0.354
Epoch:42, Train_acc:81.7%, Train_loss:0.355, Test_acc:81.2%, Test_loss:0.353
Epoch:43, Train_acc:81.3%, Train_loss:0.353, Test_acc:80.7%, Test_loss:0.354
Epoch:44, Train_acc:82.0%, Train_loss:0.355, Test_acc:80.7%, Test_loss:0.354
Epoch:45, Train_acc:81.7%, Train_loss:0.353, Test_acc:79.7%, Test_loss:0.354
Epoch:46, Train_acc:82.1%, Train_loss:0.354, Test_acc:80.2%, Test_loss:0.354
Epoch:47, Train_acc:82.0%, Train_loss:0.349, Test_acc:80.2%, Test_loss:0.356
Epoch:48, Train_acc:82.1%, Train_loss:0.350, Test_acc:80.2%, Test_loss:0.356
Epoch:49, Train_acc:82.0%, Train_loss:0.345, Test_acc:80.7%, Test_loss:0.355
Epoch:50, Train_acc:81.8%, Train_loss:0.344, Test_acc:80.7%, Test_loss:0.355
---------------Done---------------6、模型结果展示
from datetime import datetime
current_time datetime.now()epochs_range range(epochs)plt.figure(figsize(12, 3))
plt.subplot(1, 2, 1)plt.plot(epochs_range, train_acc, labelTraining Accuracy)
plt.plot(epochs_range, test_acc, labelTest Accuracy)
plt.legend(loclower right)
plt.title(Training Accuracy)
plt.xlabel(current_time) plt.subplot(1, 2, 2)
plt.plot(epochs_range, train_loss, labelTraining Loss)
plt.plot(epochs_range, test_loss, labelTest Loss)
plt.legend(locupper right)
plt.title(Training Loss)
plt.show()
7、预测
test_x x_test[0].reshape(1, 1, 13)pred model(test_x.to(device)).reshape(-1, 2)
res pred.argmax(1).item()print(f预测结果: {res}, (1: 患病; 0: 不患病))预测结果: 1, (1: 患病; 0: 不患病)