摘要
这个该死的SPI机器设备驱动,搞得我们的自动开关机稳定性测试卡死了!用crash专用工具分析了两次dump信息,还是卡死前的log最有用。我简直要怒火中烧了!
正文
某SPI机器设备驱动器造成的自动开关机稳定性测试卡死难题一例
硬件系统:某ARM SoC 软件系统:Linux 难题状况:商品做自动开关机稳定性测试,产生卡死。
自然环境
硬件系统:某ARM SoC
软件系统:Linux
难题状况:商品做自动开关机稳定性测试,产生卡死。
剖析
用crash专用工具分析2次卡死dump信息内容,获得卡死前的log以下。2次卡死的backtrace略有不同,但死机原因相近:最终全是在启用 complete 的全过程中浏览空指针造成 kernel panic。
log 1: [ 1.092790] c2 Unable to handle kernel NULL pointer dereference at virtual address 00000004 [ 1.092796] c2 pgd = c0004000 [ 1.092802] c0 [00000004] *pgd=00000000 [ 1.092812] c2 Internal error: Oops: 5 [#1] PREEMPT SMP ARM ... [ 1.094412] c0 Backtrace: [ 1.094430] c2 [<c07e971c>] (_raw_spin_lock_irqsave) from [<c006e2a8>] (complete c01c/c04c) [ 1.094446] c2 [<c006e28c>] (complete) from [<c04318d4>] (spi_complete c010/c014) [ 1.094468] c2 [<c04318c4>] (spi_complete) from [<c0432328>] (spi_finalize_current_message c0228/c026c) [ 1.094479] c2 [<c0432100>] (spi_finalize_current_message) from [<c04327c8>] (spi_transfer_one_message c045c/c0484) [ 1.094503] c2 [<c043236c>] (spi_transfer_one_message) from [<c0432e98>] (__spi_pump_messages c06a8/c06e8) [ 1.094531] c2 [<c04327f0>] (__spi_pump_messages) from [<c04330fc>] (__spi_sync c0208/c0270) [ 1.094559] c2 [<c0432ef4>] (__spi_sync) from [<c0433178>] (spi_sync c014/c018) [ 1.094586] c2 [<c0433164>] (spi_sync) from [<c0440744>] (dm9051_r_reg c04c/c07c) [ 1.094595] c2 [<c04406f8>] (dm9051_r_reg) from [<c0441e30>] (dm9051_probe c06e4/0x82c) [ 1.094605] c2 [<c044174c>] (dm9051_probe) from [<c0431764>] (spi_drv_probe c08c/c0a8) log 2: [ 1.271300] c3 Unable to handle kernel NULL pointer dereference at virtual address 00000004 [ 1.271305] c3 pgd = c0004000 [ 1.271312] c0 [00000004] *pgd=00000000 [ 1.271323] c3 Internal error: Oops: 5 [#1] PREEMPT SMP ARM ... [ 1.414585] c0 Backtrace: [ 1.414603] c3 [<c07e97f4>] (_raw_spin_lock_irqsave) from [<c006e2a8>] (complete c01c/c04c) [ 1.414618] c3 [<c006e28c>] (complete) from [<c0431fb8>] (spi_complete c028/c034) [ 1.414643] c3 [<c0431f90>] (spi_complete) from [<c0432350>] (spi_finalize_current_message c0230/c0278) [ 1.414666] c3 [<c0432120>] (spi_finalize_current_message) from [<c04327f4>] (spi_transfer_one_message c045c/c0484) [ 1.414693] c3 [<c0432398>] (spi_transfer_one_message) from [<c0432ec4>] (__spi_pump_messages c06a8/c06e8) [ 1.414725] c3 [<c043281c>] (__spi_pump_messages) from [<c0432f1c>] (spi_pump_messages c018/c01c) [ 1.414759] c3 [<c0432f04>] (spi_pump_messages) from [<c0042abc>] (kthread_worker_fn c0c0/c014c) [ 1.414771] c3 [<c00429fc>] (kthread_worker_fn) from [<c00429e8>] (kthread 0x110/0x124) [ 1.414798] c3 [<c00428d8>] (kthread) from [<c000f208>] (ret_from_fork c014/c02c)
整理kernel spi驱动器编码:drivers/spi/spi.c
//启用 complete
static void spi_complete(void *arg) { complete(arg); }
//给 spi_message 的 complete 与 context 组员取值
static int __spi_sync(struct spi_device *spi, struct spi_message *message, int bus_locked) { DECLARE_COMPLETION_ONSTACK(done); ... message->complete = spi_complete; message->context = &done; message->spi = spi; if (status == 0) { ... __spi_pump_messages(master, false); ... wait_for_completion(&done); status = message->status; } message->context = NULL; return status; }
//回调函数 spi_message 的 complete 涵数,即 spi_complete,传到的主要参数是 spi_message 的 context,即 __spi_sync 涵数里边界定的 completion 自变量 “done”。
void spi_finalize_current_message(struct spi_master *master) { ... spin_lock_irqsave(&master->queue_lock, flags); mesg = master->cur_msg; spin_unlock_irqrestore(&master->queue_lock, flags); ... if (mesg->complete) mesg->complete(mesg->context); }
因而基本上能够明确是实行 mesg->complete(mesg->context); 时,传到的主要参数 mesg->context 为 NULL 造成难题。继续看卡死前的一部分log:
//cpu2 上进行一次 spi 传送并进行
[ 1.271151] c2 70a00000.spi: __spi_sync [ 1.271172] c2 spi_master spi0: spi_finalize_current_message [ 1.271180] c2 [spi_complete]
//cpu2 上再度进行一次 spi 传送并进行
[ 1.271192] c2 70a00000.spi: __spi_sync [ 1.271212] c2 spi_master spi0: spi_finalize_current_message [ 1.271219] c2 [spi_complete]
//cpu2 上再度进行一次 spi 传送,但在进行前,cpu1 进行了新的 spi 传送
[ 1.271227] c2 70a00000.spi: __spi_sync [ 1.271247] c2 spi_master spi0: spi_finalize_current_message [ 1.271249] c1 70a00000.spi: __spi_sync //cpu1 进行新的传送 [ 1.271261] c2 [spi_complete] [ 1.271281] c1 dm9051_r_reg: spi_sync() failed
//cpu3 上实行传送进行的 completion 时,传到的 mesg->context 为 NULL 造成卡死
[ 1.271288] c3 spi_master spi0: spi_finalize_current_message [ 1.271293] c3 [spi_complete] [ 1.271300] c3 Unable to handle kernel NULL pointer dereference at virtual address 00000004
__spi_sync 涵数中,有以下句子:
__spi_pump_messages(master, false); // 解决spi message,进行传送 wait_for_completion(&done); // 等候传送进行 message->context = NULL; //将该 spi message 的 context 置为 NULL
换句话说,__spi_sync 直到 completion 之后将此次已传送进行的 spi message 的 context 置为NULL。
假如在cpu1上进行新的传送时,传到的 spi message 自变量详细地址与cpu2上的是同一个,那麼cpu1就会有很有可能浏览到这一 NULL context。
spi message 自变量是dm9051驱动器传送出来的,查询dm9051驱动器,其启用spi的编码以下,能够见到它应用了局部变量 dm->spi_msg1 来结构 spi message。因而导致了难题。
static u8 dm9051_r_reg(struct dm9051_net *dm, u16 reg_addr) { struct spi_transfer *xfer = &dm->spi_xfer1; struct spi_message *msg = &dm->spi_msg1; u8 r_cmd[2] = {reg_addr, 0x00}; u8 r_data[2] = {0x00, 0x00}; int ret; xfer->tx_buf = r_cmd; xfer->rx_buf = r_data; xfer->len = 2; ret = spi_sync(dm->spidev, msg); if (ret < 0) DM_MSG0("dm9051_r_reg: spi_sync() failed\n"); return r_data[1]; } static int dm9051_probe(struct spi_device *spi) { ... spi_message_init(&dm->spi_msg1); spi_message_add_tail(&dm->spi_xfer1, &dm->spi_msg1); ... }
处理
模仿 include/linux/spi/spi.h 中 spi_read 和 spi_write 的完成,每一次读写能力前都结构新的 spi message。
static inline int spi_read(struct spi_device *spi, void *buf, size_t len) { struct spi_transfer t = { .rx_buf = buf, .len = len, }; struct spi_message m; spi_message_init(&m); spi_message_add_tail(&t, &m); return spi_sync(spi, &m); }
————————————————-
创作者:bigfish99
blog:https://www.cnblogs.com/bigfish0506/
微信公众号:大魚内嵌式
关注不迷路
扫码下方二维码,关注宇凡盒子公众号,免费获取最新技术内幕!
评论0