Bottom Line; Up Front
Deep diving into the source code of mount.cifs
to understand cifs_mount
return codes in dmesg
led to a bug report and a fix coming in a future kernel update.
Scenario
Today after applying updates, including a new kernel, I noticed files disappearing from Rhythmbox. My music collection is stored on a network share, and for some reason it wasn’t mounted. I have it set up using autofs
[1] to mount on demand rather than making it persistent. Traversing to the directory from the command line showed the share wouldn’t mount, so I looked at dmesg
and saw the following:
$ sudo dmesg -T | grep cifs_mount
[Thu Jul 21 09:49:27 2022] CIFS: VFS: cifs_mount failed w/return code = -22
[Thu Jul 21 09:49:38 2022] CIFS: VFS: cifs_mount failed w/return code = -22
[Thu Jul 21 09:51:19 2022] CIFS: VFS: cifs_mount failed w/return code = -22
[Thu Jul 21 09:52:26 2022] CIFS: VFS: cifs_mount failed w/return code = -22
[Thu Jul 21 09:53:07 2022] CIFS: VFS: cifs_mount failed w/return code = -22
[Thu Jul 21 09:54:00 2022] CIFS: VFS: cifs_mount failed w/return code = -22
[Thu Jul 21 09:54:15 2022] CIFS: VFS: cifs_mount failed w/return code = -22
[Thu Jul 21 09:54:57 2022] CIFS: VFS: cifs_mount failed w/return code = -22
[Thu Jul 21 09:56:09 2022] CIFS: VFS: cifs_mount failed w/return code = -22
[Thu Jul 21 09:56:38 2022] CIFS: VFS: cifs_mount failed w/return code = -22
[Thu Jul 21 09:57:39 2022] CIFS: VFS: cifs_mount failed w/return code = -95
[Thu Jul 21 09:57:54 2022] CIFS: VFS: cifs_mount failed w/return code = -22
[Thu Jul 21 09:58:36 2022] CIFS: VFS: cifs_mount failed w/return code = -22
I am mounting a share from an Ubuntu 20.04 server on a Pop!_OS 22.04 client, and with the recent kernel update I suspected there might be some sort of version mismatch at play. A quick search brought me to a StackOverflow[2] that didn’t have a direct answer, but enough information to point me in a direction. I tried mounting the share from the command line, as suggested in the post, and it was successful. A closer look at the autofs
configuration for the share helped narrow down a possible cause:
Media -fstype=cifs,rw,credentials=/home/e0D69NsW/.config/smbcredentials,uid=1000,vers=3,dir_mode=0755,file_mode=0644 ://198.51.100.97/media
Mounting from the command line succeeded, and nothing about the configuration changed, so it had to be something else. I noticed when mounting from the command line that I excluded the vers=
option. Per the man page for mount.cifs
, if this option is excluded it will attempt to negotiate the highest SMB2+ version supported by both the client and server. I removed the vers
option, changing the configuration to the following:
Media -fstype=cifs,rw,credentials=/home/e0D69NsW/.config/smbcredentials,uid=1000,dir_mode=0755,file_mode=0644 ://198.51.100.97/media
After restarting autofs
the mount worked again, solving the immediate problem.
Diving Deeper
I’d seen the cifs_mount failed errors many times before and generally ignored them. This time around I wanted to understand the return codes, and I wasn’t fully convinced the problem was caused by just a version mismatch. A little digging led me to mount.h
[3] in the cifs-utils
git
repository:
23 /* exit status - bits below are ORed */
24 #define EX_USAGE 1 /* incorrect invocation or permission */
25 #define EX_SYSERR 2 /* out of memory, cannot fork, ... */
26 #define EX_SOFTWARE 4 /* internal mount bug or wrong version */
27 #define EX_USER 8 /* user interrupt */
28 #define EX_FILEIO 16 /* problems writing, locking, ... mtab/fstab */
29 #define EX_FAIL 32 /* mount failure */
30 #define EX_SOMEOK 64 /* some mount succeeded */
The most common return code I saw was 22, a value not listed in mount.h
. The comment just above the exit status definitions provides a hint. I arrive at 22 by using two bitwise OR[4]> operations: 2|4
and 6|16
. The output of 2|4
is 6 and 6|16
results in 22. The corresponding exit status are EX_SYSERR
(2), EX_SOFTWARE
(4), or EX_FILEIO
(16). EX_SOFTWARE
seemed the most likely cause, as using the wrong version could trigger the error, but it raised the question of are there any bugs that explain the behavior.
Going back to my original search I found a chain of links starting with an Arch Linux forum post[5] on a similar topic that led to a kernel bug report[6], and ultimately a Debian bug report[7]. The kernel bug report included additional debugging commands:
echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control
echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control
echo 1 > /proc/fs/cifs/cifsFYI
echo 1 > /sys/module/dns_resolver/parameters/debug
dmesg --clear
tcpdump -s 0 -w trace.pcap port 445 & pid=$!
mount ...
kill $pid
dmesg > trace.log
I ran the commands and compared the output of my trace.log
to that in the bug report. They seemed to match, at least as much as they could coming from different systems. lore
[8] doesn’t allow deep linking, so I couldn’t download the packet capture and compare it to mine. As I read through the Debian bug report I learned this is a client-side problem being fixed in the 5.18.13 kernel. I will need to wait for the update unless I want to compile my own kernel, and I don’t, but I have a workaround in the meantime.
Wrapping Up
These kinds of problems don’t happen every day, but they’re part and parcel of using Linux on both the server and the desktop. Linux and open source software may not come with a monetary cost, but there is certainly a time investment required. Learning how to troubleshoot and deep dive to get to the root cause is an essential skill to develop, and looking beyond common sources of information (web searches, documentation, bug reports) is key to building this skill. In this example the source code for a package related to the problem at hand (cifs-utils
) provided validation of what I thought to be the issue. I could have stopped once I had my workaround, but digging further led me to bug reports confirming there was a client-side bug, and a fix is on the way. That’s enough information for me to close this out and move on to something more interesting.