Btrfs error device

Привет, такая проблема. Есть компьютер, который используется как роутер а также контейнерный сервер. В логах у него такое: 848.648139] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 18, gen 0 [ 849.141432] BTRFS warnin...

Привет, такая проблема.
Есть компьютер, который используется как роутер а также контейнерный сервер. В логах у него такое:

 848.648139] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 18, gen 0
[  849.141432] BTRFS warning (device sda2): checksum error at logical 12112596992 on dev /dev/sda2, physical 12112596992, root 290, inode 2153392, offset 4214378496, length 4096, links 1 (path: var/lib/lxd/disks/default.img)
[  849.141465] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 19, gen 0
[  849.195020] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 20, gen 0
[  852.190183] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 21, gen 0
[  866.312699] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 22, gen 0
[  870.094738] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 23, gen 0
[  874.623532] BTRFS info (device sda2): scrub: not finished on devid 1 with status: -125
[  915.548043] BTRFS info (device sda2): scrub: started on devid 1
[  926.183099] kauditd_printk_skb: 14 callbacks suppressed
[  926.183110] audit: type=1130 audit(1642170476.243:210): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  926.183129] audit: type=1131 audit(1642170476.243:211): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  963.509864] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 24, gen 0
[  964.184943] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 25, gen 0
[  964.414589] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 26, gen 0
[  966.260093] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 27, gen 0
[  966.619578] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 28, gen 0
[  967.107493] BTRFS warning (device sda2): checksum error at logical 12112596992 on dev /dev/sda2, physical 12112596992, root 290, inode 2153392, offset 4214378496, length 4096, links 1 (path: var/lib/lxd/disks/default.img)
[  967.107505] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 29, gen 0
[  967.165297] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 30, gen 0
[  970.167824] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 31, gen 0
[  984.112427] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 32, gen 0
[  987.712951] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 33, gen 0
[ 1018.115785] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 34, gen 0
[ 1018.144848] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 35, gen 0
[ 1019.375785] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 36, gen 0
[ 1019.527438] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 37, gen 0
[ 1025.579311] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 38, gen 0
[ 1038.929269] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 39, gen 0
[ 1042.125975] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 40, gen 0
[ 1043.770749] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 0, flush 0, corrupt 41, gen 0

На других дисках с btrfs всё вроде-бы нормально. Причём corrupt каждый раз увеличивается. Проблема наблюдается только с тем диском, на котором стоит операционная система. Это ssd.

Тут различные варианты — от самого плохого(летит ssd и рано или поздно либо уйдёт в ro или же вообще перестанет опредилятся) от бага в ядре, который связан с повреждением данных.

The btrfs Wiki glossary says this about what a generation is:

  • generation

    An internal counter which updates for each transaction. When a metadata block is written (using copy on write), current generation is stored in the block, so that blocks which are too new (and hence possibly inconsistent) can be identified.

Another entry mentions that

Under normal circumstances the generation numbers must match. A
mismatch can be caused by a lost write after a crash (ie. a dangling
block «pointer»; software bug, hardware bug), misdirected write (the
block was never written to that location; software bug, hardware bug).

It doesn’t tell me very much, and others are asking too, see for example this mail thread.

Actually, the last post in that thread is mentioning something half useful: A «generation error» is «an indication that blocks have not been written», which is basically echoing what the Wiki says.

So, with that information we can draw some conclusions:

  • The btrfs filesystem is not fully documented (user-side) with explanation of the output from its tools (the Wiki even says «For now, most of the information exists in people’s heads.»)

  • There was a few errors writing meta information to disk, which, yes, could indicate a problem.

By answering this question, I hope that some btfs guru pops up and gives you a proper answer to the question «What do I do about it?».

Your next port of call may be asking on an btrfs mailing list, such as the one mentioned in the Wiki (I would do this now if I were you).

Hello All,

I was looking thru dmesg and saw the following BTRFS error.

[   19.473684] BTRFS error (device sda1): bad tree block start 6207684608 6207668224

However, when I run btrfs filesystem show (to check if the filesystem is corrupted, everything seems fine (no drives missing)

Label: none  uuid: 194543af-f0aa-4ba7-9867-0e3cabced52b
        Total devices 2 FS bytes used 3.02GiB
        devid    1 size 54.02GiB used 5.03GiB path /dev/sdb1
        devid    2 size 54.00GiB used 5.03GiB path /dev/sda1

Is the error at the top something that I need to worry about? I tried researching the issue but see no relevant information on why this error is showing and whay I should do. Any insight into what this error is and/or what to do about it or what should be done about it would be helpful.

Relevant Info:

OS: ubuntu 18.04 LTS server

Kernel: 4.15.0-43-generic

EDIT:

Output of btrfs check on both drives in array.

NOTE: The output was generated from a arch liveusb disk booted on the problem system.

Kernel on Arch boot USB: 4.19.4-arch1-1-ARCH

# btrfs check --readonly /dev/sde1
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/sde1
UUID: 194543af-f0aa-4ba7-9867-0e3cabced52b
found 3245326336 bytes used, no error found
total csum bytes: 2236316
total tree bytes: 104366080
total fs tree bytes: 95895552
total extent tree bytes: 4980736
btree space waste bytes: 24032179
file data blocks allocated: 4631064576
 referenced 2890346496

# btrfs check --readonly /dev/sdd1
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/sdd1
UUID: 194543af-f0aa-4ba7-9867-0e3cabced52b
found 3245326336 bytes used, no error found
total csum bytes: 2236316
total tree bytes: 104366080
total fs tree bytes: 95895552
total extent tree bytes: 4980736
btree space waste bytes: 24032179
file data blocks allocated: 4631064576
 referenced 2890346496

Thanks,

moyam01

A small amount of backstory:

I have a small media filesystem, on which I store various movies and TV shows that are used for my HTPC setup. This was originally set up, using btrfs, on a 1TB WD external drive.

Later, I decided to purchase another drive, to give this filesystem RAID1 mirroring capabilities. This drive is a Seagate Barracuda (2TB, BARRACUDA 7200.14 FAMILY). Unfortunately, this was not a good choice of drive. The drive started developing large amounts of read errors shortly, although BTRFS was able to correct them.

Recently, the amount of read errors from this drive has spiked, with its condition steadily worsening. BTRFS is now starting to crash:

kernel: RSP: 0018:ffff88005f0e7cc0  EFLAGS: 00010282
kernel: RIP: 0010:[<ffffffffa0081736>]  [<ffffffffa0081736>] btrfs_check_repairable+0xf6/0x100 [btrfs]
kernel: task: ffff88001b5c4740 ti: ffff88005f0e4000 task.ti: ffff88005f0e4000
kernel: Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
kernel: CPU: 1 PID: 3136 Comm: kworker/u8:3 Tainted: G           O    4.5.3-1-ARCH #1
kernel: invalid opcode: 0000 [#1] PREEMPT SMP 
kernel: kernel BUG at fs/btrfs/extent_io.c:2309!
kernel: ------------[ cut here ]------------
kernel: BTRFS info (device sdc1): csum failed ino 73072 extent 1531717287936 csum 3335082470 wanted 3200325796 mirror 0
kernel: ata3: EH complete
kernel: BTRFS error (device sdc1): bdev /dev/sda3 errs: wr 0, rd 18, flush 0, corrupt 0, gen 0
kernel: blk_update_request: I/O error, dev sda, sector 2991635296

I’d like to remove the faulty drive from the RAID1 array, going back to no redundancy on a single drive. Unfortunately, there seems to be a lack of documentation on how to do this.

I am aware that one can run the following:

sudo btrfs balance start -dconvert=single /media

to convert the data profile to single mode, but I’m unsure as to just WHERE the data will be placed. As one of the drives is failing, I’d like to be able to ensure that BTRFS doesn’t dutifully erase all the data on the good drive, and place a single copy on the bad drive — instead, I’d like to simply act as if the other drive never existed (as in, convert back to my old setup)

This doesn’t work:

$ sudo btrfs device delete /dev/sda3 /media
ERROR: error removing device '/dev/sda3': unable to go below two devices on raid1

What am I to do? Help would be greatly appreciated.

TL;DR: started with 1 drive in BTRFS single, added another drive, made it RAID1, other drive is now erroring, how do I return to just one drive (SPECIFICALLY the known good one) with single?

BTRFS-RESCUE(8) BTRFS BTRFS-RESCUE(8)

NAME

btrfs-rescue — recover a damaged btrfs filesystem

SYNOPSIS

btrfs rescue <subcommand> <args>

DESCRIPTION

btrfs rescue is used to try to recover a damaged btrfs
filesystem.

SUBCOMMAND

chunk-recover [options]
<device>
Recover the chunk tree by scanning the devices

Options

-y
assume an answer of yes to all questions.
-h
help.
-v
(deprecated) alias for global -v option

NOTE:

Since chunk-recover will scan the whole device, it
will be very slow especially executed on a large device.

fix-device-size
<device>
fix device size and super block total bytes values that are do not match

Kernel 4.11 starts to check the device size more strictly and
this might mismatch the stored value of total bytes. See the exact error
message below. Newer kernel will refuse to mount the filesystem where
the values do not match. This error is not fatal and can be fixed. This
command will fix the device size values if possible.

BTRFS error (device sdb): super_total_bytes 92017859088384 mismatch with fs_devices total_rw_bytes 92017859094528

The mismatch may also exhibit as a kernel warning:

WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]
clear-uuid-tree
<device>
Clear UUID tree, so that kernel can re-generate it at next read-write
mount.

Since kernel v4.16 there are more sanity check performed, and
sometimes non-critical trees like UUID tree can cause problems and
reject the mount. In such case, clearing UUID tree may make the
filesystem to be mountable again without much risk as it’s built from
other trees.

super-recover [options]
<device>
Recover bad superblocks from good copies.

Options

-y
assume an answer of yes to all questions.
-v
(deprecated) alias for global -v option
zero-log
<device>
clear the filesystem log tree

This command will clear the filesystem log tree. This may fix
a specific set of problem when the filesystem mount fails due to the log
replay. See below for sample stack traces that may show up in system
log.

The common case where this happens was fixed a long time ago,
so it is unlikely that you will see this particular problem, but the
command is kept around.

NOTE:

Clearing the log may lead to loss of changes that were
made since the last transaction commit. This may be up to 30 seconds (default
commit period) or less if the commit was implied by other filesystem
activity.

One can determine whether zero-log is needed according to
the kernel backtrace:

? replay_one_dir_item+0xb5/0xb5 [btrfs]
? walk_log_tree+0x9c/0x19d [btrfs]
? btrfs_read_fs_root_no_radix+0x169/0x1a1 [btrfs]
? btrfs_recover_log_trees+0x195/0x29c [btrfs]
? replay_one_dir_item+0xb5/0xb5 [btrfs]
? btree_read_extent_buffer_pages+0x76/0xbc [btrfs]
? open_ctree+0xff6/0x132c [btrfs]

If the errors are like above, then zero-log should be used
to clear the log and the filesystem may be mounted normally again. The
keywords to look for are ‘open_ctree’ which says that it’s during mount and
function names that contain replay, recover or
log_tree.

EXIT STATUS

btrfs rescue returns a zero exit status if it succeeds. Non
zero is returned in case of failure.

AVAILABILITY

btrfs is part of btrfs-progs. Please refer to the
documentation at https://btrfs.readthedocs.io or wiki
http://btrfs.wiki.kernel.org for further information.

SEE ALSO

btrfs-check(8), btrfs-scrub(8),
mkfs.btrfs(8)

This document (7018181) is provided subject to the disclaimer at the end of this document.

SUSE Linux Enterprise Server 15
SUSE Linux Enterprise Server 12
SUSE Linux Enterprise Server 11

NOTE: This is a live document. Content may change without further notice!

Filesystem errors are not uncommon, yet, need to be resolved to ensure a safe and stable system.
This document concentrates on errors seen with the BTRFS filesystem on SUSE Linux Enterprise

Note that at some of these errors we have to aim in two directions:
What actually caused the corruption?
Is there a bug in the btrfs tool or kernel driver to fix that corruption which prevents that?

Repairing a filesystem does not necessarily mean to recover the data, it means to fix the filesystem itself, not its content, at least not in every case.

Let’s start with some best practices first.
Whenever a BTRFS filesystem contains errors, this TID is a good starting point.

Let’s have a look on some typical errors seen in the past:

WARNING: CPU: 2 PID: 452 at ../fs/btrfs/extent-tree.c:3731 btrfs_free_reserved_data_space_noquota+0xe8/0x100 [btrfs]()

Good thing is, it’s a WARNING, not a fatal error.
WARNINGs like this one, e.g. regarding quota, typically are runtime only things that are fixed by BTRFS after the WARNING is issued. Not a bad problem.

Yet, such an issue should be reported to SUSE Support for closer examination.

If you see a message like:

BTRFS: Transaction aborted (error -2)

followed by a stack trace which looks like:

[<ffffffffa041277b>] __btrfs_abort_transaction+0x4b/0x120 [btrfs]
[<ffffffffa0445f87>] __btrfs_unlink_inode+0x367/0x3c0 [btrfs]
[<ffffffffa04499e7>] btrfs_unlink_inode+0x17/0x40 [btrfs]
[<ffffffffa0449a76>] btrfs_unlink+0x66/0xb0 [btrfs]
kernel: BTRFS warning (device sdb3): __btrfs_unlink_inode:3802: Aborting unused transaction(No such entry).

and the filesystem mounted read only, a possible way to fix this issue is to try:

mount -t btrfs -o recovery,ro /dev/<device_name> /<mount_point>

If this does not work, perform a backup then use the latest SUSE kernel and btrfs tools to start the repair procedure.

WARNING: Using ‘—repair‘ can further damage a filesystem instead of helping if it can’t fix your particular issue.

It is extremely important that you ensure a backup has been created before invoking ‘—repair‘. If any doubt open a support request first, before attempting a repair. Use this flag at your own risk. If you do not perform a backup and use an old kernel and old btrfs tool to attempt your repair, you may cause your data to be lost permanently. Use the latest quarterly update for the latest version of SUSE Linux Enterprise Server. For example, at the time of this writing that would be SLE-15-SP2-Online-x86_64-QU2-Media1.iso or SLE-15-SP2-Full-x86_64-QU2-Media1.iso.

btrfs check —repair /dev/<device_name>
btrfs scrub start -Bf /dev/<device_name>

As a last resort cleaning the transaction log can be done with:

btrfs rescue zero-log /dev/<device_name>

More fatal issues are seen if the filesystem spits out tons of messages into the logs, slows down considerably or even goes readonly.

If the Root filesystem is affected, reboot the system into a independent rescue system from DVD, ISO image, USB pendrive, etc… Use the latest available rescue system, the more recent, the better.

What to do if:

  • A bad tree root is found at mount time: use «-o recovery» This attempts to autocorrect that error.
  • Weird ENOSPC issues seen: mount with «-o clear_cache» which will drop btrfs cache
  • Quota issues prevent mounting: Needs the latest available btrfsprogs to fix that. See Section «Additional information»
  • Quota issues seen during normal operation: run ‘btrfs quota rescan’
  • Only if everything else  fails, run ‘btrfs check‘ and research if repair could possibly fix this issue.

WARNING: If in doubt, open a case with Support. ‘btrfs check —repair‘ run with a version which can’t fix the particular issue might make things worse.

In some cases it may be necessary to use a more recent btrfs tools version from the latest service pack for the major version of SLE that you are using to repair a damaged filesystem from within that OS.

Example: For SLE15 GA, SP1 or SP2 it may work out to use the latest SP2 version of btrfsprogs.

Download the SP2 QU2 ISO image from the customer center and mount it to the /mnt directory on the system with the broken filesystem.

Extract the btrfs tool from the btrfsprogs RPM:

rpm2cpio /mnt/Module-Basesystem/x86_64/btrfsprogs-4.19.1-8.6.2.x86_64.rpm | cpio -id ./usr/sbin/btrfs

Then use ./usr/sbin/btrfs check —repair /dev/<defective btrfs device>

NOTE: This only works for a btrfs filesystem which is not mounted, it doesn’t work for the root filesystem of a running system.
To repair that, reboot the system and boot the rescue system from latest quarterly update ISO for the latest major release of SUSE Linux Enterprise Server.

Further: as said above, use with care. In doubt, run «./usr/sbin/btrfs check /dev/<defective btrfs device>» (without —repair) and send the output to SUSE Support for advice.

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented «AS IS» WITHOUT WARRANTY OF ANY KIND.

I have a weird issue with a broken kernel image that broke when uninstalling. It looks like an hard drive error — however I am not 100% sure.

I could maybe ignore this but as this is related to an installed linux image I can no longer remove this image. Running apt upgrade now returns:

Fetched 54,0 MB in 9s (5.581 kB/s)                                                                                                                                                                                        
(Reading database ... 652815 files and directories currently installed.)
Removing linux-image-4.14.13-041413-generic (4.14.13-041413.201801101001) ...
dpkg: error processing package linux-image-4.14.13-041413-generic (--remove):
 unable to securely remove '/lib/modules/4.14.13-041413-generic/kernel/fs/ntfs/ntfs.ko.dpkg-tmp': Input/output error
Errors were encountered while processing:
 linux-image-4.14.13-041413-generic

And simply abort the upgrade. I could not get apt to stop trying to remove this package with any advice giving in this community. There are always input/output errors popping up.
I tried to run smartmontools on the SSD (it is an nvme device) but it simply returns nothing. It doesn’t even really recognize the device (unknown model) even though I am using a recent version of smartmontools.
Any advice would be highly appreciated — I will probably soon send the SSD to the seller as I cannot work with a broken state like this.

dmesg:

[ 7505.464153] BTRFS error (device dm-0): bdev /dev/dm-0 errs: wr 0, rd 0, flush 0, corrupt 0, gen 4
[ 7505.464156] BTRFS error (device dm-0): unable to fixup (regular) error at logical 49416634368 on dev /dev/dm-0
[ 7505.464159] BTRFS error (device dm-0): bdev /dev/dm-0 errs: wr 0, rd 0, flush 0, corrupt 0, gen 5
[ 7505.464160] BTRFS error (device dm-0): unable to fixup (regular) error at logical 49416552448 on dev /dev/dm-0
[ 7505.465547] BTRFS warning (device dm-0): checksum/header error at logical 49416962048 on dev /dev/dm-0, sector 44088704: metadata leaf (level 0) in tree 257
[ 7505.465549] BTRFS warning (device dm-0): checksum/header error at logical 49416962048 on dev /dev/dm-0, sector 44088704: metadata leaf (level 0) in tree 257
[ 7505.465551] BTRFS error (device dm-0): bdev /dev/dm-0 errs: wr 0, rd 0, flush 0, corrupt 0, gen 6
[ 7505.465554] BTRFS error (device dm-0): unable to fixup (regular) error at logical 49416962048 on dev /dev/dm-0
[ 7611.150366] BTRFS error (device dm-0): parent transid verify failed on 49416634368 wanted 192617 found 340661
[ 7611.150569] BTRFS error (device dm-0): parent transid verify failed on 49416634368 wanted 192617 found 340661
[ 7616.836634] BTRFS error (device dm-0): parent transid verify failed on 49416634368 wanted 192617 found 340661
[ 7616.836837] BTRFS error (device dm-0): parent transid verify failed on 49416634368 wanted 192617 found 340661

btrfs scrub:

scrub status for 59deeb9a-a07e-4aaa-91d6-dd5a12d7959c
    scrub started at Wed Feb 14 08:50:48 2018 and finished after 00:00:08
    total bytes scrubbed: 22.60GiB with 3 errors
    error details: verify=3
    corrected errors: 0, uncorrectable errors: 3, unverified errors: 0

uhoh, looks like the problem is back and worse than before. So as a background, I swapped the SAS cable attached to my IBM M1015 controller (in JBOD non-IT mode) and it seemed to be ok. But after around 24hrs, the same thing happened. What I also noticed was that the Samsung 850 EVO was running on high temp (46-50 degrees) in and out.

What I made a mistake in was that I thought rebooting the server will be fine. Upon rebooting, I didn’t check to see if the Samsung 850 was detected (it wasn’t and Cache 2 was ‘unassigned’).  I started the array without knowing and found out after an hour later when the cache in cache pool mode has been rebalancing. I did a safe shutdown and switched my Samsung 850 to be attached directly via SATA to my MB controller instead. After a 2nd boot, the drive showed up and I tried adding it back to the cache. But there was a warning saying that the drive will be wiped.

So what I did was I started the array again, and mounted the Samsung 850 via Unassigned Devices plugin. I then checked to see what files were in the Cache 2 drive — appdata, system, downloads, etc. When I tried to copy over the files from the Cache2 to the array, I noticed all my shares have disappeared. Also, Cache 1 has a file system error on the ‘Main’ tab. I then checked to see the individual array disks via Midnight Commander, and the folders/files fortunately are there. So now, I’m facing several problems to summarize:

1) Constantly failing Cache 2 drive (Samsung 850 EVO 1TB)

-did BTRFS scrub and the disk was fine after the reboots

-connected originally via IBM M1015 JBOD SAS then moved to SATA MB controller

-in drive pool with slow Crucial M4 SSD (256GB) — maybe the speed difference in the pooled disk caused issues?

-perhaps I fixed the drop outs after swapping the SSD controller?

-or were the drops caused by high SSD temps?

2) Accidentally started array with unassigned Cache 2 and had partial rebalace

-in the unRAID main tab, it still shows that the cache pool is 1.3TB in size, showing that total of the 2 devices although the Samsung was dropped

-how do I safely add it back into the pool without wiping the data?

3) Shares all disappeared

-/mnt/users is gone, but the individual disks are still there

I guess some questions are whether these issues are all linked up? And is there a way to quickly add back in my dropped Cache 2 SSD back in the pool safely? Many thanks again for the help!

recover a damaged btrfs filesystem

Examples (TL;DR)

  • Rebuild the filesystem metadata tree (very slow): sudo btrfs rescue chunk-recover path/to/partition
  • Fix device size alignment related problems (e.g. unable to mount the filesystem with super total bytes mismatch): sudo btrfs rescue fix-device-size path/to/partition
  • Recover a corrupted superblock from correct copies (recover the root of filesystem tree): sudo btrfs rescue super-recover path/to/partition
  • Recover from an interrupted transactions (fixes log replay problems): sudo btrfs rescue zero-log path/to/partition
  • Create a /dev/btrfs-control control device when mknod is not installed: sudo btrfs rescue create-control-device

tldr.sh

Synopsis

btrfs rescue <subcommand> <args>

Description

btrfs rescue is used to try to recover a damaged btrfs filesystem.

Subcommand

chunk-recover [options] <device>

Recover the chunk tree by scanning the devices

Options

-y

assume an answer of yes to all questions.

-h

help.

-v

(deprecated) alias for global -v option

NOTE:

Since chunk-recover will scan the whole device, it will be very slow especially executed on a large device.

fix-device-size <device>

fix device size and super block total bytes values that are do not match

Kernel 4.11 starts to check the device size more strictly and this might mismatch the stored value of total bytes. See the exact error message below. Newer kernel will refuse to mount the filesystem where the values do not match. This error is not fatal and can be fixed.  This command will fix the device size values if possible.

BTRFS error (device sdb): super_total_bytes 92017859088384 mismatch with fs_devices total_rw_bytes 92017859094528

The mismatch may also exhibit as a kernel warning:

WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]
clear-uuid-tree <device>

Clear UUID tree, so that kernel can re-generate it at next read-write mount.

Since kernel v4.16 there are more sanity check performed, and sometimes non-critical trees like UUID tree can cause problems and reject the mount. In such case, clearing UUID tree may make the filesystem to be mountable again without much risk as it’s built from other trees.

super-recover [options] <device>

Recover bad superblocks from good copies.

Options

-y

assume an answer of yes to all questions.

-v

(deprecated) alias for global -v option

zero-log <device>

clear the filesystem log tree

This command will clear the filesystem log tree. This may fix a specific set of problem when the filesystem mount fails due to the log replay. See below for sample stack traces that may show up in system log.

The common case where this happens was fixed a long time ago, so it is unlikely that you will see this particular problem, but the command is kept around.

NOTE:

Clearing the log may lead to loss of changes that were made since the last transaction commit. This may be up to 30 seconds (default commit period) or less if the commit was implied by other filesystem activity.

One can determine whether zero-log is needed according to the kernel backtrace:

? replay_one_dir_item+0xb5/0xb5 [btrfs]
? walk_log_tree+0x9c/0x19d [btrfs]
? btrfs_read_fs_root_no_radix+0x169/0x1a1 [btrfs]
? btrfs_recover_log_trees+0x195/0x29c [btrfs]
? replay_one_dir_item+0xb5/0xb5 [btrfs]
? btree_read_extent_buffer_pages+0x76/0xbc [btrfs]
? open_ctree+0xff6/0x132c [btrfs]

If the errors are like above, then zero-log should be used to clear the log and the filesystem may be mounted normally again. The keywords to look for are ‘open_ctree’ which says that it’s during mount and function names that contain replay, recover or log_tree.

Exit Status

btrfs rescue returns a zero exit status if it succeeds. Non zero is returned in case of failure.

Availability

btrfs is part of btrfs-progs.  Please refer to the documentation at https://btrfs.readthedocs.io or wiki http://btrfs.wiki.kernel.org for further information.

See Also

btrfs-check(8), btrfs-scrub(8), mkfs.btrfs(8)

Referenced By

btrfs(8), btrfs-check(8), btrfs-restore(8).

Jan 25, 2023 6.1.3 BTRFS

Понравилась статья? Поделить с друзьями:
  • Bthport sys ошибка
  • Bthport sys windows 10 как исправить
  • Bthavrcp sys windows 10 как исправить
  • Bt rre160 коды ошибок
  • Bt e2p e ошибка lg колонка