ZFS 是最早由昇陽電腦開發,並在 2005 年發布的先進檔案系統。
ZFS 被描述為「終極檔案系統」,穩定、快速、安全並面向未來。
ZFS 的特點包括:儲存池(被稱為 "zpool" 的整合卷管理系統)、寫時複製、快照、資料完整性校驗和自動修復(擦除)、RAID-Z、最大 16 Exabyte 檔案大小,以及最大 256×10¹⁵ Zettabyte 儲存,且對檔案系統(資料集)或檔案的數量沒有限制 [1] 。
ZFS 採用通用開發與散布許可證(CDDL)授權,其與 GPL 不相容,因此 ZFS 不可能被納入 Linux 核心中。然而,這並不妨礙第三方開發者開發並發布原生的 Linux 核心模組,比如 OpenZFS (以前被稱為 ZFS on Linux)。
由 ZFS 不包含於核心中產生的問題:
- OpenZFS 專案必須主動跟上 Linux 核心版本。在 OpenZFS 釋出穩定版本後,由 Arch ZFS 維護者進行發布。
- 有時會因為 OpenZFS 不支援新核心版本而無法正常進行滾動更新。
安裝
安裝 ZFS 核心模組有兩種方式:一種是安裝適用於特定核心版本的包,另一種是使用為已安裝的核心構建模組的 DKMS 包。具體步驟請參考下方。
ZFS 的使用者空間工具由 zfs-utilsAUR 提供,是所有 ZFS 核心模組的依賴之一,。
核心特定包
從 archzfs 倉庫或 Arch 使用者倉庫安裝:
- zfs-linuxAUR —— 適用於 linux包 的穩定版本。
- zfs-linux-gitAUR —— 適用於 linux包 的開發版本(支援更新的核心版本)。
- zfs-linux-ltsAUR —— 適用於 linux-lts包 的穩定版本。
- zfs-linux-hardenedAUR —— 適用於 linux-hardened包 的穩定版本。
- zfs-linux-zenAUR —— 適用於 linux-zen包 的穩定版本。
通過在命令列中執行 zpool status
來測試安裝情況。如果出現 "insmod" 錯誤,請嘗試 depmod -a
。
DKMS
為了在每次核心升級時自動重新編譯 ZFS 模組,使用者可以使用 DKMS。
從 archzfs 倉庫或 Arch 使用者倉庫安裝:
- zfs-dkmsAUR —— 支援動態核心模組的穩定版本。(可能與最新的穩定版核心不相容,建議使用 linux-lts包 核心)
- zfs-dkms-staging-gitAUR —— 獲得 zfs 穩定分支上的最新修補程式,以及對 Arch 中最新核心軟體包的後向支援。
- zfs-dkms-gitAUR —— 支援動態核心模組的開發版本。
IgnorePkg
條目,以防在進行定期更新時升級這些軟體包。要編譯上述 DKMS 包提供的 ZFS 模組,需同時安裝系統核心所對應的標頭檔包(例如 linux包 對應的 linux-headers包,linux-lts包 對應的 linux-lts-headers包,等等。)在 DKMS 包或核心更新後,DKMS pacman hook 會讓核心模組自動重新進行編譯。
根分割區為 ZFS
嘗試使用 ZFS
如果有使用者希望在不會造成資料遺失的情況下,用諸如 ~/zfs0.img
~/zfs1.img
~/zfs2.img
等簡單檔案的虛擬塊裝置(在 ZFS 術語中被稱為 VDEVs)來試驗 ZFS,可以參閱嘗試使用 ZFS 文章。這篇文章涵蓋了一些常見的任務,如建立一個 RAIDZ 陣列、故意破壞資料並恢復、快照資料集等。
組態
ZFS 的開發者認為,ZFS 是一個「零管理」的檔案系統;因此,組態 ZFS 非常容易。組態主要通過兩個命令完成:zfs
和 zpool
。
自動啟動
為了達到ZFS所謂的「零管理」狀態,你可能會希望在系統啟動時自動匯入儲存池。
-
zfs.target
- 其它單元的依賴參考點 -
zfs-import.target
- 提供正確的啟動順序[3] -
zfs-import-cache.service
- 用於匯入儲存池
為每一個你想用 zfs-import-cache.service
自動匯入的儲存池執行如下命令:
zfs-import-cache.service
通過讀取 /etc/zfs/zpool.cache
檔案來匯入儲存池。可以執行如下命令來通過 zfs-import-cache.service
自動匯入每個需要匯入的儲存池:
# zpool set cachefile=/etc/zfs/zpool.cache pool
要在不使用 /etc/fstab
的前提下自動掛載 ZFS 檔案系統,有兩種選擇:
- 啟用 zfs-mount.service 服務
- 使用 zfs-mount-generator
使用 zfs-mount.service 服務
為了在啟動時自動掛載 ZFS,你需要啟用 zfs-mount.service
。
- 此方法對於單獨的
/var
資料集無效,因為它不能被提前掛載。你應該改用 zfs-mount-generator 方式。更多資訊請參見 OpenZFS 問題 #3768。 - It appears that the ZFS module is loaded too late for using this method, see BBS#274044 and Talk:ZFS#zfs hook in mkinitcpio.conf. The workaround is to use the
ZFS
mkinitcpio hook.
使用 zfs-mount-generator
你也可以用 zfs-mount-generator 在啟動時為你的 ZFS 檔案系統生成 systemd 掛載單元。systemd 會根據掛載單元自動掛載檔案系統,無需使用 zfs-mount.service
。具體操作如下:
- 建立
/etc/zfs/zfs-list.cache
目錄。 - 啟用必要的 ZFS Event Daemon (ZED) 指令碼(被稱為 ZEDLET)來建立可掛載的 ZFS 檔案系統列表。(如果用的是 OpenZFS >= 2.0.0,這個連結會被自動建立)
# ln -s /usr/lib/zfs/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zed.d
-
啟用
zfs.target
目標,並啟動/啟用 ZFS Event Daemon (zfs-zed.service
)。這個服務負責執行上一步提到的指令碼。 - 你需要在
/etc/zfs/zfs-list.cache
目錄下建立一個以你的儲存池命名的空白檔案。只有當這個檔案存在時,ZEDLET 才會更新檔案系統列表。# touch /etc/zfs/zfs-list.cache/<pool-name>
- 檢查檔案
/etc/zfs/zfs-list.cache/<pool-name>
中的內容。如果該檔案為空,確保zfs-zed.service
處於執行狀態,並執行以下命令來修改你檔案系統的 canmount 屬性:zfs set canmount=off zroot/fs1
;修改這個屬性會讓 ZFS 觸發一個由 ZED 擷取的事件,ZED 繼而執行 ZEDLET 指令碼來更新/etc/zfs/zfs-list.cache
中的檔案。如果/etc/zfs/zfs-list.cache
中的檔案已經更新過,你可以用如下命令來改回 ZFS 檔案系統的canmount
屬性:zfs set canmount=on zroot/fs1
你需要為系統裡的每一個 ZFS 儲存池在 /etc/zfs/zfs-list.cache
目錄下建立對應的檔案。確保已經啟用了相關單元和目標。
儲存池
在建立 ZFS 檔案系統之前,並不一定要先給它分割區。推薦將 ZFS 指向整個硬碟 (例如 /dev/sdx
而不是像 /dev/sdx1
的單個分割區),這將自動建立一個 GPT (GUID 分割區表) ,並在磁碟的開始部分為傳統引導程式添加一個 8MB 的保留分割區。但是,如果你想要建立具有不同冗餘屬性的多個卷,你可以在現有檔案系統中指定一個分割區或一個檔案。
12
的 ashift
值以獲得最佳效能。為了與傳統系統相容,先進格式化磁碟類比 512 位元組的磁區大小,這導致 ZFS 有時會為 ashift
選項使用一個不理想的值。一旦池被建立,改變 ashift
選項的唯一方法就是重新建立池。與此同時,使用一個 12
的 ashift 值也會減少可用容量。參見 OpenZFS FAQ: 效能考慮, 先進格式化磁碟, 以及 ZFS 和先進格式化磁碟.辨識磁碟
OpenZFS 建議在建立少於 10 個裝置的 ZFS 儲存池時使用裝置 ID [4]. 使用塊裝置持久化命名#通過 id 和 通過路徑來確定要用於建立 ZFS 池的驅動器列表。
磁碟 ID 應該類似於以下內容:
$ ls -lh /dev/disk/by-id/
lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0JKRR -> ../../sdc lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0JTM1 -> ../../sde lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0KBP8 -> ../../sdd lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0KDGY -> ../../sdb
/dev/sda
,/dev/sdb
,...), ZFS 可能無法在啟動時間歇地檢測到 zpools。使用 GPT 標籤
通過使用 GPT 分割區,磁碟標籤和 UUID 也可以用於 ZFS 掛載。ZFS 驅動器有標籤,但 Linux 在啟動時無法讀取這些標籤。與 MBR 分割區不同,GPT 分割區直接支援 UUID 和標籤,與分割區內的格式無關。讓 ZFS 使用磁碟分割區而不是整個磁碟可帶來兩個好處。作業系統不會從 ZFS 已寫入分割區磁區的任何不可預測資料中生成偽分割區號,而且如果有需要,你還可以很容易地給固態硬碟組態預留空間 (OP),並給機械硬碟組態少量預留空間,以確保 zpool 可以將磁區數略微不同的型號替換到你的鏡像。這樣,就可以零成本地用現有的技術和工具來組態與控制 ZFS。
使用 gdisk 將全部或部分驅動器劃分為單一分割區。gdisk 不會自動為分割區命名,所以如果需要分割區標籤,請使用 gdisk 命令 "c" 為分割區添加標籤。比起 UUID,你可能更喜歡標籤的一些原因是:標籤容易控制,標籤可以使你每個磁碟的用途一目了然,而且標籤更短,更容易輸入。這些都是在伺服器宕機和高壓力時的優勢。GPT 分割區標籤有足夠的空間,可以儲存大多數國際字元 zhwp:GUID磁碟分割表#分割區表項(LBA_2–33),允許以有組織的方式對大型資料池進行標記。
使用 GPT 分割區的驅動器具有類似如下所示的標籤和 UUID:
$ ls -l /dev/disk/by-partlabel
lrwxrwxrwx 1 root root 10 Apr 30 01:44 zfsdata1 -> ../../sdd1 lrwxrwxrwx 1 root root 10 Apr 30 01:44 zfsdata2 -> ../../sdc1 lrwxrwxrwx 1 root root 10 Apr 30 01:59 zfsl2arc -> ../../sda1
$ ls -l /dev/disk/by-partuuid
lrwxrwxrwx 1 root root 10 Apr 30 01:44 148c462c-7819-431a-9aba-5bf42bb5a34e -> ../../sdd1 lrwxrwxrwx 1 root root 10 Apr 30 01:59 4f95da30-b2fb-412b-9090-fc349993df56 -> ../../sda1 lrwxrwxrwx 1 root root 10 Apr 30 01:44 e5ccef58-5adf-4094-81a7-3bac846a885f -> ../../sdc1
$ UUID=$(lsblk --noheadings --output PARTUUID /dev/sdXY)
建立 ZFS 池
要建立 ZFS 池,請使用如下命令:
# zpool create -f -m <mount> <pool> [raidz(2|3)|mirror] <ids>
ashift
。- create: 建立池的子命令。
- -f: 強制建立池。這是為了忽略「EFI 標籤錯誤」。參見不包含 EFI 標籤。
-
-m: 池的掛載點。如果沒有指定掛載點, 池將被掛載到
/<pool>
。
- pool: 池的名稱。
- raidz(2|3)|mirror: 這是從裝置清單中建立出的虛擬裝置的類型,RAID Z 是單盤奇偶校驗(與 RAID 5 類似),RAID Z2 是 2 盤奇偶校驗(與 RAID 6 類似),RAID Z3 是 3 盤奇偶校驗。另外還有鏡像,它類似於 RAID 1 或 RAID 10,但不限於 2 個裝置。如果不指定裝置類型,每個裝置將被添加為一個與 RAID 0 類似的 vdev。在建立之後,可以在每個單盤 vdev 上添加一個裝置來轉換為鏡像,這對於遷移資料很有用。
- ids: 池中包含的驅動器或分割區的 ID。
使用單個 RAID-Z vdev 建立池:
# zpool create -f -m /mnt/data bigdata \ raidz \ ata-ST3000DM001-9YN166_S1F0KDGY \ ata-ST3000DM001-9YN166_S1F0JKRR \ ata-ST3000DM001-9YN166_S1F0KBP8 \ ata-ST3000DM001-9YN166_S1F0JTM1
使用兩個鏡像 vdev 建立池:
# zpool create -f -m /mnt/data bigdata \ mirror \ ata-ST3000DM001-9YN166_S1F0KDGY \ ata-ST3000DM001-9YN166_S1F0JKRR \ mirror \ ata-ST3000DM001-9YN166_S1F0KBP8 \ ata-ST3000DM001-9YN166_S1F0JTM1
先進格式化磁碟
在池建立時,應始終使用 ashift=12, 但具有 8K 磁區的固態硬碟除外(此時應使用 ashift=13)。在 512 位元組硬碟構成的 vdev 上使用 4k 磁區不會導致效能問題,但不能反過來在 4k 硬碟上使用 512 位元組磁區。鑑於 ashift 無法在池建立後更改,建議即使在純 512 位元組硬碟構成的儲存池上也使用 4k 磁區,以保證對未來使用 4k 硬碟進行替換或擴容的相容性。由於探測 4k 硬碟的機制並不是很準確,在建立儲存池時應顯式指定 -o ashift=12
選項。更多資訊請參考 OpenZFS FAQ。
使用 ashift=12 及單個 raidz vdev 建立儲存池:
# zpool create -f -o ashift=12 -m /mnt/data bigdata \ raidz \ ata-ST3000DM001-9YN166_S1F0KDGY \ ata-ST3000DM001-9YN166_S1F0JKRR \ ata-ST3000DM001-9YN166_S1F0KBP8 \ ata-ST3000DM001-9YN166_S1F0JTM1
建立相容 GRUB 的儲存池
預設情況下,zpool create 會為儲存池啟用所有特性。如果使用 GRUB 時將 /boot
放置到了 ZFS 下,就需要將 GRUB 不支援的特性全部禁用,否則 GRUB 將無法讀取池中的資料。ZFS 內建了相容性檔案(參見 /usr/share/zfs/compatibility.d
),可以幫助建立僅包含部分特性集的儲存池,其中就包括了 grub2。
可以通過如下命令建立包含部分特性集的儲存池:
# zpool create -o compatibility=grub2 $POOL_NAME $VDEVS
驗證儲存池狀態
如果命令成功執行,則不會有任何輸出。使用 mount 命令會顯示儲存池已被掛載。使用 zpool status
會顯示儲存池已被建立:
# zpool status -v
pool: bigdata state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM bigdata ONLINE 0 0 0 -0 ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0KDGY-part1 ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0JKRR-part1 ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0KBP8-part1 ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0JTM1-part1 ONLINE 0 0 0 errors: No known data errors
這時候建議通過重新啟動來驗證下 ZFS 儲存池是否在啟動時會被掛載。在向儲存池傳輸任何資料前,最好先處理完所有報錯。
匯入通過 ID 建立的儲存池
如果儲存池沒有成功自動掛載,就需要手動進行匯入。這時候,需避免使用最簡單的匯入方法。
zpool import pool
!這一操作將使用 /dev/sd?
進行儲存池匯入,會導致硬碟順序改變後出現問題。這一問題甚至會在插入 USB 儲存後重新啟動就被觸發。使用下列任一命令來匯入儲存池,以保持儲存池建立時的一致性:
# zpool import -d /dev/disk/by-id bigdata # zpool import -d /dev/disk/by-partlabel bigdata # zpool import -d /dev/disk/by-partuuid bigdata
最後檢查儲存池的狀態:
# zpool status -v bigdata
刪除儲存池
ZFS 可輕鬆地刪除已掛載的儲存池,並移除所有關於 ZFS 裝置的元資料。
要刪除儲存池:
# zpool destroy <pool>
要刪除資料集:
# zfs destroy <pool>/<dataset>
接下來檢查下儲存池狀態:
# zpool status
no pools available
匯出儲存池
如果要在另一個系統上使用儲存池,就要先將其匯出。另外,如果儲存池是從 archiso 匯入的,也需要先將其匯出,因為 archiso 和使用中系統的 hostid 不同。如果儲存池沒有被匯出,那 zpool 會拒絕將其匯入。可以使用 -f
進行強制匯入,但該操作並不規範。
嘗試匯入未被匯出的儲存池將出現報錯,稱儲存池在被其它系統占用。該錯誤可能會在啟動階段出現,並將系統置入 busybox 控制台,而修復需要使用 archiso 進行操作:要麼匯出儲存池,要麼在核心啟動選項中添加 zfs_force=1
參數(不建議)。詳細資訊請查閱 #On boot the zfs pool does not mount stating: "pool may be in use from other system"。
通過以下命令匯出儲存池:
# zpool export <pool>
擴充現有 zpool
可以通過如下命令將一個裝置(單個分割區或硬碟)添加到現有 zpool:
# zpool add <pool> <device-id>
可以通過如下命令匯入由多個裝置構成的儲存池:
# zpool import -d <device-id-1> -d <device-id-2> <pool>
或者更簡單的:
# zpool import -d /dev/disk-by-id/ <pool>
添加裝置為鏡像
可以將裝置(分割區或硬碟)作為鏡像附加到現有裝置上(與 RAID1 類似):
# zpool attach <pool> <device-id|mirror> <new-device-id>
你可以將新裝置添加到現有鏡像 vdev 中(例如從 2 盤鏡像變為 3 盤鏡像)或將其附加到單個裝置上以構成新的鏡像 vdev。
重新命名 zpool
可以用以下兩步重新命名已建立的 zpool:
# zpool export oldname # zpool import oldname newname
更換掛載點
可以通過如下命令修改 zpool 的掛載點:
# zfs set mountpoint=/foo/bar poolname
升級 zpool
在使用更新版本的 zfs
模組時,zpools 可能會顯示一條更新提示:
$ zpool status -v
pool: bigdata state: ONLINE status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(5) for details.
- 低版本的
zfs
模組無法匯入高版本的 zpool。 - 在涉及到重要資料時,建議在執行
zpool upgrade
前先建立一份備份。
使用如下命令來升級名為 bigdata 的 zpool:
# zpool upgrade bigdata
使用如下命令來升級所有 zpool:
# zpool upgrade -a
建立資料集
相對於在儲存池中建立資料夾,使用者可以選擇在儲存池中建立資料集。除了快照外,資料集還提供了如配額控制等更強大的控制功能。在建立並掛載資料集前,需確儲存儲池中不存在與資料集同名的資料夾。以下命令可用於建立資料集:
# zfs create <存储池名>/<数据集名>
可以對資料集應用 ZFS 特定屬性。例如,你可以對資料集中的資料夾設定配額限制:
# zfs set quota=20G <存储池名>/<数据集名>/<文件夹>
如需了解更多 ZFS 命令,可查閱 zfs(8) 或 zpool(8)。
原生加密
ZFS 支援如下幾種加密方式:aes-128-ccm
, aes-192-ccm
, aes-256-ccm
, aes-128-gcm
, aes-192-gcm
及 aes-256-gcm
。當加密設定為 on
時,將使用 aes-256-gcm
進行加密。See zfs-change-key(8) for a description of the native encryption, including limitations.
支援下列幾種金鑰格式:passphrase
, raw
, hex
。
One can also specify/increase the default iterations of PBKDF2 when using passphrase
with -o pbkdf2iters <n>
, although it may increase the decryption time.
- To import a pool with keys, one needs to specify the
-l
flag, without this flag encrypted datasets will be left unavailable until the keys are loaded. See #Importing a pool created by id. - Native ZFS encryption has been made available in the stable 0.8.0 release or newer. Previously it was only available in development versions provided by packages like zfs-linux-gitAUR, zfs-dkms-gitAUR or other development builds. Users who were only using the development versions for the native encryption, may now switch to the stable releases if they wish.
- The default encryption suite was changed from
aes-256-ccm
toaes-256-gcm
in the 0.8.4 release.
使用如下命令建立通過密碼短語加密的資料集:
# zfs create -o encryption=on -o keyformat=passphrase <存储池名>/<数据集名>
使用金鑰而不是密碼短語進行加密:
# dd if=/dev/random of=/path/to/key bs=32 count=1 iflag=fullblock # zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///path/to/key <存储池名>/<数据集名>
The easy way to make a key in human-readable form (keyformat=hex
):
# od -Anone -x -N 32 -w64 /dev/random | tr -d [:blank:] > /path/to/hex.key
驗證金鑰位置:
# zfs get keylocation <存储池名>/<数据集名>
更改金鑰位置:
# zfs set keylocation=file:///path/to/key <存储池名>/<数据集名>
你也可以下列任意一條命令手動載入金鑰:
# zfs load-key <存储池名>/<数据集名> # load key for a specific dataset # zfs load-key -a # load all keys # zfs load-key -r zpool/dataset # load all keys in a dataset
掛載加密資料集:
# zfs mount <存储池名>/<数据集名>
啟動時解鎖/掛載:systemd
可以使用 systemd 單元在啟動時自動解鎖資料集。例如,可以建立如下服務來解鎖特定的資料集:
/etc/systemd/system/zfs-load-key@.service
[Unit] Description=Load %I encryption keys Before=systemd-user-sessions.service zfs-mount.service After=zfs-import.target Requires=zfs-import.target DefaultDependencies=no [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/bin/bash -c 'until (systemd-ask-password "Encrypted ZFS password for %I" --no-tty | zfs load-key %I); do echo "Try again!"; done' [Install] WantedBy=zfs-mount.service
接下來為每個加密資料集啟動/啟用該服務 (例如 zfs-load-key@pool0-dataset0.service
)。注意,-
在 systemd 單元中的定義為 /
,詳細資料可參考 systemd-escape(1)
。
Before=systemd-user-sessions.service
ensures that systemd-ask-password is invoked before the local IO devices are handed over to the desktop environment.另一種方法是載入所有可能用到的金鑰:
/etc/systemd/system/zfs-load-key.service
[Unit] Description=Load encryption keys DefaultDependencies=no After=zfs-import.target Before=zfs-mount.service [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/bin/zfs load-key -a StandardInput=tty-force [Install] WantedBy=zfs-mount.service
接下來啟動/啟用 zfs-load-key.service
。
登入時解鎖:PAM
If you are not encrypting the root volume, but only the home volume or a user-specific volume, another idea is to wait until login to decrypt it. The advantages of this method are that the system boots uninterrupted, and that when the user logs in, the same password can be used both to authenticate and to decrypt the home volume, so that the password is only entered once.
First set the mountpoint to legacy to avoid having it mounted by zfs mount -a
:
zfs set mountpoint=legacy zroot/data/home
Ensure that it is in /etc/fstab so that mount /home
will work:
/etc/fstab
zroot/data/home /home zfs rw,xattr,posixacl,noauto 0 0
Alternatively, you can keep using ZFS mounts if you use both:
zfs set canmount=noauto zroot/data/home zfs set org.openzfs.systemd:ignore=on zroot/data/home
The first will stop ZFS automatically mounting it, and the second systemd, but you will still be able to manually (or through the following scripts) mount it. If you have child datasets, org.openzfs.systemd:ignore=on
will be inherited, but you will need to set canmount=noauto
on each as it is not inheritable, otherwise they will try to mount without a mountpoint.
On a single-user system, with only one /home
volume having the same encryption password as the user's password, it can be decrypted at login as follows: first create /usr/local/bin/mount-zfs-homedir
/usr/local/bin/mount-zfs-homedir
#!/bin/bash set -eu # $PAM_USER will be the username of the user, you can use it for per-user home volumes. HOME_VOLUME="zroot/data/home" if [ "$(zfs get keystatus "${HOME_VOLUME}" -Ho value)" != "available" ]; then PASSWORD=$(cat -) zfs load-key "${HOME_VOLUME}" <<< "$PASSWORD" || continue fi # This will also mount any child datasets, unless they use a different key. echo "$(zfs list -rHo name,keystatus,mounted "${HOME_VOLUME}")" | while IFS=$'\t' read -r NAME KEYSTATUS MOUNTED; do if [ "${MOUNTED}" != "yes" ] && [ "${KEYSTATUS}" == "available" ]; then zfs mount "${NAME}" || true fi done
do not forget to make it executable; then get PAM to run it by adding the following line to /etc/pam.d/system-auth:
/etc/pam.d/system-auth
auth optional pam_exec.so expose_authtok /usr/local/bin/mount-zfs-homedir
Now it will transparently decrypt and mount the /home volume when you log in anywhere: on the console, via ssh, etc.
SSH
A caveat is that since your ~/.ssh
directory is not mounted, if you log in via ssh, you must use password authentication the first time rather than relying on ~/.ssh/authorized_keys
.
If you do not wish to enable (insecure) password authentication, you can instead move ~/.ssh/authorized_keys
to a new location. Make /etc/ssh/user_config/
and inside it a folder for each user, owned by that user and with 700
permissions. Then move each user's authorized_keys
into their respective folders, and edit the system sshd configuration:
/etc/ssh/sshd_config
AuthorizedKeysFile /etc/ssh/user_config/%u/authorized_keys
Then restart sshd.service
. You can also optionally make a link for each user from ~/.ssh/authorized_keys
to the new location so users can still edit it as they are used to.
This will let you log in, but your home partition will not be mounted, and you will need to do so manually. There are multiple options to work around this:
SSH Key & Password when required
It is possible to set up PAM to only prompt for a password via SSH when it is necessary to decrypt your home partition. You will need to enable both publickey
and keyboard-interactive
authentication methods:
/etc/ssh/sshd_config
PubkeyAuthentication yes KbdInteractiveAuthentication yes AuthenticationMethods publickey,keyboard-interactive ## Example of excluding a certain user who does not have an encrypted home directory. #Match User nohome # KbdInteractiveAuthentication no # AuthenticationMethods publickey
AuthenticationMethods publickey,keyboard-interactive
, this means that you need to do both authentication methods to log in with SSH. The very similar AuthenticationMethods publickey keyboard-interactive
means you can do either to log in, which would let someone bypass your public key auth.keyboard-interactive
and not password
? password
is done client-side, so even if the auth is skipped, the user is still prompted and the password is just thrown away. With keyboard-interactive
the user does not get prompted at all when we skip it.This will mean it asks for the password after validating the key, but using PAM we can stop it asking for the password when not needed. We make a script that will fail when the key is not available to us:
/usr/local/bin/require-encrypted-homedir
#!/bin/bash set -eu HOME_VOLUME="zroot/data/home" # You can use $PAM_USER to use the username in the volume for a per-user solution. if [ "$(zfs get keystatus "${HOME_VOLUME}" -Ho value)" != "available" ]; then exit 27 # PAM_TRY_AGAIN elif [[ "${SSH_AUTH_INFO_0:-""}" =~ ^"publickey " ]]; then exit 0 else # If this happens, it implies a configuration error: either you are allowing auth without a public # key, or have enabled this in a non-SSH PAM service. Both are dangerous and this should block it, # but if you see it, fix your configuration. exit 3 # PAM_SERVICE_ERR fi
And make it executable.
Now we want to configure PAM to call this, and skip asking for the password if the script succeeds because we already have the key available. Add this line above the existing auth line(s) you want to skip (all of them unless you have something else set up) for the SSH service:
/etc/pam.d/sshd
auth sufficient pam_exec.so /usr/local/bin/require-encrypted-homedir
/etc/pam.d/sshd
not /etc/pam.d/system-auth
as above. You do not want local users without a public key to be able to skip the password. There a safeguard in the script against this, but still best to be careful.sshd
. This means that the script we are adding here will never be run for private keys and they cannot be skipped, however, we still do a check for defence-in-depth to try and ensure a key has been checked.With this, you will be prompted for a password only when the key is not loaded.
SSH Key & Password
A simpler option is to just enable both methods, meaning your key still gets checked, but then you have to type the password too, which will decrypt your home partition.
/etc/ssh/sshd_config
PubkeyAuthentication yes PasswordAuthentication yes AuthenticationMethods publickey,password
AuthenticationMethods publickey,password
, this means that you need to do both authentication methods to log in with SSH. The very similar AuthenticationMethods publickey password
means you can do either to log in, which would let someone bypass your public key auth.This works (and will not let anyone authenticate with just a password), but has the downside of requiring your password every time.
You can also specify something like:
AuthenticationMethods publickey password,publickey
This allows clients to either use either just a public key, or one and a password. Which the client will do will be based on the PreferredAuthentications
option. -o PreferredAuthentications=password,publickey
will ask for the password, while -o PreferredAuthentications=publickey
will not. This is more manual than automated fallback, but has less moving parts, and avoids asking you every time if you prefer publickey
by default (you can use host-specific options on clients to simplify setting these options).
交換卷
- 如果您的系統記憶體壓力較大,則不管剩餘多少交換空間可用,將 zvol 用作交換卷都可能會導致檔案系統鎖起。這個問題現在正在 OpenZFS issue #7734 中調查。
- zvol 上的交換空間不支援從休眠中喚醒,如果嘗試從休眠中恢復將會導致儲存池損壞。可能的解決方案見:https://github.com/openzfs/zfs/issues/260#issuecomment-758782144
ZFS 不允許使用交換檔案,但您可以將一個 ZFS 卷 (ZVOL) 用作交換空間。需要注意的是必須將 ZVOL 的塊大小設定為系統的 PAGESIZE,後者可以通過執行 getconf PAGESIZE
命令來獲得(對於 x86_64 系統來說,其預設值為 4KiB)。關閉 ZVOL 上的寫入快取也可以讓系統在低記憶體狀態下更好執行。
建立一個 8 GiB 的 ZFS 卷:
# zfs create -V 8G -b $(getconf PAGESIZE) -o compression=zle \ -o logbias=throughput -o sync=always\ -o primarycache=metadata -o secondarycache=none \ -o com.sun:auto-snapshot=false <pool>/swap
將其格式化為交換空間:
# mkswap -f /dev/zvol/<pool>/swap # swapon /dev/zvol/<pool>/swap
要將其永久自動掛載,編輯 /etc/fstab
。ZVOLs 支援垃圾回收,這對 ZFS 的塊分配器有潛在幫助,同時當交換空間仍有剩餘時有助於減少其他資料集上的磁碟碎片。
在 /etc/fstab
中添加如下行:
/dev/zvol/<pool>/swap none swap discard 0 0
存取控制列表(Access Control Lists,ACL)
要對資料集使用 ACL,請使用如下命令:
# zfs set acltype=posixacl <nameofzpool>/<nameofdataset> # zfs set xattr=sa <nameofzpool>/<nameofdataset>
出於效能原因,建議組態 xattr
[5]。
鑑於資料集會繼承 ACL 參數,最好是對 zpool 啟用 ACL。預設模式為 restricted
,你可能會需要修改其設定:aclinherit=passthrough
[6];但要注意的是,aclinherit
不影響 POSIX ACL [7]:
# zfs set aclinherit=passthrough <nameofzpool> # zfs set acltype=posixacl <nameofzpool> # zfs set xattr=sa <nameofzpool>
Databases
ZFS, unlike most other file systems, has a variable record size, or what is commonly referred to as a block size. By default, the recordsize on ZFS is 128KiB, which means it will dynamically allocate blocks of any size from 512B to 128KiB depending on the size of file being written. This can often help fragmentation and file access, at the cost that ZFS would have to allocate new 128KiB blocks each time only a few bytes are written to.
Most RDBMSes work in 8KiB-sized blocks by default. Although the block size is tunable for MySQL/MariaDB, PostgreSQL, and Oracle database, all three of them use an 8KiB block size by default. For both performance concerns and keeping snapshot differences to a minimum (for backup purposes, this is helpful), it is usually desirable to tune ZFS instead to accommodate the databases, using a command such as:
# zfs set recordsize=8K <pool>/postgres
These RDBMSes also tend to implement their own caching algorithm, often similar to ZFS's own ARC. In the interest of saving memory, it is best to simply disable ZFS's caching of the database's file data and let the database do its own job:
primarycache
to function, because it is fed with data evicted from primarycache
. If you intend to use the L2ARC, do not set the option below, otherwise no actual data will be cached on L2ARC.# zfs set primarycache=metadata <pool>/postgres
ZFS uses the ZIL for crash recovery, but databases are often syncing their data files to the file system on their own transaction commits anyway. The end result of this is that ZFS will be committing data twice to the data disks, and it can severely impact performance. You can tell ZFS to prefer to not use the ZIL, and in which case, data is only committed to the file system once. However, doing so on non-solid state storage (e.g. HDDs) can result in decreased read performance due to fragmentation (OpenZFS Wiki) -- with mechanical hard drives, please consider using a dedicated SSD as ZIL rather than setting the option below. In addition, setting this for non-database file systems, or for pools with configured log devices, can also negatively impact the performance, so beware:
# zfs set logbias=throughput <pool>/postgres
These can also be done at file system creation time, for example:
# zfs create -o recordsize=8K \ -o primarycache=metadata \ -o mountpoint=/var/lib/postgres \ -o logbias=throughput \ <pool>/postgres
Please note: these kinds of tuning parameters are ideal for specialized applications like RDBMSes. You can easily hurt ZFS's performance by setting these on a general-purpose file system such as your /home directory.
/tmp
If you would like to use ZFS to store your /tmp directory, which may be useful for storing arbitrarily-large sets of files or simply keeping your RAM free of idle data, you can generally improve performance of certain applications writing to /tmp by disabling file system sync. This causes ZFS to ignore an application's sync requests (eg, with fsync
or O_SYNC
) and return immediately. While this has severe application-side data consistency consequences (never disable sync for a database!), files in /tmp are less likely to be important and affected. Please note this does not affect the integrity of ZFS itself, only the possibility that data an application expects on-disk may not have actually been written out following a crash.
# zfs set sync=disabled <pool>/tmp
Additionally, for security purposes, you may want to disable setuid and devices on the /tmp file system, which prevents some kinds of privilege-escalation attacks or the use of device nodes:
# zfs set setuid=off <pool>/tmp # zfs set devices=off <pool>/tmp
Combining all of these for a create command would be as follows:
# zfs create -o setuid=off -o devices=off -o sync=disabled -o mountpoint=/tmp <pool>/tmp
Please note, also, that if you want /tmp on ZFS, you will need to mask (disable) systemd's automatic tmpfs-backed /tmp (tmp.mount
, else ZFS will be unable to mount your dataset at boot-time or import-time.
Transmitting snapshots with ZFS Send and ZFS Recv
It is possible to pipe ZFS snapshots to an arbitrary target by pairing zfs send
and zfs recv
. This is done through standard output, which allows the data to be sent to any file, device, across the network, or manipulated mid-stream by incorporating additional programs in the pipe.
Below are examples of common scenarios:
Basic ZFS Send
First, create a snapshot of some ZFS filesystem:
# zfs snapshot zpool0/archive/books@snap
Now send the snapshot to a new location on a different zpool:
# zfs send -v zpool0/archive/books@snap | zfs recv zpool4/library
The contents of zpool0/archive/books@snap
are now live at zpool4/library
To and from files
First, create a snapshot of some ZFS filesystem:
# zfs snapshot zpool0/archive/books@snap
Write the snapshot to a gzip file:
# zfs send zpool0/archive/books@snap > /tmp/mybooks.gz
zfs send
with -w
flag if you wish to preserve encryption during the send.Now restore the snapshot from the file:
# gzcat /tmp/mybooks.gz | zfs recv -F zpool0/archive/books
Send over ssh
First, create a snapshot of some ZFS filesystem:
# zfs snapshot zpool1/filestore@snap
Next we pipe our "send" traffic over an ssh session running "recv":
# zfs send -v zpool1/filestore@snap | ssh $HOST zfs recv coldstore/backups
The -v
flag prints information about the datastream being generated. If you are using a passphrase or passkey, you will be prompted to enter it.
Incremental Backups
You may wish update a previously sent ZFS filesystem without retransmitting all of the data over again. Alternatively, it may be necessary to keep a filesystem online during a lengthy transfer and it is now time to send writes that were made since the initial snapshot.
First, create a snapshot of some ZFS filesystem:
# zfs snapshot zpool1/filestore@initial
Next we pipe our "send" traffic over an ssh session running "recv":
# zfs send -v -R zpool1/filestore@initial | ssh $HOST zfs recv coldstore/backups
Once changes are written, make another snapshot:
# zfs snapshot zpool1/filestore@snap2
The following will send the differences that exist locally between zpool1/filestore@initial and zpool1/filestore@snap2 and create an additional snapshot for the remote filesystem coldstore/backups:
# zfs send -v -i -R zpool1/filestore@initial | ssh $HOST zfs recv coldstore/backups
Now both zpool1/filestore and coldstore/backups have the @initial and @snap2 snapshots.
On the remote host, you may now promote the latest snapshot to become the active filesystem:
# rollback coldstore/backups@snap2
調校
通用
可以使用參數進一步調整 ZFS 池和資料集。
要檢索當前 ZFS 池的參數狀態,請執行以下操作:
# zfs get all <pool>
要檢索指定資料集的參數狀態,請執行以下操作:
# zfs get all <pool>/<dataset>
要禁用預設啟用的存取時間功能(atime),請執行以下操作:
# zfs set atime=off <pool>
要禁用特定資料集的存取時間功能(atime),請執行以下操作:
# zfs set atime=off <pool>/<dataset>
除了完全關閉 atime 之外,您還可以使用 relatime
。這為ZFS帶來了預設的 ext4/XFS atime 語意,只有在修改或更改時間發生變化時,或者存取時間在過去24小時內沒有變化時,才更新存取時間。這是 atime=off
和 atime=on
之間的折衷。該屬性只在 atime
為 on
時生效:
# zfs set atime=on <pool> # zfs set relatime=on <pool>
壓縮功能則是對資料的透明壓縮。ZFS 支援數種不同的壓縮演算法,目前預設採用 lz4 。gzip 比較適合用於那些不頻繁寫入並且可壓縮率較高的資料。請參考 OpenZFS Wiki 以獲得更多資訊。
要啟用壓縮,請執行:
# zfs set compression=on <pool>
若要將池和/或資料集的屬性重設為預設狀態,請使用 zfs inherit
:
# zfs inherit -rS atime <pool> # zfs inherit -rS atime <pool>/<dataset>
-r
標誌將遞迴重設ZPool中的所有資料集。Scrubbing
當 ZFS 在讀取資料過程中檢測到錯誤時,它會在可能時靜默修複數據,寫回到磁碟並記錄紀錄檔,使得你可以獲得儲存池中錯誤的概覽。ZFS 沒有 fsck 一類的工具,但提供了稱為 scrubbing 的特性。它會遍歷儲存池中的所有資料,並驗證是否所有塊都可被正常讀取。
要對儲存池執行 scrub:
# zpool scrub <pool>
要中斷執行中的 scrub:
# zpool scrub -s <pool>
多久需要執行一次呢?
根據 Oracle 的部落格文章 Disk Scrub - Why and When?:
- 這一問題對支援人員來說有難度,因為最貼切的回答是「看情況」。所以,在我給出一個較通用的回答前,有些可以用來建立更貼合你需求的答案的提示。
- 你最舊的備份的有效期是多久?對資料執行 scrub 操作的頻率因至少與你最舊備份的有效期相當,以確保回覆點可用。
- 你通常多久會碰到一次磁碟故障?While the recruitment of a hot-spare disk invokes a "resilver" -- a targeted scrub of just the VDEV which lost a disk -- you should probably scrub at least as often as you experience disk failures on average in your specific environment.
- 你多久會讀取一次磁碟上最舊的資料?你應偶爾進行一次 scrub,以防止舊資料在你不知道的情況下出現位腐壞。
- 如果針對上述任一問題的答案是「我不知道」,那最通用的回答是:你應至少每月對 zpool 執行一次 scrub 操作。這一周期對多數用例來說都較為合適,提供了足以在各種高負載環境下完成執行的時間,並快於大型 zpools(192+ 磁碟)出現磁碟故障的時間。
根據 Aaron Toponce 的 ZFS Administration Guide,他建議對消費級磁碟每周執行一次 scrub。
根據服務或定時器執行
zfs-scrub-weekly@pool-to-scrub.timer
或 zfs-scrub-monthly@pool-to-scrub.timer
。可以使用 systemd 定時器/服務來自動對儲存池執行 scrub。
要對特定儲存池執行每月 scrubbing:
/etc/systemd/system/zfs-scrub@.timer
[Unit] Description=Monthly zpool scrub on %i [Timer] OnCalendar=monthly AccuracySec=1h Persistent=true [Install] WantedBy=multi-user.target
/etc/systemd/system/zfs-scrub@.service
[Unit] Description=zpool scrub on %i [Service] Nice=19 IOSchedulingClass=idle KillSignal=SIGINT ExecStart=/usr/bin/zpool scrub %i [Install] WantedBy=multi-user.target
啟用/啟動 zfs-scrub@pool-to-scrub.timer
單元以為特定 zpool 啟用月度 scrubbing。
啟用 TRIM
要檢查你的 vdev 是否支援 TRIM,你可以通過 -t
為 zpool status
輸出添加 TRIM 資訊:
$ zpool status -t tank
pool: tank state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 ata-ST31000524AS_5RP4SSNR-part1 ONLINE 0 0 0 (trim unsupported) ata-CT480BX500SSD1_2134A59B933D-part1 ONLINE 0 0 0 (untrimmed) errors: No known data errors
ZFS 可以手動或通過 autotrim
定時對支援的裝置進行 TRIM。
對 zpool 手動進行 TRIM:
# zpool trim <zpool>
為資料池中所有支援的 vdev 啟用自動 TRIM:
# zpool set autotrim=on <zpool>
zpool trim
的操作有所不同,你可能會想偶爾手動執行 TRIM。 要使用 systemd 定時器/服務對特定儲存池每月執行一次完整的 zpool trim
:
/etc/systemd/system/zfs-trim@.timer
[Unit] Description=Monthly zpool trim on %i [Timer] OnCalendar=monthly AccuracySec=1h Persistent=true [Install] WantedBy=multi-user.target
/etc/systemd/system/zfs-trim@.service
[Unit] Description=zpool trim on %i Documentation=man:zpool-trim(8) Requires=zfs.target After=zfs.target ConditionACPower=true ConditionPathIsDirectory=/sys/module/zfs [Service] Nice=19 IOSchedulingClass=idle KillSignal=SIGINT ExecStart=/bin/sh -c '\ if /usr/bin/zpool status %i | grep "trimming"; then\ exec /usr/bin/zpool wait -t trim %i;\ else exec /usr/bin/zpool trim -w %i; fi' ExecStop=-/bin/sh -c '/usr/bin/zpool trim -s %i 2>/dev/null || true' [Install] WantedBy=multi-user.target
啟用/啟動 zfs-trim@pool-to-trim.timer
單元以對特定儲存池啟用 TRIM。
SSD Caching
If your pool has no configured log devices, ZFS reserves space on the pool's data disks for its intent log (the ZIL, also called SLOG). If your data disks are slow (e.g. HDD) it is highly recommended to configure the ZIL on solid state drives for better write performance and also to consider a layer 2 adaptive replacement cache (L2ARC). The process to add them is very similar to adding a new VDEV.
All of the below references to device-id are the IDs from /dev/disk/by-id/*
.
ZIL
To add a mirrored ZIL:
# zpool add <pool> log mirror <device-id-1> <device-id-2>
Or to add a single device ZIL (unsafe):
# zpool add <pool> log <device-id>
Because the ZIL device stores data that has not been written to the pool, it is important to use devices that can finish writes when power is lost. It is also important to use redundancy, since a device failure can cause data loss. In addition, the ZIL is only used for sync writes, so may not provide any performance improvement when your data drives are as fast as your ZIL drive(s).
L2ARC
使用如下命令添加 L2ARC:
# zpool add <pool> cache <device-id>
L2ARC 是唯讀快取,所以不需要任何冗餘。從 ZFS 2.0.0 版本開始,L2ARC 可以在重新啟動後保留。[8]
L2ARC 通常只在熱資料量比系統記憶體更大,但又小到能放入 L2ARC 的情況下有用。L2ARC 由系統記憶體中的 ARC 進行索引,每條記錄(預設為 128KiB)消耗 70 位元組記憶體。所以,對應的記憶體用量可用以下公式計算:
(L2ARC 大小) / (记录大小) * 70 字节
因此,由於 L2ARC 占用了 ARC 的記憶體空間,在某些情況下它會造成儲存效能的下降。
ZVOLs
ZFS volumes (ZVOLs) can suffer from the same block size-related issues as RDBMSes, but it is worth noting that the default recordsize for ZVOLs is 8 KiB already. If possible, it is best to align any partitions contained in a ZVOL to your recordsize (current versions of fdisk and gdisk by default automatically align at 1MiB segments, which works), and file system block sizes to the same size. Other than this, you might tweak the recordsize to accommodate the data inside the ZVOL as necessary (though 8 KiB tends to be a good value for most file systems, even when using 4 KiB blocks on that level).
RAIDZ and Advanced Format physical disks
Each block of a ZVOL gets its own parity disks, and if you have physical media with logical block sizes of 4096B, 8192B, or so on, the parity needs to be stored in whole physical blocks, and this can drastically increase the space requirements of a ZVOL, requiring 2× or more physical storage capacity than the ZVOL's logical capacity. Setting the recordsize to 16k or 32k can help reduce this footprint drastically.
See OpenZFS issue #1807 for details.
I/O Scheduler
While ZFS is expected to work well with modern schedulers including, mq-deadline
, and none
, experimenting with manually setting the I/O scheduler on ZFS disks may yield performance gains. The ZFS recomendation is "[...] users leave the default scheduler 「unless you’re encountering a specific problem, or have clearly measured a performance improvement for your workload」"[9]
排障
Creating a zpool fails
If the following error occurs then it can be fixed.
# the kernel failed to rescan the partition table: 16 # cannot label 'sdc': try using parted(8) and then provide a specific slice: -1
One reason this can occur is because ZFS expects pool creation to take less than 1 second[10][11]. This is a reasonable assumption under ordinary conditions, but in many situations it may take longer. Each drive will need to be cleared again before another attempt can be made.
# parted /dev/sda rm 1 # parted /dev/sda rm 1 # dd if=/dev/zero of=/dev/sdb bs=512 count=1 # zpool labelclear /dev/sda
A brute force creation can be attempted over and over again, and with some luck the ZPool creation will take less than 1 second. One cause for creation slowdown can be slow burst read writes on a drive. By reading from the disk in parallell to ZPool creation, it may be possible to increase burst speeds.
# dd if=/dev/sda of=/dev/null
This can be done with multiple drives by saving the above command for each drive to a file on separate lines and running
# cat $FILE | parallel
Then run ZPool creation at the same time.
ZFS is using too much RAM
By default, ZFS caches file operations (ARC) using up to half of available system memory on the host. To adjust the ARC size, add the following to the Kernel parameters list:
zfs.zfs_arc_max=536870912 # (for 512MiB)
In case that the default value of zfs_arc_min
(1/32 of system memory) is higher than the specified zfs_arc_max
it is needed to add also the following to the Kernel parameters list:
zfs.zfs_arc_min=268435456 # (for 256MiB, needs to be lower than zfs.zfs_arc_max)
You may also want to increase zfs_arc_sys_free
instead (in this example to 8GiB):
# echo $((8*1024**3)) > /sys/module/zfs/parameters/zfs_arc_sys_free
For a more detailed description, as well as other configuration options, see Gentoo:ZFS#ARC.
ZFS should release ARC as applications reserve more RAM, but some applications still get confused, and reported free RAM is always wrong. But in case all your applications work as intended and you have no problems, there is no need to change ARC settings.
Does not contain an EFI label
The following error will occur when attempting to create a zfs filesystem,
/dev/disk/by-id/<id> does not contain an EFI label but it may contain partition
The way to overcome this is to use -f
with the zfs create command.
No hostid found
An error that occurs at boot with the following lines appearing before initscript output:
ZFS: No hostid found on kernel command line or /etc/hostid.
This warning occurs because the ZFS module does not have access to the spl hosted. There are two solutions, for this. Either place the spl hostid in the 核心參數 in the boot loader. For example, adding spl.spl_hostid=0x00bab10c
.
The other solution is to make sure that there is a hostid in /etc/hostid
, and then regenerate the initramfs image. Which will copy the hostid into the initramfs image.
Pool cannot be found while booting from SAS/SCSI devices
In case you are booting a SAS/SCSI based, you might occassionally get boot problems where the pool you are trying to boot from cannot be found. A likely reason for this is that your devices are initialized too late into the process. That means that zfs cannot find any devices at the time when it tries to assemble your pool.
In this case you should force the scsi driver to wait for devices to come online before continuing. You can do this by putting this into /etc/modprobe.d/zfs.conf
:
/etc/modprobe.d/zfs.conf
options scsi_mod scan=sync
Afterwards, regenerate the initramfs.
This works because the zfs hook will copy the file at /etc/modprobe.d/zfs.conf
into the initcpio which will then be used at build time.
On boot the zfs pool does not mount stating: "pool may be in use from other system"
Unexported pool
If the new installation does not boot because the zpool cannot be imported, chroot into the installation and properly export the zpool. See #Emergency chroot repair with archzfs.
Once inside the chroot environment, load the ZFS module and force import the zpool,
# zpool import -a -f
now export the pool:
# zpool export <pool>
To see the available pools, use,
# zpool status
It is necessary to export a pool because of the way ZFS uses the hostid to track the system the zpool was created on. The hostid is generated partly based on the network setup. During the installation in the archiso the network configuration could be different generating a different hostid than the one contained in the new installation. Once the zfs filesystem is exported and then re-imported in the new installation, the hostid is reset. See Re: Howto zpool import/export automatically? - msg#00227.
If ZFS complains about "pool may be in use" after every reboot, properly export pool as described above, and then regenerate the initramfs in normally booted system.
Incorrect hostid
Double check that the pool is properly exported. Exporting the zpool clears the hostid marking the ownership. So during the first boot the zpool should mount correctly. If it does not there is some other problem.
Reboot again, if the zfs pool refuses to mount it means the hostid is not yet correctly set in the early boot phase and it confuses zfs. Manually tell zfs the correct number, once the hostid is coherent across the reboots the zpool will mount correctly.
Boot using zfs_force and write down the hostid. This one is just an example.
$ hostid
0a0af0f8
This number have to be added to the 核心參數 as spl.spl_hostid=0x0a0af0f8
. Another solution is writing the hostid inside the initram image, see the installation guide explanation about this.
Users can always ignore the check adding zfs_force=1
in the 核心參數, but it is not advisable as a permanent solution.
Devices have different sector alignment
Once a drive has become faulted it should be replaced A.S.A.P. with an identical drive.
# zpool replace bigdata ata-ST3000DM001-9YN166_S1F0KDGY ata-ST3000DM001-1CH166_W1F478BD -f
but in this instance, the following error is produced:
cannot replace ata-ST3000DM001-9YN166_S1F0KDGY with ata-ST3000DM001-1CH166_W1F478BD: devices have different sector alignment
ZFS uses the ashift option to adjust for physical block size. When replacing the faulted disk, ZFS is attempting to use ashift=12
, but the faulted disk is using a different ashift (probably ashift=9
) and this causes the resulting error.
For Advanced Format disks with 4 KiB block size, an ashift
of 12
is recommended for best performance. See OpenZFS FAQ: Performance Considerations and ZFS and Advanced Format disks.
Use zdb to find the ashift of the zpool: zdb
, then use the -o
argument to set the ashift of the replacement drive:
# zpool replace bigdata ata-ST3000DM001-9YN166_S1F0KDGY ata-ST3000DM001-1CH166_W1F478BD -o ashift=9 -f
Check the zpool status for confirmation:
# zpool status -v
pool: bigdata state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Jun 16 11:16:28 2014 10.3G scanned out of 5.90T at 81.7M/s, 20h59m to go 2.57G resilvered, 0.17% done config: NAME STATE READ WRITE CKSUM bigdata DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 replacing-0 OFFLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0KDGY OFFLINE 0 0 0 ata-ST3000DM001-1CH166_W1F478BD ONLINE 0 0 0 (resilvering) ata-ST3000DM001-9YN166_S1F0JKRR ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0KBP8 ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0JTM1 ONLINE 0 0 0 errors: No known data errors
Pool resilvering stuck/restarting/slow?
According to ZFS issue #840, this is a known issue since 2012 with ZFS-ZED which causes the resilvering process to constantly restart, sometimes get stuck and be generally slow for some hardware. The simplest mitigation is to stop zfs-zed.service until the resilver completes.
Your boot time can be significantly impacted if you update your intitramfs (eg when doing a kernel update) when you have additional but non-permanently attached pools imported because these pools will get added to your initramfs zpool.cache and ZFS will attempt to import these extra pools on every boot, regardless of whether you have exported it and removed it from your regular zpool.cache.
If you notice ZFS trying to import unavailable pools at boot, first run:
$ zdb -C
To check your zpool.cache for pools you do not want imported at boot. If this command is showing (a) additional, currently unavailable pool(s), run:
# zpool set cachefile=/etc/zfs/zpool.cache zroot
To clear the zpool.cache of any pools other than the pool named zroot. Sometimes there is no need to refresh your zpool.cache, but instead all you need to do is regenerate the initramfs.
ZFS Command History
ZFS logs changes to a pool's structure natively as a log of executed commands in a ring buffer (which cannot be turned off). The log may be helpful when restoring a degraded or failed pool.
# zpool history zpool
History for 'zpool': 2023-02-19.16:28:44 zpool create zpool raidz1 /scratch/disk_1.img /scratch/disk_2.img /scratch/disk_3.img 2023-02-19.16:31:29 zfs set compression=lz4 zpool 2023-02-19.16:41:45 zpool scrub zpool 2023-02-19.17:00:57 zpool replace zpool /scratch/disk_1.img /scratch/bigger_disk_1.img 2023-02-19.17:01:34 zpool scrub zpool 2023-02-19.17:01:42 zpool replace zpool /scratch/disk_2.img /scratch/bigger_disk_2.img 2023-02-19.17:01:46 zpool replace zpool /scratch/disk_3.img /scratch/bigger_disk_3.img
小技巧
建立帶有 ZFS 支援的 Archiso 映像
建立 Arch Linux live CD/DVD/USB 映像的具體步驟在 Archiso 中已有描述。如需在映像中加入 ZFS 支援,可以選擇手動構建 AUR 上的 PKGBUILDs,或是在映像中加入非官方使用者倉庫中的預構建包。
使用 AUR 自行構建 ZFS 包
參考正常流程自行構建你需要的 ZFS 包。如果你不確定需要哪個包,zfs-dkmsAUR 和 zfs-utilsAUR 可以支援你在 Archiso 映像上做出的多數改動。下一步需要建立一個本地倉庫,並將倉庫添加到新組態的 Pacman 設定檔中。
將構建出的包添加到要安裝的包列表中。下面的例子假設你僅想安裝 zfs-dkmsAUR 和 zfs-utilsAUR 包:
packages.x86_64
... zfs-dkms zfs-utils
如果你添加了任何 DKMS 包,請確保你同時添加了 ISO 所用核心對應的標頭檔包(預設核心對應為 linux-headers包)。
使用 archzfs 非官方使用者倉庫
將 archzfs 非官方使用者倉庫添加到新 Archiso 組態中的 pacman.conf
檔案中。
將 archzfs-linux
軟體包組添加到要安裝的軟體包包列表中(archzfs
倉庫提供的包僅支援 x86_64 架構):
packages.x86_64
... archzfs-linux
收尾
無論你使用了哪種方法,最後都需要構建 ISO 映像。
Automatic snapshots
zrepl
The zreplAUR package provides a ZFS automatic replication service, which could also be used as a snapshotting service much like snapper.
For details on how to configure the zrepl daemon, see the zrepl documentation. The configuration file should be located at /etc/zrepl/zrepl.yml
. Then, run zrepl configcheck
to make sure that the syntax of the config file is correct. Finally, enable zrepl.service
.
sanoid
sanoidAUR is a policy-driven tool for taking snapshots. Sanoid also includes syncoid
, which is for replicating snapshots. It comes with systemd services and a timer.
Sanoid only prunes snapshots on the local system. To prune snapshots on the remote system, run sanoid there as well with prune options. Either use the --prune-snapshots
command line option or use the --cron
command line option together with the autoprune = yes
and autosnap = no
configuration options.
ZFS Automatic Snapshot Service for Linux
The zfs-auto-snapshot-gitAUR package provides a shell script to automate the management of snapshots, with each named by date and label (hourly, daily, etc), giving quick and convenient snapshotting of all ZFS datasets. The package also installs cron tasks for quarter-hourly, hourly, daily, weekly, and monthly snapshots. Optionally adjust the --keep parameter
from the defaults depending on how far back the snapshots are to go (the monthly script by default keeps data for up to a year).
To prevent a dataset from being snapshotted at all, set com.sun:auto-snapshot=false
on it. Likewise, set more fine-grained control as well by label, if, for example, no monthlies are to be kept on a snapshot, for example, set com.sun:auto-snapshot:monthly=false
.
--skip-scrub
from ExecStart
line. Consequences not known, someone please edit.Once the package has been installed, enable and start the selected timers (zfs-auto-snapshot-{frequent,daily,weekly,monthly}.timer
).
建立共享
NFS
首先,確保系統已經安裝並組態了 NFS。注意:無需編輯 /etc/exports
。對於 NFS 共享,確保已經啟動 nfs-server.service
和 zfs-share.service
。
要將儲存池共享到網路:
# zfs set sharenfs=on 存储池名
要將資料集共享到網路:
# zfs set sharenfs=on 存储池名/数据集名
要為特定 IP 段啟用讀寫權限:
# zfs set sharenfs="rw=@192.168.1.100/24,rw=@10.0.0.0/24" 存储池名/数据集名
要確認資料集是否已成功匯出:
# showmount -e `hostname`
Export list for hostname: /path/of/dataset 192.168.1.100/24
要確認當前匯出狀態的詳細資訊:
# exportfs -v
/path/of/dataset 192.168.1.100/24(sync,wdelay,hide,no_subtree_check,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)
要通過 ZFS 檢視當前 NFS 共享列表:
# zfs get sharenfs
SMB
/var/lib/samba/usershares
and the only supported sharesmb options are on
and off
. Enabling guest access via sharesmb=guest_ok=y
is not supported. When sharing through SMB, using usershares
in /etc/samba/smb.conf
will allow ZFS to setup and create the shares. See Samba#Enable Usershares for details.
/etc/samba/smb.conf
[global] usershare path = /var/lib/samba/usershares usershare max shares = 100 usershare allow guests = yes usershare owner only = no
Create and set permissions on the user directory as root
# mkdir /var/lib/samba/usershares # chmod +t /var/lib/samba/usershares
To make a pool available on the network:
# zfs set sharesmb=on nameofzpool
To make a dataset available on the network:
# zfs set sharesmb=on nameofzpool/nameofdataset
To check if the dataset is exported successfully:
# smbclient -L localhost -U%
Sharename Type Comment --------- ---- ------- IPC$ IPC IPC Service (SMB Server Name) nameofzpool_nameofdataset Disk Comment: path/of/dataset SMB1 disabled -- no workgroup available
To view the current SMB share list by ZFS:
# zfs get sharesmb
Encryption in ZFS using dm-crypt
Before OpenZFS version 0.8.0, ZFS did not support encryption directly (See #Native encryption). Instead, zpools can be created on dm-crypt block devices. Since the zpool is created on the plain-text abstraction, it is possible to have the data encrypted while having all the advantages of ZFS like deduplication, compression, and data robustness. Furthermore, utilizing dm-crypt will encrypt the zpools metadata, which the native encryption can inherently not provide.[12]
dm-crypt, possibly via LUKS, creates devices in /dev/mapper
and their name is fixed. So you just need to change zpool create
commands to point to that names. The idea is configuring the system to create the /dev/mapper
block devices and import the zpools from there. Since zpools can be created in multiple devices (raid, mirroring, striping, ...), it is important all the devices are encrypted otherwise the protection might be partially lost.
For example, an encrypted zpool can be created using plain dm-crypt (without LUKS) with:
# cryptsetup open --type=plain --hash=sha256 --cipher=aes-xts-plain64 --offset=0 \ --key-file=/dev/sdZ --key-size=512 /dev/sdX enc # zpool create zroot /dev/mapper/enc
In the case of a root filesystem pool, the mkinitcpio.conf
HOOKS line will enable the keyboard for the password, create the devices, and load the pools. It will contain something like:
HOOKS=(... keyboard encrypt zfs ...)
Since the /dev/mapper/enc
name is fixed no import errors will occur.
Creating encrypted zpools works fine. But if you need encrypted directories, for example to protect your users' homes, ZFS loses some functionality.
ZFS will see the encrypted data, not the plain-text abstraction, so compression and deduplication will not work. The reason is that encrypted data has always high entropy making compression ineffective and even from the same input you get different output (thanks to salting) making deduplication impossible. To reduce the unnecessary overhead it is possible to create a sub-filesystem for each encrypted directory and use eCryptfs on it.
For example to have an encrypted home: (the two passwords, encryption and login, must be the same)
# zfs create -o compression=off -o dedup=off -o mountpoint=/home/<username> <zpool>/<username> # useradd -m <username> # passwd <username> # ecryptfs-migrate-home -u <username> <log in user and complete the procedure with ecryptfs-unwrap-passphrase>
Emergency chroot repair with archzfs
To get into the ZFS filesystem from live system for maintenance, there are two options:
- Build custom archiso with ZFS as described in #Create an Archiso image with ZFS support.
- Boot the latest official archiso and bring up the network. Then enable archzfs repository inside the live system as usual, sync the pacman package database and install the archzfs-archiso-linux package.
To start the recovery, load the ZFS kernel modules:
# modprobe zfs
Import the pool:
# zpool import -a -R /mnt
Mount the boot partition and EFI system partition (if any):
# mount /dev/sda2 /mnt/boot # mount /dev/sda1 /mnt/efi
Chroot into the ZFS filesystem:
# arch-chroot /mnt /bin/bash
Check the kernel version:
# pacman -Qi linux # uname -r
uname will show the kernel version of the archiso. If they are different, run depmod (in the chroot) with the correct kernel version of the chroot installation:
# depmod -a 3.6.9-1-ARCH (version gathered from pacman -Qi linux but using the matching kernel modules directory name under the chroot's /lib/modules)
This will load the correct kernel modules for the kernel version installed in the chroot installation.
Regenerate the initramfs. There should be no errors.
Bind mount
Here a bind mount from /mnt/zfspool to /srv/nfs4/music is created. The configuration ensures that the zfs pool is ready before the bind mount is created.
fstab
See systemd.mount(5) for more information on how systemd converts fstab into mount unit files with systemd-fstab-generator(8).
/etc/fstab
/mnt/zfspool /srv/nfs4/music none bind,defaults,nofail,x-systemd.requires=zfs-mount.service 0 0
Monitoring / Mailing on Events
See ZED: The ZFS Event Daemon for more information.
An email forwarder, such as S-nail, is required to accomplish this. Test it to be sure it is working correctly.
Uncomment the following in the configuration file:
/etc/zfs/zed.d/zed.rc
ZED_EMAIL_ADDR="root" ZED_EMAIL_PROG="mailx" ZED_NOTIFY_VERBOSE=0 ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"
Update 'root' in ZED_EMAIL_ADDR="root"
to the email address you want to receive notifications at.
If you are keeping your mailrc in your home directory, you can tell mail to get it from there by setting MAILRC
:
/etc/zfs/zed.d/zed.rc
export MAILRC=/home/<user>/.mailrc
This works because ZED sources this file, so mailx
sees this environment variable.
If you want to receive an email no matter the state of your pool, you will want to set ZED_NOTIFY_VERBOSE=1
. You will need to do this temporary to test.
Start and enable zfs-zed.service
.
With ZED_NOTIFY_VERBOSE=1
, you can test by running a scrub as root: zpool scrub <pool-name>
.
Wrap shell commands in pre & post snapshots
Since it is so cheap to make a snapshot, we can use this as a measure of security for sensitive commands such as system and package upgrades. If we make a snapshot before, and one after, we can later diff these snapshots to find out what changed on the filesystem after the command executed. Furthermore we can also rollback in case the outcome was not desired.
znp
E.g.:
# zfs snapshot -r zroot@pre # pacman -Syu # zfs snapshot -r zroot@post # zfs diff zroot@pre zroot@post # zfs rollback zroot@pre
A utility that automates the creation of pre and post snapshots around a shell command is znp.
E.g.:
# znp pacman -Syu # znp find / -name "something*" -delete
and you would get snapshots created before and after the supplied command, and also output of the commands logged to file for future reference so we know what command created the diff seen in a pair of pre/post snapshots.
Remote unlocking of ZFS encrypted root
As of PR #261, archzfs
supports SSH unlocking of natively-encrypted ZFS datasets. This section describes how to use this feature, and is largely based on dm-crypt/Specialties#Busybox based initramfs (built with mkinitcpio).
- Install mkinitcpio-netconf包 to provide hooks for setting up early user space networking.
- Choose an SSH server to use in early user space. The options are mkinitcpio-tinyssh包 or mkinitcpio-dropbear包, and are mutually exclusive.
- If using mkinitcpio-tinyssh包, it is also recommended to install tinyssh包 or tinyssh-convert-gitAUR. This tool converts an existing OpenSSH hostkey to the TinySSH key format, preserving the key fingerprint and avoiding connection warnings. The TinySSH and Dropbear mkinitcpio install scripts will automatically convert existing hostkeys when generating a new initcpio image.
- Decide whether to use an existing OpenSSH key or generate a new one (recommended) for the host that will be connecting to and unlocking the encrypted ZFS machine. Copy the public key into
/etc/tinyssh/root_key
or/etc/dropbear/root_key
. When generating the initcpio image, this file will be added toauthorized_keys
for the root user and is only valid in the initrd environment. - Add the
ip=
核心參數 to your boot loader configuration. Theip
string is highly configurable. A simple DHCP example is shown below.ip=:::::eth0:dhcp
- Edit
/etc/mkinitcpio.conf
to include thenetconf
,dropbear
ortinyssh
, andzfsencryptssh
hooks before thezfs
hook:HOOKS=(... netconf <tinyssh>|<dropbear> zfsencryptssh zfs ...)
- Regenerate the initramfs.
- Reboot and try it out!
Changing the SSH server port
By default, mkinitcpio-tinyssh包 and mkinitcpio-dropbear包 listen on port 22
. You may wish to change this.
For TinySSH, copy /usr/lib/initcpio/hooks/tinyssh
to /etc/initcpio/hooks/tinyssh
, and find/modify the following line in the run_hook()
function:
/etc/initcpio/hooks/tinyssh
/usr/bin/tcpserver -HRDl0 0.0.0.0 <new_port> /usr/sbin/tinysshd -v /etc/tinyssh/sshkeydir &
For Dropbear, copy /usr/lib/initcpio/hooks/dropbear
to /etc/initcpio/hooks/dropbear
, and find/modify the following line in the run_hook()
function:
/etc/initcpio/hooks/tinyssh
/usr/sbin/dropbear -E -s -j -k -p <new_port>
Unlocking from a Windows machine using PuTTY/Plink
First, we need to use puttygen.exe
to import and convert the OpenSSH key generated earlier into PuTTY's .ppk private key format. We will call it zfs_unlock.ppk
for this example.
The mkinitcpio-netconf process above does not setup a shell (nor do we need need one). However, because there is no shell, PuTTY will immediately close after a successful connection. This can be disabled in the PuTTY SSH configuration (Connection > SSH > [X] Do not start a shell or command at all), but it still does not allow us to see stdout or enter the encryption passphrase. Instead, we use plink.exe
with the following parameters:
plink.exe -ssh -l root -i c:\path\to\zfs_unlock.ppk <hostname>
The plink command can be put into a batch script for ease of use.
Enabling bclone support
To use cp --reflink
and other commands needing bclone support, it is necessary to upgrade the feature flags if coming from a version prior to 2.2.2. This will allow the pool to have support for bclone. This is done with zpool upgrade
, if the status of the pool show this is possible.
It is also required to enable a module parameter, otherwise userspace apps will not be able to use this feature. You can do this by putting this into /etc/modprobe.d/zfs.conf
:
/etc/modprobe.d/zfs.conf
options zfs zfs_bclone_enabled=1
Check that is working, and how much space is being saved with the command: zpool get all POOLNAME | grep clon
參考
- Aaron Toponce's 17-part blog on ZFS
- OpenZFS releases
- OpenZFS FAQ
- FreeBSD Handbook - The Z File System
- Oracle Solaris ZFS Administration Guide
- ZFS Best Practices Guide
- ZFS Troubleshooting Guide
- How Pingdom uses ZFS to back up 5TB of MySQL data every day
- Tutorial on adding the modules to a custom kernel
- How to create cross platform ZFS disks under Linux
- How-To: Using ZFS Encryption at Rest in OpenZFS (ZFS on Linux, ZFS on FreeBSD, …)
- Archzfs iso download page: Frequently updated and downloadable archzfs linux iso with full OpenZFS support since 2016