426 lines
17 KiB
Diff
426 lines
17 KiB
Diff
From a4103262b01a1b8704b37c01c7c813df91b7b119 Mon Sep 17 00:00:00 2001
|
|
From: Yu Zhao <yuzhao@google.com>
|
|
Date: Sun, 18 Sep 2022 01:59:58 -0600
|
|
Subject: [PATCH 01/29] mm: x86, arm64: add arch_has_hw_pte_young()
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain; charset=UTF-8
|
|
Content-Transfer-Encoding: 8bit
|
|
|
|
Patch series "Multi-Gen LRU Framework", v14.
|
|
|
|
What's new
|
|
==========
|
|
1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
|
|
Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
|
|
2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
|
|
machines. The old direct reclaim backoff, which tries to enforce a
|
|
minimum fairness among all eligible memcgs, over-swapped by about
|
|
(total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
|
|
pulls the plug on swapping once the target is met, trades some
|
|
fairness for curtailed latency:
|
|
https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
|
|
3. Fixed minior build warnings and conflicts. More comments and nits.
|
|
|
|
TLDR
|
|
====
|
|
The current page reclaim is too expensive in terms of CPU usage and it
|
|
often makes poor choices about what to evict. This patchset offers an
|
|
alternative solution that is performant, versatile and
|
|
straightforward.
|
|
|
|
Patchset overview
|
|
=================
|
|
The design and implementation overview is in patch 14:
|
|
https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
|
|
|
|
01. mm: x86, arm64: add arch_has_hw_pte_young()
|
|
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
|
|
Take advantage of hardware features when trying to clear the accessed
|
|
bit in many PTEs.
|
|
|
|
03. mm/vmscan.c: refactor shrink_node()
|
|
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
|
|
its sole caller"
|
|
Minor refactors to improve readability for the following patches.
|
|
|
|
05. mm: multi-gen LRU: groundwork
|
|
Adds the basic data structure and the functions that insert pages to
|
|
and remove pages from the multi-gen LRU (MGLRU) lists.
|
|
|
|
06. mm: multi-gen LRU: minimal implementation
|
|
A minimal implementation without optimizations.
|
|
|
|
07. mm: multi-gen LRU: exploit locality in rmap
|
|
Exploits spatial locality to improve efficiency when using the rmap.
|
|
|
|
08. mm: multi-gen LRU: support page table walks
|
|
Further exploits spatial locality by optionally scanning page tables.
|
|
|
|
09. mm: multi-gen LRU: optimize multiple memcgs
|
|
Optimizes the overall performance for multiple memcgs running mixed
|
|
types of workloads.
|
|
|
|
10. mm: multi-gen LRU: kill switch
|
|
Adds a kill switch to enable or disable MGLRU at runtime.
|
|
|
|
11. mm: multi-gen LRU: thrashing prevention
|
|
12. mm: multi-gen LRU: debugfs interface
|
|
Provide userspace with features like thrashing prevention, working set
|
|
estimation and proactive reclaim.
|
|
|
|
13. mm: multi-gen LRU: admin guide
|
|
14. mm: multi-gen LRU: design doc
|
|
Add an admin guide and a design doc.
|
|
|
|
Benchmark results
|
|
=================
|
|
Independent lab results
|
|
-----------------------
|
|
Based on the popularity of searches [01] and the memory usage in
|
|
Google's public cloud, the most popular open-source memory-hungry
|
|
applications, in alphabetical order, are:
|
|
Apache Cassandra Memcached
|
|
Apache Hadoop MongoDB
|
|
Apache Spark PostgreSQL
|
|
MariaDB (MySQL) Redis
|
|
|
|
An independent lab evaluated MGLRU with the most widely used benchmark
|
|
suites for the above applications. They posted 960 data points along
|
|
with kernel metrics and perf profiles collected over more than 500
|
|
hours of total benchmark time. Their final reports show that, with 95%
|
|
confidence intervals (CIs), the above applications all performed
|
|
significantly better for at least part of their benchmark matrices.
|
|
|
|
On 5.14:
|
|
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
|
|
less wall time to sort three billion random integers, respectively,
|
|
under the medium- and the high-concurrency conditions, when
|
|
overcommitting memory. There were no statistically significant
|
|
changes in wall time for the rest of the benchmark matrix.
|
|
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
|
|
more transactions per minute (TPM), respectively, under the medium-
|
|
and the high-concurrency conditions, when overcommitting memory.
|
|
There were no statistically significant changes in TPM for the rest
|
|
of the benchmark matrix.
|
|
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
|
|
and [21.59, 30.02]% more operations per second (OPS), respectively,
|
|
for sequential access, random access and Gaussian (distribution)
|
|
access, when THP=always; 95% CIs [13.85, 15.97]% and
|
|
[23.94, 29.92]% more OPS, respectively, for random access and
|
|
Gaussian access, when THP=never. There were no statistically
|
|
significant changes in OPS for the rest of the benchmark matrix.
|
|
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
|
|
[2.16, 3.55]% more operations per second (OPS), respectively, for
|
|
exponential (distribution) access, random access and Zipfian
|
|
(distribution) access, when underutilizing memory; 95% CIs
|
|
[8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
|
|
respectively, for exponential access, random access and Zipfian
|
|
access, when overcommitting memory.
|
|
|
|
On 5.15:
|
|
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
|
|
and [4.11, 7.50]% more operations per second (OPS), respectively,
|
|
for exponential (distribution) access, random access and Zipfian
|
|
(distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
|
|
[6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
|
|
exponential access, random access and Zipfian access, when swap was
|
|
on.
|
|
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
|
|
less average wall time to finish twelve parallel TeraSort jobs,
|
|
respectively, under the medium- and the high-concurrency
|
|
conditions, when swap was on. There were no statistically
|
|
significant changes in average wall time for the rest of the
|
|
benchmark matrix.
|
|
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
|
|
minute (TPM) under the high-concurrency condition, when swap was
|
|
off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
|
|
respectively, under the medium- and the high-concurrency
|
|
conditions, when swap was on. There were no statistically
|
|
significant changes in TPM for the rest of the benchmark matrix.
|
|
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
|
|
[11.47, 19.36]% more total operations per second (OPS),
|
|
respectively, for sequential access, random access and Gaussian
|
|
(distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
|
|
[10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
|
|
for sequential access, random access and Gaussian access, when
|
|
THP=never.
|
|
|
|
Our lab results
|
|
---------------
|
|
To supplement the above results, we ran the following benchmark suites
|
|
on 5.16-rc7 and found no regressions [10].
|
|
fs_fio_bench_hdd_mq pft
|
|
fs_lmbench pgsql-hammerdb
|
|
fs_parallelio redis
|
|
fs_postmark stream
|
|
hackbench sysbenchthread
|
|
kernbench tpcc_spark
|
|
memcached unixbench
|
|
multichase vm-scalability
|
|
mutilate will-it-scale
|
|
nginx
|
|
|
|
[01] https://trends.google.com
|
|
[02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
|
|
[03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
|
|
[04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
|
|
[05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
|
|
[06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
|
|
[07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
|
|
[08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
|
|
[09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
|
|
[10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
|
|
|
|
Read-world applications
|
|
=======================
|
|
Third-party testimonials
|
|
------------------------
|
|
Konstantin reported [11]:
|
|
I have Archlinux with 8G RAM + zswap + swap. While developing, I
|
|
have lots of apps opened such as multiple LSP-servers for different
|
|
langs, chats, two browsers, etc... Usually, my system gets quickly
|
|
to a point of SWAP-storms, where I have to kill LSP-servers,
|
|
restart browsers to free memory, etc, otherwise the system lags
|
|
heavily and is barely usable.
|
|
|
|
1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
|
|
patchset, and I started up by opening lots of apps to create memory
|
|
pressure, and worked for a day like this. Till now I had not a
|
|
single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
|
|
getting to the point of 3G in SWAP before without a single
|
|
SWAP-storm.
|
|
|
|
Vaibhav from IBM reported [12]:
|
|
In a synthetic MongoDB Benchmark, seeing an average of ~19%
|
|
throughput improvement on POWER10(Radix MMU + 64K Page Size) with
|
|
MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
|
|
three different request distributions, namely, Exponential, Uniform
|
|
and Zipfan.
|
|
|
|
Shuang from U of Rochester reported [13]:
|
|
With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
|
|
and [9.26, 10.36]% higher throughput, respectively, for random
|
|
access, Zipfian (distribution) access and Gaussian (distribution)
|
|
access, when the average number of jobs per CPU is 1; 95% CIs
|
|
[42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
|
|
throughput, respectively, for random access, Zipfian access and
|
|
Gaussian access, when the average number of jobs per CPU is 2.
|
|
|
|
Daniel from Michigan Tech reported [14]:
|
|
With Memcached allocating ~100GB of byte-addressable Optante,
|
|
performance improvement in terms of throughput (measured as queries
|
|
per second) was about 10% for a series of workloads.
|
|
|
|
Large-scale deployments
|
|
-----------------------
|
|
We've rolled out MGLRU to tens of millions of ChromeOS users and
|
|
about a million Android users. Google's fleetwide profiling [15] shows
|
|
an overall 40% decrease in kswapd CPU usage, in addition to
|
|
improvements in other UX metrics, e.g., an 85% decrease in the number
|
|
of low-memory kills at the 75th percentile and an 18% decrease in
|
|
app launch time at the 50th percentile.
|
|
|
|
The downstream kernels that have been using MGLRU include:
|
|
1. Android [16]
|
|
2. Arch Linux Zen [17]
|
|
3. Armbian [18]
|
|
4. ChromeOS [19]
|
|
5. Liquorix [20]
|
|
6. OpenWrt [21]
|
|
7. post-factum [22]
|
|
8. XanMod [23]
|
|
|
|
[11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
|
|
[12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
|
|
[13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
|
|
[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
|
|
[15] https://dl.acm.org/doi/10.1145/2749469.2750392
|
|
[16] https://android.com
|
|
[17] https://archlinux.org
|
|
[18] https://armbian.com
|
|
[19] https://chromium.org
|
|
[20] https://liquorix.net
|
|
[21] https://openwrt.org
|
|
[22] https://codeberg.org/pf-kernel
|
|
[23] https://xanmod.org
|
|
|
|
Summary
|
|
=======
|
|
The facts are:
|
|
1. The independent lab results and the real-world applications
|
|
indicate substantial improvements; there are no known regressions.
|
|
2. Thrashing prevention, working set estimation and proactive reclaim
|
|
work out of the box; there are no equivalent solutions.
|
|
3. There is a lot of new code; no smaller changes have been
|
|
demonstrated similar effects.
|
|
|
|
Our options, accordingly, are:
|
|
1. Given the amount of evidence, the reported improvements will likely
|
|
materialize for a wide range of workloads.
|
|
2. Gauging the interest from the past discussions, the new features
|
|
will likely be put to use for both personal computers and data
|
|
centers.
|
|
3. Based on Google's track record, the new code will likely be well
|
|
maintained in the long term. It'd be more difficult if not
|
|
impossible to achieve similar effects with other approaches.
|
|
|
|
This patch (of 14):
|
|
|
|
Some architectures automatically set the accessed bit in PTEs, e.g., x86
|
|
and arm64 v8.2. On architectures that do not have this capability,
|
|
clearing the accessed bit in a PTE usually triggers a page fault following
|
|
the TLB miss of this PTE (to emulate the accessed bit).
|
|
|
|
Being aware of this capability can help make better decisions, e.g.,
|
|
whether to spread the work out over a period of time to reduce bursty page
|
|
faults when trying to clear the accessed bit in many PTEs.
|
|
|
|
Note that theoretically this capability can be unreliable, e.g.,
|
|
hotplugged CPUs might be different from builtin ones. Therefore it should
|
|
not be used in architecture-independent code that involves correctness,
|
|
e.g., to determine whether TLB flushes are required (in combination with
|
|
the accessed bit).
|
|
|
|
Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
|
|
Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com
|
|
Signed-off-by: Yu Zhao <yuzhao@google.com>
|
|
Reviewed-by: Barry Song <baohua@kernel.org>
|
|
Acked-by: Brian Geffon <bgeffon@google.com>
|
|
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
|
|
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
|
|
Acked-by: Steven Barrett <steven@liquorix.net>
|
|
Acked-by: Suleiman Souhlal <suleiman@google.com>
|
|
Acked-by: Will Deacon <will@kernel.org>
|
|
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
|
|
Tested-by: Donald Carr <d@chaos-reins.com>
|
|
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
|
|
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
|
|
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
|
|
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
|
|
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
|
|
Cc: Andi Kleen <ak@linux.intel.com>
|
|
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
|
|
Cc: Catalin Marinas <catalin.marinas@arm.com>
|
|
Cc: Dave Hansen <dave.hansen@linux.intel.com>
|
|
Cc: Hillf Danton <hdanton@sina.com>
|
|
Cc: Jens Axboe <axboe@kernel.dk>
|
|
Cc: Johannes Weiner <hannes@cmpxchg.org>
|
|
Cc: Jonathan Corbet <corbet@lwn.net>
|
|
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Cc: linux-arm-kernel@lists.infradead.org
|
|
Cc: Matthew Wilcox <willy@infradead.org>
|
|
Cc: Mel Gorman <mgorman@suse.de>
|
|
Cc: Michael Larabel <Michael@MichaelLarabel.com>
|
|
Cc: Michal Hocko <mhocko@kernel.org>
|
|
Cc: Mike Rapoport <rppt@kernel.org>
|
|
Cc: Peter Zijlstra <peterz@infradead.org>
|
|
Cc: Tejun Heo <tj@kernel.org>
|
|
Cc: Vlastimil Babka <vbabka@suse.cz>
|
|
Cc: Miaohe Lin <linmiaohe@huawei.com>
|
|
Cc: Mike Rapoport <rppt@linux.ibm.com>
|
|
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
|
|
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
---
|
|
arch/arm64/include/asm/pgtable.h | 14 ++------------
|
|
arch/x86/include/asm/pgtable.h | 6 +++---
|
|
include/linux/pgtable.h | 13 +++++++++++++
|
|
mm/memory.c | 14 +-------------
|
|
4 files changed, 19 insertions(+), 28 deletions(-)
|
|
|
|
--- a/arch/arm64/include/asm/pgtable.h
|
|
+++ b/arch/arm64/include/asm/pgtable.h
|
|
@@ -999,23 +999,13 @@ static inline void update_mmu_cache(stru
|
|
* page after fork() + CoW for pfn mappings. We don't always have a
|
|
* hardware-managed access flag on arm64.
|
|
*/
|
|
-static inline bool arch_faults_on_old_pte(void)
|
|
-{
|
|
- WARN_ON(preemptible());
|
|
-
|
|
- return !cpu_has_hw_af();
|
|
-}
|
|
-#define arch_faults_on_old_pte arch_faults_on_old_pte
|
|
+#define arch_has_hw_pte_young cpu_has_hw_af
|
|
|
|
/*
|
|
* Experimentally, it's cheap to set the access flag in hardware and we
|
|
* benefit from prefaulting mappings as 'old' to start with.
|
|
*/
|
|
-static inline bool arch_wants_old_prefaulted_pte(void)
|
|
-{
|
|
- return !arch_faults_on_old_pte();
|
|
-}
|
|
-#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
|
|
+#define arch_wants_old_prefaulted_pte cpu_has_hw_af
|
|
|
|
#endif /* !__ASSEMBLY__ */
|
|
|
|
--- a/arch/x86/include/asm/pgtable.h
|
|
+++ b/arch/x86/include/asm/pgtable.h
|
|
@@ -1397,10 +1397,10 @@ static inline bool arch_has_pfn_modify_c
|
|
return boot_cpu_has_bug(X86_BUG_L1TF);
|
|
}
|
|
|
|
-#define arch_faults_on_old_pte arch_faults_on_old_pte
|
|
-static inline bool arch_faults_on_old_pte(void)
|
|
+#define arch_has_hw_pte_young arch_has_hw_pte_young
|
|
+static inline bool arch_has_hw_pte_young(void)
|
|
{
|
|
- return false;
|
|
+ return true;
|
|
}
|
|
|
|
#endif /* __ASSEMBLY__ */
|
|
--- a/include/linux/pgtable.h
|
|
+++ b/include/linux/pgtable.h
|
|
@@ -259,6 +259,19 @@ static inline int pmdp_clear_flush_young
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
|
#endif
|
|
|
|
+#ifndef arch_has_hw_pte_young
|
|
+/*
|
|
+ * Return whether the accessed bit is supported on the local CPU.
|
|
+ *
|
|
+ * This stub assumes accessing through an old PTE triggers a page fault.
|
|
+ * Architectures that automatically set the access bit should overwrite it.
|
|
+ */
|
|
+static inline bool arch_has_hw_pte_young(void)
|
|
+{
|
|
+ return false;
|
|
+}
|
|
+#endif
|
|
+
|
|
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
|
|
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
|
|
unsigned long address,
|
|
--- a/mm/memory.c
|
|
+++ b/mm/memory.c
|
|
@@ -121,18 +121,6 @@ int randomize_va_space __read_mostly =
|
|
2;
|
|
#endif
|
|
|
|
-#ifndef arch_faults_on_old_pte
|
|
-static inline bool arch_faults_on_old_pte(void)
|
|
-{
|
|
- /*
|
|
- * Those arches which don't have hw access flag feature need to
|
|
- * implement their own helper. By default, "true" means pagefault
|
|
- * will be hit on old pte.
|
|
- */
|
|
- return true;
|
|
-}
|
|
-#endif
|
|
-
|
|
#ifndef arch_wants_old_prefaulted_pte
|
|
static inline bool arch_wants_old_prefaulted_pte(void)
|
|
{
|
|
@@ -2791,7 +2779,7 @@ static inline int cow_user_page(struct p
|
|
* On architectures with software "accessed" bits, we would
|
|
* take a double page fault, so mark it accessed here.
|
|
*/
|
|
- if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
|
|
+ if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
|
|
pte_t entry;
|
|
|
|
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
|